Copying text from a protected pdf file

Deron Meranda deron.meranda at gmail.com
Thu Sep 15 22:06:53 UTC 2005


On 9/15/05, Paul Smith <phhs80 at gmail.com> wrote:
> On 9/15/05, Leonard Isham <leonard.isham at gmail.com> wrote:
> > > > > I have got a pdf file, whose text I would like to copy to a word
> > > > > processor. However, it seems to be protected, as when I copy and paste
> > > > > a piece of text from there into a word processor, I only see garbage.
...
> Thanks, Leonard. I have just checked: the pdf file is not copy
> protected, but, even so, what I can copy into a word processor is
> garbage. It may be something relating with encodings.

It could be encodings.  Text in PDF is really only in terms of glyphs,
not characters, which makes text extraction particularly difficult
and font-specific.  Fortunately there are a few standard PDF encodings
defined by Adobe (these map "characters" to glyphs, and are not
quite the same things as you'd think of an "encoding" being), but
each PDF file can create it's own custom encodings as well and
visually you'd see nothing different.  There's also nothing to keep
the "text" in a PDF file from being written weird (such as writing
from right-to-left) since it's just graphics instructions; but most PDF
generating programs do it in the obvious way.

You might want to look at the "pdftotext" program (which is part of
the xpdf package, obsoleted in FC4).  It generally can do a good job
of extracting text.

Just some more information... are your documents generally
written in English (or use the English alphabet)?  And are they more
like plain prose (paragraphs of text), or fanciful like marketing marterials
with lots of interspersed graphics, panels, and so forth?
-- 
Deron Meranda




More information about the fedora-list mailing list