Copying text from a protected pdf file

George White aa056 at chebucto.ns.ca
Fri Sep 16 12:44:01 UTC 2005


Quoting Paul Smith <phhs80 at gmail.com>:

> I have got a pdf file, whose text I would like to copy to a word
> processor. However, it seems to be protected, as when I copy and paste
> a piece of text from there into a word processor, I only see garbage.
> Is there some way of getting clean text from the pdf file?

The PDF format has many ways to display text.  To be able to extract text
you need a file that stores strings and uses font information to render them
in the viewer.  You may be seeing images that were rasterized long ago.
You should provide the output of the "pdffonts" command, preferrable for a 
minimal document (a big document could combine sections that use fonts with
images).  

For example, the simplest case is a document that uses the PostScript Type 1
fonts provided by the viewer:

$ pdffonts /usr/share/doc/cups-1.1.20/ssr.pdf
name                                 type         emb sub uni object ID
------------------------------------ ------------ --- --- --- ---------
Times-Roman                          Type 1       no  no  no       4  0
Helvetica                            Type 1       no  no  no       7  0
Helvetica-Bold                       Type 1       no  no  no       8  0
Times-Bold                           Type 1       no  no  no       5  0
Courier                              Type 1       no  no  no       3  0
Symbol                               Type 1       no  no  no       9  0
Times-Italic                         Type 1       no  no  no       6  0


-- 
George N. White III
Head of St. Margarets Bay, Nova Scotia




More information about the fedora-list mailing list