Extracting ASCII text from a PDF Document

Chris Brannon cmbrannon79 at gmail.com
Thu Aug 12 12:40:05 UTC 2010


Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
> 	I need to use the file on a Debian system so I hope I am
> just using a2ps and pstotext wrong.

Don't do that!  Use pdftotext instead.
On my distribution, ArchLinux, pdftotext is provided by the "poppler"
package.  I don't know which package you need for Debian.
Perhaps it's in xpdf.

One thing you'll notice when converting PDF to plain text is that certain
two-letter combinations are replaced with UTF-8-encoded Unicode characters.
Only the gods know why.
Common examples are fi, fl, and ff.
Of course, most screenreaders won't render those correctly.

-- Chris




More information about the Blinux-list mailing list