Extracting ASCII text from a PDF Document
Chris Brannon
cmbrannon79 at gmail.com
Thu Aug 12 12:40:05 UTC 2010
Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
> I need to use the file on a Debian system so I hope I am
> just using a2ps and pstotext wrong.
Don't do that! Use pdftotext instead.
On my distribution, ArchLinux, pdftotext is provided by the "poppler"
package. I don't know which package you need for Debian.
Perhaps it's in xpdf.
One thing you'll notice when converting PDF to plain text is that certain
two-letter combinations are replaced with UTF-8-encoded Unicode characters.
Only the gods know why.
Common examples are fi, fl, and ff.
Of course, most screenreaders won't render those correctly.
-- Chris
More information about the Blinux-list
mailing list