pdf documents

Fri Dec 28 09:14:03 UTC 2007

Yes, but unless I'm badly mistaken, it is very old and doesn't support 
directly extracting images from pdf files.  You would still need to 
install the xpdf package to get the pdfimages utility so you can process 
the images as single files.  I read about the OCR package you describe 
but I'm fairly sure it's old and unmaintained.  Maybe someone was going 
to take over development, I'm not sure.  I've noticed that most pdf 
files are text and don't have page images, or if they do, the images are 
pictures so would be useless anyway.  Also, what is the accuracy rate 
for this OCR package?  What about accessibility?

Matt Barnes wrote:
> Tesseract is an OCR and can convert pdf's and images to text. I 
> haven't gotten around to installing it and trying it out, but it seems 
> like the OCR of choice, located here:
> http://sourceforge.net/project/showfiles.php?group_id=158586