Extracting text from PDF

Wed Jun 27 07:28:14 PDT 2007

> I'm trying to extract text automatically from PDFs with pdftotext to
> make them searchable. This usually works well, except with PDFs
> generated by cups-pdf. There is no text output at all. What is the
> reason for this and is there a way to change this?

The reason for this is that there are no fonts properly embedded or referenced in your PDF.

Most likely, your PDF generating software chain does a poor job here. (Which software components are involved? With which settings? Ghostscript? What exactly is the Ghostscript commandline being used?)

Most likely, your PDF contains what text you see on screen (or on paper, when printed) only in the form of bitmaps, not proper fonts...

--
Kurt Pfeifle
System & Network Printing Consultant --- Linux/Unix/Windows/Samba/CUPS
Infotec Deutschland GmbH - A RICOH Company ......... Stuttgart/Germany