Extracting text from PDF

Thu Jun 28 07:58:26 PDT 2007

> > I'm trying to extract text automatically from PDFs with pdftotext to
> > make them searchable. This usually works well, except with PDFs
> > generated by cups-pdf. There is no text output at all. What is the
> > reason for this and is there a way to change this?
>
> The reason for this is that there are no fonts properly embedded or referenced in your PDF.
>
> Most likely, your PDF generating software chain does a poor job here. (Which software components are involved? With which settings? Ghostscript? What exactly is the Ghostscript commandline being used?)

It's a webpage rendered by firefox, printed by cups-pdf (version 2.4.2-3, Debian package), ghostscript used was ESP Ghostscript 815.03, Debian package).

> Most likely, your PDF contains what text you see on screen (or on paper, when printed) only in the form of bitmaps, not proper fonts...

How can I check this?

regards, Rolf