Extracting text from PDF
kutz at netcologne.de
Thu Jun 28 07:58:26 PDT 2007
> > I'm trying to extract text automatically from PDFs with pdftotext to
> > make them searchable. This usually works well, except with PDFs
> > generated by cups-pdf. There is no text output at all. What is the
> > reason for this and is there a way to change this?
> The reason for this is that there are no fonts properly embedded or referenced in your PDF.
> Most likely, your PDF generating software chain does a poor job here. (Which software components are involved? With which settings? Ghostscript? What exactly is the Ghostscript commandline being used?)
It's a webpage rendered by firefox, printed by cups-pdf (version 2.4.2-3, Debian package), ghostscript used was ESP Ghostscript 815.03, Debian package).
> Most likely, your PDF contains what text you see on screen (or on paper, when printed) only in the form of bitmaps, not proper fonts...
How can I check this?
More information about the cups