Extracting text from PDF

Rolf kutz at netcologne.de
Wed Jun 27 08:00:04 PDT 2007


> Rolf wrote:
> > I'm trying to extract text automatically from PDFs with pdftotext to make them searchable. This usually works well, except with PDFs generated by cups-pdf. There is no text output at all. What is the reason for this and is there a way to change this?
> >
> > regards, Rolf
>
> cups-pdf uses Ghostscript to generate the PDF. Your problem is
> most likely due to a Ghostscript version that does not
> generate Unicode maps for the used fonts in the PDF, a feature
> pdftotext depends on.

Cups-pdf is using ESP Ghostscript 815.03:
$ pdfinfo 1004.pdf
Producer:       ESP Ghostscript 815.03
CreationDate:   Wed Jun 27 12:36:14 2007
ModDate:        Wed Jun 27 12:36:14 2007
Tagged:         no
Pages:          2
Encrypted:      no
Page size:      595 x 842 pts (A4)
File size:      54363 bytes
Optimized:      no
PDF version:    1.4

Generating from the same source with html2ps and ps2pdf does produce textual output, but the resulting PDF doesn't look as good. html2ps does use ESP Ghostscript 815.03, too:

Producer:       ESP Ghostscript 815.03
CreationDate:   Wed Jun 27 16:42:49 2007
ModDate:        Wed Jun 27 16:42:49 2007
Tagged:         no
Pages:          3
Encrypted:      no
Page size:      595 x 842 pts (A4)
File size:      12745 bytes
Optimized:      no
PDF version:    1.4

It seems to be more related to cups-pdf, I guess.

regards, Rolf




More information about the cups mailing list