Extracting text from PDF

Helge Blischke h.blischke at srz.de
Wed Jun 27 08:44:52 PDT 2007


Rolf wrote:
>>Rolf wrote:
>>
>>>I'm trying to extract text automatically from PDFs with pdftotext to make them searchable. This usually works well, except with PDFs generated by cups-pdf. There is no text output at all. What is the reason for this and is there a way to change this?
>>>
>>>regards, Rolf
>>
>>cups-pdf uses Ghostscript to generate the PDF. Your problem is
>>most likely due to a Ghostscript version that does not
>>generate Unicode maps for the used fonts in the PDF, a feature
>>pdftotext depends on.
> 
> 
> Cups-pdf is using ESP Ghostscript 815.03:
> $ pdfinfo 1004.pdf
> Producer:       ESP Ghostscript 815.03
> CreationDate:   Wed Jun 27 12:36:14 2007
> ModDate:        Wed Jun 27 12:36:14 2007
> Tagged:         no
> Pages:          2
> Encrypted:      no
> Page size:      595 x 842 pts (A4)
> File size:      54363 bytes
> Optimized:      no
> PDF version:    1.4
> 
> Generating from the same source with html2ps and ps2pdf does produce textual output, but the resulting PDF doesn't look as good. html2ps does use ESP Ghostscript 815.03, too:
> 
> Producer:       ESP Ghostscript 815.03
> CreationDate:   Wed Jun 27 16:42:49 2007
> ModDate:        Wed Jun 27 16:42:49 2007
> Tagged:         no
> Pages:          3
> Encrypted:      no
> Page size:      595 x 842 pts (A4)
> File size:      12745 bytes
> Optimized:      no
> PDF version:    1.4
> 
> It seems to be more related to cups-pdf, I guess.
> 
> regards, Rolf

 From some PDFs I just tested with, I get the following hints:
If the fonts used for rendering text are CID fonts and lack
a ToUnicode map, the strings rendered using this font cannot
be converted to readable text.

I do not know which was the first version of Ghostscript that
generated ToUnicode maps for those fonts, maybe 8.15 did not.

To be sure, please post (an URL to) both the PDF and the source
PostScript file.

Helge


-- 
Helge Blischke
Softwareentwicklung

H.Blischke at acm.org




More information about the cups mailing list