Extracting text from PDF
Rolf
kutz at netcologne.de
Wed Jun 27 08:00:04 PDT 2007
> Rolf wrote:
> > I'm trying to extract text automatically from PDFs with pdftotext to make them searchable. This usually works well, except with PDFs generated by cups-pdf. There is no text output at all. What is the reason for this and is there a way to change this?
> >
> > regards, Rolf
>
> cups-pdf uses Ghostscript to generate the PDF. Your problem is
> most likely due to a Ghostscript version that does not
> generate Unicode maps for the used fonts in the PDF, a feature
> pdftotext depends on.
Cups-pdf is using ESP Ghostscript 815.03:
$ pdfinfo 1004.pdf
Producer: ESP Ghostscript 815.03
CreationDate: Wed Jun 27 12:36:14 2007
ModDate: Wed Jun 27 12:36:14 2007
Tagged: no
Pages: 2
Encrypted: no
Page size: 595 x 842 pts (A4)
File size: 54363 bytes
Optimized: no
PDF version: 1.4
Generating from the same source with html2ps and ps2pdf does produce textual output, but the resulting PDF doesn't look as good. html2ps does use ESP Ghostscript 815.03, too:
Producer: ESP Ghostscript 815.03
CreationDate: Wed Jun 27 16:42:49 2007
ModDate: Wed Jun 27 16:42:49 2007
Tagged: no
Pages: 3
Encrypted: no
Page size: 595 x 842 pts (A4)
File size: 12745 bytes
Optimized: no
PDF version: 1.4
It seems to be more related to cups-pdf, I guess.
regards, Rolf
More information about the cups
mailing list