Extracting text from PDF

Helge Blischke h.blischke at srz.de
Fri Jun 29 10:04:45 PDT 2007


Kurt Pfeifle wrote:
>>>>Most likely, your PDF contains what text you see on screen (or on
>>>>paper, when printed) only in the form of bitmaps, not proper fonts...
>>>
>>>How can I check this?
> 
> 
> 
> I forgot one more method I wanted to mention: a commandline. Try this:
> 
>    pdffonts /path/to/your/pdf
> 
> If you don't have that command, install package "xpdf-utils" or "poppler-utils" (depending on your distribution).
> 
> 
> 
>>In acroread or in kpdf look for the menu entry where you can look at
>>the document properties. There you should see a tab which allows you
>>to check for the fonts.
>>
>>See if the fonts are there, and what kind of names they have.
>>
>>That said, this problem ("a bitmap font was used") usually does not
>>appear with Firefox. Helge's guess about the root of the problem may
>>be a much better one.
>>
>>If you use your firefox to "print to file" your job, please upload
>>the resulting PostScript. I/we can then try to convert with a
>>CUPS/pstops + Ghostscript commandline chain (using different versions
>>of Ghostscript and parameter variations) to see if we find one which
>>does not show your problem....
> 
> 
> 
> --
> Kurt Pfeifle
> System & Network Printing Consultant --- Linux/Unix/Windows/Samba/CUPS
> Infotec Deutschland GmbH - A RICOH Company ......... Stuttgart/Germany
> 

Just for fun, I did a test: printing from Firefox 1.0 (Solaris) to a file
and then distilling that file to PDF using both Ghostscript and Distiller
(4.05 on Solaris and 7.x on WinXP).
Both PDFs displayed fine but failed to produce any output from pdftotext
(3.02 from the xpdf suite). As neither PDF contains any ToUnicode map,
it is just what I suspected.

Interestingly enough, both PDFs spit out reasonablly readable text
when fed into Ghostscript's ps2ascii.

Looking into the PostScript job created by Firefox, character codes are
represented as 2-byte-codes, resembling UTF-16, and these strings are
rendered by a procedure, unicodeshow, which - at least for Type1 fonts -
boils down to rendering each character using glyphshow.

Helge

-- 
Helge Blischke
Softwareentwicklung

H.Blischke at acm.org




More information about the cups mailing list