Extracting text from PDF

Helge Blischke h.blischke at srz.de
Tue Jul 3 09:11:34 PDT 2007


Rolf Kutz wrote:
> Kurt Pfeifle schrieb:
> 
>>If you use your firefox to "print to file" your job, please upload the resulting PostScript. I/we can then try to convert with a CUPS/pstops + Ghostscript commandline chain (using different versions of Ghostscript and parameter variations) to see if we find one which does not show your problem....
> 
> 
> Here is a link to the Postscript:
> 
> http://www.technology-forum.com/tmp/1004.ps
> 
> Regards, Rolf

Is what I suspected: the fonts used are converted to Type1 from TrueType,
and the encoding uses glyph names derived from the unicode
numbers (line /uni0031 etc.) - see the snippet below:
---snip---
%!PS-AdobeFont-1.0-3.0 DejaVu_Serif.Book.0.0.Set0 1.0
%%Creator: Mozilla Freetype2 Printing code 2.0
%%Title: DejaVu_Serif.Book.0.0.Set0
%%Pages: 0
%%EndComments
8 dict begin
/FontName /DejaVu_Serif.Book.0.0.Set0 def
/FontType 1 def
/FontMatrix [ 0.001 0 0 0.001 0 0 ]readonly def
/PaintType 0 def
/FontBBox [-769 -401 1679 1242]readonly def
/Encoding [
/.notdef
/uni0031/uni0030/uni0034/uni0068/uni0074/uni0070/uni003A/uni002F
/uni0077/uni002E/uni0065/uni0063/uni006E/uni006F/uni006C/uni0067
/uni0079/uni002D/uni0066/uni0072/uni0075/uni006D/uni0033/uni0073
/uni005F/uni0020/uni0032/uni0037/uni0035/.notdef/.notdef/.notdef
....
---snip---
The glyph naming scheme used here is quite proprietary, thus
the pstotext utilities cannot cope with it.

Helge



-- 
Helge Blischke
Softwareentwicklung

H.Blischke at acm.org




More information about the cups mailing list