Extracting text from PDF

Tue Jul 3 11:52:41 PDT 2007

Rolf Kutz wrote:
> Kurt Pfeifle schrieb:
>>>>> Most likely, your PDF contains what text you see on screen (or on
>>>>> paper, when printed) only in the form of bitmaps, not proper fonts...
>>>> How can I check this?
>>
>> I forgot one more method I wanted to mention: a commandline. Try this:
>>
>>    pdffonts /path/to/your/pdf
> 
> rk at hydra:~$ pdffonts PDF/1004.pdf
> name                                 type         emb sub uni object ID
> ------------------------------------ ------------ --- --- --- ---------
> RCISND+Nimbus_Sans_L.Bold.0.0.Set0   Type 1C      yes yes no      14  0
> QPHYDB+Nimbus_Roman_No9_L.Regular.0.0.Set0 Type 1C      yes yes no
>  9  0
> VKJNGT+Nimbus_Sans_L.Regular.0.0.Set0 Type 1C      yes yes no      12  0
> 
> Same result as above. I hope you can see something from this.

Yes. After some re-formatting...  :-)

 name                                       type         emb sub uni object ID
 ------------------------------------------ ------------ --- --- --- ---------
 RCISND+Nimbus_Sans_L.Bold.0.0.Set0         Type 1C      yes yes no      14  0
 QPHYDB+Nimbus_Roman_No9_L.Regular.0.0.Set0 Type 1C      yes yes no       9  0
 VKJNGT+Nimbus_Sans_L.Regular.0.0.Set0 Type      1C      yes yes no      12  0

The "uni" column contains "no" if the PDF file contains no "ToUnicode" map.
AFAIR, this is what Helge pointed at too.

-- 
Kurt Pfeifle
System & Network Printing Consultant ---- Linux/Unix/Windows/Samba/CUPS
Infotec Deutschland GmbH  .....................  Hedelfinger Strasse 58
A RICOH Company  ...........................  D-70327 Stuttgart/Germany