I'm looking through a couple documents: 5014.CIDFont_Spec.pdf and 5411.ToUnicode.pdf. My ultimate goal is basically text extraction that includes the positions of the characters.
So in a CMap file the following sequences can appear:
3 begincidrange
<20> <7e> 1
<8140> <817e> 633
<8180> <81ac> 696
endcidrange
2 beginbfrange
<10FE> <10FF> <4E00>
<1100> <1101> <4E02>
endbfrange
The bfrange is supposed to do this mapping, which I believe maps CID's to UTF-16BE.
CID=4350 -> U+4E00
CID=4351 -> U+4E01
CID=4352 -> U+4E02
CID=4353 -> U+4E03
And the cidrange is supposed to map char codes to CID's.
Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:
-ToUnicode map
-use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
-map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap
So I guess where I'm confused is that these fonts will have something like this:
<002600520051004900550044005700480055> Tj
And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.
So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?
So in a CMap file the following sequences can appear:
3 begincidrange
<20> <7e> 1
<8140> <817e> 633
<8180> <81ac> 696
endcidrange
2 beginbfrange
<10FE> <10FF> <4E00>
<1100> <1101> <4E02>
endbfrange
The bfrange is supposed to do this mapping, which I believe maps CID's to UTF-16BE.
CID=4350 -> U+4E00
CID=4351 -> U+4E01
CID=4352 -> U+4E02
CID=4353 -> U+4E03
And the cidrange is supposed to map char codes to CID's.
Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:
-ToUnicode map
-use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
-map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap
So I guess where I'm confused is that these fonts will have something like this:
<002600520051004900550044005700480055> Tj
And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.
So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?