Quantcast
Channel: Adobe Community : Popular Discussions - PDF Language and Specifications
Viewing all articles
Browse latest Browse all 46145

CMaps and ToUnicode CMaps

$
0
0
I'm looking through a couple documents: 5014.CIDFont_Spec.pdf and 5411.ToUnicode.pdf. My ultimate goal is basically text extraction that includes the positions of the characters.

So in a CMap file the following sequences can appear:

3 begincidrange
<20> <7e> 1
<8140> <817e> 633
<8180> <81ac> 696
endcidrange

2 beginbfrange
<10FE> <10FF> <4E00>
<1100> <1101> <4E02>
endbfrange

The bfrange is supposed to do this mapping, which I believe maps CID's to UTF-16BE.

CID=4350 -> U+4E00
CID=4351 -> U+4E01
CID=4352 -> U+4E02
CID=4353 -> U+4E03

And the cidrange is supposed to map char codes to CID's.

Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:

-ToUnicode map
-use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
-map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap

So I guess where I'm confused is that these fonts will have something like this:

<002600520051004900550044005700480055> Tj

And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.

So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?

Viewing all articles
Browse latest Browse all 46145

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>