CMaps and ToUnicode CMaps

I'm looking through a couple documents: 5014.CIDFont_Spec.pdf and 5411.ToUnicode.pdf. My ultimate goal is basically text extraction that includes the positions of the characters.

So in a CMap file the following sequences can appear:

3 begincidrange
<20> <7e> 1
<8140> <817e> 633
<8180> <81ac> 696
endcidrange

2 beginbfrange
<10FE> <10FF> <4E00>
<1100> <1101> <4E02>
endbfrange

The bfrange is supposed to do this mapping, which I believe maps CID's to UTF-16BE.

CID=4350 -> U+4E00
CID=4351 -> U+4E01
CID=4352 -> U+4E02
CID=4353 -> U+4E03

And the cidrange is supposed to map char codes to CID's.

Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:

-ToUnicode map
-use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
-map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap

So I guess where I'm confused is that these fonts will have something like this:

<002600520051004900550044005700480055> Tj

And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.

So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?

CMaps and ToUnicode CMaps

Trending Articles

Man charged with July slaying of Jovan Hopkins in Back of the Yards

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Man dies and another in serious condition after A614 crash between Driffield...

Trio remanded on gun, other serious charges

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Sarah Samis, Emil Bove III

Practice Sheet of Right form of verbs for HSC Students

Casualty cut free following three-car collision in Newtown Unthank

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Fushigi no Dungeon – Furai no Shiren 3: Karakuri Yashiki no Nemuri Hime (JPN)

Playboi Carti – MUSIC – SORRY 4 DA WAIT [iTunes Plus M4A + M4V]

LAG, Lacp configuration on Mellanox switches

Who's been in court? A round up of cases heard by Essex magistrates

99 God Status for Whatsapp, Facebook

The Angry Birds Movie (Tamil Dubbed)

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Toughie 3495

Novel : I Love You, Stupid! 2

La Liga Font 2017/2018 (Free TTF Version)