Quantcast
Channel: Adobe Community : Popular Discussions - PDF Language and Specifications
Viewing all articles
Browse latest Browse all 46145

Identity-H, CMap and troubles choosing predefined encodings

$
0
0

I'm trying to parse certain pdf document on Mac OS X. The pages have embeded CID () fonts with Identity-H encoding. The font itself is Type0 font with CIDFontType2 descendant font. I'm able to extract text from any page by using 2-byte CIDs and mapping them to characters defined in ToUnicode stream. However there are a few character mismatches which (IMHO) are the cause of wrongly chosen encoding (MacRomanEncoding instead of PDFDocEncoding).

 

One of mismatched characters in document is Ø (latin capital o with stroke, empty set symbol) character, the character I'm extracting is ÿ (latin small character y with diaeresis). According to pdf 1.7 specification characters Ø and ÿ have same octal code, but in different encodings (330 in PDFDocStanrdardEncoding and MacRomanEncoding accordingly).

 

My question is how can I be sure to select correct encoding for the text? Is it PDFDocEncoding by default unless specified otherwise?


Viewing all articles
Browse latest Browse all 46145

Trending Articles