I'm trying to parse certain pdf document on Mac OS X. The pages have embeded CID () fonts with Identity-H encoding. The font itself is Type0 font with CIDFontType2 descendant font. I'm able to extract text from any page by using 2-byte CIDs and mapping them to characters defined in ToUnicode stream. However there are a few character mismatches which (IMHO) are the cause of wrongly chosen encoding (MacRomanEncoding instead of PDFDocEncoding).
One of mismatched characters in document is Ø (latin capital o with stroke, empty set symbol) character, the character I'm extracting is ÿ (latin small character y with diaeresis). According to pdf 1.7 specification characters Ø and ÿ have same octal code, but in different encodings (330 in PDFDocStanrdardEncoding and MacRomanEncoding accordingly).
My question is how can I be sure to select correct encoding for the text? Is it PDFDocEncoding by default unless specified otherwise?