I printed a Microsoft Word file as a PDF file by Distiller with an embed SakkalMajalla truetype font. I want to extract unicode texts from the PDF file. I found ToUnicode misses part of mapping. For example, CID 06B4 doen't have any mapping. I guess 06b4 should be mapped to U+0644. There are some substitutions in SakkalMajalla. So uni0644.medi (U+FEE0) is replaced by liga.0758.medi.alt1 (U+10354). Why can't Distiller deal with the situation? How can I recover missed mapping from PDF objects except ToUnicode? Thanks
P.S. I also asked the question couple days ago. Please see
I haven't got answers. I don't have privilege to move or delete that discussion. Sorry for asking a question in two communities.
/GS1 gs
BT
/TT1 1 Tf
24 0 0 24 513.84 764.1203 Tm
0 g
0 Tc
0 Tw
<0284>Tj
.495 .5925 TD
<0551>Tj
-.1675 -.5925 TD
<06b4>Tj
.4 .4225 TD
<0551>Tj
-.12 -.4225 TD
<024f>Tj
/TT2 1 Tf
12 0 0 12 506.58 764.1203 Tm
( )Tj
ET
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (JJEELB+TT1+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /JJEELB+TT1+0 def
/CMapType 2 def
1 begincodespacerange <024f> <0551> endcodespacerange
3 beginbfchar
<024f> <0639>
<0284> <0649>
<0551> <064E>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end