I have a font that does not have a tounicode map. It has this encoding: "/90ms-RKSJ-H"
Looking that up in the Pdf Ref I see these two things:
90ms−RKSJ−H Microsoft Code Page 932 (lfCharSet 0x80), JIS X 0208 character set with NEC and IBM® extensions
90ms−RKSJ−H/V Adobe-Japan1-2 Adobe-Japan1-2 Adobe-Japan1-2 Adobe-Japan1-2
That tells me that to extract the text content I need to use the Adobe-Japan1-2 CMap to convert char codes to cids, and then use the Adobe-Japan1-UCS2 CMap to convert cids to unicode. (The first CMap had a registry of Adobe and Ordering of Japan1).
Well, that makes sense. But then I look at Adobe-Japan1-2, which has this codespacerange:
1 begincodespacerange
<0000> <22FF>
endcodespacerange
So char codes are two bytes long and the first byte must be less than 0x22.
Here's the first draw string I get from the pdf:
<8DE092639640906C2091538D918E7392AC91BA90558BBB8BA689EF8AF1958D8D7388D 720>Tj
I believe that only one of the two byte codes in that string actually fit the range.
Looking that up in the Pdf Ref I see these two things:
90ms−RKSJ−H Microsoft Code Page 932 (lfCharSet 0x80), JIS X 0208 character set with NEC and IBM® extensions
90ms−RKSJ−H/V Adobe-Japan1-2 Adobe-Japan1-2 Adobe-Japan1-2 Adobe-Japan1-2
That tells me that to extract the text content I need to use the Adobe-Japan1-2 CMap to convert char codes to cids, and then use the Adobe-Japan1-UCS2 CMap to convert cids to unicode. (The first CMap had a registry of Adobe and Ordering of Japan1).
Well, that makes sense. But then I look at Adobe-Japan1-2, which has this codespacerange:
1 begincodespacerange
<0000> <22FF>
endcodespacerange
So char codes are two bytes long and the first byte must be less than 0x22.
Here's the first draw string I get from the pdf:
<8DE092639640906C2091538D918E7392AC91BA90558BBB8BA689EF8AF1958D8D7388D 720>Tj
I believe that only one of the two byte codes in that string actually fit the range.