CMaps and ToUnicode CMaps

June 2, 2008, 6:11 am

≫ Next: PDF/A vs. Form Fields (checkboxes, radiobuttons)

≪ Previous: Opening spanish pdf only possible with spanish web-browser

I'm looking through a couple documents: 5014.CIDFont_Spec.pdf and 5411.ToUnicode.pdf. My ultimate goal is basically text extraction that includes the positions of the characters.

So in a CMap file the following sequences can appear:

3 begincidrange
<20> <7e> 1
<8140> <817e> 633
<8180> <81ac> 696
endcidrange

2 beginbfrange
<10FE> <10FF> <4E00>
<1100> <1101> <4E02>
endbfrange

The bfrange is supposed to do this mapping, which I believe maps CID's to UTF-16BE.

CID=4350 -> U+4E00
CID=4351 -> U+4E01
CID=4352 -> U+4E02
CID=4353 -> U+4E03

And the cidrange is supposed to map char codes to CID's.

Ok. So looking through the pdf spec 5.9.1, a CID Font can be mapped to unicode in one of three ways:

-ToUnicode map
-use one of the encodings: MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding
-map charcode to CID using font CMap and then CID to unicode based on the registry-ordering CMap

So I guess where I'm confused is that these fonts will have something like this:

<002600520051004900550044005700480055> Tj

And based on CMap rules I can parse this string into char codes. That makes sense for a begincidrange, because that converts char codes to CID's. But if I have a ToUnicode CMap with beginbfrange, it is supposed to convert CID's to Unicode.

So my guess is that the hex in the Tj array is CID's if we have a ToUnicode map, and it is char codes if we don't?

↧

PDF/A vs. Form Fields (checkboxes, radiobuttons)

October 16, 2009, 8:25 am

≫ Next: Cid to unicode

≪ Previous: CMaps and ToUnicode CMaps

I've question regarding the PDF/A specification and would appreciate some clarification.

The specs says:

"If an annotation dictionary contains the AP key, the appearance dictionary that it defines as its value shall contain only the N key, whose value shall be a stream defining the appearance of the annotation. [...] Every form field shall have an appearance dictionary associated with the field's data. A conforming reader shall render the field according to the appearance dictionary without regard to the form data."

Now i have a document which is fully PDF/A compliant however the acrobat preflight (8 & 9) check fails with the message that my checkboxes do not only contain the /N key. However they do contain just the /N (normal) appearances.

1.) So what's the problem? (Can't think that this should be a bug since intarsys PDF/A validation also reports it...)

The PDF Spec 1.4 defines the 'normal' appearance of a checkbox like that:

/AP << /N << /On formXObject1 /Off formXObject2>>>>

2.) Or does the spec mean, that there can only be one appearance inside the /N enty?

Then PDF 1.4 states:

"The appearance for the off state is optional, but if present must be stored in the appearance dictionary under the name Off."

3.) So if i only define the "On" state the document will become fully PDF/A compliant?

4.) And what about the Appearance state (AS)? PDF 1.4 states:

"The choice between the checked and unchecked appearance states is determined by the AS entry in the annotation dictionary"

But PDF/A does not mention the /AS key but it is vital for checkboxes...?! According to the spec a viewer has to "render the field according to the appearance dictionary" -> but checkboxes can have two states ....

Thanks for bringing any light on this!

/ToM

↧

Cid to unicode

March 16, 2010, 7:01 am

≫ Next: Problem with Fonts (DR, DA)

≪ Previous: PDF/A vs. Form Fields (checkboxes, radiobuttons)

Hi i'm new to the forum and to the development on PDF format so please be patient .

I'm writing a script to extract text from PDF but i cannot understand how can i translate cid to unicode characters.

I've taken cid numbers from the beginbfchar and beginbfrange sections but how can i translate them to unicode?

I have already read similar threads on the forum about this but i haven't understood how can i do that.

Thanks.

↧

Problem with Fonts (DR, DA)

June 10, 2010, 4:53 am

≫ Next: Unicode Text Extract

≪ Previous: Cid to unicode

My last question was regarding the usage of the DR attribute, which should be defined in the interactive form dictionary.

I did that but still different readers or the acrobat seem to have problems with the PDFs...

I definined Fonts as part of the interactive form dictionary:

/DR<</Font<</Times#20New#20Roman 11 0 R >>>>

In a textfield i referenced that font in a /DA:

/DA(/Times#20New#20Roman 0 Tf 0 g)

The #20 is allowed i was told (its a replacement of a blank).

Besides strange visual behaviour in acrobat 8.1 after i save the document acrobat uses helvetica for the field, as if it couldn't find the font...?

Strange thing ;-)

Thanks for helping,

ToM

↧

Unicode Text Extract

February 8, 2011, 4:53 pm

≫ Next: Unicode Passwords

≪ Previous: Problem with Fonts (DR, DA)

It looks this problem is not uncommon... I have a PostScript file that is generated by a 3rd party product. Distiller converts it to PDF, but I cannot search or copy a text. It looks this is because the ToUnicode CMap information is not readily available. I created a very simple document that contains only A, B, C, 1 and 9 vertically and tried a few things, but none of them worked.

Can somebody shed light on this to get it working?

So far I tried the following lines in the CMap section of the PostScript file. I read the tech article - 5411 ToUnicode Mapping File and other Adobe references and manuals.

5 begincidchar

<0031> 20

<001C> 28

<0041> 36

<0042> 37

<0043> 38

endcidchar

        5 beginbfchar
        <0014> <0031>
        <001C> <0039>
        <0024> <0041>
        <0025> <0042>
        <0026> <0043>
        endbfchar

        5 beginbfchar
        <14> <0031>
        <1C> <0039>
        <24> <0041>
        <25> <0042>
        <26> <0043>
        endbfchar

Thank you for your help.

* No attachement is possible in this forum(?)

%!PS-Adobe-3.0
%%Keywords: (atend)
%%Subject: (atend)
%%Title: (atend)
%%Creator: Document Sciences DCPI Build {dcpi_BUILD_3.0SP1.48.124}
%%Author: (atend)
%%CreationDate: Wed Feb 09 09:04:42 NZDT 2011
%%PageOrder: Ascend
%%Pages: 2
%%BoundingBox: 0 0 612 792
%%Orientation: Portrait
%%DocumentProcessColors: Black
%%DocumentCustomColors:
%%CMYKCustomColor:
%%RGBCustomColor:
%%EndComments
%%BeginProlog
%%BeginResource: definicoes
%%EndResource
/PslibDict 300 dict def PslibDict begin/N{def}def/B{bind def}N/eN{exch def}B
/ForcePS true N
/strcat{/str1 eN /str2 eN str1 length str2 length add string /str3 eN str3 0 str2 putinterval
str3 str2 length str1 putinterval str3}N [/setvpsjobname where ForcePS not and
{pop /VPS? true N
/Concat {grestore endinline gsave systemdict /concat get exec}def}
{/setvpsjobname
{256 string cvs /VPSJobName eN}N
/Concat {systemdict /concat get exec}def
/startbooklet{pop}N /endbooklet{}N /VPS? false N /endelement{}N}ifelse
cleartomark /ImgData {(ImgData_) ImgName strcat cvn}B /ImgForm {(ImgForm_) ImgName strcat cvn}B
/EmitterForm{VPS? {EPS_INIT exch
placeelement EPS_CLEANUP grestore PageBBoxDict defineinline gsave
}{EPS_INIT exch /Form findresource execform EPS_CLEANUP}ifelse}B
/BeginImgData{/img_size eN /img_ury eN /img_urx eN /img_lly eN /img_llx eN /ImgName eN
<</BBox [img_llx img_lly img_urx img_ury] img_size 0 eq{/F (ImgForm_) ImgName strcat}
{/Length img_size /EODCount img_size}ifelse /Flexible true /Type /Reusable >> VPS?{ImgForm exch defineelement}
{ImgData exch currentfile exch /SubFileDecode filter /ReusableStreamDecode filter}ifelse}N
/EndImageData{VPS?{endelement}{N <</FormType 1 /BBox [img_llx img_lly img_urx img_ury]/Matrix [1 0 0 1 0 0]
/PaintProc {pop ImgData cvx exec 0 setfileposition ImgData cvx exec cvx exec}>> ImgForm exch
/Form defineresource pop}ifelse}N
/PageBBoxDict {<</BBox [0 0 currentpagedevice /PageSize get aload pop ] >>}B
/Save {PslibDict /PageSave save put VPS? {PageBBoxDict defineinline}if}N
/Restore {VPS?{endinline}if PslibDict /PageSave get restore}N
/p{show}N/w{0 rmoveto}B/a{moveto}B/l{lineto}B/qs{currentpoint
currentpoint newpath moveto 3 2 roll dup true charpath stroke
stringwidth pop 3 2 roll add exch moveto}B/qf{currentpoint
currentpoint newpath moveto 3 2 roll dup true charpath fill
stringwidth pop 3 2 roll add exch moveto}B/qsf{currentpoint

currentpoint newpath moveto 3 2 roll dup true charpath gsave stroke grestore fill
stringwidth pop 3 2 roll add exch moveto}B/qc{currentpoint
currentpoint newpath moveto 3 2 roll dup true charpath clip
stringwidth pop 3 2 roll add exch moveto}B/qsc{currentpoint
currentpoint initclip newpath moveto 3 2 roll dup true charpath clip stroke
stringwidth pop 3 2 roll add exch moveto}B/qfc{currentpoint
currentpoint initclip newpath moveto 3 2 roll dup true charpath clip fill
stringwidth pop 3 2 roll add exch moveto}B/qfsc{currentpoint
currentpoint initclip newpath moveto 3 2 roll dup true charpath gsave stroke grestore clip fill
stringwidth pop 3 2 roll add exch moveto}B/qi{currentpoint
3 2 roll
stringwidth pop 3 2 roll add exch moveto}B/tr{currentpoint currentpoint 5 4 roll add moveto}B/rt{moveto}B
/reencdict 12 dict def /ReEncode { reencdict begin
/newcodesandnames exch def /newfontname exch def /basefontname exch def
/basefontdict basefontname findfont def /newfont basefontdict maxlength dict def
basefontdict { exch dup /FID ne { dup /Encoding eq
{ exch dup length array copy newfont 3 1 roll put }
{ exch newfont 3 1 roll put } ifelse } { pop pop } ifelse } forall
newfont /FontName newfontname put newcodesandnames aload pop
128 1 255 { newfont /Encoding get exch /.notdef put } for
newcodesandnames length 2 idiv { newfont /Encoding get 3 1 roll put } repeat
newfontname newfont definefont pop end } def
/MarkerBegin{countdictstack 50464 mark}B
/MarkerCleanup{stopped {(A Marker caused a PostScript error, continuing processing...) =}if
{cleartomark dup 50464 eq{pop exit}if}loop countdictstack exch sub dup 0 gt{{end}repeat}{pop}ifelse}B
/EPS_INIT{/EPS_SAVE save N gsave newpath 100 dict begin /DictStackDepth countdictstack N
/showpage{}N/erasepage{}N/copypage{}N mark}B
/EPS_CLEANUP{cleartomark DictStackDepth 1 dup countdictstack exch sub {end}for
end grestore EPS_SAVE restore} B
/setcustomcolor
{exch aload pop pop 4 {4 index mul 4 1 roll} repeat
5 -1 roll pop setcmykcolor} 1 index where {pop pop pop} {dup xcheck {bind} if N} ifelse
/ccdef {5 packedarray def} B
/cc {setcustomcolor} N
/ds_ComposeFont
{
    1 index /CMap resourcestatus
    { pop pop }
    {
        /CIDInit /ProcSet findresource
        begin
        12 dict
        begin
        begincmap
        /CMapName 2 index def
        /CMapVersion 1.000 def
        /CMapType 1 def
        /WMode 0 def

        /CIDSystemInfo 3 dict dup
        begin
        /Registry (Adobe) def
        /Ordering (Identity) def
        /Supplement 0 def
        end def
        1 begincodespacerange
        <0000> <FFFF>
        endcodespacerange
        1 begincidrange
        <0000> <FFFF> 0
        endcidrange
        endcmap
        CMapName currentdict /CMap defineresource pop
        end
        end
    } ifelse
    composefont
} bind def
end %PSlibdict
/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
[/Keywords(atend)
/Subject(atend)
/Title(atend)
/Creator (Document Sciences DCPI Build {dcpi_BUILD_3.0SP1.48.124})
/Author(atend)
/CreationDate (Wed Feb 09 09:04:42 NZDT 2011)
/DOCINFO pdfmark
%%EndProlog
%%BeginSetup
PslibDict begin

%%EndSetup
%DocScienceBeginComposeFont: EWMKLC+TimesNewRomanPSMT-Identity-H
16 dict
begin
/FontType 42 def
/FontMatrix [1 0 0 1 0 0] def
/FontBBox [-1164 2062 4096 -628] def
/CIDFontName /EWMKLC+TimesNewRomanPSMT def
/CIDCount 65535 def
/PaintType 0 def
/CIDFontType 2 def
/GDBytes 2 def
/CIDSystemInfo 3 dict dup
begin
/Registry (Adobe) def
/Ordering (Identity) def

/Supplement 0 def

end def
/CharStrings 1 dict dup begin /.notdef 0 def end def
/Encoding 1 array dup 0 /.notdef put readonly def
/CIDMap 0 def
/GlyphDirectory 16 dict dup begin
0 <0002011C0000051C050000030007004DB10201BB02BE0006000702BFB2000504B802BEB403000A07
04B802BEB5010019080605BF02BE0002000301290009016B015E00182B10F63CFD3C4E10F43C4DFD
3C003F3CFD3C10FC3CFD3C31302111211125211121011C0400FC2003C0FC400500FB002004C0> def
end def
/sfnts [

....

] def
/EWMKLC+TimesNewRomanPSMT currentdict end /CIDFont defineresource pop
/EWMKLC+TimesNewRomanPSMT-Identity-H /Identity-H [/EWMKLC+TimesNewRomanPSMT] ds_ComposeFont pop
%DocScienceEndComposeFont: EWMKLC+TimesNewRomanPSMT-Identity-H
%DocScienceBeginEmbedGlyphs
/EWMKLC+TimesNewRomanPSMT /CIDFont findresource /GlyphDirectory get
begin
37 <00030022000004E6054C001E002B0038027D40305A005A1E8900881E8933991A9D27AC1AAC27E91A
EA27E72F0C38007A2779339A329933A524AA33D81AD827D8280A043AB802E7B30F67363AB8FFC0B3
1C22343AB8FFC040E31517343340212C343340191E34324023283432401B1D3444244425891AD901
D624DA33E52507042401250D32100315061B1A141E1624162815302E3245244A34570158195A2796
02111000103A55015A24603A703A803AA03A081A301A3250000310071A241E28192F040602031E17
1E4F3388249A24D93307203A403A503A6302630360056006600760307606731A731B701E74247327
7A288406861B861E8F33803ACA2FDA2FEB24FA241959080F1F1B092122101F1B1621233324000304
2C00352B1F24032229382C33032E2228353509162928171716022E280808090890260126B8FFC0B2
3A3526B8FFC0B2423526B8FF80B33F413426B8FFC0B343463426B8FFC040144235264C5F1C010A1E
301C021C55042B1F382C31B8FFC040104535124004A004020004A004E0040304B8FFC0400A0D1134
00040120040104B801D140252C06131302552C0C0F0F02552C0C0D0D02552C22100F0E0F1002550F
200D0D02550F9E393ABC01D100210061011800182B2B4EF42B2B3C4DED2B2B2BFD5D712B5D714358
B90031032DE91BB90031032DED592B103C3C3C10F45D72ED2B2B2B2B2B72003F3C10ED3F3C10ED11
12392FED12173911121739113901111217392B2B3130437940412F342328181E01071A1B191B0206
062624250225332628182633012F07313301231E2633033401313303271B29330130052E3300251D
2233001E32033533010100103C2B3C2B2B2B012B2B2B2B2B2B2B2B2A81818181015D710172727200
7271002B2B2B2B012B2B2B005D005D01161716151406062321353332373635113427262323352132
171616151406251616333236363534262322071116333236353426262322060703B28D466180DFE5
FD80335525171D274D33024AA463969E7CFD7B255F3992934EC2BA64507471B5BE56C28F3E581B02
B41E425C8565B95525362372036C7E212C251824B77766A10F07073F824D77A816FB6F1BA3784F92
54040500> def
end
%DocScienceEndEmbedGlyphs

...

%%Page: 1 1
%%PageBoundingBox: 0 0 612 792
%%BeginPageSetup
<</PageSize [612.0 792.0]>> setpagedevice
[ /CropBox [0 0 612 792] /PAGE pdfmark
%%PageOrientation: Portrait
%%EndPageSetup
Save
/EWMKLC+TimesNewRomanPSMT-Identity-H findfont [12 0 0 12 0 0] makefont setfont
90 709.30 a
0 0 0 setrgbcolor
<0024> p
90 695.50 a
<0025> p
90 681.70 a
<0026> p
90 667.91 a
<0014> p
90 654.11 a
<001C> p
Restore
showpage

%%Page: 2 2
%%PageBoundingBox: 0 0 612 792
%%BeginPageSetup
[ /CropBox [0 0 612 792] /PAGE pdfmark
%%PageOrientation: Portrait
%%EndPageSetup
Save
/EWMKLC+TimesNewRomanPSMT-Identity-H findfont [12 0 0 12 0 0] makefont setfont
90 709.30 a
0 0 0 setrgbcolor
<0024> p
90 695.50 a
<0025> p
90 681.70 a
<0026> p
90 667.91 a
<0014> p
90 654.11 a
<001C> p
Restore
showpage

%%EOF

↧

Unicode Passwords

March 30, 2011, 9:26 am

≫ Next: Is there a way to tell what settings were used to export a PDF?

≪ Previous: Unicode Text Extract

Hi,

I wonder how acrobat handles a password that contains unicode characters when generating an encryption key. That is, what byte representation (i.e. encoding of the password string) is used, before padding the password in Algorithm 3.2 (PDF Reference 1.7)?

In another thread (http://forums.adobe.com/message/2235717#2235717) it was stated that only characters that fall into the PDFDocEncoding (basically ISO-8859-1) are allowed. However, I found that in Acrobat I can enter non ISO-8859-1 characters for my password (and successfully reopen the file with Acrobat Reader on another machine using this password). However I cannot open the file with other PDF-readers (evince, PDF-XChange Viewer) that just work fine with non-unicode passwords.

Is there a document that explains these details? Are there any differences between the different PDF versions?

Thanks,

Michael

↧

Is there a way to tell what settings were used to export a PDF?

June 18, 2014, 11:42 am

≫ Next: ISO 3200 - 7.5.8 Cross-Reference Streams

≪ Previous: Unicode Passwords

Is there any way to tell what export settings were used to create a PDF? I'm trying to recreate settings used in a previous PDF.

↧

ISO 3200 - 7.5.8 Cross-Reference Streams

July 9, 2014, 1:32 am

≫ Next: PDF CID keyed font support

≪ Previous: Is there a way to tell what settings were used to export a PDF?

I'm having trouble interpreting the Example 4 on page 48 of PDF 32000-1:2008, Specifically in relation to the blue bolded text below.

The first field of these entries is the entry type (2) ------What does the 2 represent?

number of the object stream (15) --------What does the 15 represent?

position within the sequence of objects in the object stream (0, 1, and 2)------What does this refer to?

It's just not clear what the 15 is if it's the second entry? Would somebody be able to explain where these numbers come from?

The following shows the same objects from the previous example stored in an object stream in a PDF 1.5 file, along with a cross-reference stream.

The cross-reference stream (see 7.5.8, "Cross-Reference Streams") contains entries for the fonts (objects 11 and 13) and the descriptor (object 12), which are compressed objects in an object stream. The first field of these entries is the entry type (2), the second field is the number of the object stream (15), and the third field is the position within the sequence of objects in the object stream (0, 1, and 2). The cross-reference stream also contains a type 1 entry for the object stream itself.

↧

PDF CID keyed font support

November 10, 2014, 4:00 am

≫ Next: Can't decrypt attachment with Crypt filter

≪ Previous: ISO 3200 - 7.5.8 Cross-Reference Streams

Hi,

I understand that PDF supports embedded CFF font programs for composite fonts of type CIDFontType0 - the spec is clear.

But, does it support embedded font programs that are not CFF - ie. CIDFont with type 1 descendent for example?

The spec only mentions CFF when talking about type 1 composite fonts, but it isn't totally clear.

So, does PDF support embedded non-CFF programs for composite type 1 fonts?

regards

Steve

↧

Can't decrypt attachment with Crypt filter

August 27, 2015, 2:42 pm

≫ Next: SigFlags - Signature flags meaning

≪ Previous: PDF CID keyed font support

This question has been asked a couple of times previously and the answers are not helpful.

A PDF encrypted with attachments only (attachments use the /Crypt filter, /StdCF) does not work the same as a PDF encrypted with all-data-encrypted.

I have software that can read one but not the other.

Conversely, I have two PDFs created (by me) with identical format except for this setting.

The trailer ID is the same, the Crypt settings are the same (/EncryptMetadata false, /U, /O, /R /V, etc.,

except one is set up with attachments only (/StmF /Identity /StrF /Identity, and /StdCF /AuthEvent /EFOpen)

and one with everything (/StmF /StdCF /StrF /StdCF, and /StdCF /AuthEvent /DocOpen.)

Let me stress: these files are *identical* except for these settings and the encrypted v non-encrypted data.

All of the objects have the same object id's, etc.

The contents of the encrypted attachment stream is *identical* byte for byte, between the two files,

and Acrobat can extract the attachment from the fully encrypted file but generates a 0 byte file for the other.

3rd party software: QPdf (5.1.3) cannot read the attachment from the Acrobat generated file, but *can* read it from the file I generated.

Foxit reader (7.03.916) can read the one generated by Acrobat, but extracts garbage from the file I generated.

I've been over the PDF reference 1.7 and the supplements and can't find any magic tricks. Any help would be appreciated.

↧

SigFlags - Signature flags meaning

May 24, 2017, 3:52 am

≫ Next: Insert a page from an external pdf

≪ Previous: Can't decrypt attachment with Crypt filter

There are currently two values (bit positions) for the SigFlags key defined:

1 - SignaturesExists

2 - AppendOnly

I got questions regarding the signing state of a signature:

I) SignatureExists Bit position means that there are unsigned/empty signature fields?

II) AppendOnly Bit position means that there is at least one signed signature (which would be invalided if changed "normally"). Or can the AppendOnly mode also be used to indicate that I (as a document creator) want that all changes are done in appand mode?

III) Those values are bit values which can have the following value in the final PDF:

0 - nothing set

1 - SignatureExists=true, (AppendOnly=false) - Signed or unsigned signature exists

2 - AppendOnly=true, (SignatureExists=false) - All changes should be done in append mode, don't know whether there are signatures

3 - SignatureExists=true, AppendOnly=true - There are signatures all changes should be done in appendMode

Correct?

↧

Insert a page from an external pdf

June 16, 2017, 8:14 pm

≫ Next: startxref entries

≪ Previous: SigFlags - Signature flags meaning

I want to insert a page from an external PDF file. I understand that I will have to use XObject for that, but my attempts have failed. Can any one share a sample use of the XObject with external reference ?

↧

startxref entries

October 18, 2017, 1:20 pm

≫ Next: Cascading Filters - Streams

≪ Previous: Insert a page from an external pdf

I see the following all the time and I just can't seem to find any reason for this:

%PDF-1.5

%4 extended chars were here

26 0 obj

<</Linearized 1/L 165618/O 28/E 160858/N 1/T 165312/H [ 540 170]>>

endobj

39 0 obj

<</DecodeParms<</Columns 5/Predictor 12>>/Filter/FlateDecode/ID[<7B2B288DDC9D317F7D8371774B51C538><51EF5FB138E5AC44821CEBA03B8 6B240>]/Index[26 56]/Info 25 0 R/Length 82/Prev 165313/Root 27 0 R/Size 82/Type/XRef/W[1 3 1]>>stream

some stream data (obviously an xref stream)

endstream

endobj

startxref

%%EOF

The rest of the PDF follows.

...and here's the end of it:

startxref

116

%%EOF

So here's my question:

What is the point of the first startxref statement? Obviously at offset 0 is the header comment specifying the version.

I see this all the time and it has never made any sense to me.

↧

Cascading Filters - Streams

April 10, 2018, 11:59 pm

≫ Next: Type 3 font created from Image

≪ Previous: startxref entries

I am trying to decode cascading filters. I have written my own ASCII85 encoding and decoding filters.

For Flate encoding and decoding, I am using zlib.

When I first do an ASCII85 decoding and then a FLATE decoding on a stream using

/Filter [/FlateDecode /ASCII85Decode] it works fine

but

when I do a FLATE encoding and then an ASCII85 encoding on the same stream using

/Filter [/ASCII85Decode /FlateDecode] it does not.

Any pointers as to why this happens ?

Thanks

P.Chellappan

↧

Type 3 font created from Image

June 26, 2018, 12:45 am

≫ Next: Is any tool to decompress stream from reference stream

≪ Previous: Cascading Filters - Streams

Hi guys

I am trying to convert a document that is in afp and uses bitmap fonts -into pdf document.

I am considering different solutions available in pdf and right now I am trying to create a font Type 3 that includes an Image for letter glyph.

As I can correctly create a 'letter' as solid rectangle (example below) that is drown by lines, I can not do it with Image that is defined as Xobject.

Is it possible to do it? I am presenting an example where a letter 'a' is correctly displayed as rectangle and not correctly displayed 'b' letter that is xobject image.

Or I did a simple mistake that prevent 'b' to be displayed? Maybe someone tried this before and has a working example?

I am not using any tool just a notepad++ for creating pdf. (Arial font that is also used in pdf is not used 6 0 obj)

Example:

%PDF-1.7

%ÇìÇì

1 0 obj

<</Type /Catalog

/Pages 2 0 R

endobj

2 0 obj

<< /Type /Pages

/Kids [

3 0 R

] /Count 1

endobj

3 0 obj

<</Type /Page

/MediaBox [0 0 595 842]

/Rotate 0

/Parent 2 0 R

/Resources 4 0 R

/Contents 5 0 R

endobj

4 0 obj

/Font << /F13 7 0 R >>

endobj

5 0 obj

<< /Length 51 >>

stream

/F13 28 Tf

72 360 Td

(ababacccccccabab) Tj

endstream

endobj

6 0 obj

<< /Type /Font

/Subtype /TrueType

/BaseFont /Arial

endobj

7 0 obj

<< /Type /Font

/Subtype /Type3

/FontBBox [ 0 0 750 750 ]

/FontMatrix [ 0.001 0 0 0.001 0 0 ]

/CharProcs 9 0 R

/Encoding 8 0 R

/FirstChar 97

/LastChar 98

/Resources 12 0 R

/Widths [ 1000 1000 ]

endobj

8 0 obj

<< /Type /Encoding

/Differences [ 97 /square /triangle ]

endobj

9 0 obj

<< /square 10 0 R

/triangle 11 0 R

endobj

10 0 obj

<< /Length 39 >>

stream

1000 0 0 0 750 750 d1

0 0 750 750 re

endstream

endobj

11 0 obj

<< /Length 52 >>

stream

1000 0 0 0 750 750 d1

10 0 0 10 2 2 cm

/Img1 Do

endstream

endobj

12 0 obj

/ProcSet [ /PDF /ImageB ]

/XObject<</Img1 13 0 R>>

endobj

13 0 obj

/Length 32

/Subtype /Image

/ImageMask true

/Type /XObject

/BitsPerComponent 1

/Height 14

/Width 16

/Decode [ 1 0 ]

/Filter /RunLengthDecode

stream

ü 8 8 8 0 p p8p8ð8àx€>

endstream

endobj

xref

0 14

0000000000 65535 f

0000000015 00000 n

0000000065 00000 n

0000000148 00000 n

0000000271 00000 n

0000000350 00000 n

0000000450 00000 n

0000000527 00000 n

0000000738 00000 n

0000000813 00000 n

0000000866 00000 n

0000000955 00000 n

0000001057 00000 n

0000001130 00000 n

trailer

<< /Size 7

/Root 1 0 R

/ID [<43C69728DA7CC038F5B0DFF857BC0967><43C69728DA7CC038F5B0DFF857BC0967>]

startxref

1342

%%EOF

↧

Is any tool to decompress stream from reference stream

July 19, 2018, 7:21 am

≫ Next: Create unicode supported pdf

≪ Previous: Type 3 font created from Image

Guys

I am trying to lookup the content of stream from reference stream that is deflated and I can do it.

What I did so far is a tool that decompress data using Zlib library.

I also took into consideration suggestions in this forum to cut 2 first byte from stream and add a header. Nothing helps.

I guess someone had this problem before. Maybe it is a tool that I could use.

I'v got a tool that deflates the whole pdf document, but unfortunately it also replaces reference stream with ref table and object stream to many objects, so for my educational purpose it is useless. I need to see what is in ref stream for a few of the old pdf files I am investigating, because I struggle with correctly create this for a new document.

Regards,

Jerzy

↧

Create unicode supported pdf

May 6, 2008, 2:22 am

≫ Next: Dealing with Predictors when decoding PDFs

≪ Previous: Is any tool to decompress stream from reference stream

Hallo,

Please tell me how do i create a unicode pdf file with VB.

I need to know the structure of those PDF files

Thanks

az

↧

Dealing with Predictors when decoding PDFs

June 21, 2010, 1:55 pm

≫ Next: Identity-H, CMap and troubles choosing predefined encodings

≪ Previous: Create unicode supported pdf

I've seen a lot of questions about how to "unpredict" Xref streams. The answer usually goes something like "RTFM!"

Well, I have R'd the FM, and it's still complicated. So, to save others the pain I've gone through, here's a slightly more detailed rundown.

A great blog explaining how PNG prediction works (which, yes, can be applied to text) is here:

http://www.atalasoft.com/cs/blogs/rickm/archive/2008/05/02/using-png-predictors-to-enhance -gzip-pkzip-flate-compression.aspx

A fun fact not covered in the blog: PNG prediction happens bottom-up, so when you decode you need to run your algorithm top-down.

Basic steps:

Assuming you've already decoded (inflated, un-gzipped or whatever) the string...

You'll need to know the predictor type and the column width, at a bare minimum. This is provided in the Xref stream dictionary.

-Predictor type will be /Predictor in the /DecodeParms subdictionary. Usually this is 12 (PNG Up prediction)

-Column width will be /W. Usually this is [1 2 1]

Armed with this information:

1) Strip off the last 10 characters of the string. This is the CRC and is unnecessary to extract the raw data.

2) Sum up the column widths. For the example above [1 2 1] would be 4. This is one less than the number of bytes in each row.

3) Split the string into rows by the column width: sum+1, or in our example, 5.

4) The first byte on the row will be the predictor type. You can actually change the predictor line-by-line, though I haven't seen an example of this actually happening. For PNG Up prediction (12, as above), the first byte should be 0x02. You should either strip this off (i.e. assume all lines use the same prediction), or write code to change the algorithm based on this number. The simpler solution, albeit potentially hazardous for your reader, is to strip it off.

5) Either way, you should now have a row equal to the <width> (4 in our example). You now convert the row byte-by-byte. Since PNG Up unprediction works top-down, the first row is already effectively decoded. I create a "prevRow" array with <width> rows, filled with zeroes.

6) Loop through each row, and within that loop loop through each byte in the row. Convert the byte from binary to int and add it to the same byte in the previous row. Pseudocode: unpredictedByte = prevRow[currrentByte] + row[currentByte]

7) Convert the int back to binary. You'll see why in a second.

8) Once you have all your rows unpredicted, loop through them again, splitting them into binary strings based on the W parameter. In our example: 1 2 1. I.e. the first string is 1 byte, the second string is 2 bytes, the third string is 1 byte.

9) Conver the ENTIRE STRING to int (this is why we had to go back to binary, since the strings can be of arbitrary length).

10) Your first column should be the Xref entry type: 0 = f. I.e. a "free" or deleted object. My reader ignores these. 1 = n. I.e. an "in use" object. You'll want to save these for the future. 2 = a compressed object. You'll also want to save these, but they'll require a little more work before they're usable.

11) If our first column is 1 (an in-use object), the second column is the offset address for that object. I add it to my Xref table array.

12) If our first column is 2 (a compressed object), you have to decompress the object stream to get the actual object references and offsets. See PDF Specification section 7.5.7 (page 45).

The code in PHP looks like this:

          $prevRow = array_fill(0, $totalWidth, 0); //Treat prevRow as an array of ints for math
                for($j=0; $j<count($rows); $j++) {
                    $row = substr($rows[$j], 1); //Chop off the filter-type byte (should be 02 for UP prediction)
                    $offsets[$j] = '';

                    //Reconstruct the string character by character
                    for($i=0; $i<strlen($row); $i++) {
                        $decodedByte = ord($row[$i]); //Convert the binary character to an int so we can do math on it
                        $decodedByte = $decodedByte+$prevRow[$i]; //Add the current row's character to the previous row's character
                        $decodedByte = $decodedByte & 0xFF; //Seems pointless to me, but Zend Framework does this
                        $prevRow[$i] = $decodedByte; //Update for our next pass

                        //WARNING: Assumes 1 2 1 column structure.
                        if($i == 0)
                            $types[$j] = $decodedByte;
                        else if($i == 1 || $i == 2)
                            $offsets[$j] .= chr($decodedByte); //Convert back to binary
                        if($i == 3)
                            $generations[$j] = $decodedByte;
                    } //Close for $i
                }//Close for $j

↧

Identity-H, CMap and troubles choosing predefined encodings

September 8, 2011, 3:33 am

≫ Next: how to map from cid to unicode

≪ Previous: Dealing with Predictors when decoding PDFs

I'm trying to parse certain pdf document on Mac OS X. The pages have embeded CID () fonts with Identity-H encoding. The font itself is Type0 font with CIDFontType2 descendant font. I'm able to extract text from any page by using 2-byte CIDs and mapping them to characters defined in ToUnicode stream. However there are a few character mismatches which (IMHO) are the cause of wrongly chosen encoding (MacRomanEncoding instead of PDFDocEncoding).

One of mismatched characters in document is Ø (latin capital o with stroke, empty set symbol) character, the character I'm extracting is ÿ (latin small character y with diaeresis). According to pdf 1.7 specification characters Ø and ÿ have same octal code, but in different encodings (330 in PDFDocStanrdardEncoding and MacRomanEncoding accordingly).

My question is how can I be sure to select correct encoding for the text? Is it PDFDocEncoding by default unless specified otherwise?

↧