Dealing with Predictors when decoding PDFs

I've seen a lot of questions about how to "unpredict" Xref streams. The answer usually goes something like "RTFM!"

Well, I have R'd the FM, and it's still complicated. So, to save others the pain I've gone through, here's a slightly more detailed rundown.

A great blog explaining how PNG prediction works (which, yes, can be applied to text) is here:

http://www.atalasoft.com/cs/blogs/rickm/archive/2008/05/02/using-png-p redictors-to-enhance-gzip-pkzip-flate-compression.aspx

A fun fact not covered in the blog: PNG prediction happens bottom-up, so when you decode you need to run your algorithm top-down.

Basic steps:

Assuming you've already decoded (inflated, un-gzipped or whatever) the string...

You'll need to know the predictor type and the column width, at a bare minimum. This is provided in the Xref stream dictionary.

-Predictor type will be /Predictor in the /DecodeParms subdictionary. Usually this is 12 (PNG Up prediction)

-Column width will be /W. Usually this is [1 2 1]

Armed with this information:

1) Strip off the last 10 characters of the string. This is the CRC and is unnecessary to extract the raw data.

2) Sum up the column widths. For the example above [1 2 1] would be 4. This is one less than the number of bytes in each row.

3) Split the string into rows by the column width: sum+1, or in our example, 5.

4) The first byte on the row will be the predictor type. You can actually change the predictor line-by-line, though I haven't seen an example of this actually happening. For PNG Up prediction (12, as above), the first byte should be 0x02. You should either strip this off (i.e. assume all lines use the same prediction), or write code to change the algorithm based on this number. The simpler solution, albeit potentially hazardous for your reader, is to strip it off.

5) Either way, you should now have a row equal to the <width> (4 in our example). You now convert the row byte-by-byte. Since PNG Up unprediction works top-down, the first row is already effectively decoded. I create a "prevRow" array with <width> rows, filled with zeroes.

6) Loop through each row, and within that loop loop through each byte in the row. Convert the byte from binary to int and add it to the same byte in the previous row. Pseudocode: unpredictedByte = prevRow[currrentByte] + row[currentByte]

7) Convert the int back to binary. You'll see why in a second.

8) Once you have all your rows unpredicted, loop through them again, splitting them into binary strings based on the W parameter. In our example: 1 2 1. I.e. the first string is 1 byte, the second string is 2 bytes, the third string is 1 byte.

9) Conver the ENTIRE STRING to int (this is why we had to go back to binary, since the strings can be of arbitrary length).

10) Your first column should be the Xref entry type: 0 = f. I.e. a "free" or deleted object. My reader ignores these. 1 = n. I.e. an "in use" object. You'll want to save these for the future. 2 = a compressed object. You'll also want to save these, but they'll require a little more work before they're usable.

11) If our first column is 1 (an in-use object), the second column is the offset address for that object. I add it to my Xref table array.

12) If our first column is 2 (a compressed object), you have to decompress the object stream to get the actual object references and offsets. See PDF Specification section 7.5.7 (page 45).

The code in PHP looks like this:

          $prevRow = array_fill(0, $totalWidth, 0); //Treat prevRow as an array of ints for math
                for($j=0; $j<count($rows); $j++) {
                    $row = substr($rows[$j], 1); //Chop off the filter-type byte (should be 02 for UP prediction)
                    $offsets[$j] = '';

                    //Reconstruct the string character by character
                    for($i=0; $i<strlen($row); $i++) {
                        $decodedByte = ord($row[$i]); //Convert the binary character to an int so we can do math on it
                        $decodedByte = $decodedByte+$prevRow[$i]; //Add the current row's character to the previous row's character
                        $decodedByte = $decodedByte & 0xFF; //Seems pointless to me, but Zend Framework does this
                        $prevRow[$i] = $decodedByte; //Update for our next pass

                        //WARNING: Assumes 1 2 1 column structure.
                        if($i == 0)
                            $types[$j] = $decodedByte;
                        else if($i == 1 || $i == 2)
                            $offsets[$j] .= chr($decodedByte); //Convert back to binary
                        if($i == 3)
                            $generations[$j] = $decodedByte;
                    } //Close for $i
                }//Close for $j

Dealing with Predictors when decoding PDFs

Trending Articles

Alok, Daecolm & Malou – Unforgettable – Single [iTunes Plus M4A]

Download: D boy ft Shenky & Chester – Nafola nafulwa”Prod By: Shenky”

Mp3 Download: Mdu - Mazola

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Materials Around Us Class 6 Worksheet Science Chapter 6

Gulabi kallu Lyrics and translation | GAV / Govindhudu andhari vadele (2014)

Grimsby sex-swap teen Nicole beats the bullies

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

Troubleshooting Connectivity #9 –ローカル接続でネットワークエラーとはこれいかに？

[MP3] Okpo Recordz Virus & Texzy –“Raba Raba” (Prod. by Exy Pro)

Karimnagar District Police Office Mobile Numbers List in Telangana State

Missing boy, Queens Quay West and Bathurst Street area, Javin Dillon, 15

usage of CSRF token in ABAP report for POST request

AVS4YOU Products Patcher v1.4 By RADIXX11

SAHARA FLASH LIVE IN WERAGOLLA 2018-04-20

Practice Sheet of Right form of verbs for HSC Students

99 God Status for Whatsapp, Facebook

Portable iSkysoft PDF Editor 5.6.0.1

Bureau of Internal Revenue: Regional Offices (Directory)