Hi guys,
I need to implement full text search functionality in PDF document. So to do so I am parsing PDF document with Quartz 2d, which is Objective C framework. I can get all instructions or operators e.g BT,ET etc.
This is a log from parsing pdf file which contains a text - "From sample to result".
2010-09-03 09:35:10.141 testSearch[352:207] op_q begin
2010-09-03 09:35:10.142 testSearch[352:207] op_q end
2010-09-03 09:35:10.142 testSearch[352:207] op_cm begin
Float value: 156.933105
2010-09-03 09:35:10.143 testSearch[352:207] op_cm end
2010-09-03 09:35:10.143 testSearch[352:207] op_gs begin
Name value: GS0
2010-09-03 09:35:10.143 testSearch[352:207] op_gs end
2010-09-03 09:35:10.144 testSearch[352:207] op_m begin
Integer value: 0
2010-09-03 09:35:10.144 testSearch[352:207] op_m end
2010-09-03 09:35:10.145 testSearch[352:207] op_l begin
Integer value: 0
2010-09-03 09:35:10.145 testSearch[352:207] op_l end
2010-09-03 09:35:10.145 testSearch[352:207] op_Q begin
Integer value: 0
2010-09-03 09:35:10.146 testSearch[352:207] op_Q end
2010-09-03 09:35:10.146 testSearch[352:207] op_q begin
Integer value: 0
2010-09-03 09:35:10.146 testSearch[352:207] op_q end
2010-09-03 09:35:10.147 testSearch[352:207] op_cm begin
Float value: 143.933105
2010-09-03 09:35:10.147 testSearch[352:207] op_cm end
2010-09-03 09:35:10.148 testSearch[352:207] op_gs begin
Name value: GS0
2010-09-03 09:35:10.148 testSearch[352:207] op_gs end
2010-09-03 09:35:10.149 testSearch[352:207] op_m begin
Integer value: 0
2010-09-03 09:35:10.149 testSearch[352:207] op_m end
2010-09-03 09:35:10.149 testSearch[352:207] op_l begin
Integer value: 0
2010-09-03 09:35:10.150 testSearch[352:207] op_l end
2010-09-03 09:35:10.150 testSearch[352:207] op_Q begin
Integer value: 0
2010-09-03 09:35:10.151 testSearch[352:207] op_Q end
2010-09-03 09:35:10.151 testSearch[352:207] op_BT begin
Integer value: 0
2010-09-03 09:35:10.152 testSearch[352:207] op_BT end
2010-09-03 09:35:10.152 testSearch[352:207] op_gs begin
Name value: GS0
2010-09-03 09:35:10.152 testSearch[352:207] op_gs end
2010-09-03 09:35:10.153 testSearch[352:207] op_Tf begin
Integer value: 1
2010-09-03 09:35:10.153 testSearch[352:207] op_Tf end
2010-09-03 09:35:10.154 testSearch[352:207] op_Tm begin
Float value: 559.436035
2010-09-03 09:35:10.154 testSearch[352:207] op_Tm end
2010-09-03 09:35:10.155 testSearch[352:207] op_TJ begin
2010-09-03 11:04:25.616 testSearch[838:207] Array string value [0]: F
2010-09-03 11:04:25.616 testSearch[838:207] Array integer value [1]: 34
2010-09-03 11:04:25.617 testSearch[838:207] Array string value [2]: r
2010-09-03 11:04:25.617 testSearch[838:207] Array integer value [3]: 7
2010-09-03 11:04:25.618 testSearch[838:207] Array string value [4]: o
2010-09-03 11:04:25.618 testSearch[838:207] Array integer value [5]: 6
2010-09-03 11:04:25.618 testSearch[838:207] Array string value [6]: m s
2010-09-03 11:04:25.619 testSearch[838:207] Array integer value [7]: 7
2010-09-03 11:04:25.619 testSearch[838:207] Array string value [8]: a
2010-09-03 11:04:25.619 testSearch[838:207] Array integer value [9]: 10
2010-09-03 11:04:25.620 testSearch[838:207] Array string value [10]: m
2010-09-03 11:04:25.620 testSearch[838:207] Array integer value [11]: 9
2010-09-03 11:04:25.620 testSearch[838:207] Array string value [12]: p
2010-09-03 11:04:25.621 testSearch[838:207] Array integer value [13]: 3
2010-09-03 11:04:25.621 testSearch[838:207] Array string value [14]: l
2010-09-03 11:04:25.622 testSearch[838:207] Array integer value [15]: 3
2010-09-03 11:04:25.622 testSearch[838:207] Array string value [16]: e t
2010-09-03 11:04:25.622 testSearch[838:207] Array integer value [17]: -6
2010-09-03 11:04:25.623 testSearch[838:207] Array string value [18]: o r
2010-09-03 11:04:25.623 testSearch[838:207] Array integer value [19]: 6
2010-09-03 11:04:25.623 testSearch[838:207] Array string value [20]: e
2010-09-03 11:04:25.624 testSearch[838:207] Array integer value [21]: 14
2010-09-03 11:04:25.624 testSearch[838:207] Array string value [22]: s
2010-09-03 11:04:25.625 testSearch[838:207] Array integer value [23]: 17
2010-09-03 11:04:25.625 testSearch[838:207] Array string value [24]: u
2010-09-03 11:04:25.625 testSearch[838:207] Array integer value [25]: 10
2010-09-03 11:04:25.626 testSearch[838:207] Array string value [26]: l
2010-09-03 11:04:25.626 testSearch[838:207] Array integer value [27]: 7
2010-09-03 11:04:25.627 testSearch[838:207] Array string value [28]: t
2010-09-03 09:35:10.169 testSearch[352:207] op_TJ end
I read PDF Reference "Text" section and it's not clear for me how all this works.
For example according to PDF reference cm operator means - "Modify the current transformation matrix (CTM) by concatenating the
specified matrix (see Section 4.2.1, “Coordinate Spaces”). Although the
operands specify a matrix, they are written as six separate numbers, not as
an array", but as you can see from log a have got only one float value. Am I missing something?
The same thing is with Tm operator, I get only some float value not a transformation matrix.
2010-09-03 09:35:10.154 testSearch[352:207] op_Tm begin
Float value: 559.436035
2010-09-03 09:35:10.154 testSearch[352:207] op_Tm end
So the main question is how to find position of a particular text phrase (I asume I need to calculate glyph size etc.)?
Please, give me a direction where to start and I would be very appreciated if you'd describe an overall picture of how rendering and positioning process is performed.