Quantcast
Channel: Adobe Community : Popular Discussions - PDF Language and Specifications
Viewing all articles
Browse latest Browse all 46145

Getting text position while parsing PDF

$
0
0

Hi guys,

 

I need to implement full text search functionality in PDF document. So to do so I  am parsing PDF document with Quartz 2d, which is Objective C framework. I can get all instructions or operators e.g BT,ET etc.

 

This is a log from parsing pdf file which contains a text - "From sample to result".

 

 

2010-09-03 09:35:10.141 testSearch[352:207] op_q begin

2010-09-03 09:35:10.142 testSearch[352:207] op_q end

2010-09-03 09:35:10.142 testSearch[352:207] op_cm begin

Float value: 156.933105

2010-09-03 09:35:10.143 testSearch[352:207] op_cm end

2010-09-03 09:35:10.143 testSearch[352:207] op_gs begin

Name value: GS0

2010-09-03 09:35:10.143 testSearch[352:207] op_gs end

2010-09-03 09:35:10.144 testSearch[352:207] op_m begin

Integer value: 0

2010-09-03 09:35:10.144 testSearch[352:207] op_m end

2010-09-03 09:35:10.145 testSearch[352:207] op_l begin

Integer value: 0

2010-09-03 09:35:10.145 testSearch[352:207] op_l end

2010-09-03 09:35:10.145 testSearch[352:207] op_Q begin

Integer value: 0

2010-09-03 09:35:10.146 testSearch[352:207] op_Q end

2010-09-03 09:35:10.146 testSearch[352:207] op_q begin

Integer value: 0

2010-09-03 09:35:10.146 testSearch[352:207] op_q end

2010-09-03 09:35:10.147 testSearch[352:207] op_cm begin

Float value: 143.933105

2010-09-03 09:35:10.147 testSearch[352:207] op_cm end

2010-09-03 09:35:10.148 testSearch[352:207] op_gs begin

Name value: GS0

2010-09-03 09:35:10.148 testSearch[352:207] op_gs end

2010-09-03 09:35:10.149 testSearch[352:207] op_m begin

Integer value: 0

2010-09-03 09:35:10.149 testSearch[352:207] op_m end

2010-09-03 09:35:10.149 testSearch[352:207] op_l begin

Integer value: 0

2010-09-03 09:35:10.150 testSearch[352:207] op_l end

2010-09-03 09:35:10.150 testSearch[352:207] op_Q begin

Integer value: 0

2010-09-03 09:35:10.151 testSearch[352:207] op_Q end

2010-09-03 09:35:10.151 testSearch[352:207] op_BT begin

Integer value: 0

2010-09-03 09:35:10.152 testSearch[352:207] op_BT end

2010-09-03 09:35:10.152 testSearch[352:207] op_gs begin

Name value: GS0

2010-09-03 09:35:10.152 testSearch[352:207] op_gs end

2010-09-03 09:35:10.153 testSearch[352:207] op_Tf begin

Integer value: 1

2010-09-03 09:35:10.153 testSearch[352:207] op_Tf end

2010-09-03 09:35:10.154 testSearch[352:207] op_Tm begin

Float value: 559.436035

2010-09-03 09:35:10.154 testSearch[352:207] op_Tm end

2010-09-03 09:35:10.155 testSearch[352:207] op_TJ begin

2010-09-03 11:04:25.616 testSearch[838:207] Array string value [0]: F

2010-09-03 11:04:25.616 testSearch[838:207] Array integer value [1]: 34

2010-09-03 11:04:25.617 testSearch[838:207] Array string value [2]: r

2010-09-03 11:04:25.617 testSearch[838:207] Array integer value [3]: 7

2010-09-03 11:04:25.618 testSearch[838:207] Array string value [4]: o

2010-09-03 11:04:25.618 testSearch[838:207] Array integer value [5]: 6

2010-09-03 11:04:25.618 testSearch[838:207] Array string value [6]: m s

2010-09-03 11:04:25.619 testSearch[838:207] Array integer value [7]: 7

2010-09-03 11:04:25.619 testSearch[838:207] Array string value [8]: a

2010-09-03 11:04:25.619 testSearch[838:207] Array integer value [9]: 10

2010-09-03 11:04:25.620 testSearch[838:207] Array string value [10]: m

2010-09-03 11:04:25.620 testSearch[838:207] Array integer value [11]: 9

2010-09-03 11:04:25.620 testSearch[838:207] Array string value [12]: p

2010-09-03 11:04:25.621 testSearch[838:207] Array integer value [13]: 3

2010-09-03 11:04:25.621 testSearch[838:207] Array string value [14]: l

2010-09-03 11:04:25.622 testSearch[838:207] Array integer value [15]: 3

2010-09-03 11:04:25.622 testSearch[838:207] Array string value [16]: e t

2010-09-03 11:04:25.622 testSearch[838:207] Array integer value [17]: -6

2010-09-03 11:04:25.623 testSearch[838:207] Array string value [18]: o r

2010-09-03 11:04:25.623 testSearch[838:207] Array integer value [19]: 6

2010-09-03 11:04:25.623 testSearch[838:207] Array string value [20]: e

2010-09-03 11:04:25.624 testSearch[838:207] Array integer value [21]: 14

2010-09-03 11:04:25.624 testSearch[838:207] Array string value [22]: s

2010-09-03 11:04:25.625 testSearch[838:207] Array integer value [23]: 17

2010-09-03 11:04:25.625 testSearch[838:207] Array string value [24]: u

2010-09-03 11:04:25.625 testSearch[838:207] Array integer value [25]: 10

2010-09-03 11:04:25.626 testSearch[838:207] Array string value [26]: l

2010-09-03 11:04:25.626 testSearch[838:207] Array integer value [27]: 7

2010-09-03 11:04:25.627 testSearch[838:207] Array string value [28]: t

2010-09-03 09:35:10.169 testSearch[352:207] op_TJ end

 

 

I read PDF Reference "Text" section and it's not clear for me how all this works.

 

For example according to PDF reference cm operator means - "Modify the current transformation matrix (CTM) by concatenating the

 

specified matrix (see Section 4.2.1, “Coordinate Spaces”). Although the

operands specify a matrix, they are written as six separate numbers, not as

an array", but as you can see from log a have got only one float value. Am I missing something?

 

 

The same thing is with Tm operator, I get only some float value not a transformation matrix.

 

2010-09-03 09:35:10.154 testSearch[352:207] op_Tm begin

Float value: 559.436035

2010-09-03 09:35:10.154 testSearch[352:207] op_Tm end

 

 

So the main question is how to find position of a particular text phrase (I asume I need to calculate glyph size etc.)?

Please, give me a direction where to start and I would be very appreciated if you'd describe an overall picture of how rendering and positioning process is performed.


Viewing all articles
Browse latest Browse all 46145

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>