Quantcast
Channel: Adobe Community : Popular Discussions - PDF Language and Specifications
Viewing all articles
Browse latest Browse all 46145

Determining word boundaries when no space exists in text

$
0
0

I am developing a text search feature for a viewer application and I run into PDFs quite often that do not use the space character to delineate word boundaries.  For example, a text showing operator with individual glyph positioning will contain strings and positioning information like this:

 

[(de)15(grees)-262(and)-262(who)-262(w)10(ould)-262(contrib)20(ute)-26 2(an)]TJ

 

When the strings are concatenated the result is:

 

"degreesandwhowouldcontributean"

 

Without spaces it's not possile to split the string into words based on character information.  It would appear the only information that could be used to guess word boundaries is the glyph positioning.  I have tested the documents in Adobe Reader and the application is able to correctly determine where word boundaries are, and it must be doing so by examing the glyph positioning and metrics.

 

My first appreach was to get the glyph width for the space character, and assume a space is any position advance greater than the glyph width of a space.  The problem with that is the case where the font has been subsetted and the 'space' glyph is missing from the font.

 

My second approach was to calculate the average glyph width for the font, then assume any text advance greater than 33% of the average glyph width is a space.  Works better but still not a reliable general solution.

 

My question: does Adobe have a standard method for determining word boundaries when space characters are missing?


Viewing all articles
Browse latest Browse all 46145

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>