There are four distinct processes in the MaxTract work flow; image analysis, PDF extraction, linearisation and parsing. I will briefly describe those here, though for a more technical and detailed description please have a look at the following papers: Baker, Sexton, Sorge ’09 and Baker, Sexton, Sorge ’10
Here, a PDF file is converted into a TIF image which is analysed to identify the precise bounding boxes of each glyph upon a page. Whilst approximate positional information can be obtained from PDF files themselves, this is not sufficient for working out the relationships between mathematical symbols.
The PDF file itself is parsed in order to extract the name of each character and corresponding font and size comprising the page. To do this we use our own bespoke PDF parser, we have tried other other open source tools, however we found they did not provide sufficient output for dealing with mathematics. The parser essentially follows all of the text and drawing commands in an uncompressed PDF file, then maps the results of these commands to the bounding boxes found in the previous step.
After the data has been gathered and collated from the analysis and extraction stages, it is converted into a parse tree which is an intermediate format suitable for later parsing and analysis. A special linear grammar is used here, which can not only identify words and lines, but more importantly capture the subtle nuances of mathematical notation. The resultant tree contains information about the relationships between neighboring symbols, their sizes, positions and names and is rich enough to allow its conversion into a number of different formats.
The final processing stage is the use of specialist drivers to walk the tree and produce appropriate markup. We currently have three main drivers that can produce text LaTeX and MathML, the results of which can be seen here. The system has been designed so that different drivers, focusing on layout, markup and presentation can easily be plugged in and combined, making the system extensible and easily customisable.
My next post will be about the current limitations of our system, and how we intend to improve it throughout this project.