Extracting & structuring content from text- or image-based tables has long been a challenge. Transforming tabular content into a structured model such as XML or HTML is nearly always a manual or semi-manual process. Tabular content is particularly important in regulatory, financial, & scientific documents where complex alphanumeric content is often presented in tabular format. Tables are tough to structure due to inconsistencies with tabular content, high diversity of layouts, complicated elements such as straddle headings, various alignments of contents, the presence of empty cells, & other intricacies.
Data Conversion Laboratory & Fuse Machines created an AI model that finds & extracts information from all tables in a document using a combination of Computer Vision (CV) & Natural Language Processing (NLP). We'll review how we developed & managed a hybrid approach of rules-based processes & machine-learning to identify & extract tabular data, & augmented training data to develop an AI model that automates table-to-XML extraction. This presentation dives into the details of why the automated process of table structure is important, why we took the approaches we did, & how one can measure the efficacy of table identification & extraction.
Mark Gross, President, Data Conversion Laboratory
Isu Shrestha, Senior Machine Learning Engineer