Democratizing Machine Reading

Computers, the Internet, and cheap storage promote the acquisition and collection of vast quantities of data. This doesn’t just apply to pictures in our phones, more importantly there is a seemingly infinite supply of text documents which contain critical scientific, socio-political, and business insights – far more than can be read by a human. Within the natural language processing (NLP) domain, the field of information extraction (IE) targets exactly this problem, but it requires its practitioners to have expertise either in linguistics, machine learning, or both. Consequently, the majority of the advancements in the field of IE are largely inaccessible to those who need them the most. Here we propose techniques to allow domain experts to develop and deploy IE systems targeting their own particular needs without requiring this expertise. By lowering the barrier to entry for IE, the develop- ments in the field would be made available to the wider community.

We propose to do this though a human-technology partnership, where domain experts specify their information need at the level they find intuitive, (e.g., that phosphorylation acts on a protein or that humanitarian interventions have a cost and a measurable outcome). Our system will then extend techniques from the adjacent field of program synthesis to convert these high-level, abstract specifications into ones which can be executed in order to extract the desired information. Importantly, we argue that the resulting model must be readily interpretable and easily extensible to support its long-term deployment and maintainability. To this end, we propose to inform our final representations through a rigorous series of user studies aimed at discovering which aspects of these representations are more or less difficult to understand.

By democratizing machine reading, its benefits (i.e., accessing knowledge from unstructured data, and doing so at scale) would be available to all – not only those who have access to NLP experts or the funding for their services. This has the potential to dramatically impact the process and pace of conducting scientific research, as well collaboration, as experts could have more ready access to the knowledge in adjacent fields. Further, when NLP or Linguistics experts are incorporated, their time can be spent more efficiently; the rules that result from our proposed approach serve as a valuable starting point for IE, removing the tedium of initial grammar development and enabling them to move directly to more productive aspects of the work.