Democratizing machine reading

Computers, the Internet, and cheap storage promote the acquisition and collection of vast quantities of data. There is a seemingly infinite supply of text documents which contain critical scientific, socio-political, and business insights – far more than can be read by a human. Within the natural language processing (NLP) domain, the field of information extraction (IE) targets exactly this problem, but it requires its practitioners to have expertise either in linguistics, machine learning, or both. Consequently, the majority of the advancements in the field of IE are largely inaccessible to those who need them the most. No one is better equipped to address societal issues than the experts who dedicate their entire careers to a specific slice of research — sociologists who study race relations, biologists who seek cures for infectious diseases, economists and epidemiologists who develop models to try to understand trends and anticipate problems, and so on. Proposed here are techniques to empower these domain experts to develop and deploy IE systems targeting their own particular needs without requiring expertise in NLP, linguistics, or machine learning. This has the potential to dramatically impact the process, pace, and productivity of conducting critical scientific research and collaboration, as experts could have far more ready access to the knowledge most essential to them and their research (both in their domain and adjacent domains). Even beyond the currently proposed work, the momentum achieved here will be used to help establish a broader dialogue within the scientific community surrounding these issues through a series of outreach efforts such as tutorials, publications, and a workshop at a high-visibility conference. To broaden participation, outreach activities (including deepening collaborations with institutional colleagues and local community outreach) will be done with an emphasis on groups who are historically underrepresented in academia.

This proposed work will be accomplished through a human-technology partnership, where domain experts specify their information need at the level they find intuitive, (e.g., phosphorylation acts on proteins, or humanitarian interventions have a cost and a measurable outcome). The system will then extend techniques from the adjacent field of program synthesis to convert these high-level, abstract specifications into low-level grammars (i.e., sets of hierarchical information extraction rules) which can be executed in order to extract the desired information from text. Crucially, the specification requires no linguistic knowledge, making it accessible to a broader population. The need for domain-specific entities (e.g., names of proteins, instances of protests, etc.) will be addressed through an entity discovery procedure that incorporates techniques for detecting multi-word entity candidates and inferring their semantic types (e.g., PROTEIN, LOCATION). To ensure that the product of the system is readily interpretable and easily extensible, a series of user studies will be conducted to discover the key characteristics of rules and grammars that affect their interpretability and maintainability. Through this combined effort, several datasets and software products will be produced and made available to the wider community, including (but is not limited to) …

a dataset of event specifications and the corresponding automatically synthesized rules for several domains
a dataset of human judgements of grammar interpretability
models which can serve as automatic proxies for the more expensive human evaluation of interpretability

Data availability

All data will be anonymized and released under the Open Data Commons Public Domain Dedication & License, which allows users to freely share, modify, and use this data, in the hope that this effort will be exploited further. To ensure as wide an audience as possible, the software and techniques developed in this work including the rule synthesis framework, a pipeline for entity discovery, and any generated user interfaces, will be released as open-source software products (under an Apache 2.0 open source license).