We stand with Ukraine

Presenting BUSTER: A NER Benchmark for the Finance Domain

Expert.ai Team - 19 December 2023

dataset

Among the many AI-based technologies that are helping businesses make the most of their information, natural language processing (NLP) is a critical tool. For language-intensive domains like insurance and pharmaceuticals, finance and media, NLP is being used to capture the non-structured, non-tabular data that contains crucial information for enterprise processes.  

However, transferring such technologies into industry applications isn’t seamless. One reason for this complexity is the disconnect that exists between popular benchmarks and actual data. Lack of supervision, unbalanced information classes, noisy data and long documents often cause real problems for information management and data discovery.  

In response to this need, the expert.ai R&D team has created a special dataset that can be used as a reference benchmark in the field of entity recognition, in particular for the financial domain. The dataset is known as “BUSTER” (BUSiness Transaction Entity Recognition) and consists of 3,779 manually annotated documents on financial transactions. 

The research paper outlining the dataset known as “BUSTER” (BUSiness Transaction Entity Recognition) will be presented at the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) taking place December 6-10 in Singapore. 

The Vertical Domain Challenge for AI Models 

In the enterprise, NLP is often adopted as an assistance tool to support human experts in a wide range of tasks, such as document classification, information extraction and text summarization.  

However, transferring such technologies into industry applications requires a lot of effort to make sure that you have a workable data set. For example, the large language models (LLMs) that have captured headlines this year, such as the GPT-3 models behind ChatGPT, are models that have been trained on the content of the internet up to a certain point, and therefore contain a wide repository of general knowledge.  

However, to be functional in the enterprise, companies need models that are relevant for their specific domains. That’s why adapting an LLM to a vertical domain usually requires fine-tuning on domain-specific annotated data. Here, labeling is often a time-consuming, expensive process, especially when experts in the field are involved. This is where external data sets come in. 

Benchmark data sets provide a starting corpus or set of annotated documents that can be used to train an AI model. This is helpful when a company does not have enough training data available or would like to jump start implementation using a validated industry-focused data set.  

However, many existing data sets use the standard Named Entity Recognition (NER) task, which focuses on a limited set of data, including people, organizations and locations. While this is certainly useful, a much broader set of tags will be even more useful, especially when it comes to the many nuanced differences within a single vertical domain.  

Hence, our research team created BUSTER: a BUSiness Transaction Entity Recognition dataset for the NLP community.  

The Importance of Named Entity Recognition in NLP 

Natural Language Processing makes unstructured data machine-readable (or structured) and available for standard natural language processing (NLP) actions. Named Entity Recognition, also known as entity extraction, is an NLP function that identifies and categorizes key entities from text, then classifies them into predefined categories. Such entities could include the names of people, organizations, products and locations, numerical quantities or amounts, dates—the various types depend on the context of the domain in question.  

This capability plays a pivotal role in many NLP tasks that serve to make sense of unstructured data by identifying the key information it contains. Practically speaking, NER helps provide a quick understanding of the topics or themes contained in a large text. This could include identifying specific terms to help customer service agents identify and route issues to the appropriate department or help claims operators extract essential information from submission packets.  

What is BUSTER? 

BUSTER is an Entity Recognition document-level benchmark that focuses on the main actors involved in a business transaction.  

As with any data set, the type of data used to train the model is critical. BUSTER was created based on a collection of around 10,000 business transaction documents from EDGAR company acquisition reports.  

EDGAR (Electronic Data Gathering, Analysis, and Retrieval system) is a service of the U.S. Securities and Exchange Commission (SEC) through which both domestic and foreign organizations doing business in the US are required to submit certain reports. The EDGAR service provides more than 150 different form types. One notable type of form is the 8-K, which provides investors with timely notification of significant changes at listed companies, such as acquisitions, bankruptcy, the resignation of directors, or changes in the fiscal year. Exhibit 99.1 or EX-99 is a disclosure document sometimes included in the 8K that consists of a disclosure document that summarizes all the details of the operation announced in the form and it is designed to provide investors with a complete and detailed view of the operation. It includes details of things like company acquisitions, ownership changes, share purchase prices and more. 

After collecting around 10,000 business transaction documents from EDGAR company acquisition reports, the team created a dataset with 3,779 manually annotated documents (the Gold corpus). To establish baselines, the team experimented using both generic and domain-specific language models. Out of several transformer-based models, four baseline models were selected that reported state-of-the-art performance in NLP: BERT, RoBERTa, SEC-BERT and Longformer. The best-performing model in the experiment—RoBERTa—was used to automatically annotate the remaining 6,196 documents. The resulting annotated data was released as a “silver” extra corpus in the BUSTER benchmark.  

Today, the full BUSTER benchmark is publicly available and free to download from the expert.ai website and on HuggingFace. We believe it could become a reference benchmark in the field of Entity Recognition, in particular for the financial domain. 

Learn more about BUSTER: Download the paper and access the dataset.