DeBEIR: A Python Package for Dense Bi-Encoder Information Retrieval

Information Retrieval (IR) is the task of retrieving documents given a query or information need. These documents are retrieved and ranked based on a relevance function or relevance model such as Best-Matching 25 (BM25) (Robertson et al., 1995). Although deep learning has been successful in other computer science fields, such as computer vision with AlexNet (Krizhevsky et al., 2012) and Inception (Szegedy et al., 2014) and natural language processing with transformers (Devlin et al., 2019; Lee et al., 2019; Yang Liu & Lapata, 2019); success in information retrieval was limited due to comparisons against weak baselines (Yang et al., 2019). However, in 2019 (Lin, 2019), deep learning in information retrieval could surpass less computationally intensive keyword-based statistical models in terms of retrieval effectiveness, sparking a resurgence in the field of dense retrieval. Dense retrieval is the task of retrieving documents given a query or information need using a dense vector representation of the query and documents (Lin et al., 2021). The dense vector representation is obtained by passing the query and documents through a neural network. The neural network is usually a pre-trained language model such as BERT (Devlin et al., 2019) or RoBERTa (Yinhan Liu et al., 2019). The dense query vector representation is then used to retrieve documents using a similarity function such as cosine similarity.

Information Retrieval (IR) is the task of retrieving documents given a query or information need. These documents are retrieved and ranked based on a relevance function or relevance model such as Best-Matching 25 (BM25) (Robertson et al., 1995). Although deep learning has been successful in other computer science fields, such as computer vision with AlexNet (Krizhevsky et al., 2012) and Inception (Szegedy et al., 2014) and natural language processing with transformers (Devlin et al., 2019;Lee et al., 2019;Yang Liu & Lapata, 2019); success in information retrieval was limited due to comparisons against weak baselines . However, in 2019 (Lin, 2019), deep learning in information retrieval could surpass less computationally intensive keyword-based statistical models in terms of retrieval effectiveness, sparking a resurgence in the field of dense retrieval. Dense retrieval is the task of retrieving documents given a query or information need using a dense vector representation of the query and documents (Lin et al., 2021). The dense vector representation is obtained by passing the query and documents through a neural network. The neural network is usually a pre-trained language model such as BERT (Devlin et al., 2019) or RoBERTa (Yinhan . The dense query vector representation is then used to retrieve documents using a similarity function such as cosine similarity.
Unlike statistical learning, tuning deep learning retrieval methods is often costly and timeconsuming. This cost makes it essential to efficently automate much of the training, tuning and evaluation processes.
We present DeBEIR a library for: (1) facilitating dense retrieval research, primarily focusing on bi-encoder dense retrieval where query and documents dense vectors are generated separately (Reimers & Gurevych, 2019), (2) expedited experimentation in dense retrieval research by reducing boilerplate code through an interchangeable pipeline API and code extendability through the inheritance of general classes; (3) abstractions for standard training loops and hyperparameter tuning from easy-to-define configuration files.
DeBEIR is aimed at helping practitioners, researchers and data scientists experimenting with bi-encoders by providing them with dense retrieval methods that are easy to use out of the box but also have additional extendability for more nuanced research. Furthermore, our pipeline runs asynchronously to reduce I/O performance bottlenecks, facilitating faster experiments and research.
A brief summary of the pipeline is ( Figure 2): 1. Configuration based on Tom's Obvious Minimal Language (TOML) files; these are loaded in a class factory to create pipeline objects.
2. An executor object takes in a query builder object. The purpose of the query builder object is to define the mapping of the documents and which parts of the query to use for query execution.
3. The executor object asynchronously runs the queries.

Finally, an evaluator object uses the results to list metrics defined by a configuration file against an oracle test set.
This pipeline is condensed into a single class that can be built from a configuration file.

Statement of Need
Dense retrieval has been popular in Information Retrieval since 2015 (Guo et al., 2017;Hui et al., 2017;Yin et al., 2015). Retrieval effectiveness of these dense retrieval methods was often compared against weaker baselines and was not shown significantly stronger than statistical models , such as a well-tuned BM25 model while being considerably slower. This situation is similar to what happened in the early 2000s, where there was a slow down in retrieval effectiveness from the use of less robust baselines (Armstrong et al., 2009) when proposing new methods.
However, attitudes on dense retrieval changed when transformer models were found to be effective once fine-tuned on Natural Language Inference tasks or Ms-Marco (T. Nguyen et al., 2016) as a cross-encoder (Lin, 2019), significantly overtaking even the best BM25 models.
There are generally two classes of dense retrieval models for IR: (1) the cross-encoder, which encodes queries and documents at query time and (2) the bi-encoder, which can encode documents at index time and queries at query time. The cross-encoder is generally more effective than the bi-encoder model for retrieval (Lin et al., 2021). However, this increased effectiveness requires a more substantial computation and can be a bottleneck in production systems. Therefore, a less expensive model such as BM25 is typically used to retrieve smaller candidate lists (first-stage retrieval) to be fed to second-stage retrieval re-ranking by a crossencoder.
Although cross-encoders are more accurate than bi-encoders, bi-encoder are more effective than BM25 (V. Nguyen et al., 2022) and are faster than cross-encoders. Therefore, a gap in the literature in IR is to replace BM25 first-stage retrieval with a bi-encoder or otherwise used as the sole ranking system, without a second-stage re-ranker. However, current libraries do not address this use case because it requires integration with the indexing and querying pipeline of the search engine.
DeBEIR is a library that addresses this gap by facilitating bi-encoder research and provides base classes with flexible functionality through inheritance. While we provide cross-encoder re-rankers for feature completeness, the library's priority is facilitating bi-encoder research.
The strength of bi-encoders lies in the offline indexing of dense vectors. These vectors can then be used for first-stage retrieval and potentially passed to a second-stage retrieval system such as a cross-encoder. Bi-encoders can be used as the sole retrieval system when there is a lack of training data (V. Nguyen et al., 2022) and, therefore, can be more useful in areas such as biomedical IR, where training data is expensive to annotate and therefore scarce. Cross-encoders, however, require large amounts of training data for effectiveness.
The DeBEIR library offers an API for commonly used functions for training, hyper-parameter tuning ( Figure 2) and evaluation of transformer-based models. The pipeline can be broken up into multiple stages: parsing, query building, query execution, serialization and evaluation ( Figure 1). Furthermore, we package our caching mechanism for the expensive encoding operations to speed up the pipeline during repeated experimentation.
Although similar libraries exist, such as sentence-transformers (Reimers & Gurevych, 2019), and openNIR (MacAvaney, 2020), they have less of a focus on the early stages of the dense retrieval pipeline. This stage involves indexing the textual data from the corpora and indexing dense vector representations, which is only helpful for bi-encoder type models over the traditional cross-encoder and is thus not typically explored by other libraries. Other limitations include a lack of extendability which restricts the users' options for training customization (we provide base classes that can be inherited) or the library is tailored to general-purpose machine learning rather than informational retrieval. Finally, these libraries have a limited caching mechanism, as cross-encoders typically does not require this capability as it is decoupled from the index. Bi-encoders can have queries cached at query time to make repeated query calls to the index significantly faster.
DeBEIR will help facilitate early-stage dense retrieval and rapid experimentation research with bi-encoders. It is also flexible enough for second-stage retrieval using cross-encoders from this library or other libraries. We will continue to improve this tool over time.

Examples Pipeline
The pipeline is a single class that can be built from a configuration file. The configuration file is a TOML file that defines the pipeline stages and their parameters. The pipeline is built using a class factory that takes in the configuration file and creates the pipeline stages. The pipeline stages are then executed in order. # Run optuna with wandb integration study = run_optuna_with_wandb(trainer, wandb_kwargs={ "project": "my-hparam-tuning-project" })

# Print optuna stats and best run print_optuna_stats(study)
More information on the library is found on the GitHub page, DeBEIR. Any feedback and suggestions are welcome by opening a thread in DeBEIR issues.