BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication

BibDedupe is a Python library developed for bibliographic record deduplication in meta-analysis and research synthesis. It is constructed with a focus on four requirements: (1) Zero false positives : The primary objective is to prevent incorrectly merging distinct entries. This focus on zero false positives is crucial to ensure trustworthiness and prevent biased conclusions in the analysis. (2) Reproducibility : BibDedupe implements fixed rules to produce consistent results, in line with the scientific standard of reproducibility. (3) Efficiency : The library is also tuned for low false-negative rates and rapid processing, to ensure scalability of the duplicate identification process. (4) Continuous evaluation and improvement : It is continuously evaluated on over 160,000 records from 10 datasets to ensure its effectiveness, especially in follow-up refinements. Unlike general-purpose deduplication tools, BibDedupe is specifically designed for the unique requirements of bibliographic data in meta-analysis and research synthesis. In this context, BibDedupe aims to provide a Python library that improves the effectiveness and efficiency of duplicate identification, potentially benefitting review papers across scientific disciplines.


Statement of Need
Handling duplicates is a critical step in meta-analysis and research synthesis (Harrer et al., 2021), given that errors in this step can directly affect conclusions (Wood, 2008).Prior research has invested considerable efforts to evaluate duplicate identification software for bibliographic data (Binette & Steorts, 2022;Bramer et al., 2016;Koumarelas et al., 2020;Rathbone et al., 2015).While methodologists have repeatedly cautioned against the risk of treating identical studies independently when they are published in different papers (Fairfield et al., 2017;Senn, 2009), the risk of erroneously classifying papers as duplicates has arguably received less attention.However, once removed from the process, it is rarely possible to recover false positives, or to quantify and correct their effect on meta-analytic results.As such, preventing false positives is of critical importance1 , while false negatives can be detected and merged in the subsequent screening and analysis steps (McLoughlin, 2022).
Proprietary software for duplicate identification often suffers from shortcomings related to the four requirements.Tools like Endnote or Covidence require compromises related to false positives, have limited transparency of black-box algorithms, or lack peer-review and external validation.Moreover, the use of proprietary software incurs costs, and restricts the combination of research tools, because data is hard to access and programmatic interfaces are not offered.
General purpose deduplication libraries often lack the specificity needed for bibliographic data, requiring skills and excessive amounts of effort to develop and evaluate algorithms.For example, libraries such as the Python Record Linkage Toolkit (De Bruin, 2019) and dedupe (io) (Gregg & Eder, 2022) provide an arsenal of similarity measures, blocking rules, and utility functions.As such, they provide a valuable basis to support the design of domain-specific duplicate identification tools, but they are rarely used directly by researchers conducting a meta-analysis (Nguyen et al., 2022).When developing a custom deduplication algorithm, its effectiveness can only be evaluated by creating an independently deduplicated dataset.More severely, developing an accurate algorithm require in-depth knowledge of publication practices and errors typically introduced by academic databases, or other systems handling bibliographic metadata.Experience shows that minor changes potentially have significant effects on overall performance.Finally, machine-learning libraries, such as dedupe (io), involve the learning of blocking rules and similarity functions from each dataset, and based on user input.Such manual processing steps reduce efficiency and limit reproducibility.
Open-source research software for duplicate identification is scarce, and to-date, peer-reviewed software is non-existent in this area.In the Python ecosystem, the only library I found is ASReview Datatools, provided by the team behind the ASReview screening tool (Van De Schoot et al., 2021).My evaluations show that this library introduces a considerable number of false positives, and cannot be used for meta-analyses.R users or Python users willing to switch the ecosystem, may use ASySD (Hair et al., 2023), a recently published R package with a Shiny web interface.The code of this package resembles BibDedupe, but it does not achieve zero-false-positives, uses a relatively small test dataset from medicine (n=1845) in the unit tests, and was not evaluated in the peer review process.
In conclusion, researchers are not served well by proprietary tools, or general purpose deduplication libraries.Effective and peer-reviewed libraries are urgently needed for meta-analyses and research synthesis to facilitate researchers' trust and adoption of open-source libraries in the area of literature reviews.

Example usage
import pandas as pd from bib_dedupe.bib_dedupe import merge # Load your bibliographic dataset into a pandas DataFrame records_df = pd.read_csv("records.csv")# Get the merged_df merged_df = merge(records_df) For advanced use cases, it is also possible to complete and customize each step individually

Implementation
I define duplicates as potentially differing bibliographic representations of the same real-world record (cf.Rathbone et al., 2015).This conceptual definition is operationalized as follows.
The following are considered duplicates: The following are considered non-duplicates: • Papers reporting on the same study if they are published separately (e.g., involving different stages of the study such as pilots and protocols, or differences in outcomes, interventions, or populations) • A conference paper and its extended journal publication • A journal paper and a reprint in another journal It is noted that the focus is on duplicates of bibliographic records.The linking of multiple records reporting results from the same study is typically done in a separate step after full-text retrieval, using information from the full-text document, querying dedicated registers, and potentially corresponding with the authors (see Higgins et al., 2023, sec. 4.6.2 and 4.6.2).
These clarifications are necessary for the evaluation dataset, and for users to understand what will (not) be considered a duplicate.The rationale is that cases of duplicates are rarely or never cited as separate items in a reference section, while non-duplicates can in principle be cited separately.It is a different issue whether the corresponding research and administrative practices are considered questionable or ethical (e.g., salami publications, or registering multiple DOIs for the same paper).
To accurately identify and merge duplicates, BibDedupe implements the steps of preprocessing, blocking, rule-based matching, and merging.As seen in the usage example, each step can be adapted.

Preprocessing
Preprocessing involves an array of standardizations across fields, including replacement of special characters.For titles and journals, stop words are removed to give more weight to distinctive words in the similarity measures.For the author field, name particles are removed because they are often handled incorrectly in the data creation process.Additional notes and translations are removed from the title field.For translated journal names, the English version is used as a replacement.

Blocking
To avoid checking all possible combinations of papers, blocking selects the pairs that are likely to be duplicates.This is a common technique in deduplication where only records within the same block are compared for potential duplication.
BibDedupe relies on a comprehensive set of blocking rules to avoid false negatives in this step.After the set of blocking rules is applied, pairs not sharing a minimum number of words in the titles are removed, effectively reducing the number of pairs by 50-95% without losing true pairs.This leads to a more efficient matching step.

Matching
The matching function selects duplicates or potential duplicates from the list of blocked record pairs.Potential duplicates, also known as "maybe cases", are marked separately for manual verification.To achieve accurate and interpretable matching, I specified an array of human-readable conditions, which are based on pre-calculated and context-specific similarities between fields.
The conditions and similarity functions account for bibliographic errors commonly introduced between duplicates.I summarize the key design decisions of BibDedupe, which differ from other approaches (notably ASySD): • Robust author similarities: The most substantial format variation is observed in the author field, requiring robust similarity measures.This is particularly challenging for non-Western names, which are not supported well by current citation style conventions, or name-parsing software (see nameparser).Given that Chinese authors are leading in many research output and impact rankings (Brainard & Normile, 2022), this is a limitation.After testing multiple similarity measures, I found that the agreement between capital or beginning-of-word letters provided the most robust measure of author similarity, suggesting that common similarity measures like Jaro-Winkler are less appropriate in this case.I briefly illustrate this with an example of non-Western names that were erroneously abbreviated:

Merging
Upon merging a set of records, BibDedupe keeps track of the original IDs in the origin field.
Compared to the common approach of deleting n-1 records from the set of duplicates, this approach has three distinct advantages: (1) validation: together with the original dataset, it allows users to validate whether duplicate decisions are accurate, (2) undo: it is possible to restore selected cases where erroneous duplicates were merged, and (3) evaluation: it enables subsequent use of datasets to evaluate and tune duplicate detection algorithms.
The merging function uses heuristics to select the most appropriate fields from duplicate records, instead of selecting all fields from one record regardless of field-level quality.For instance, proper capitalization is preferred when one record has author or title fields in all-caps, and DOIs are selected when other DOI fields are empty.

Evaluation
To evaluate BibDedupe, I collected 10 datasets comprising over 160,000 records and 34,900 duplicates (Hair et al., 2023;Rathbone et al., 2015;Wagner et al., 2021).The results are displayed in Table 1.This is, to the best of my knowledge, the only evaluation that is updated automatically on a regular basis, and the most comprehensive evaluation of bibliographic duplicate detection algorithms to date.Complementary evaluation data, including proprietary software and tools that do not offer programmatic access, is reported by Hair et al. (2023).
I completed over 3,000 iterations to evaluate and improve BibDedupe based on these datasets.The efforts involved tuning the preprocessing, blocking, and matching steps, vetting different similarity measures, and validating the false positives and negatives based on the definition of (non)-duplicates.I carefully reviewed the conditions to combine and generalize narrowly defined cases.In addition, I implemented unit tests to ensure consistency, and understand how changes in the code affect each step.Runtime was optimized by implementing and evaluating different approaches to parallel processing, such as processing NumPy-arrays vs. splitting dataframes horizontally.As a result, the depression dataset with approx.80,000 records is processed in under 10 minutes with 8 CPUs.

Ongoing improvements
BibDedupe provides duplicate identification functionality, which performs with zero false positives on a dataset comprising over 160,000 records.It builds on carefully crafted rules and high-quality training data to ensure effectiveness, transparency, and reproducibility.The evaluation runs automatically and provides a solid foundation for continuous improvements and additions of datasets.I intend to incorporate additional datasets and continue refining the rules and procedures.
Papers referring to the same record (per definition) • Paper versions, including the author's original, submitted, accepted, proof, and corrected versions (NISO/ALPSP JAV Working Group, 2008) • Papers that are continuously updated (e.g., versions of Cochrane reviews) • Papers with different DOIs if they refer to the same record (e.g., redundantly registered DOIs for online and print versions)