Web-based text anonymization with Node.js: Introducing NETANOS (Named entity-based Text Anonymization for Open Science)

Summary

Netanos (Named Entity-based Text ANonymization for Open Science) is a natural language processing software that anonymizes texts by identifying and replacing named entities. The key feature of NETANOS is that the anonymization preserves critical context that allows for secondary linguistic analyses on anonymized texts.

Consider the example string “Max and Ben spent more than 1000 hours on writing the software. They started in August 2016 in Amsterdam.” While coarse anonymization such as simple "XXX" replacement would suffice to mask the true content of the string, essential text properties are lost that are needed for secondary analyses. For example, content-based deception detection approaches rely on the number of specific times and dates to differentiate between deceptive and truthful texts (Warmelink et al. 2013).

The architecture of NETANOS relies on two software libraries capable of identifying named entities. (1) The Stanford Named Entity Recognizer (NER) (Finkel, Grenager, and Manning 2005) integrated with the ner Node.js package (Srivastava 2016), and (2) the NLP-compromise JavaScript frontend-library (Kelly 2016). Both libraries are used in a layered architecture to identify persons (e.g. “Max”, “Ben”), locations (e.g. “Amsterdam”, “Munich”), organizations (e.g. “Google”), dates (e.g. “August 2016”), and values (e.g. “42”).

Specifically, the text anonymization is achieved with the following stepwise procedure: The input string is analyzed by Stanford's NER, identifying organizations, locations, persons, and dates. All identified entities are replaced with their context-preserving anonymized versions. NLP-compromise's named entity recognition tool is applied to identify potentially remaining, unrecognized entities.

Besides the key feature of context preserving text anonymization, Netanos also provides three alternative anonymization types.

  • Context-preserving anonymization (key feature): Identified named entity types are replaced with a composite string consisting of the entity type and the corresponding index of occurrence. “[PERSON_1] and [PERSON_2] spent more than [DATE/TIME_1] on writing the software. They started in [DATE/TIME_2] in [LOCATION_1].”

  • Named entity-based replacement: Identified entities are replaced with a different, randomly chosen named entity of the same type. “Barry and Rick spent more than 997 hours on writing the software. They started in January 14 2016 in Odessa.”

  • Non-context preserving anonymization: This replacement type is inspired by the anonymization procedure suggested by the UK Data Service (Service, n.d.). It replaces all strings having a capital first letter and all numeric values with XXX. “XXX and XXX spent more than XXX hours on writing the software. XXX started in XXX XXX in XXX.”

  • Combined, non-context preserving anonymization: The context-preserving replacement is used to identify candidates for replacement that are then replaced with the procedure of the non-context preserving replacement “XXX and XXX spent more than XXX XXX on writing the software. XXX started in XXX XXX in XXX.”

Note that all replacements are applied globally across the input string.

Technical Pipeline

The software architecture of NETANOS is illustrated in the following technical pipeline on FigShare.

Note

The software documentation for NETANOS with working examples and installation guidelines is available here.

The NETANOS tool has been experimentally validated on the potential re-identifiability of anonymized texts. A preprint to that paper is available on the Open Science Framework preprint server.

References

Finkel, J. R., T. Grenager, and C. Manning. 2005. “Incorporating Non-Local Information into Information Extraction Systems by Gibbs Sampling.” Proceedings of the 43rd Annual Meeting on Association for Computational Linguistic, 363–70. doi:10.3115/1219840.1219885.

Kelly, S. 2016. “NLP Compromise: Natural Language Processing in Javascript.” https://github.com/nlpcompromise/compromise.

Service, UK Data. n.d. “Ukds.tools.textAnonHelper / Home [Bitbucket Wiki].” https://bitbucket.org/ukda/ukds.tools.textanonhelper/wiki/Home.

Srivastava, N. 2016. “Ner: Client for Stanford Named Entity Reconginiton.” https://github.com/niksrc/ner.

Warmelink, L., A. Vrij, S. Mann, and P. A Granhag. 2013. “Spatial and Temporal Details in Intentions: A Cue to Detecting Deception.” Applied Cognitive Psychology 27: 101–6. doi:10.1002/acp.2878.