PII-Codex: a Python library for PII detection, categorization, and severity assessment

There have been a number of advancements in the detection of personal identifiable information (PII) and scrubbing libraries to aid developers and researchers in their detection and anonymization efforts. With the recent shift in data handling procedures and global policy implementations regarding identifying information, it is becoming more important for data consumers to be aware of what data needs to be scrubbed, why it’s being scrubbed, and to have the means to perform said scrubbing.


Summary
There have been a number of advancements in the detection of personal identifiable information (PII) and scrubbing libraries to aid developers and researchers in their detection and anonymization efforts. With the recent shift in data handling procedures and global policy implementations regarding identifying information, it is becoming more important for data consumers to be aware of what data needs to be scrubbed, why it's being scrubbed, and to have the means to perform said scrubbing.
PII-Codex is a collection of extended theoretical, conceptual, and policy works in PII categorization and severity assessment (Milne et al., 2016;Schwartz & Solove, 2011), and the integration thereof with PII detection software and API client adapters. It allows researchers to analyze a body of text or a collection thereof and determine whether the PII detected within these texts, if any, are considered identifiable. Furthermore, it allows end-users to determine the severity and associated categorizations of detected PII tokens.

Challenges
While a number of open-source PII detection libraries have been created and PII detection APIs are provided by cloud service providers (Azure, 2022;Services, 2022), the detection results are often provided with the type of PII detected, an index reference of where the detection is within the text, and a confidence score associated with the detection. Those receiving these results aren't provided with a means of understanding why the text token is classified as PII, what framework, policy, or convention labels it as such, and just how severe its exposure is.

Statement of Need
The general knowledge base of identifiable data, the usage restrictions of this data, and the associated policies surrounding it have shifted drastically over the years. Between the mid-1990s and 2000s, or the dotcom bubble, the industry saw a rise in data capitalism by way of making information freely accessible, fostering a way to make the web personal, and finally, placing value on data and the potential it had to impact consumerism (West, 2017). Alongside the rise in data capitalism came early data policy initiatives. In 1995, the EU Data Protection Directive was created to establish some minimum data privacy and security standards (2022) and the US Health Insurance Portability and Accountability Act (HIPAA) was enacted in 1996 with the final regulation being published in 2000 (OCR, 2022) to help battle healthcare fraud and to provide regulations governing the privacy and security of an individual's patient details. Both of these policies have evolved over the years to include protected entities and have paved the way to the policies and protective technologies the world sees today aimed at protecting PII.
The tech industry specifically has had to adjust to these policy changes regarding the tracking of individuals, the usage of data from online profiles and platforms, and the right to be forgotten entirely from a service or platform (Right to Erasure, 2022). While the shift has provided data protections around the globe, the majority of technology users continue to have little to no control over their personal information with third-party data consumers (Tene & Polonetsky, 2012;Trepte, 2020). From an individual researcher's perspective, understanding if identifiable data types exist in a data set can prevent accidental sharing of such data by allowing its detection in the first place and, in the case of this software package, permit for the results to be publishable by sanitizing the text tokens and provide transparency on the reasons why the token was considered to be PII. From a platform user's perspective, detecting PII ahead of publication and understanding why it is considered PII can prevent an accidental disclosure that can later be used by adversaries. This need is what drives the development of PII-Codex.  2022). It combines these categories to rate the detection on a scale of 1 to 3, labeling it as Non-Identifiable, Semi-Identifiable, or Identifiable as presented by the risk continuum by Schwartz and Solove (Schwartz & Solove, 2011). The package provides a subset of Milne et al.'s Information Sensitivity Typology as some technologies group entries into a singular category or the detection of the entry may not yet be available.

The PII-Codex Package
Built into the package is an analyzer service that leverages Microsoft's Presidio library for PII detection and anonymization (Microsoft, n.d.) as well as the option to use the built-in detection adapters for Microsoft Presidio, Azure Detection Cognitive Skill (Azure, 2022), and AWS Comprehend (Services, 2022) for pre-existing detections. The output of the adapters and the analysis service are analysis objects with a listing of detections, detection frequencies, severities, mean risk scores for each string processed, and summary statistics on the analysis made.
The final outputs do not contain the original texts but provide the sanitized or anonymized texts and where to find the detections, should the end-user require this information. In providing this capability, one can prevent the accidental dissemination of private information in downstream research efforts, an issue commonly discussed in cybersecurity research (Beigi & Liu, 2020;Bélanger & Crossler, 2011;Moura & Serrão, 2019).

Design
PII-Codex is broken down into a series of services, utilities, and adapters. For a majority of cases, end-users may already have used Microsoft Presidio, Azure, AWS Comprehend or some other solution to detect PII in text. To account for these cases, adapters were provided to convert the varying detection results into a common form, DetectionResultItem and DetectionResult objects, which are later used by the Analysis Service and Assessment Service. This usage flow is presented in Figure 1. As shown in Figure 2, for end-users that still require detections to be carried out, Microsoft Presidio was integrated as the primary analysis provider within the Analysis Service. The Analysis and Assessment services expose functions for those defining their own detectors and enable the conversion to a common detection type so that the full Analysis Result set can be built.

Example Usage
The collection analysis permits a list of strings under texts parameter or a DataFrame with a text column under the data parameter. The collection will be analyzed and a summary provided in an AnalysisResultSet object. The AnalysisResultSet object will show individual detections and their risk assessments which includes risk score assessment and associated PII categories. Each analysis is provided with the sanitized input text when using the default analysis service. Unless supplied with another replacement token, the sanitized input text will contain in place of detected PII tokens: Hi! My phone number is <REDACTED>." Email detections, for example, are presented as Identifiable, which automatically places it at a risk level of 3, the highest a token is assigned. Something like a URL is considered Semi-Identifiable and therefore is assigned a risk level of 2. Other texts will fall under Non-Identifiable and will be assigned a risk level of 1.
For collections of strings being analyzed, each risk score mean is taken into account to provide a collection-wide risk score mean value. Given that a collection can have n number of analyzed strings, the collection risk score mean value can be calculated with the mean of means formula below.
= 1 + 2 + ... + (2) In the AnalysisResult object, the mean risk score of all detected tokens in a string is provided as the risk score mean. The AnalysisResultSet object will contain the mean of means, or the average of all risk score averages, will be provided as the risk score mean.