Personal Information detectors

From the Settings tab, you can view all Personal Information (PI) detectors and tags, add new detectors and tags, and turn detector tags on and off.

Definitions

The following terms are found throughout Personal Information detectors documentation:

  • Detector (PI)— A combination of AI models, regular expressions (RegExes) and keywords that detect a string of text and classify it as a form of Personal Information.
    Custom PI detectors must have at least one regular expression.
  • Primary Deduplication Identifier—detectors where the expected ratio between a person and the PI Type is 1:1, meaning a single person can only have one of that PI value and that PI value can only apply to one person. Examples of this include Social Security Number and Passport Number.
  • Secondary Deduplication Identifier—detectors where the expected ratio between a person and the PI Type is 1:many, meaning a single person can have many of this PI value but a given PI value only applies to one person OR a single person only has one PI value but the PI value could be the same on different entities. Examples of this include email address and date of birth.
  • Tertiary Deduplication Identifier—detectors where the expected ratio between a person and the PI Type is many:many, meaning a single person can have many PI values of this type and a given PI value can apply to many people. Examples of this include address and phone number.
  • Out of the Box Detectors - Out of the Box (OOTB) Detectors are pre-built detectors created for Data Breach Response
  • Custom Detectors - Custom Detectors are created by users to solve the particular needs of a given project outside of the OOTB offerings

Permissions

Settings are available for users assigned the role of Lead.

Viewing detectors

To view the Detectors table open the Settings tab and click the Detectors subtab.
An image of the Detectors table

Detectors fields

The following fields appear on the Detectors table:

  • Detector—this includes names of Out of the Box Detectors and Custom Detectors
  • Description— a description of each detector
  • Enabled—status (Yes/No)
    • Detectors set to No will not be run during Data Analysis.
  • Created By—lists whether the detector was created by the system or a user.
  • Category—lists the detector category
  • Deduplication Identifier—the settings that will be used in entity normalization.

Click on any detector to see the details for that detector.

You can sort or filter the list using the filter icon in the top right corner of the table.

Creating custom detectors

To create a Custom PI Detector:

  1. Click Add Detector in the top right corner of the Detectors table.
  2. Fill out the fields about the new Detector. Name and Category are required.
    If you do not select a value for Deduplication Identifier, it will default to Binary.

    An image of the Add New Detector window
  3. Click Next.
  4. Add Regexes for your Custom Detector.
    1. For more information on regular expressions, see Frequently asked questions.
    2. Specify a Match Group for the regular expression, if necessary.
      • Match group indicates which matching group contains the PI.
        For example, take the following regular expression:(ssn|social security number)\s*+:\s*+(\d{3}-\d{2}-\d{4}).
        This regular expression matches two groups, (ssn|social security number), and (\d{3}-\d{2}-\d{4}), but only group 2 contains the personal information to be captured. Therefore, the match group would be set to 2.
    3. Click Add.
    4. Repeat as necessary. You can include several regexes for a single Custom Detector.
  5. Add Keywords for your Custom Detector.
    1. Specify a Type for the keyword:
      • Global Keyword— A global keyword term is a term that must appear somewhere in the body of the document. If a global keyword is not found in the document, the detector will not return PI matches.
      • Global Blocklist Keyword— A global blocklist term is a term that must not appear anywhere in the body of the document. If a global blocklist term is found in the document, the detector will not return any PI matches.
      • Local Keyword— A local keyword term is a term that must appear near a PI matched via a regex pattern. You can specify a maximum distance in characters to indicate how far away the term should be on either side of the PI found. If the term is not found within the specified distance, the detector will not return that PI match.
      • Local Blocklist Keyword— A local blocklist term is a term that must not appear in the vicinity of a PI match. You can specify a maximum distance. If the local blocklist term appears within that distance of a PI match, the PI match will not be returned.
    2. If you select a Local Keyword or a Local Blocklist Keyword, specify a Max Keyword Distance.
      • The Max Keyword Distance dictates how far away a keyword is permitted to be on either side of information found by a regular expression.
      • The default value is 40 characters.
  6. When complete, click Save.

The Custom Detector will now appear in your Detectors list.

Editing detectors

To enable or disable a Detector:

  1. Click on the detector you would like to modify.
  2. From the detector detail view adjust the Enabled toggle.
  3. Click Save.
    An image of the Edit Detector window

Limitations

The quality of PI detections may be impacted by the quality of unstructured documents.

Data Breach Response uses extracted text to create PI detections on unstructured documents. The formatting of the incoming text will affect the performance of PI detections. The following are examples of what will impact detection performance:

  • Optical character recognition (OCR) quality may affect the quality of detections. If your source data is images and you use OCR technology to generate the text, incorrectly generated text may affect the detector performance.
  • Lack of standard punctuation or casing
  • Data Breach Response does not support CJK characters.

Data Breach Response’s PI detectors only support single language text. If a document includes multiple languages, the output may not be accurate. For documents that do not have English as the primary language, run structured analytics language identification against your document set. Data Breach Response will use the primary language identified when running the Data Analysis

Frequently asked questions