Quality control

Data Breach Response is pre-configured with PI Detectors, also referred to as Out of the Box (OOTB) detectors. This is in addition to any custom detectors or classifiers added for a project. The Quality Control (QC) process is an optional step that occurs prior to review by the Project Lead. The goal of this step is to ensure that there is a minimization of documents with over and under predictions. Another benefit of the QC process is that it reduces the number of over and under predictions from token level detectors by refining existing or adding new detectors. Feedback to the machine learning models can also be provided, if the documents are reviewed and ‘Marked Complete’. QC is an iterative process that may vary from project to project. Below are recommended workflows for QCing non-spreadsheet and spreadsheet documents.

Note: Any changes made during any of the following steps will not take effect until Incorporate Feedback is re-run. If you mark a document complete, it will prevent changes on those documents. Therefore, documents should not be marked complete during pre-review QC.

QCing unstructured documents for recall

After running Incorporate Feedback, the Project Lead should make sure that all documents with applicable PI within the data set are identified. This is referred to as Document Recall:

  1. Identifying documents that hit on PI search terms, but do not contain PI hits.

  2. Reviewing samples of documents.

  3. Updating the PI Detectors by either:

    • Refining existing detectors.

    • Or adding new, custom PI detectors.

Create QC review queues with no identified PI

To make sure all documents containing personal information are identified within the data set, it is important to QC for documents with no PI hits so that no information is missed.

Note: We recommend QCing documents with PI misses for high frequency PI types such as SSN and project specific PI types. High frequency PI types will vary from project to project. As a baseline, we recommend QC for the following:

  • Date of birth
  • Date of death
  • Social Security Number

To QC Recall:

  1. Use PI and Entity Search to search for documents that do not contain the PI type that you will be QCing.

    • For example, to search for documents that do not contain social security numbers, search for ‘PI Types’ != ‘Social Security Number’. For more information on search syntax, see Search Syntax.
      An image of PI and entity search used to search for documents without social security numbers.

  2. Create a saved search containing these documents.
    An image of a saved search being created.

    • To keep track of information the search contains, specify that there is no personal information contained within the docs via the saved search name. For example, No PI: Social Security Number

  3. Create a Review Center queue using the saved search.

  4. Repeat for the remaining PI types you want to QC.

Review documents

Once you create QC batches, you should review them using the instructions outlined in Detector QC. As you review documents, store suggestions for detector changes in a shared spreadsheet or other platform to collaborate with others executing QC. You should not mark documents complete during this review.

Refining detectors

If you discover any information during document review that should be captured by a token level detector, you can update detectors during this process. There are three possible actions you can take. For more information on updating detectors, see Personal information detectors.

Update an existing detector

  1. Update the keywords, regexes, or both.
  2. Validate that the detector and keywords capture the target information.

Add a new detector

Add a new detector when the required PI can’t be captured by an OOTB detector.

Add a document level classifier

Add a new document level classifier when:

  • The required PI can't be consistently located within a document, but the document itself can be located with keywords or groups of keywords. This could include handwritten or poorly scanned documents.
  • Grouping types of documents such as Birth Certificates, Medical Records, or Curriculum Vitae.

Document-level classifiers apply to the document as a whole and do not locally identify or highlight individual pieces of PI.

QCing unstructured documents for precision

Reducing the number of documents with over captures is another important step in the QC process that helps reduce document and annotation volumes, therefore reducing the work for the review team.

To QC for precision on unstructured documents:

  1. Perform Blocklisting on PI with a large number of hits, especially a large amount of PI-to-Document hits. The most common PI types where this occurs are phone number and full address.

  2. Review samples of documents to identify common rules to apply to reduce over captures. The steps below outline this approach.

Create QC review queues containing PI

Note: We recommend QCing documents with each of the following attributes:

  • Contains PI. This allows you to generate a random sample of documents
  • Contains phone number
  • Contains full address
  • Contains partial date of birth
  • Contains PI types that look “off”
    • To prioritize remaining PI types for QC, look at PI counts for each PI type in the document report. An example of a PI type looking “off” is if the project contains data from an accounting firm and there are a high number of Patient IDs. This is unexpected and should be investigated.

To QC for precision:

  1. Use PI and Entity Search to search for documents that contain the PI type that you will be QCing.

    • For example, to search for documents that contain PI, search for CONTAINS PI. For more information see Search syntax.
      An image of PI and entity search used to search for documents containing PI.

  2. Create a saved search containing these documents.
    An image of a saved search being created.

    • To keep track of information the search contains, specify that there is no personal information contained within the docs via the saved search name. For example, Contains PI.

  3. Create a Review Center queue using the saved search.
  4. Repeat for the remaining PI types you want to QC.

Review documents

Once you create QC batches, you should review them using the instructions outlined in Detector QC. As you review documents, store suggestions for detector changes in a shared spreadsheet or other platform to collaborate with others executing QC. You should not mark documents complete during this review.

Refining detectors

For consistent over captures, you can take the following actions:

  1. Add a local blocklist keyword (blocklist term if X distance away from regex).
    • For example, telling the system to ignore all account numbers within 50 characters of the term index.
    • Add a global blocklist keyword that prevents that detector from registering if the global blocklist keyword is present on the document.
    • Common examples include phone numbers and mailing addresses.

    • If there are 100s or 1000s of the same phone number or address located in the dataset, use the Blocklist tool to blocklist this information. For more information, see Blocklisting.

QC of detectors and boundaries in spreadsheets

In addition to identifying personal information within non-spreadsheet documents, Data Breach Response also identifies table boundaries in spreadsheets and the personal information within them.

Spreadsheet QC for recall

When reviewing spreadsheets for recall, the goal is to ensure that table boundaries are being properly detected, and that detectors are capturing any unstructured PI within the tables. To achieve this, create Recall batches for spreadsheets documents.

Table Boundaries

When table boundaries are off, the likelihood of under captures increases. For documents where the boundaries need to be adjusted, column annotations will need to be made, and the document marked complete.

Unstructured PI

Spreadsheets can also have localized text-based PI detections. Be sure to scan through Notes and Comments columns, as they could contain a mix of PI types that may not be appropriate for full column annotation.

Spreadsheet QC for precision

To review spreadsheets headers for precision, use the Spreadsheet QC tool. For instructions on using the Spreadsheet QC tool, see Spreadsheet QC tool.

Frequently asked questions