Last date modified: 2026-May-05

Quality control

aiR for Data Breach Response includes pre-configured PI detectors (also called out-of-the-box, or OOTB, detectors) and supports any custom detectors you add. Quality control (QC) is an optional step that you perform before the Project Lead begins review. The goal is to minimize the number of documents with over-predictions and under-predictions by refining existing detectors or adding new ones. Because QC is iterative and varies from project to project, the workflows below are recommendations for both unstructured and structured (spreadsheet) documents.

Changes you make during these steps take effect only after you re-run Data Analysis. Keep documents unlocked during pre-review QC so that you can continue making updates to them.

QCing unstructured documents for recall

After you run Data Analysis, verify that the system identifies all documents in the data set that contain applicable PI. This verification process is called Document Recall. It involves the following steps:

  1. Identify documents that match PI search terms but do not have PI hits.
  2. Review samples of those documents.
  3. Update PI detectors by refining existing detectors or adding new custom detectors.

Create QC review queues with no identified PI

To make sure all documents containing personal information are identified within the data set, it is important to QC for documents with no PI hits so that no information is missed.

We recommend QCing documents with PI misses for high frequency PI types such as SSN and project specific PI types. High frequency PI types will vary from project to project. As a baseline, we recommend QC for the following:

  • Date of birth
  • Date of death
  • Social Security Number

To QC Recall:

  1. Use PI and Entity Search to search for documents that do not contain the PI type that you will be QCing.
    • For example, to search for documents that do not contain social security numbers, search for ‘PI Types’ != ‘Social Security Number’. For more information on search syntax, see Search Syntax.
      An image of PI and entity search used to search for documents without social security numbers.
  2. Create a saved search containing these documents.
    An image of a saved search being created.
    • To keep track of information the search contains, specify that there is no personal information contained within the docs via the saved search name. For example, No PI: Social Security Number
  3. Create a Review Center queue using the saved search.
    • The number of documents and time spent on QCing will vary from matter to matter. For recommendations on how to structure the QC process, see Quality control
  4. Repeat for the remaining PI types you want to QC.

Review documents

Once you create QC batches, you should review them using the instructions outlined in Detector QC. As you review documents, store suggestions for detector changes in a shared spreadsheet or other platform to collaborate with others executing QC. You should not mark documents complete during this review.

Refining custom detectors

If you discover any information during document review that should be captured by a token level detector, you can update detectors during this process. There are two possible actions you can take. For more information on updating detectors, see Personal information detectors.

Update an existing detector

  1. Update the keywords, regexes, or both.
  2. Validate that the detector and keywords capture the target information.

Add a new detector

Add a new detector when the required PI can’t be captured by an OOTB detector.

QCing unstructured documents for precision

Reducing the number of documents with over captures is another important step in the QC process that helps reduce document and annotation volumes, therefore reducing the work for the review team.

To QC for precision on unstructured documents:

  1. Perform Blocklisting on PI with a large number of hits, especially a large amount of PI-to-Document hits. The most common PI types where this occurs are phone number and full address.
  2. Review samples of documents to identify common rules to apply to reduce over captures. The steps below outline this approach.

Create QC review queues containing PI

We recommend QCing documents with each of the following attributes:
  • Contains PI — Use this to generate a random sample of documents.
  • Contains phone number
  • Contains full address
  • Contains partial date of birth
  • Contains unexpected PI types — For example, if a project contains data from an accounting firm and has a high number of Patient IDs, this warrants investigation.

To QC for precision:

  1. Use PI and Entity Search to search for documents that contain the PI type that you will be QCing.
    • For example, to search for documents that contain PI, search for CONTAINS PI. For more information see Search syntax.
      An image of PI and entity search used to search for documents containing PI.
  2. Create a saved search containing these documents.
    An image of a saved search being created.
    • Use the saved search name to describe the criteria. For example, Contains PI or Contains Phone Number.
  3. Create a Review Center queue using the saved search.
    • The number of documents and time spent on QCing will vary from matter to matter. For recommendations on how to structure the QC process, see Frequently asked questions.
  4. Repeat for the remaining PI types you want to QC.

Review documents

Once you create QC batches, you should review them using the instructions outlined in Detector QC. As you review documents, store suggestions for detector changes in a shared spreadsheet or other platform to collaborate with others executing QC. You should not mark documents complete during this review.

Refining detectors

For consistent over captures, you can take the following actions:

  1. Add a local blocklist keyword (blocklist term if X distance away from regex).
    • For example, telling the system to ignore all account numbers within 50 characters of the term index.
    • Add a global blocklist keyword that prevents that detector from registering if the global blocklist keyword is present on the document.
    • Common examples include phone numbers and mailing addresses.
    • If there are 100s or 1000s of the same phone number or address located in the dataset, use the Blocklist tool to blocklist this information. For more information, see Blocklisting.

QC of detectors and boundaries in spreadsheets

In addition to identifying personal information within non-spreadsheet documents, aiR for Data Breach Response also identifies table boundaries in spreadsheets and the potential personal information within them.

Spreadsheet QC for recall

When reviewing spreadsheets for recall, the goal is to ensure that table boundaries are being properly detected, and that detectors are capturing any unstructured PI within the tables. To achieve this, create Recall batches for spreadsheets documents.

Table boundaries

When table boundaries are off, the likelihood of under captures increases. For documents where the boundaries need to be adjusted, column annotations will need to be made, and the document marked complete.

Unstructured PI

Spreadsheets can also have localized text-based PI detections. Be sure to scan through Notes and Comments columns, as they could contain a mix of PI types that may not be appropriate for full column annotation.

Spreadsheet QC for precision

To review spreadsheets headers for precision, use the Spreadsheet QC tool. For instructions on using the Spreadsheet QC tool, see Spreadsheet QC tool.

Frequently asked questions

Return to top of the page
Feedback