Quality control
PI Detect is pre-configured with PI Detectors, also referred to as Out of the Box (OOTB) detectors. This is in addition to any custom detectors or classifiers added for a project. The Quality Control (QC) process is an optional step that occurs prior to review by the Project Lead. The goal of this step is to make sure that there is a minimization of documents with over and under predictions. Another benefit of the QC process is that it reduces the number of over and under predictions from token level detectors by refining existing or adding new detectors. Feedback to the machine learning models can also be provided, if the documents are reviewed and marked complete. QC is an iterative process that may vary from project to project. Below are the recommended workflows for QCing non-spreadsheet and spreadsheet documents.
Note: Any changes made during any of the following steps will not take effect until Incorporate Feedback is re-run. If you mark a document complete, it will prevent changes on those documents. Therefore, documents should not be marked complete during pre-review QC.
QCing unstructured documents for recall
After running Incorporate Feedback, the Project Lead should make sure that all documents with applicable PI within the data set are identified. This is referred to as Document Recall:
- Identifying documents that hit on PI search terms, but do not contain PI hits.
- Reviewing samples of documents.
- Updating the PI Detectors by either:
- Refining existing detectors.
- Or adding new, custom PI detectors.
Create QC review queues with no identified PI
To make sure all documents containing personal information are identified within the data set, it is important to QC for documents with no PI hits so that no information is missed.
- Date of birth
- Date of death
- Social Security Number
To QC Recall:
- Use PI and Entity Search to search for documents that do not contain the PI type that you will be QCing.
- For example, to search for documents that do not contain social security numbers, search for ‘PI Types’ != ‘Social Security Number’. For more information on search syntax, see Search Syntax.
- For example, to search for documents that do not contain social security numbers, search for ‘PI Types’ != ‘Social Security Number’. For more information on search syntax, see Search Syntax.
- Create a saved search containing these documents.
- To keep track of information the search contains, specify that there is no personal information contained within the docs via the saved search name. For example, No PI: Social Security Number
- Create a Review Center queue using the saved search.
- The number of documents and time spent on QCing will vary from matter to matter. For recommendations on how to structure the QC process, see Quality control
- Repeat for the remaining PI types you want to QC.
Review documents
Once you create QC batches, you should review them using the instructions outlined in Detector QC. As you review documents, store suggestions for detector changes in a shared spreadsheet or other platform to collaborate with others executing QC. You should not mark documents complete during this review.
Refining detectors
If you discover any information during document review that should be captured by a token level detector, you can update detectors during this process. There are three possible actions you can take. For more information on updating detectors, see Personal Information detectors.
Update an existing detector
- Update the keywords, regexes, or both.
- Validate that the detector and keywords capture the target information.
Add a new detector
- Add a new detector when the required PI can’t be captured by an OOTB detector.
Add a document level classifier
- Add a new document level classifier when:
- The required PI can't be consistently located within a document, but the document itself can be located with keywords or groups of keywords. This could include handwritten or poorly scanned documents.
- Grouping types of documents such as Birth Certificates, Medical Records, or Curriculum Vitae.
- Document-level classifiers apply to the document as a whole and do not locally identify or highlight individual pieces of PI.
QCing unstructured documents for precision
Reducing the number of documents with over captures is another important step in the QC process that helps reduce document and annotation volumes, therefore reducing the work for the review team. To QC for precision on unstructured documents:
- Perform Blocklisting on PI with many hits, especially a large amount of PI-to-Document hits. The most common PI types where this occurs are phone number and full address.
- Review samples of documents to identify common rules to apply to reduce over captures. The steps below outline this approach.
Create QC review queues containing PI
- Contains PI. This allows you to generate a random sample of documents
- Contains phone number
- Contains full address
- Contains partial date of birth
- Contains PI types that look “off”
- To prioritize remaining PI types for QC, look at PI counts for each PI type in the document report. An example of a PI type looking “off” is if the project contains data from an accounting firm and there are a high number of Patient IDs. This is unexpected and should be investigated.
To QC for precision:
- Use PI and Entity Search to search for documents that contain the PI type that you will be QCing.
- For example, to search for documents that contain PI, search for CONTAINS PI. For more information see Search syntax.
- For example, to search for documents that contain PI, search for CONTAINS PI. For more information see Search syntax.
- Create a saved search containing these documents.
- To keep track of information the search contains, specify that there is no personal information contained within the docs via the saved search name. For example, Contains PI.
- Create a Review Center queue using the saved search.
- The number of documents and time spent on QCing will vary from matter to matter. For recommendations on how to structure the QC process, see Quality control
- Repeat for the remaining PI types you want to QC.
Review documents
Once you create QC batches, you should review them using the instructions outlined in Detector QC. As you review documents, store suggestions for detector changes in a shared spreadsheet or other platform to collaborate with others executing QC. You should not mark documents complete during this review.
Refining detectors
For consistent over captures, you can take the following actions:
- Add a local blocklist keyword (blocklist term if X distance away from regex).
- For example, telling the system to ignore all account numbers within 50 characters of the term “index.”
-
- Add a global blocklist keyword that prevents that detector from registering if the global blocklist keyword is present on the document.
- Common examples include phone numbers and mailing addresses.
- If there are 100s or 1000s of the same phone number or address located in the dataset, use the Blocklist tool to blocklist this information. For more information on blocklisting, see Blocklisting.
QC of detectors and boundaries in spreadsheets
In addition to identifying personal information within non-spreadsheet documents, PI Detect and Data Breach Response also identifies table boundaries in spreadsheets and the personal information within them.
Spreadsheet QC for recall
When reviewing spreadsheets for recall, the goal is to make sure that table boundaries are detected, and that detectors are capturing any unstructured PI within the tables. To achieve this, create Recall batches for spreadsheets documents. When creating the batches, set File Type to All Excel and CSV Files.
Table Boundaries:
When table boundaries are off, the likelihood of under captures increases. For documents where the boundaries need to be adjusted, column annotations will need to be made, and the document marked complete.
Unstructured PI:
Spreadsheets can also have localized text-based PI detections. Be sure to scan through "Notes" and "Comments" columns, as they could contain a mix of PI types that may not be appropriate for full column annotation.
Spreadsheet QC for precision
To review spreadsheets headers for precision, use the Spreadsheet QC tool. For instructions on using the Spreadsheet QC tool, see Spreadsheet QC tool.
Frequently asked questions
Quality Control is an iterative process that may vary in time spent from project to project, depending on how much time you can devote to it, as well as what type of information is contained within the data set. For example, if project A requires 10 new custom detectors to be created and project B requires 5, more QC time may be allotted to project A. Usually, only 2-3 iterations of feedback incorporation are required to complete the QC process after including all detectors. To achieve this follow the guidelines below:
- QCing for Recall: For High priority PI types (such as SSNs and DOBs), perform QC until there are no under captures in your sample QC batches. Perform the same for any important custom detectors.
- QCing for Precision: Focus on the number of documents, not the accuracy of the PI annotations. QC until the number of documents you are removing per round is less than 15% of the remaining population.
- Identifying the target document population is the most important input for the review process. Therefore, evaluate the number of documents that are identified as having 1+ PI after each Incorporate Feedback round. Once the impact of the detector updates has reduced the population by only 15% or less, then you can start the review.
- For example, If you identify 1000 documents with PI after the 1st round of predictions you may reduce the population by 20% to 800. After the next round of feedback you are only able to make a 10% reduction in the number of documents, to 720. After more feedback is provided, it is more difficult to reduce the population and the level of effort to make percent decrease will become greater.
There is a balance between precision and recall. If you make your detectors too precise, you could miss PI in other documents. Prioritize your document recall and refine the precision of your results during review. Because this is an interactive process, you can continue to update your detectors with feedback from the review team.