Entity normalization
The following are definitions of terms used throughout this documentation.
Entity—a unique individual identified by their name and their PI, such as SSN, date of birth, and full address. Entities are assigned a unique Entity ID, appear in the Entity Centric Report, and are composed of one or more records.
Record—a pairing of raw name and PI data from documents. Records are evaluated against each other to determine if they should be consolidated into an entity. A single record can be transformed into an entity if no related records are found.
Entity Cluster—a group of entities based on PI conflicts and name similarity. Entities with conflicting PI or similar names are grouped together to aid in review and potential merging. Entities are not merged if there is no PI match, or a conflict exists.
Configure settings
To configure entity normalization settings, click on the Settings icon and select Deduplication Preferences.
PI identifier types
PI types are grouped into the following categories to facilitate the conflict flagging and merging logic:
Identifiers that can generate conflicts
- Primary—identifier that must match for two entities to be considered the same person. Two different entities cannot share the same primary identifier.
Examples: SSN, Passport Number, Driver’s License, DEA Number, Patient ID - Secondary—identifier that is not unique to an individual, but an individual cannot have more than one value for a specific type. Two individuals can have the same value and not be the same person.
Examples: Date of Birth, Date of Death, Partial SSN - Tertiary—identifier that does not have to match for two entities to be considered the same person. For example, if two entities have a different value for address, this does not disqualify those two entities from being merged.
Example: Credit Card Number, Personal Email Address, Personal Address, Financial Account Number, Partial Account Number, IBAN Code, Phone IMEI Number, Personal Phone Number, Swift Code, Username
Identifiers that do not generate conflicts
- Other—identifier that does not have to match for two entities to be considered the same person. Two entities can have the same value and not be the same person.
Examples: CVV Number, Expiration date associated with Credit/Debit Card, PIN, Password, Financial Password, Financial Institution, Medical Information - Yes or No (Ignore for Consolidation)—identifier that is not recorded as a token value and is not used in normalization. Appears as Yes or No record on the report.Note: All custom PI types are automatically set to Yes/No.
Configure entity report settings
You can configure the following entity report settings by scrolling to Report Settings on the Deduplication Preferences tab:
- Include Partial Addresses In Entity Report—determines how to handle partial addresses in the Entity Report.
- Determine An Entity's Primary Address—this setting will describe what logic is used to determine the Entity's Primary address in the case where an Entity has multiple address associated with it.
- Created By Date—this is field provided in the DAT file for the document.
- Modified By Date—this is field provided in the DAT file for the document.
- Frequency—the frequency the address was found in the data set for an individual entity.
- Entity Report Format for Multiple Addresses—choose between two options for displaying addresses: Single Row per Entity or Multiple Rows per Entity.
- Address Filter By Country—choose whether to include or exclude address from the Entity report, based on country.
Running the entity normalization process
This section outlines the entity normalization process and provides a description of each step.
To run the entity normalization process:
- Navigate to the Privacy Workflow tab and select the Incorporate Feedback icon from the sidebar.
- Click the Run Incorporate Feedback button.
- Select the following stages:
- Click Run.
Once you click Run, the selected stages will run automatically and result in Normalization results being generated. You can then review results using Cluster review
The following sections describe each stage of the Entity Normalization process in detail.
Process excel detections
The Process Excel Detections process repeats through confirmed Excel header mappings and performs the following actions:
- Generates detections and annotations from table columns identified as containing PI by the mapping.
- Records and PI mappings are generated for name-based PI mappings.
The remaining document set is then searched for occurrences of the contents of each column row and any matches are also added as detections and annotations.
Text-based annotations use start and end offsets to allow the highlighting or redaction of an annotation’s text. For Excel-based annotations, however, this is not enough. Therefore, the annotation data structure contains the concept of PI coordinates which can be used to indicate the Excel sheet, row, and column location for each annotation.
Some PI types may span many columns, but for reporting purposes it is best to merge their contents.
For example, a name may appear in an Excel file under the columns First Name and Last Name. During Excel content detections the contents of rows where such compound mappings occur are concatenated to form a single string. In such cases, the PI coordinate for the annotation will contain references to each of the columns where the name portions occurred to facilitate redaction.
Deduplicate individuals
The consolidation process includes the following steps:
- Name Clustering
- Nickname Matching
- Hierarchical Merging
- Address Standardization
- PI Matching
Name clustering
The name clustering step retrieves all names that were linked to PI from the data set and groups similar names into clusters. Each cluster contains several names that are nearest-neighbors in terms of character group similarity. To create these groupings, the algorithm looks at the entire name, not just first or last.
Once these clusters form, the algorithm then refines them by checking the PI linkages associated with each name. This happens by generating a graph of PI-compatible names. This graph has names as its nodes, and the edges represent whether one name is compatible with another.
Each pair of names in the cluster is compared and an edge is created between them if their PI is compatible. The resulting graph consists of several connected clusters of names, for example:
Nickname matching
In tandem with name clustering, the algorithm also performs nickname matching to identify possible variations in the name of a record to maximize the possible merges. A nickname is an alternative version of a name and is different than a name variant because it can be very different from the original.
For example, the name John can have Johnny, Jonathan as nicknames. Nicknames are determined based on a pre-determined dictionary of possible nicknames. Therefore, when the entity John Smith is clustered and compared with other similarly named entities all his various aliases are also compared simultaneously.
[John Smith, Johnny Smith, Jonathon Smith, Smith John, Smith Johnny, Smith Jonathon]
The ability to turn this feature on or off is available but requires Relativity support.
Hierarchical merging
Nickname matching relies on names being similar enough to end up in the same initial cluster. But, there can be cases where this fails to happen. Consider the name Janet Theresa Margaret Doe. It is likely that this name is not similar enough to the other names in the cluster above, so the longer name will not have the chance to merge with potential others.
To address this problem, an extra step is performed after the initial clustering and is done on the entire data set. This step, named hierarchical merging, attempts to merge larger names with smaller ones containing all the same name-parts in the larger parent. This step is a configuration and is turned on by default.
For example, Janet Theresa Margaret Doe has four name-parts, three of which are contained in Janet Theresa Doe, so the two could be merged if they have compatible PI.
However, this can only happen if there is a single compatible larger name available to merge. Suppose there was also Janet Theresa Mary Doe in the data set. Now Janet Theresa Doe could merge with either Janet Theresa Margaret Doe or Janet Theresa Mary Doe. In this situation, the algorithm will attempt to choose the larger name with the most similar PI and in cases where that is not possible, it will leave the smaller name un-merged.
Address Standardization
Most addresses are extracted from spreadsheets, where addresses can be arranged in all sorts of ways. This can cause address text to be potentially very noisy and the same address can be captured in multiple ways. For example, the address 123 Some Street, Brooklyn N.Y. 12166 could appear as:
- 123 Some st., Brooklyn, 12166
- 123 Some Street, Brooklyn, New York 12166
- 123 Some str., Brooklyn NY 12166
- 123 Some Street
Since the goal of data breach is to notify the parties affected, having multiple copies of the same address is inconvenient. The address normalization service is responsible for transforming an address into a standardized format. Using the example above, it should transform all the addresses into the same string. For example: 123 Some Street, Brooklyn, New York, 12166.
The address normalization service will take an address, expand it, and parse it to examine the constituent parts of the address. For example, street number, street, city, postcode.
If the address is a partial address, the normalization service will attempt to map the address parts starting with their most frequently observed column mapping if the annotation came from a .csv file.
For example, consider these address detections from three documents:
Column mapping | Address text | Doc ID |
---|---|---|
ADDRESS_LINE_3 | Brooklyn | DOC1 |
ADDRESS_LINE_4 | New York | DOC1 |
ADDRESS_LINE_3 | Brooklyn | DOC2 |
ADDRESS_LINE_4 | New York | DOC2 |
ADDRESS_LINE_2 | Brooklyn | DOC3 |
ADDRESS_LINE_3 | New York | DOC3 |
The address will have been concatenated as Brooklyn, New York. In most cases, the address begins at ADDRESS_LINE_3, so the address normalizer will return a normalized address with Brooklyn in the Address 3 field.
PI matching
Each PI type has an associated matcher which is responsible for determining if two instances of PI of that type are equivalent. For some PI types, a degree of fuzziness is desirable, while for others it most definitely is not.
For example, we might choose to tolerate a degree of fuzziness for postal addresses that we would not consider appropriate for social security numbers.
The following table illustrates the built-in matchers and the PI types that they are usually associated with:
Matcher | PI types | Description |
---|---|---|
Date matcher | Date of birth, Date of death | The matcher is fuzzy in the sense that it can parse a large variety of date formats, but only returns matches for dates that precisely match. For example, 18 Feb 2000 and 18/02/2000.. |
Number-based matcher | SSN, credit cards, bank account numbers | The matcher ignores any non-number character and will only match exact matches. For example, 123-12-1234 and 123121234. |
Fuzzy string matcher | Used for custom detectors | Returns matches based on the Levenstein distance between the strings. Can be configured to require exact matches or be more lenient. |
Address matcher | Used for addresses | The matcher determines a match based on a look-up in the normalized addresses produced by the data set. |
DEA number matcher | DEA numbers | The matcher determines a match based on a look-up in the normalized addresses produced by the data set. |
Report generation
Report generation is the final stage required to generate the Normalization results.
Cluster review
Once the Entity normalization process is complete and Normalization results are generated, you should review those results in the entity cluster table..
To view the entity cluster table, click on the entity cluster icon on the side panel and select Deduplicate and Normalize Entities.
Report fields
The following are definitions of fields found within the entity cluster table.
Field | Description |
---|---|
Entity Cluster | Unique ID. |
Cluster Size | The number of Entities in a cluster. |
Total Conflicts | Number of conflicts between Entities in the cluster based on the number of conflicting Records within an entity. |
# Primary Conflicts | Number of conflicts with primary identifiers. |
# Secondary Conflicts | Number of conflicts with secondary identifiers. |
# Tertiary Conflicts | Number of conflicts with tertiary identifiers. |
Conflicts? | Yes or No field based on if the number of conflicts is greater than zero. |
Reviewed Status |
Review status of the cluster. Statuses include:
|
Example conflict logic
Scenario description | Name | Primary | Secondary | Tertiary | Example Alias 1 | Example Alias 2 |
---|---|---|---|---|---|---|
Names match or are similar, but primary identifiers do not match. Tertiary identifiers match. | Match | No match | - | Match |
Name: John Smith SSN: 223-55-6788 Email: jsmith@gmail.com |
Name: John Smith SSN: 123-45-6789 Email: jsmith@gmail.com |
Names match or are similar. There is not enough information on primary identifiers to compare. Secondary identifiers do not match. Tertiary identifiers match. | Match | Not enough info | No match | Match |
Name: John Smith Email: jsmith@gmail.com DOB: 01/20/1990 |
Name: John Smith Email: jsmith@gmail.com DOB: 12/10/1970 |
Names match or are similar. There is not enough information on primary identifiers to compare. Secondary identifiers do not match. Tertiary identifiers match. | No match | Match | - | - |
Name: John Smith SSN: 123-45-6789 DOB: 01/20/1990 |
Name: David Johnson SSN: 123-45-6789 |
Names are not similar and primary identifiers do not match. Tertiary matches. | No match | No match | - | Match | Name: John Smith SSN: 123-45-6789 Email: jsmith@gmail.com |
Name: Adam Brown SSN: 623-65-6678 Email: jsmith@gmail.com |
Names are not similar and there is not enough information on primary identifiers to compare. Tertiary matches. | No match | Not enough info | - | Match | Name: John Smith Email: jsmith@gmail.com | Name: David Johnson Email: jsmith@gmail.com |
Recommended workflow
We recommend using the following workflow for review:
- Sort the table to review all clusters with Primary Conflicts > 0.
- Sort the table to review all clusters with Secondary Conflicts > 0.
- Sort the table to review all clusters with Tertiary Conflicts > 0. This may not be necessary based on project requirements.
- If there are too many unnecessary conflicts update the Identifier Types, located in the Project Settings, to reduce the number of conflicts.
- After Conflicts are resolved or Clusters with Conflicts are reviewed, sort Clusters by size in descending order and review for similar named entities.
- This is best for large clusters.
- Unless time permits, it may not be beneficial to review all clusters with a size > 1.
For Cluster review:
- Depending on the cluster size, it can be helpful to select the Expand All button.
- Select two entities and then select the Compare button to compare them.
- Once you expand the entity information for two entities and compare them, decide if they should be merged.
Information that is different across entities is red. - To merge two entities together, select the entities and select the Merge button.
After you merge the entities, the new entity moves to the bottom of the list and is represented with a merge icon. - To unmerge, select the entity with the merge icon and then select the Unmerge button.
Merge reason table
To view the Merge Reason table, select the Merge Reason tab.
Report fields
The following report fields appear on the Merge Reason table:
Field | Description |
---|---|
Entity ID | Unique Entity ID. Maps the Entity ID in the Entity Report. |
Name | Entity full name. |
Number of Records | Number of Records that make up the Entity. |
Merge Reason |
Summary of why Records were merged. Merge Reasons include:
|
Main Overlapping PI | The Primary PI Type that resulted in the merged Records. |
Second Overlapping PI | The Secondary PI Type that resulted in the merged Records. |
Source | Whether an entity was merged automatically during deduplication or by a user. Source will be System Generated for automatically merged entities, and User Generated for entities merged by a user. |
Source User | If the Source is User Generated, this will be the username of the user who merged the entity. It will be empty if the Source is System Generated. |
Review Status | The review status of the entity. |
Entity records review
The Review Entities screen is composed of two primary sections:
- Entity—this section displays the information that is shown in the Entity Report for the Entity. You can edit what is shown in the Entity’s information by selecting the Edit button. You can only select the raw values from the data and cannot edit the text directly.
- Records—this section shows the Name and PI for every Record that was merged to create the Entity.
- Other Actions—Mark Complete: You can select the Mark Complete button to mark the Entity as reviewed.
Entity centric report
The Entity Centric Report is found within the PI Detect and Data Breach Response Application under the Entity Analysis tab.
Note: We are moving the Entity Deduplication experience from the Report Generation page within the Privacy workflows tab. The data within the entity reports should match and both will exist until the full Entity Deduplication experience is moved under the Entity Analysis tab in the application. This is expected to complete in Q4 2024.
Show PI
The Show PI toggle is turned off by default. When this toggle is turned off, PI values will be replaced with Y/N values in the Entity Report.
Report fields
Following are lists of static and dynamic fields that appear on the Entity Centric Report:
Static fields
Field | Description |
---|---|
Entity ID | Unique identifier for the Entity |
Full Name | Entity Full Name normalized from the underlying entity records |
First Name | Entity First Name normalized from the underlying entity records. The machine attempts to parse the First Name from the Full Name. |
Middle Name(s) | Entity Middle Name normalized from the underlying entity records. The machine attempts to parse the Middle Name from the Full Name. |
Last Name | Entity Last Name normalized from the underlying entity records. The machine attempts to parse the Last Name from the Full Name. |
Title | Entity Last Name normalized from the underlying entity records. The machine attempts to parse the Last Name from the Full Name. |
Total PII | Total number of PI values in the report for the entity |
Total Document Count | Number of documents in which an entity record is found |
Total Reviewed Document Count | Number reviewed documents where an entity record is found. |
Total Unreviewed Document Count | Number of unreviewed documents where an entity record is found. |
Document Identifiers | Document IDs where entity records are located. |
Cluster ID | The unique identifier of the cluster in which the entity resides |
Dynamic fields
There are two types of dynamic fields: address and PI type.
Address fields
- Single Row per Entity (Default)
- Primary Address Line 1-4, Zip Code, Country
- Secondary Address Line 1-4, Zip Code, Country
- Other Addresses
- Multiple Rows per Entity
- Address Line 1-4, Zip Code, Country
PI Type Fields
- Every PI Type with at least one value tagged to an Entity will appear as a separate column.
- Original Value Fields—each field will display the original value for each PI Type found in the data.
- NORMALIZED Field—every PI Type field will include a field with the name ending in “_Normalized”. These fields are cleaned and standardized by Data Breach Response automatically. For custom PI types, these values will not be different.
Download
Select the Download Report button to save an .xslx file to your downloads folder. Any filtering applied to the report in the UI will be maintained in the downloaded file.
Troubleshooting
Data Breach Response's entity normalization and consolidation experience is designed to be iterative. The settings may need refinement before producing a final notification report. This flexibility maximizes the potential for consolidated entities and accommodates the unique characteristics of each data set.
Over merging
When two or more entities are merged incorrectly, this is known as over merging. All over merging issues can be resolved through the following:
- Adjusting the normalization settings.
- Turning on and off nickname matching. This is not yet available in the Data Breach Response UI, and requires Data Breach Response support.
- Adjusting the sensitivity of name clustering. This is not yet available in the Data Breach Response UI, and requires Data Breach Response support.
- Editing the Entity Record via the merge reason table.
There are two primary reasons why any over merging issue could occur:
- The system has found some matching PI based on the value and the assigned identifier type: primary, secondary, tertiary, other, or yes/no. See Configure settings for more details on identifiers that can generate conflicts.
- The name similarity scores are high enough to consider the names a match.
It is also possible that an individual manually merged the entities together via the Cluster review experience.
Scenarios
The normalization settings are the most likely reason why two or more entities are over merged. The following steps assume that the documents were reviewed correctly, and the entities are properly linked to their personal information.
Scenario 1A: Two entities were over merged but have shared PI.
- Identify the over merged entities.
- Identify the shared PI and the PI type.
- Confirm that the two entities should have shared PI based on reviewing the native documents.
Sometimes this can occur because of a user error or data issue. - Adjust the identifier settings of the PI type to Tertiary, Secondary, or Primary.
If the identifier type is already one of these three types, proceed to Scenario 1B. - Rerun normalization.
Scenario 1B: Two different entities were merged but share PI.
At some point, you may not feel that increasing the identifier type is the right approach. For example, you likely will never want Address to be a Primary identifier because multiple people can share addresses and there are numerous ways to write an address which might prevent correct merges from happening.
Scenario 2: Two entities were over merged but have no overlapping PI.
This scenario often occurs when there are records with very few linked PI and the identifier types are set to Yes/No or Other.
- Identify the over merged entities.
- Identify the PI types of each entity Start with the entity with the least amount of PI, these are usually where the issues arise.
- Adjust the identifier type of the PI to Tertiary, Secondary, or Primary.
- Rerun Normalization.
If this still does not resolve the issue, please refer to Scenario 1B.
Under merging
Under merging occurs when two or more entities that should have been merged were not merged. The reason two entities may not merge are as follows:
- The full names were not identified as being similar enough.
- The PI types do not match.
- A conflict has occurred due to the identifier type of one or more PI.
Data Breach Response attempts to normalize names and PI to maximize the potential merges that are possible. However, there are instances where the machine may not be able to normalize the names and PI sufficiently to identify similar or matching values in other records. This could also be the reason for conflicts. In this case, the only resolution is to manually merge the entities through the Cluster Review Experience.
Sometimes, it may be appropriate to adjust the identifier type of one or more PI values. For example, we have seen projects where multiple individuals were associated with the same driver’s license because of an issue with the underlying data. In this case, a decision was made to ignore conflicts from driver’s licenses. So, we adjusted the driver’s license from Primary to Tertiary.