Running name normalization on email headers
When running name normalization, email header formats in the extracted text can have a lot of variation and are generally less clean than the top-level headers. Because of this, you may want to initially run name normalization on only the top-level headers (To, From, Cc, Bcc) to produce cleaner results. These results can then be used to help seed additional runs of name normalization.
This workflow assumes you have the following:
- A structured analytics set to be used only for name normalization.
- High-quality, clean data for the Email From, Email To, Email Cc, and Email Bcc fields.
- An Analytics profile where the Email From, Email To, Email Cc, and Email Bcc fields are properly mapped.
See these related pages:
Running name normalization
To run name normalization on email header fields, perform the following steps:
- From the Repeated Content Filters tab, create a filter with the following settings:
- Name - For Name Normalization ONLY
- Type - Regular Expression
- Configuration - enter (?s).*+
Note: You must set the configuration to the seven characters specified above, exactly as it appears, with no extra spaces.
The regular expression filter is the key to this solution. The filter works as follows:
- (?s) - denotes that the .* wildcard should include line breaks.
- .* - denotes "any character, any number of times." Combined with the (?s) above, this matches any character, any number of times, including line breaks.
- + - modifies the expression to match as many characters as possible without backtracking. This makes it more memory efficient.
In other words, this filters out every character of the extracted text, including line breaks, as it is being sent to the Analytics engine.
Note: We highly discourage using this regular expression anywhere else. Only use this regular expression with name normalization. If you apply this regular expression to other operations, such as email threading, the results will be unusable.
- Set the following conditions on the structured analytics set running name normalization:
- Structured Analytics Set Information
- Operations to run - select only Name Normalization.
- Email Headers
- Analytics profile - select the Analytics profile with properly mapped email header fields.
- Use email header fields - set to Yes.
- Optional Settings
- Regular expression filter - select For Name Normalization ONLY.
- Structured Analytics Set Information
- Run the structured analytics set. Once the set completes, you should see that name normalization has found entities and aliases based solely on the headers. One way to confirm this is by examining the Entity Participant field. It should be set only to the entities listed in the Entity From and Entity Recipient fields, nothing more. Similarly, the Alias Participant should contain only the Alias From and Alias Recipient aliases.
After executing this process, you can work with the entities and aliases as-is, or you may later choose to bring the extracted text into consideration. To bring in the extracted text, remove the regular expression filter from the structured analytics set, and then re-run the set with the Repopulate Text setting enabled.
Note: There are other regular expressions such as ^.*$ that can achieve the same result. However, they are more memory intensive. We recommend using the (?s).*+ expression for best performance, especially if your document set includes large documents.
Additional regular expression resources
For more information on using regular expressions (regex) in Relativity, see: