Using EDRM MIH to identify duplicates in emails

An EDRM Message ID Hash (EDRM MIH) is an MD5 hash of the Message ID value of Email Messages generated by following EDRM guidelines. The value of this hash is calculated in Relativity during discovery and is stored in the Email/EDRMMessageIdentificationHash metadata. You can use the EDRM MIH with the Processing Duplicate Workflow script in Relativity to identify potential email duplicates between cross-platform emails, even on emails that were not processed in Relativity.

How the hash is generated:

Process Value
Message-ID header line from email: Message-ID: <CALckR-a8UDkRjO4xJyjd_s0GPxQWw@mail.gmail.com>
Value passed to MIH generator: <CALckR-a8UDkRjO4xJyjd_s0GPxQWw@mail.gmail.com>
Generated MIH: 1de319c276884bd0c9e2f1621ada26cc

See DupeID > Duplicate Identification Project Overview for more information on EDRM's duplicate identification project.

How to use EDRM MIH in Relativity

To use the EDRM MIH, you must first create a new field that stores the hash value in Relativity. The new field must be mapped to the Email/EDRMMessageIdentificationHash processing field for emails processed in Relativity. For load files processed on other platforms, the field can be mapped to the metadata from the load file that refers to the EDRM MIH. The processing deduplication script can then be run to determine cross-platform duplicates.

Fields and saved search

The field and saved search names are optional. You can use any naming convention that suits your organization's needs.

  • Fields—create two new fields
    • EDRM MIH—fixed-length text, mapped to metadata source, Email/EDRMMessageIdentificationHash
      The Email/EDRMMessageIdentificationHash source is not displayed until you process at least one email. If you do not see the field, process at least one email, then return to mapping the source.
    • EDRM Duplicate Status—single choice with options, Unique, Master, Duplicate
  • Saved Search—create a new saved search
    • Name—EDRM Saved Search
    • Fields
      • Control Number
      • EDRM MIH
      • EDRM Duplicate Status
  • Processing Item Level Script: 03. Update Duplicate Status
    • Saved Search—EDRM Saved Search
    • Duplicate Status Field—EDRM Duplicate Status
    • Duplicate Hash Field—EDRM MIH
      Update duplicate status script for EDRM MIH

Viewing EDRM MIH results

The EDRM Saved Search displays the results of potential duplicates.

  • Sort the results by EDRM MIH and by EDRM Duplicate Status to view the report with information on the possible duplicates.
    List view showing potential duplicate documents returned by using the EDRM MIH hash.
  • Filter the EDRM Duplicate Status column for Master and Duplicate to see the list of potential duplicate documents.
    List view showing potential duplicate documents filtered by the duplicate status of master and duplicate.

How to map the EDRM MIH field from a load file

While importing the load file, map the fields corresponding to the EDRM MIH to a representative field in the workspace.

Import Export load file field mappings showing the EDRM MIH field mapped to the workspace corresponding hash field.

How to export the EDRM MIH values

While using the export job, be sure to include the EDRM MIH field to allow further usage of EDRM MIH for duplication.

Sample export job showing the EDRM MIH field selected, or included in the export properties.

Limitations

In the following scenarios, the EDRM MIH will not be calculated:

  • If the file does not have a Message ID value
  • If a file is not an email (.eml, .msg) file
  • If the file was discovered in Relativity before this functionality was enabled. In this scenario, consider re-discovering the email files.

In the following scenarios, the MIH on its own may not be adequate to perform deduplication:

  • Combination of the MIH and the email Date (for example, Sent Date & Time)
  • Draft messages without Message IDs
  • SPAM and Fraudulent Messages
  • System Generated Emails
  • Malformed or Corrupted Message IDs
  • Messages with Prepended or Appended Headers, Footers and Signatures
  • Messages with Message Group and Alias Addressing
  • Messages with BCCs
  • Messages with Stripped or Corrupted Attachments
  • Messages with Time Anomalies
  • Items that are Not Email Messages

See DupeID > Duplicate Identification Project Overview for more information on limitations and use cases.