

Note the following special considerations regarding deduplication:
Deduplication is applied only on Level 1 non-container parent files. If a child file (Level 2+) has the same processing duplicate hash as a parent file or another child file, then they will not be deduplicated, and they will be published to Relativity, regardless of whether the hash field has the same value. This is done to preserve family integrity. You can find out the level value for a file by mapping the Level metadata from the Field Catalog.
If you change the deduplication method between publications of the same data, even if you're using different processing sets, you may encounter unintended behavior. For example, if you publish a processing set with None selected for the deduplication method on the profile and then make a new set with the same data and publish it with Global selected, Relativity won't publish any new documents because they will all be considered duplicates. In addition, the All Custodians field will display unexpected data. This is because the second publish operation assumed that all previous publications were completed with the same deduplication settings.
When you select Global as the deduplication method on the profile, documents that are duplicates of documents that were already published to the workspace in a previous processing set aren't published again.
When you select Custodial as the deduplication method on the profile, documents that are duplicates of documents owned by the custodian specified on the data source aren't published to the workspace.
When you select None as the deduplication method on the profile, all documents and their duplicates are published to the workspace.
When you select Global as the deduplication method on the profile and you publish a processing set that includes documents with attachments, and those attachments are duplicates of each other, all documents and their attachments are published to the workspace.
When you select Global as the deduplication method on the profile, and you publish a processing set that contains a password-protected document inside a zip file, you receive an error. When you unlock that document and republish the processing set, the document is published to the workspace. If you follow the same steps with a subsequent processing set, the unlocked document is de-duplicated and not published to the workspace.
The system uses the algorithms described below to calculate hashes when performing deduplication on both loose files (standalone files not attached to emails) and emails for processing jobs that include either a global or custodial deduplication.
The system calculates hashes in a standard way, specifically by calculating all the bits and bytes that make the content of the file, creating a hash, and comparing that hash to other files in order to identify duplicates.
The following hashes are involved in deduplication:
To calculate a file hash for native files, the system:
Note: Relativity can't calculate the MD5 hash value if you have FIPS (Federal Information Processing Standards cryptography) enabled for the worker manager server.
To calculate an email’s MessageBodyHash, the system:
Note: The removal of all the components mentioned above is necessary because if the system didn't do so, one email containing a carriage return and a line feed and another email only containing a line feed would not be deduplicated against each other since the first would have two spaces and the second would have only one space.
To calculate an email’s HeaderHash, the system:
RE: Your last email
Robert Simpson
robert@relativity.com
10/4/2010 05:42:01 PM
The system calculates an email’s RecipientHash through the following steps:
Russell Scarcella
rscarcella@relativity.com
Kristen Vercellino
kvercellino@relativity.com
To calculate an email’s AttachmentHash, the system:
Beginning in
To derive the Relativity deduplication hash, the system:
`6283cfb34e4831c97e363a9247f1f01beaaed01db3a65a47be310c27e3729a3ee05dce5acaec3696c681cd7eb646a221a8fc376478b655c81214dca7419aabee6283cfb34e4831c97e363a9247f1f01beaaed01db3a65a47be310c27e3729a3ee3843222f1805623930029bad6f32a7604e2a7acc10db9126e34d7be289cf86e`
Note: If two emails have an identical body, attachment, recipient, and header hash, they are duplicates.
Note: For loose files, the Processing Duplicate Hash is a hash of the file's SHA256 hash.
Why was this not helpful?
Check one that applies.
Thank you for your feedback.
Want to tell us more?
Great!