

Analytics email threading greatly reduces the time and complexity of reviewing emails by gathering all forwards, replies, and reply-all messages together. Email threading identifies email relationships, and then extracts and normalizes email metadata. Email relationships identified by email threading include:
An email thread is a single email conversation that starts with an original email, the beginning of the conversation, and includes all of the subsequent replies and forwards pertaining to that original email. The analytics engine uses a combination of email headers and email bodies to determine if emails belong to the same thread. Analytics allows for data inconsistencies that can occur, such as timestamp differences generated by different servers. The Analytics engine then determines which emails are inclusive, meaning that it contains unique content and should be reviewed. See Inclusive emails for additional information regarding inclusive emails.
This process includes the following steps at a high level:
See these related pages:
In order for a document to be recognized as an email and threaded, it must have the Email From field and at least one of the following:
Sent Date
Email To
Email Subject
Email CC
Email BCC
These fields can either be in the document metadata or in the extracted text. If a document is recognized as an email but does not have a Date Sent field, it will be categorized as a draft.
For more information on email headers, see Supported email header formats.
After completing an email threading operation, Analytics automatically creates and populates the following fields for each document included in the operation:
The Email Threading ID is formatted as follows:
Note: In cases where email segments are modified, such as when a confidentiality footer is inserted, the Email Threading ID may have a greater number of blocks than segments in the email. In such cases, the indentation will reflect the actual number of found segments in the document. See Email threading and the Indentation field.
The Email Threading Display field also indicates which emails are both inclusive and non-duplicate by displaying the indentation level number bubble in black.
See Email duplicate spare messages for more information.
Note: On incremental runs, the primary email in the Email Duplicate Spare group will never change.
Duplicate spare email messages contain the exact same content as another message, but they are not necessarily exact duplicates, such as MD5Hash, SHA256.
The identification of duplicate spare email messages happens during email threading, and the following properties are examined during the identification of email duplicate spares:
Note: A differing “Email To” alias would cause two otherwise duplicate emails to end up with different email thread groups.
The “aliases” for an author are other textual representations of the author that are equated as the same entity. For example, John Doe sends an email using the email address john.doe@example.com. He may have another email address, such as john.doe@gmail.com. Based on these email addresses, the Analytics engine finds they are related and can make an alias list that would include "John Doe" and "Doe, John" and "john.doe@gmail.com" and "john.doe@example.com." Anytime email threading encounters any one of these four entities, email addresses, in the Sender field of an email segment, it considers them one and the same person/entity.
The exception to the general case are the specific cases where the authors do not match. For example, if it is impossible to match SMTP addresses and LDAP addresses as the author values, but the subject and text are exact matches, there is a more stringent time frame. In such cases, the time must be within 24 hours, and the minute must be within one minute of each other. For example, 15:35 would match with 18:36, but 15:35 would not match with 18:37.
Note: It is very important that the attachments are included in the Email Threading Structured Analytics Set. If only the parent emails are threaded, then it will not be able to pick up these differences.
The following properties are not considered during the Email Duplicate Spare analysis:
johnsmith
into john smith
counts as a difference. White space changes within the header also count under certain circumstances.Duplicate spare information saves to the following fields in Relativity after email threading completes:
See Email threading fields for more information on the Email Duplicate Spare and Email Duplicate ID fields.
Consider the following before running a structured analytics email threading operation:
Email threading supports a limited set of language formats for email headers.
Processing engines typically insert English-language headers on top of extracted email body text when they process container files such as .pst, .ost, or .nsf. These headers, such as "To," "From," "Subject," etc., take their contents from specific fields in the container file. The container file’s email body text does not, strictly speaking, contain the headers. For this reason, we always recommend that you keep English selected in the list of email header languages.
When the Analytics engine parses emails, it looks for cues based on supported header formats and languages to determine when it is or is not in an email header. In particular, it is looking for words like "To, From, CC, Subject" in the case of traditional English headers, or “An, Von, Cc, Betreff” in the case of standard German headers. It also looks for other header styles such as "on <date>, <author> wrote:" for single-line replies (English) or “在 <date>, <author> 写道:” (Chinese). There are many other variations and languages other than the ones shown here. For more information, see Supported email header formats
Email threading will be affected as follows by unsupported email header formats and/or headers in unsupported languages:
When mapped on the Analytics Profile, email threading uses the Microsoft Conversation Index to bring emails together into threads. For example, if you replied to an email thread, but deleted everything below your signature, and changed recipients, email threading could group all emails together based on the Microsoft Conversation Index. If that field weren't present, email threading would not group those emails together.
If the Conversation ID field is present for an email and mapped on the profile, it's used to group the email together with other emails first. The text is not examined to validate the Conversation ID data. If a match is found based upon Conversation ID, no further analysis is done on the email for grouping purposes. If no match is found, the system analyzes all other data to thread the email. If some emails have this field and others do not, such as non-Microsoft email clients, they still may be grouped together in the same email thread group when determined necessary.
Email threading does not use the Microsoft Conversation Index to break threads apart. Please note that inaccurate Conversation ID data will harm the quality email threading results. Email threading uses the Conversation ID to group together emails with similar Conversation IDs, even when their Extracted Text differs. The Conversation ID is not typically recommended, as email threading is highly accurate without the use of Conversation ID. Only when the email headers are widely corrupt or in unsupported formats do we recommend the use of this field.
When the Email Message ID, In Reply To, and Message References fields are mapped on the Analytics profile, emails imported from Gmail can be threaded according to Google's native threading system. This creates more accurate threading results and fewer false inclusives. These results are then threaded normally with any non-Gmail messages included in the document set.
The Gmail metadata fields are located in the Message ID Email Metadata section of the Analytics profile. For more information, see Message ID Email Metadata.
Note: The Message ID Email Metadata fields cannot be mapped on the same Analytics profile as the Conversation ID field. If you want to thread emails using both, create two separate sets of Analytics profiles and structured analytics sets.
The Analytics server is queried for the true number of found segments in the email. This indentation level is both in the document field and in the bubble/square that is present in the email threading visualization field.
In most cases, the Email threading ID consists of one "block" per email segment. See Sample 1. Thus, "F00000abc-0000-0000+" would be a three-segment email. However, there are cases where the number of segments in the email does not match the total blocks in the email threading ID. When there are fewer blocks in the threading ID than segments in the email, this indicates that the top segment matches, subject, segment body, normalized author and date, with a lower segment. When there are more blocks in the email threading ID than segments in the email, this indicates there is segment corruption or changes. See Sample 2.
The standard case is that we have three documents, Document1, Document2, Document3. The first document has two segments, the result of someone replying to an email from a colleague. The second document has three segments, a reply to Document1. the third document is exactly like the second.
We call the segments in the documents "A," the original email, "B," the reply, and "C," the subsequent reply. The table below describes both the Email Threading ID, Indentation, inclusiveness, and whether or not the document is classified as a duplicate spare.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment B
|
Segment C
|
Segment C
|
Email threading ID |
F00000abc-0000+ |
F00000abc-0000+0000+ |
F00000abc-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
Yes |
As you can see, the Email threading ID of Document1 is the first part of the ID of Document2 and Document3, just as the segments of Document1 make up the bottom part of documents 2 and 3. In other words, "F00000abc-" corresponds directly to "A", the first "0000+" to B, and the second "0000+" to C.
Now, suppose there is a corruption of segment A due to a server-applied confidentiality footer. In this case, we might have "A" at the bottom of Document1, "A" at the bottom of Document2, but "X" at the bottom of Document3, assuming Document2 was collected from the sending party and Document3 from the receiving party, who sees the footer added by the sending party's server. Because B is a match, Analytics can successfully thread the documents. However, it cannot assert that the bottom segments are the same.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment
B
|
Segment
C
|
Segment
C
|
Email threading ID |
F00000abc-0000-0000+ |
F00000abc-0000-0000+0000+ |
F00000abc-0000-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive Email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
No |
As you can see, there is an additional "0000-" that was added after the F00000abc-. This "phantom" node represents the fact that there are two different segments that occurred in the root segment's position, A and X. You can think of "A" being associated with "F00000abc-" again, and "X" with "0000-". But since each ID must begin with the thread group number, we have to list both As and Xs nodes in all documents. If there were a third bottom segment, for example if Document2 had "Y" at the bottom rather than A, then all three email threading IDs would have an additional "phantom" 0000-. So Document1 in that case would have an ID of F00000abc-0000-0000-0000+.
Why was this not helpful?
Check one that applies.
Thank you for your feedback.
Want to tell us more?
Great!