

Analytics email threading greatly reduces the time and complexity of reviewing emails by gathering all forwards, replies, and reply-all messages together. Email threading identifies email relationships, and then extracts and normalizes email metadata. Email relationships identified by email threading include:
An email thread is a single email conversation that starts with an original email, the beginning of the conversation, and includes all of the subsequent replies and forwards pertaining to that original email. The analytics engine uses a combination of email headers and email bodies to determine if emails belong to the same thread. Analytics allows for data inconsistencies that can occur, such as time stamp differences generated by different servers. The Analytics engine then determines which emails are inclusive, meaning that it contains unique content and should be reviewed. See Inclusive emails for additional information regarding inclusive emails.
This process includes the following steps at a high level:
See these related pages:
For a document to be recognized as an email and threaded, it must have the Email From field and at least one of the following:
These fields can either be in the metadata or in the extracted text.
Other considerations include:
For more information on email headers, see Supported email header formats.
After completing an email threading operation, Analytics automatically creates and populates the following fields for each document included in the operation:
Duplicate spare email messages contain the exact same content as another message, but they are not necessarily exact duplicates according to comparison methods such as MD5Hash or SHA-256. Email threading considers or ignores the following properties when marking items as duplicate spares.
During email threading, the following properties are examined to identify email duplicate spares:
The following properties are not considered during the Email Duplicate Spare analysis:
johnsmith
into john smith
counts as a difference. White space changes within the header also count under certain circumstances.Duplicate spare information saves to the following fields after email threading completes:
See Email threading fields for more information on the Email Duplicate Spare and Email Duplicate ID fields.
Email threading supports a limited set of language formats for email headers.
Processing engines typically insert English-language headers on top of extracted email body text when they process container files such as .pst files, .ost files, or .nsf files. These headers, such as To, From, Subject, and others, take their contents from specific fields in the container file. The container file’s email body text does not, strictly speaking, contain the headers. For this reason, we always recommend that you keep English selected in the list of email header languages.
When the Analytics engine parses emails, it looks for cues based on supported header formats and languages to determine when it is or is not in an email header. In particular, it is looking for words like To, From, CC, Subject in the case of traditional English headers, or An, Von, Cc, Betreff in the case of standard German headers. It also looks for other header styles such as on <date>, <author> wrote: for single-line replies (English) or 在 <date>, <author> 写道: (Chinese). There are many other variations and languages other than the ones shown here. For more information, see Supported email header formats.
Email threading will be affected as follows by unsupported email header formats or headers in unsupported languages:
Some email providers use their own fields to identify email threads. If your email set includes these fields, you can use them to improve your email threading results.
If you map the Email Message ID, In Reply To, and Message References fields on the Analytics profile, emails imported from Gmail can be threaded according to Google's native threading system. This creates more accurate threading results and fewer false inclusives. These results are then threaded normally with any non-Gmail messages included in the document set.
The Gmail metadata fields are located in the Message ID Email Metadata section of the Analytics profile. For more information, see Message ID Email Metadata.
Note: The Message ID Email Metadata fields cannot be mapped on the same Analytics profile as the Conversation ID field. If you want to thread emails using both, create two separate sets of Analytics profiles and structured analytics sets.
If you map the Microsoft conversation index number to the Conversation ID field, email threading uses it to thread emails according to Microsoft's threading system. For example, if you replied to an email thread, but deleted everything below your signature and changed recipients, email threading could still group those emails together based on the Conversation ID. If that field were not present, email threading would not group those emails together.
If the Conversation ID field is present for an email and mapped on the profile, it is used to group the email together with other emails first. The text is not examined to validate the Conversation ID data. If a match is found based upon Conversation ID, no further analysis is done on the email for grouping purposes. If no match is found, the system analyzes all other data to thread the email. If some emails have this field and others do not, such as emails from non-Microsoft email clients, they may still be grouped together in the same email thread group. Email threading only uses the Conversation ID to group emails; it does not use them to break threads apart.
Note: We only recommend using the Conversation ID field if the email headers are widely corrupt or in unsupported formats. Email threading uses the Conversation ID to group together emails with similar Conversation IDs, even when their extracted text differs. Inaccurate Conversation ID data will harm the quality of email threading results.
If your document set includes emails that come from system notifications or a shared email address, email threading sometimes considers all emails from this address to be from the same participant. To prevent this, you can add shared email addresses to an exclusion list.
When you apply an exclusion list to a structured analytics set, email threading creates unique participants for any email addresses that match the entries in the exclusion list. For example, if you apply an exclusion list that includes notifications@acme.com
:
Joe Smith <notifications@acme.com>
and Kate Doe <notifications@acme.com>
will each be treated as a unique participant.notifications@acme.com
participants will never be merged with any others. Joe Smith <notifications@acme.com>
will not be merged with Joe Smith <joe.smith@acme.com>
. This can sometimes artificially increase the number of participants, but it also lowers the chances of over-merging thread groups.When creating a structured analytics set, you can set up an exclusion list by enabling the Apply Exclusion List option. By default, the list includes several common keywords for notification and no-reply email addresses.
When modifying the exclusion list:
reply@[domain]
, no.reply@[domain]
, Reply_To@[domain]
, and all other variants containing the word "reply" will be excluded. If “John” is in the exclusion list, Johnny Doe <email>
, scott john <email>
, [person] <johnson@[domain]>
, and all other variants containing "John" will be excluded.
Entry |
Participants that would be excluded |
Participants that would not be excluded |
---|---|---|
reply |
[person] <reply@[domain]> do_not_reply@[domain] no.reply@[domain] reply_to@[domain] Replyable@[domain] |
Any participant not containing the text "reply" |
notifications |
notifications@[domain] more-notifications@[domain] notifications from@[domain] stopnotifications@[domain] |
Any participant not containing the text "notifications" |
@jira.com |
[person]@jira.com
|
jiraffe@email.com [person]@newjira.com [person]@jira-new.com [person]@jira.net |
John |
john.Doe <email> john_smith <email> Johnny <email> ScottJohn <email> [person] <johnson@[domain]> |
Any participant not containing the text "John" |
John Doe |
Any participant containing the text "John Doe" with a space |
John <email> john.Doe <email> johnDoe <email> [person] john-Doe@[domain] John Smith <email> |
www |
www.[domain].com shawwwilliams@[domain] |
login.[domain].com http://[domain].com shaw.w.williams@[domain] |
In most cases, the email threading ID consists of one block per email segment, see Example 1. Thus, F00000abc-0000-0000+ would be a three-segment email. However, there are cases where the number of segments in the email does not match the total blocks in the email threading ID. When there are fewer blocks in the threading ID than segments in the email, this indicates that the top segment matches, subject, segment body, normalized author and date, with a lower segment. When there are more blocks in the email threading ID than segments in the email, this indicates there is segment corruption or changes, see Example 2.
The standard case is that we have three documents, Document1, Document2, Document3. The first document has two segments, the result of someone replying to an email from a colleague. The second document has three segments, a reply to Document1. The third document is exactly like the second.
We call the segments in the documents A the original email, B the reply, and C the subsequent reply. The table below describes both the email threading ID, indentation, inclusivity, and whether or not the document is classified as a duplicate spare.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment B
|
Segment C
|
Segment C
|
Email threading ID |
F00000abc-0000+ |
F00000abc-0000+0000+ |
F00000abc-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
Yes |
As you can see, the email threading ID of Document1 is the first part of the ID of Document2 and Document3, just as the segments of Document1 make up the bottom part of documents 2 and 3. In other words, F00000abc-"corresponds directly to A, the first 0000+ to B, and the second 0000+ to C.
Now, suppose there is a corruption of segment A due to a server-applied confidentiality footer. In this case, we might have A at the bottom of Document1, A at the bottom of Document2, but X at the bottom of Document3, assuming Document2 was collected from the sending party and Document3 from the receiving party, who sees the footer added by the sending party's server. Because B is a match, Analytics can successfully thread the documents. However, it cannot assert that the bottom segments are the same.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment
B
|
Segment
C
|
Segment
C
|
Email threading ID |
F00000abc-0000-0000+ |
F00000abc-0000-0000+0000+ |
F00000abc-0000-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive Email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
No |
As you can see, there is an additional 0000- that was added after the F00000abc-. This phantom node represents the fact that there are two different segments that occurred in the root segment's position, A and X. You can think of A being associated with F00000abc- again, and X with 0000-" But, since each ID must begin with the thread group number, we have to list both A's and X's nodes in all documents. If there were a third bottom segment, for example, if Document2 had Y at the bottom rather than A, then all three email threading IDs would have an additional phantom 0000-. So Document1 in that case would have an ID of F00000abc-0000-0000-0000+.
Consider the following before running a structured analytics email threading operation:
On this page
Why was this not helpful?
Check one that applies.
Thank you for your feedback.
Want to tell us more?
Great!