Email threading
Analytics email threading greatly reduces the time and complexity of reviewing emails by gathering all forwards, replies, and reply-all messages together. Email threading identifies email relationships, and then extracts and normalizes email metadata. Email relationships identified by email threading include:
- Email threads
- People involved in an email conversation
- Email attachments, if the Parent ID is provided along with the attachment item
- Duplicate emails
An email thread is a single email conversation that starts with an original email, the beginning of the conversation, and includes all of the subsequent replies and forwards pertaining to that original email. The analytics engine uses a combination of email headers and email bodies to determine if emails belong to the same thread. Analytics allows for data inconsistencies that can occur, such as time stamp differences generated by different servers. The Analytics engine then determines which emails are inclusive, meaning that it contains unique content and should be reviewed. See Inclusive emails for additional information regarding inclusive emails.
This process includes the following steps at a high level:
- Segment emails into the component emails, such as when a reply, reply-all, or forward occurred.
- Examine and normalize the header data, such as senders, recipients, and dates. This happens with both the component emails and the parent email, whose headers are usually passed explicitly as metadata.
- Recognize when emails belong to the same conversation, referred to as the email thread group, using the body segments along with headers, and determine where in the conversation the emails occur. Email thread groups are created using the body segments, email from, email to, email date, and email subject headers.
- Determine inclusiveness within conversations by analyzing the text, the sent time, and the sender of each email.
See these related pages:
- Supported email header formats
- Inclusive emails
- Email threading results
- Structured analytics
- Running structured analytics
- Email thread visualization
Minimum threading requirements
For a document to be recognized as an email and threaded, it must have the Email From field and at least one of the following:
- Sent Date
- Email To
- Email Subject
- Email CC
- Email BCC
These fields can either be in the metadata or in the extracted text.
Other considerations include:
- The Email From field must have a value other than special characters. For example, if the field only contains "<>", the document will not be considered an email.
- If a document is recognized as an email but does not have a Date Sent field, it will be categorized as a draft.
For more information on email headers, see Supported email header formats.
Email threading fields
After completing an email threading operation, Analytics automatically creates and populates the following fields for each document included in the operation:
- <Structured Analytics Set prefix>::Email Author Date ID—contains a string of hashes that corresponds to each email segment of the current document's email conversation. Each hash is the generated MD5 hash value of the email segment’s normalized author field and the date field. The email thread visualization (ETV) tool uses this field to create the visual depiction of email threading. For more information, see Email thread visualization.
- <Structured Analytics Set prefix>::Email Threading ID—indicates which conversation an email belongs to and where in that conversation it occurred. This field groups all emails into their email thread groups and identifies each email document’s position within the group by Sent Date, beginning with the oldest.
The Email Threading ID format is:- The first nine characters represent the email thread group and root node of the email thread group tree. This is always a letter followed by eight hexadecimal digits.
- The next character is either a + (plus) sign or a - (minus) sign. The + sign means that an actual root email document exists in the data set provided, and the - sign means that the root email is not in the data set. For example, an Email Threading ID of "F00000abc-" indicates that the Analytics engine found evidence of a root email for the email thread group, but it did not find the root email itself as a separate document in the data set. In these cases, reviewers may be able to read the text of the missing root email by looking at the bottom segment of later emails in the thread.
- The next characters are zero or more node identifiers. These identifiers are each made of four hexadecimal digits, followed by a + or - sign that indicates whether the email for that node exists within the data set. Each identifier defines one node in the tree. For example, an email sent from A to B, then replied to by B, would have the initial ten characters for the root message from A to B, plus one five-character set for the reply.Note: Each email threading node, including the root node, represents one email. The node contains all copies of the email, even if the email was sent to multiple people in the data set.
- <Structured Analytics Set prefix>::Email Thread Group—initial nine characters of the email threading ID. This value is the same for all members of an email thread group which are all of the replies, forwards, and attachments following an initial sent email. This field is not relational.
- <Structured Analytics Set prefix>::Email Action—the action the sender of the email performed to generate this email. Email threading records one of the following action values:
- SEND
- REPLY
- REPLY-ALL
- FORWARD
- DRAFT
- <Structured Analytics Set prefix>::Indentation—a whole number field indicating when the document was created within the thread. This is derived from the # of Segments Analytics engine metadata field.Note: In cases where email segments are modified, for example, a confidentiality footer is inserted, the Email Threading ID may have a greater number of blocks than segments in the email. In such cases, the Indentation will reflect the actual number of found segments in the document. See Indentation levels and the email threading ID.
- <Structured Analytics Set prefix>::Email Threading Display—visualization of the properties of a document including the following:
- Sender
- Email subject
- File type icon, accounting for email action
- Indentation level number bubble
- <Structured Analytics Set prefix>::Inclusive Email—inclusive email indicator. An inclusive email message contains the most complete prior message content and lets you bypass redundant content. Reviewing documents specified as inclusive and ignoring duplicates covers all authored content in an email thread group.
- <Structured Analytics Set prefix>::Inclusive Reason—lists the reasons a message is marked inclusive. Each reason indicates the type of review required for the inclusive email:
- Message—the message contains content in the email body requiring review.
- Attachment—the message contains an attachment that requires review. The rest of the content does not necessarily require review.
- Inferred match—email threading identified the message as part of a thread where the header information matches with the rest of the conversation, but the body is different. This reason only occurs on the root email of a thread and is often caused by mail servers inserting confidentiality footers in a reply. Review to verify the message actually belongs to the conversation.
- Unanalyzed Attachment—the message contains attachments that could not be analyzed by the Analytics engine. In most cases this is due to the attachment being larger than 30 MB.
- <Structured Analytics Set prefix>::Email Duplicate Spare—this Yes/No field indicates whether an email is a duplicate. Duplicate spare refers to a message that is an exact duplicate of another message such as a newsletter sent to a distribution list. A No value in this field indicates that an email is either not a duplicate of any other emails in the set, or that it is the primary email in the Email Duplicate group. The primary email is usually the email with the lowest Document Identifier/Control Number, identifier field in Relativity. A Yes value indicates the document is in an Email Duplicate group, but it is not the primary document.
See Email duplicate spare messages for more information.Note: On incremental runs, the primary email in the Email Duplicate Spare group will never change.
- <Structured Analytics Set prefix>::Email Duplicate ID—this field contains a unique identifier only if the email is a duplicate of another email in the set. If the email is not a part of an email duplicate group, this field is not set.
- <Structured Analytics Set prefix>::Email Thread Hash—a hash generated by Relativity to facilitate email thread visualization.
Email duplicate spare messages
Duplicate spare email messages contain the exact same content as another message, but they are not necessarily exact duplicates according to comparison methods such as MD5Hash or SHA-256. Email threading considers or ignores the following properties when marking items as duplicate spares.
Properties considered during Email Duplicate Spare analysis
During email threading, the following properties are examined to identify email duplicate spares:
- Email Thread Group—the emails must have the same email thread group in order to be email duplicate spares. Please note that a different Email To alias would cause two otherwise duplicate emails to end up with different email thread groups.
- Email From—the email authors must be the same email alias.
The aliases for an author are other textual representations of the author that are equated as the same entity. For example, John Doe sends an email using the email address john.doe@example.com. He may have another email address, such as john.doe@gmail.com. Based on these email addresses, the Analytics engine finds they are related and can make an alias list that would include John Doe and Doe, John and john.doe@gmail.com and john.doe@example.com. Anytime email threading encounters any one of these four entities, email addresses, in the Sender field of an email segment, it considers them the same person or entity.Note: If the sender's name includes a single character by itself, such as J Smith, that character is generally ignored during the alias matching process. The only single-character names that are used for matches are names consisting of a Chinese, Japanese, or Korean ideograph. - Email Subject—the trimmed email subject must match exactly. The analytics engine trims and ignores any prefixes such as RE:, FW:or FWD: or the equivalent in the non-English languages Analytics supports.
- Email Body—the email body must match exactly, although white space is ignored.
- Sent Date—the identification considers the sent date, but permits for a level of variance. In general, the allowed time window is 30 hours for a valid match of email messages with no minute matching involved.
The exception to the general case are the specific cases where the authors do not match. For example, if it is impossible to match SMTP addresses and LDAP addresses as the author values, but the subject and text are exact matches, there is a more stringent time frame. In such cases, the time must be within 24 hours, and the minute must be within one minute of each other. For example, 15:35 would match with 18:36, but 15:35 would not match with 18:37. - Attachments—attachment text must match exactly. As long as attachments were included in the Document Set to Analyze, it will examine the extracted text of the attachments and detect changes in the text. Duplicate emails with attachments that have the same names but different text are not identified as Email Duplicate Spares. Blank attachments are considered unique by the Analytics engine. A duplicate email with a blank attachment is considered inclusive.Note: It is very important that the attachments are included in the Email Threading Structured Analytics Set. If only the parent emails are threaded, then it will not be able to pick up these differences.
- Signatures—signature text must match exactly. If the same signature is found across multiple emails, only one instance of the signature will be considered inclusive, even if other emails have unique combinations of body text and signatures. For more information, see How signature blocks affect inclusivity.
Properties ignored during Email Duplicate Spare analysis
The following properties are not considered during the Email Duplicate Spare analysis:
- Email To—while Email To is not considered during the Email Duplicate Spare analysis, the Email To must be the same alias for otherwise duplicate emails for them to have the same Email Threading ID. If the Email To alias differs between two emails, the emails will not receive the same Email Threading ID. Emails must have the same Email Threading ID in order to be email duplicate spares.
- Email CC/Email BCC—these two fields are not considered for this identification.
- Microsoft Conversation index
- Message ID
- In Reply To
- Message References
- White space—white space in the email subject or email body.
- Email Action—email action is not considered. The indicators of the email action, such as RE, FW, and FWD, are trimmed from the email subject line for this identification.
Email duplicate spare information storage
Duplicate spare information saves to the following fields after email threading completes:
- <Structured Analytics Set prefix>::Email Duplicate ID and the field selected for the Destination Email Duplicate ID field on the structured analytics set layout
- <Structured Analytics Set prefix>::Email Duplicate Spare
See Email threading fields for more information on the Email Duplicate Spare and Email Duplicate ID fields.
How email threading handles unsupported header languages
Email threading supports a limited set of language formats for email headers.
- When email headers themselves are in a supported language, such as English, then the Analytics engine will thread them even if the header’s contents are not in a supported language, such as a subject written in Thai after the header Subject:.
- When the headers themselves are not in a supported language, the Analytics engine will not be able to parse them out for email threading. This commonly happens when a speaker of an unsupported language hits Reply in his or her email client and the email client then includes the headers in the embedded message.
Processing engines typically insert English-language headers on top of extracted email body text when they process container files such as .pst files, .ost files, or .nsf files. These headers, such as To, From, Subject, and others, take their contents from specific fields in the container file. The container file’s email body text does not, strictly speaking, contain the headers. For this reason, we always recommend that you keep English selected in the list of email header languages.
When the Analytics engine parses emails, it looks for cues based on supported header formats and languages to determine when it is or is not in an email header. In particular, it is looking for words like To, From, CC, Subject in the case of traditional English headers, or An, Von, Cc, Betreff in the case of standard German headers. It also looks for other header styles such as on <date>, <author> wrote: for single-line replies (English) or 在 <date>, <author> 写道: (Chinese). There are many other variations and languages other than the ones shown here. For more information, see Supported email header formats.
Email threading will be affected as follows by unsupported email header formats or headers in unsupported languages:
- Groups of emails that should be in a single thread will split into multiple threads. This is due to not matching up the unsupported-header-surrounded segment with its supported-header-surrounded version. Either when the email itself is in the collection, or when the email was replied to by both a supported and a non-supported language email client.
- There will be fewer segments than desired in the email thread group of a document that contains unsupported language email headers.
- If emails contain mixed languages in the header fields, such as a mix of English and an unsupported language, your Indentation field level may be lower than expected because Analytics did not identify all of the email segments.
Optional email threading fields
Some email providers use their own fields to identify email threads. If your email set includes these fields, you can use them to improve your email threading results.
Email threading and the Gmail metadata fields
If you map the Email Message ID, In Reply To, and Message References fields on the Analytics profile, emails imported from Gmail can be threaded according to Google's native threading system. This creates more accurate threading results and fewer false inclusives. These results are then threaded normally with any non-Gmail messages included in the document set.
The Gmail metadata fields are located in the Message ID Email Metadata section of the Analytics profile. For more information, see Message ID Email Metadata.
Note: The Message ID Email Metadata fields cannot be mapped on the same Analytics profile as the Conversation ID field. If you want to thread emails using both, create two separate sets of Analytics profiles and structured analytics sets.
Email threading and the Microsoft Conversation ID field
If you map the Microsoft conversation index number to the Conversation ID field, email threading uses it to thread emails according to Microsoft's threading system. For example, if you replied to an email thread, but deleted everything below your signature and changed recipients, email threading could still group those emails together based on the Conversation ID. If that field were not present, email threading would not group those emails together.
If the Conversation ID field is present for an email and mapped on the profile, it is used to group the email together with other emails first. The text is not examined to validate the Conversation ID data. If a match is found based upon Conversation ID, no further analysis is done on the email for grouping purposes. If no match is found, the system analyzes all other data to thread the email. If some emails have this field and others do not, such as emails from non-Microsoft email clients, they may still be grouped together in the same email thread group. Email threading only uses the Conversation ID to group emails; it does not use them to break threads apart.
Note: We only recommend using the Conversation ID field if the email headers are widely corrupt or in unsupported formats. Email threading uses the Conversation ID to group together emails with similar Conversation IDs, even when their extracted text differs. Inaccurate Conversation ID data will harm the quality of email threading results.
Using an exclusion list for email threading
If your document set includes emails that come from system notifications or a shared email address, email threading sometimes considers all emails from this address to be from the same participant. To prevent this, you can add shared email addresses to an exclusion list.
When you apply an exclusion list to a structured analytics set, email threading creates unique participants for any email addresses that match the entries in the exclusion list. For example, if you apply an exclusion list that includes notifications@acme.com
:
- Senders such as
Joe Smith <notifications@acme.com>
andKate Doe <notifications@acme.com>
will each be treated as a unique participant. notifications@acme.com
participants will never be merged with any others.Joe Smith <notifications@acme.com>
will not be merged withJoe Smith <joe.smith@acme.com>
. This can sometimes artificially increase the number of participants, but it also lowers the chances of over-merging thread groups.
Guidelines for writing exclusion lists
When creating a structured analytics set, you can set up an exclusion list by enabling the Apply Exclusion List option. By default, the list includes several common keywords for notification and no-reply email addresses.
When modifying the exclusion list:
- Type each entry on a separate line.
- Entries can be any keyword. They do not need to be a whole email address.
- White space and special characters matter. Writing "John Doe" with a space will not exclude "johndoe" without a space.
- Entries are not case sensitive. "no-reply" and "NO-REPLY" are treated the same.
- Any participant that contains the text of an exclusion list entry will be excluded. For example, if "reply" is in the exclusion list,
reply@[domain]
,no.reply@[domain]
,Reply_To@[domain]
, and all other variants containing the word "reply" will be excluded. If “John” is in the exclusion list,Johnny Doe <email>
,scott john <email>
,[person] <johnson@[domain]>
, and all other variants containing "John" will be excluded.
Entry |
Participants that would be excluded |
Participants that would not be excluded |
---|---|---|
reply |
[person] <reply@[domain]> do_not_reply@[domain] no.reply@[domain] reply_to@[domain] Replyable@[domain] |
Any participant not containing the text "reply" |
notifications |
notifications@[domain] more-notifications@[domain] notifications from@[domain] stopnotifications@[domain] |
Any participant not containing the text "notifications" |
@jira.com |
[person]@jira.com
|
jiraffe@email.com [person]@newjira.com [person]@jira-new.com [person]@jira.net |
John |
john.Doe <email> john_smith <email> Johnny <email> ScottJohn <email> [person] <johnson@[domain]> |
Any participant not containing the text "John" |
John Doe |
Any participant containing the text "John Doe" with a space |
John <email> john.Doe <email> johnDoe <email> [person] john-Doe@[domain] John Smith <email> |
www |
www.[domain].com shawwwilliams@[domain] |
login.[domain].com http://[domain].com shaw.w.williams@[domain] |
Indentation levels and the email threading ID
In most cases, the email threading ID consists of one block per email segment, see Example 1. Thus, F00000abc-0000-0000+ would be a three-segment email. However, there are cases where the number of segments in the email does not match the total blocks in the email threading ID. When there are fewer blocks in the threading ID than segments in the email, this indicates that the top segment matches, subject, segment body, normalized author and date, with a lower segment. When there are more blocks in the email threading ID than segments in the email, this indicates there is segment corruption or changes, see Example 2.
Example 1
The standard case is that we have three documents, Document1, Document2, Document3. The first document has two segments, the result of someone replying to an email from a colleague. The second document has three segments, a reply to Document1. The third document is exactly like the second.
We call the segments in the documents A the original email, B the reply, and C the subsequent reply. The table below describes both the email threading ID, indentation, inclusivity, and whether or not the document is classified as a duplicate spare.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment B
|
Segment C
|
Segment C
|
Email threading ID |
F00000abc-0000+ |
F00000abc-0000+0000+ |
F00000abc-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
Yes |
As you can see, the email threading ID of Document1 is the first part of the ID of Document2 and Document3, just as the segments of Document1 make up the bottom part of documents 2 and 3. In other words, F00000abc-"corresponds directly to A, the first 0000+ to B, and the second 0000+ to C.
Example 2
Now, suppose there is a corruption of segment A due to a server-applied confidentiality footer. In this case, we might have A at the bottom of Document1, A at the bottom of Document2, but X at the bottom of Document3, assuming Document2 was collected from the sending party and Document3 from the receiving party, who sees the footer added by the sending party's server. Because B is a match, Analytics can successfully thread the documents. However, it cannot assert that the bottom segments are the same.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment
B
|
Segment
C
|
Segment
C
|
Email threading ID |
F00000abc-0000-0000+ |
F00000abc-0000-0000+0000+ |
F00000abc-0000-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive Email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
No |
As you can see, there is an additional 0000- that was added after the F00000abc-. This phantom node represents the fact that there are two different segments that occurred in the root segment's position, A and X. You can think of A being associated with F00000abc- again, and X with 0000-" But, since each ID must begin with the thread group number, we have to list both A's and X's nodes in all documents. If there were a third bottom segment, for example, if Document2 had Y at the bottom rather than A, then all three email threading IDs would have an additional phantom 0000-. So Document1 in that case would have an ID of F00000abc-0000-0000-0000+.
General email threading considerations
Consider the following before running a structured analytics email threading operation:
- If email headers contain typos or extra characters as a result of incorrect OCR, the operation does not recognize the text as email header fields and the files are not recognized as email files.
- If you have produced documents with Bates stamps in the text, this results in extra inclusive flags. As such, email duplicate spares are separated because they have differing text. Filter out the Bates stamps using a regular expression filter linked under Optional Settings when you set up the Structured Analytics Set. See Repeated content filters.
- If some emails in a thread contain extra text in the bottom segment, most commonly found from confidentiality footers being applied by a server, this results in extra inclusive flags.
- When running email threading, any loose electronic documents, such as a non-email that is not an attachment, that are analyzed by the Analytics engine will receive an inclusive email value of No.