Email threading
Analytics email threading greatly reduces the time and complexity of reviewing emails by gathering all forwards, replies, and reply-all messages together. Email threading identifies email relationships, and then extracts and normalizes email metadata. Email relationships identified by email threading include:
- Email threads
- People involved in an email conversation
- Email attachments, if the Parent ID is provided along with the attachment item
- Duplicate emails
An email thread is a single email conversation that starts with an original email, the beginning of the conversation, and includes all of the subsequent replies and forwards pertaining to that original email. The analytics engine uses a combination of email headers and email bodies to determine if emails belong to the same thread. Analytics allows for data inconsistencies that can occur, such as timestamp differences generated by different servers. The Analytics engine then determines which emails are inclusive, meaning that it contains unique content and should be reviewed. See Inclusive emails for additional information regarding inclusive emails.
This process includes the following steps at a high level:
- Segment emails into the component emails, such as when a reply, reply-all, or forward occurred.
- Examine and normalize the header data, such as senders, recipients, and dates. This happens with both the component emails and the parent email, whose headers are usually passed explicitly as metadata.
- Recognize when emails belong to the same conversation, referred to as the Email thread group, using the body segments along with headers, and determine where in the conversation the emails occur. Email thread groups are created using the body segments, Email From, Email To, Email Date, and Email Subject headers.
- Determine inclusiveness within conversations by analyzing the text, the sent time, and the sender of each email.
See these related pages:
- Supported email header formats
- Inclusive emails
- Email threading results
- Structured analytics
- Running structured analytics
- Email thread visualization
Minimum threading requirements
In order for a document to be recognized as an email and threaded, it must have the Email From field and at least one of the following:
-
Sent Date
-
Email To
-
Email Subject
-
Email CC
-
Email BCC
These fields can either be in the document metadata or in the extracted text. If a document is recognized as an email but does not have a Date Sent field, it will be categorized as a draft.
For more information on email headers, see Supported email header formats.
Email threading fields
After completing an email threading operation, Analytics automatically creates and populates the following fields for each document included in the operation:
- <Structured Analytics Set prefix>::Email Author Date ID—this ID value contains a string of hashes that corresponds to each email segment of the current document's email conversation. Each hash is the generated MD5 hash value of the email segment’s normalized author field and the date field. This field is used for creating the visual depiction of email threading in the email thread visualization (ETV) tool. See Email thread visualization.
- <Structured Analytics Set prefix>::Email Threading ID—ID value that indicates which conversation an email belongs to and where in that conversation it occurred. The first nine characters indicate the Email thread group, all emails with a common root, and the subsequent five-character components show each level in the thread leading up to an email's top segment.
For example, an email sent from A to B, and then replied to by B, has the initial nine characters for the message from A to B, plus one five-character set for the reply. This field groups all emails into their email thread groups and identifies each email document’s position within a thread group by Sent Date beginning with the oldest. The Email Threading ID helps you analyze your email threading results by obtaining email relationship information. - The first nine characters define the Email Thread Group and root node of the email thread group tree. This is always a letter followed by eight hexadecimal digits.
- The Email Thread Group ID is followed by a plus sign (+) which indicates that an actual root email document exists in the data set provided, or a minus sign (-) which indicates that the document is not in the data set. For example, the Email Threading ID value, F00000abc-, indicates that the Analytics engine found evidence of a root email for the email thread group, but it did not find the root email itself as a separate document in the data set. In these cases, reviewers may be able to read the text of the missing root email by looking at the bottom segment of later emails in the thread.
- The + or - symbol is followed by zero or more node identifiers comprised of four hexadecimal digits. Each node identifier is also followed by a + or - sign indicating whether or not a corresponding email item exists within the data set. Each subsequent set of four characters followed by a + or - sign defines another node in the tree and whether or not any email items exist for that node. Each node, including the root node, represents one email sent to potentially multiple people, and it will contain all copies of this email.
- <Structured Analytics Set prefix>::Email Thread Group—initial nine characters of the Email Threading ID. This value is the same for all members of an Email thread group which are all of the replies, forwards, and attachments following an initial sent email. This field is not relational.
- <Structured Analytics Set prefix>::Email Action—the action the sender of the email performed to generate this email. Email threading records one of the following action values:
- SEND
- REPLY
- REPLY-ALL
- FORWARD
- DRAFT
- <Structured Analytics Set prefix>::Indentation—a whole number field indicating when the document was created within the thread. This is derived from the number of Segments Analytics engine metadata field.
The Email Threading ID is formatted as follows:
Note: In cases where email segments are modified, such as when a confidentiality footer is inserted, the Email Threading ID may have a greater number of blocks than segments in the email. In such cases, the indentation will reflect the actual number of found segments in the document. See Email threading and the Indentation field.
- <Structured Analytics Set prefix>::Email Threading Display—visualization of the properties of a document including the following:
- Sender
- Email subject
- File type icon, accounting for email action
- Indentation level number bubble
- <Structured Analytics Set prefix>::Inclusive Email—inclusive email indicator. An inclusive email message contains the most complete prior message content and lets you bypass redundant content. Reviewing documents specified as inclusive and ignoring duplicates covers all authored content in an email thread group.
- <Structured Analytics Set prefix>::Inclusive Reason—lists the reasons a message is marked inclusive. Each reason indicates the type of review required for the inclusive email:
- Message—the message contains content in the email body requiring review.
- Attachment—the message contains an attachment that requires review. The rest of the content does not necessarily require review.
- Inferred match—email threading identified the message as part of a thread where the header information matches with the rest of the conversation, but the body is different. This reason only occurs on the root email of a thread and is often caused by mail servers inserting confidentiality footers in a reply. Review to verify the message actually belongs to the conversation.
- Unanalyzed Attachment—the message contains attachments that could not be analyzed by the Analytics engine. In most cases this is due to the attachment being larger than 30 MB.
- <Structured Analytics Set prefix>::Email Duplicate Spare—this Yes/No field indicates whether an email is a duplicate. Duplicate spare refers to a message that is an exact duplicate of another message such as a newsletter sent to a distribution list. A No value in this field indicates that an email is either not a duplicate of any other emails in the set, or that it is the primary email in the Email Duplicate group. The primary email is usually the email with the lowest Document Identifier/Control Number, an identifier field in Relativity. A Yes value indicates the document is in an Email Duplicate group, but it is not the primary document.
The Email Threading Display field also indicates which emails are both inclusive and non-duplicate by displaying the indentation level number bubble in black.
See Email duplicate spare messages for more information.
Note: On incremental runs, the primary email in the Email Duplicate Spare group will never change.
- <Structured Analytics Set prefix>::Email Duplicate ID—this field contains a unique identifier only if the email is a duplicate of another email in the set. If the email is not a part of an email duplicate group, this field is not set.
- <Structured Analytics Set prefix>::Email Thread Hash—a hash generated by Relativity to facilitate email thread visualization.
Email duplicate spare messages
Duplicate spare email messages contain the exact same content as another message, but they are not necessarily exact duplicates, such as MD5Hash, SHA256.
Properties considered during Email Duplicate Spare analysis
The identification of duplicate spare email messages happens during email threading, and the following properties are examined during the identification of email duplicate spares:
- Email Thread Group—the emails must have the same email thread group in order to be email duplicate spares.
Note: A differing “Email To” alias would cause two otherwise duplicate emails to end up with different email thread groups.
- Email From—the email authors must be the same, that is, the same email alias.
- Email Subject—the trimmed email subject must match exactly. The analytics engine trims and ignores any prefixes such as "RE:", "FW:" or "FWD:" or the equivalent in the non-English languages Analytics supports.
- Email Body—the email body must match exactly, although white space is ignored.
- Sent Date—the identification considers the sent date, but permits for a level of variance. In general, the allowed time window is 30 hours for a valid match of email messages with no minute matching involved.
- Attachments—attachment text must match exactly. As long as attachments were included in the Document Set to Analyze, it will examine the extracted text of the attachments and detect changes in the text. Duplicate emails with attachments that have the same names but different text are not identified as email duplicate spares. Blank attachments are considered unique by the Analytics engine. A duplicate email with a blank attachment is considered inclusive.
The “aliases” for an author are other textual representations of the author that are equated as the same entity. For example, John Doe sends an email using the email address john.doe@example.com. He may have another email address, such as john.doe@gmail.com. Based on these email addresses, the Analytics engine finds they are related and can make an alias list that would include "John Doe" and "Doe, John" and "john.doe@gmail.com" and "john.doe@example.com." Anytime email threading encounters any one of these four entities, email addresses, in the Sender field of an email segment, it considers them one and the same person/entity.
The exception to the general case are the specific cases where the authors do not match. For example, if it is impossible to match SMTP addresses and LDAP addresses as the author values, but the subject and text are exact matches, there is a more stringent time frame. In such cases, the time must be within 24 hours, and the minute must be within one minute of each other. For example, 15:35 would match with 18:36, but 15:35 would not match with 18:37.
Note: It is very important that the attachments are included in the Email Threading Structured Analytics Set. If only the parent emails are threaded, then it will not be able to pick up these differences.
Properties ignored during Email Duplicate Spare analysis
The following properties are not considered during the Email Duplicate Spare analysis:
- Email To—while Email To is not considered during the Email Duplicate Spare analysis, the Email To must be the same alias for otherwise duplicative emails for them to have the same Email Threading ID. If the Email To alias differs between two emails, the emails will not receive the same Email Threading ID. Emails must have the same Email Threading ID in order to be email duplicate spares.
- Email CC / Email BCC—these two fields are not considered for this identification.
- Microsoft Conversation index
- White space—white space in the email subject or email body.
- Email Action—email action is not considered, and the indicators of the email action like RE, FW, FWD are trimmed from the email subject line for this identification.
Email duplicate spare information storage
Duplicate spare information saves to the following fields in Relativity after email threading completes:
- <Structured Analytics Set prefix>::Email Duplicate ID and the field selected for the Destination Email Duplicate ID field on the structured analytics set layout
- <Structured Analytics Set prefix>::Email Duplicate Spare
See Email threading fields for more information on the Email Duplicate Spare and Email Duplicate ID fields.
Email threading behavior considerations
General considerations
Consider the following before running a structured analytics email threading operation:
- If email headers contain typos or extra characters as a result of incorrect OCR, the operation does not recognize the text as email header fields and the files are not recognized as email files.
- If you have produced documents with Bates stamps in the text, this results in extra inclusive flags. As such, email duplicate spares are separated because they have differing text. Filter out the Bates stamps using a regular expression filter linked under Advanced Settings when you set up the Structured Analytics Set. See Repeated content filters.
- If some emails in a thread contain extra text in the bottom segment, most commonly found from confidentiality footers being applied by a server, this results in extra inclusive flags.
- When running email threading, any loose e-docs, such as a non-email that is not an attachment, that are analyzed by the Analytics Engine will receive an inclusive email value of No.
Considerations for unsupported header languages
Email threading supports a limited set of language formats for email headers.
- When email headers themselves are in a supported language, such as English, then the Analytics engine will thread them even if the header’s contents are not in a supported language. For example, a subject written in Thai after the header “Subject:”.
- When the headers themselves are not in a supported language, this commonly happens when a speaker of an unsupported language hits "reply" in his or her email client and the email client then includes the headers in the embedded message, the Analytics engine will not be able to parse them out for email threading.
Processing engines typically insert English-language headers on top of extracted email body text when they process container files such as .pst, .ost, or .nsf. These headers, such as "To," "From," "Subject," etc., take their contents from specific fields in the container file. The container file’s email body text does not, strictly speaking, contain the headers. For this reason, we always recommend that you keep English selected in the list of email header languages.
When the Analytics engine parses emails, it looks for cues based on supported header formats and languages to determine when it is or is not in an email header. In particular, it is looking for words like "To, From, CC, Subject" in the case of traditional English headers, or “An, Von, Cc, Betreff” in the case of standard German headers. It also looks for other header styles such as "on <date>, <author> wrote:" for single-line replies (English) or “在 <date>, <author> 写道:” (Chinese). There are many other variations and languages other than the ones shown here. For more information, see Supported email header formats
Email threading will be affected as follows by unsupported email header formats and/or headers in unsupported languages:
- Groups of emails which should be in a single thread will split into multiple threads. This is due to not matching up the unsupported-header-surrounded segment with its supported-header-surrounded version, either when the email itself is in the collection, or when the email was replied to by both a supported and a non-supported language email client.
- There will be fewer segments than desired in the email thread group of a document which contains unsupported language email headers.
- If emails contain mixed languages in header fields, for examples some field names are in English and some are in an unsupported language, your Indentation field is lower than expected because Analytics does not identify all of the email segments.
Email threading and the Conversation ID field
When mapped on the Analytics Profile, email threading uses the Microsoft Conversation Index to bring emails together into threads. For example, if you replied to an email thread, but deleted everything below your signature, and changed recipients, email threading could group all emails together based on the Microsoft Conversation Index. If that field weren't present, email threading would not group those emails together.
If the Conversation ID field is present for an email and mapped on the profile, it's used to group the email together with other emails first. The text is not examined to validate the Conversation ID data. If a match is found based upon Conversation ID, no further analysis is done on the email for grouping purposes. If no match is found, the system analyzes all other data to thread the email. If some emails have this field and others do not, such as non-Microsoft email clients, they still may be grouped together in the same email thread group when determined necessary.
Email threading does not use the Microsoft Conversation Index to break threads apart. Please note that inaccurate Conversation ID data will harm the quality email threading results. Email threading uses the Conversation ID to group together emails with similar Conversation IDs, even when their Extracted Text differs. The Conversation ID is not typically recommended, as email threading is highly accurate without the use of Conversation ID. Only when the email headers are widely corrupt or in unsupported formats do we recommend the use of this field.
Email threading and the Indentation field
The Analytics server is queried for the true number of found segments in the email. This indentation level is both in the document field and in the bubble/square that is present in the email threading visualization field.
In most cases, the Email threading ID consists of one "block" per email segment. See Sample 1. Thus, "F00000abc-0000-0000+" would be a three-segment email. However, there are cases where the number of segments in the email does not match the total blocks in the email threading ID. When there are fewer blocks in the threading ID than segments in the email, this indicates that the top segment matches, subject, segment body, normalized author and date, with a lower segment. When there are more blocks in the email threading ID than segments in the email, this indicates there is segment corruption or changes. See Sample 2.
Sample 1
The standard case is that we have three documents, Document1, Document2, Document3. The first document has two segments, the result of someone replying to an email from a colleague. The second document has three segments, a reply to Document1. the third document is exactly like the second.
We call the segments in the documents "A," the original email, "B," the reply, and "C," the subsequent reply. The table below describes both the Email Threading ID, Indentation, inclusiveness, and whether or not the document is classified as a duplicate spare.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment B
|
Segment C
|
Segment C
|
Email threading ID |
F00000abc-0000+ |
F00000abc-0000+0000+ |
F00000abc-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
Yes |
As you can see, the Email threading ID of Document1 is the first part of the ID of Document2 and Document3, just as the segments of Document1 make up the bottom part of documents 2 and 3. In other words, "F00000abc-" corresponds directly to "A", the first "0000+" to B, and the second "0000+" to C.
Sample 2
Now, suppose there is a corruption of segment A due to a server-applied confidentiality footer. In this case, we might have "A" at the bottom of Document1, "A" at the bottom of Document2, but "X" at the bottom of Document3, assuming Document2 was collected from the sending party and Document3 from the receiving party, who sees the footer added by the sending party's server. Because B is a match, Analytics can successfully thread the documents. However, it cannot assert that the bottom segments are the same.
Control number |
Document1 |
Document2 |
Document3 |
---|---|---|---|
Document layout (segments and arrangement) |
Segment
B
|
Segment
C
|
Segment
C
|
Email threading ID |
F00000abc-0000-0000+ |
F00000abc-0000-0000+0000+ |
F00000abc-0000-0000+0000+ |
Indentation level (segments) |
2 |
3 |
3 |
Inclusive Email |
No |
Yes |
Yes |
Duplicate Spare |
No |
No |
No |
As you can see, there is an additional "0000-" that was added after the F00000abc-. This "phantom" node represents the fact that there are two different segments that occurred in the root segment's position, A and X. You can think of "A" being associated with "F00000abc-" again, and "X" with "0000-". But since each ID must begin with the thread group number, we have to list both As and Xs nodes in all documents. If there were a third bottom segment, for example if Document2 had "Y" at the bottom rather than A, then all three email threading IDs would have an additional "phantom" 0000-. So Document1 in that case would have an ID of F00000abc-0000-0000-0000+.