Email threading

Relativity Analytics Email Threading greatly reduces the time and complexity of reviewing emails by gathering all forwards, replies, and reply-all messages together. Email threading identifies email relationships, and then extracts and normalizes email metadata. Email relationships identified by email threading include:

  • Email threads
  • People involved in an email conversation
  • Email attachments (if the Parent ID is provided along with the attachment item)
  • Duplicate emails

An email thread is a single email conversation that starts with an original email, (the beginning of the conversation), and includes all of the subsequent replies and forwards pertaining to that original email. The analytics engine uses a combination of email headers and email bodies to determine if emails belong to the same thread. Analytics allows for data inconsistencies that can occur, such as timestamp differences generated by different servers. The Analytics engine then determines which emails are inclusive(meaning that it contains unique content and should be reviewed). See Inclusive emails for additional information regarding inclusive emails.

This process includes the following steps at a high level:

  1. Segment emails into the component emails, such as when a reply, reply-all, or forward occurred.
  2. Examine and normalize the header data, such as senders, recipients, and dates. This happens with both the component emails and the parent email, whose headers are usually passed explicitly as metadata.
  3. Recognize when emails belong to the same conversation, (referred to as the Email Thread Group), using the body segments along with headers, and determine where in the conversation the emails occur. Email Thread Groups are created using the body segments, Email From, Email To, Email Date, and Email Subject headers.
  4. Determine inclusiveness within conversations by analyzing the text, the sent time, and the sender of each email.

This page contains the following sections:

See these related pages:

Email threading fields

After completing an email threading operation, Analytics automatically creates and populates the following fields for each document included in the operation:

  • <Structured Analytics Set prefix>::Email Author Date ID - this ID value contains a string of hashes that corresponds to each email segment of the current document's email conversation. Each hash is the generated MD5 hash value of the email segment’s normalized author field and the date field. This field is used for creating the visual depiction of email threading in the email thread visualization (ETV) tool. See Email thread visualization.
  • <Structured Analytics Set prefix>::Email Threading ID - ID value that indicates which conversation an email belongs to and where in that conversation it occurred. The first nine characters indicate the Email Thread Group, (all emails with a common root), and the subsequent five-character components show each level in the thread leading up to an email's top segment. For example, an email sent from A to B, and then replied to by B, has the initial nine characters for the message from A to B, plus one five-character set for the reply. This field groups all emails into their email thread groups and identifies each email document’s position within a thread group by Sent Date beginning with the oldest. The Email Threading ID helps you analyze your email threading results by obtaining email relationship information.
  • The Email Threading ID is formatted as follows:

    • The first nine characters define the Email Thread Group and root node of the email thread group tree. This is always a letter followed by eight hexadecimal digits.
    • The Email Thread Group ID is followed by a + (plus) sign which indicates that an actual root email document exists in the data set provided, or a - (minus) sign which indicates that the document is not in the data set. For example, the Email Threading ID value, F00000abc-, indicates that the Analytics engine found a root email for the email thread group, but the root email document was not actually included in the data set used for email threading.
    • The + or - symbol is followed by zero or more node identifiers comprised of four hexadecimal digits. Each node identifier is also followed by a + or - sign indicating whether or not a corresponding email item exists within the data set. Each subsequent set of four characters followed by a + or - sign defines another node in the tree and whether or not any email items exist for that node. Each node, including the root node, represents one email sent to potentially multiple people, and it will contain all copies of this email.

  • <Structured Analytics Set prefix>::Email Thread Group - initial nine characters of the Email Threading ID. This value is the same for all members of an Email Thread Group which are all of the replies, forwards, and attachments following an initial sent email. This field is not relational.
  • <Structured Analytics Set prefix>::Email Action - the action the sender of the email performed to generate this email. Email threading records one of the following action values:
    • SEND
    • REPLY
    • REPLY-ALL
    • FORWARD
    • DRAFT
  • <Structured Analytics Set prefix>::Indentation - a whole number field indicating when the document was created within the thread. This is derived from the # of Segments Analytics engine metadata field.
  • Note: In cases where email segments are modified (e.g., a confidentiality footer is inserted), the Email Threading ID may have a greater number of blocks than segments in the email. In such cases, the Indentation will reflect the actual number of found segments in the document. See Email threading and the Indentation field.

  • <Structured Analytics Set prefix>::Email Threading Display - visualization of the properties of a document including the following:
    • Sender
    • Email subject
    • File type icon (accounting for email action)
    • Indentation level number bubble
  • The Email Threading Display field also indicates which emails are both inclusive and non-duplicate by displaying the indentation level number bubble in black.

  • <Structured Analytics Set prefix>::Inclusive Email - inclusive email indicator. An inclusive email message contains the most complete prior message content and lets you bypass redundant content. Reviewing documents specified as inclusive and ignoring duplicates covers all authored content in an email thread group.
  • <Structured Analytics Set prefix>::Inclusive Reason - lists the reason(s) a message is marked inclusive. Each reason indicates the type of review required for the inclusive email:
    • Message - the message contains content in the email body requiring review.
    • Attachment - the message contains an attachment that requires review. The rest of the content doesn't necessarily require review.
    • Inferred match - email threading identified the message as part of a thread where the header information matches with the rest of the conversation, but the body is different. This reason only occurs on the root email of a thread and is often caused by mail servers inserting confidentiality footers in a reply. Review to verify the message actually belongs to the conversation.
    • Unanalyzed Attachment - the message contains attachments that could not be analyzed by the Analytics engine. In most cases this is due to the attachment being larger than 30 MB.
  • <Structured Analytics Set prefix>::Email Duplicate Spare - this Yes/No field indicates whether an email is a duplicate. Duplicate spare refers to a message that is an exact duplicate of another message such as a newsletter sent to a distribution list. A No value in this field indicates that an email is either not a duplicate of any other emails in the set, or that it is the master email in the Email Duplicate group. The master email is the email with the lowest Document Identifier/Control Number (identifier field in Relativity). A Yes value indicates the document is in an Email Duplicate group, but it is not the master document.
  • See Email duplicate spare messages for more information.

    Note: On incremental runs, the master email in the Email Duplicate Spare group will never change.

  • <Structured Analytics Set prefix>::Email Duplicate ID - this field contains a unique identifier only if the email is a duplicate of another email in the set. If the email is not a part of an email duplicate group, this field isn't set.
  • <Structured Analytics Set prefix>::Email Thread Hash - a hash generated by Relativity to facilitate Email Thread Visualization.

Email duplicate spare messages

Duplicate spare email messages contain the exact same content as another message, but they are not necessarily exact duplicates, i.e., MD5Hash, SHA-256.

Properties considered during Email Duplicate Spare analysis

The identification of duplicate spare email messages happens during Email threading, and the following properties are examined during the identification of email duplicate spares:

  • Email Thread Group – the emails must have the same Email Thread Group in order to be email duplicate spares. Please note that a differing “Email To” alias would cause two otherwise duplicative emails to end up with different Email Thread Groups.
  • Email From - the email authors must be the same, i.e., the same email alias.
  • The “aliases” for an author are other textual representations of the author that are equated as the same entity. For example, John Doe sends an email using the email address john.doe@example.com. He may have another email address, such as john.doe@gmail.com. Based on these email addresses, the Analytics engine finds they are related and can make an alias list that would include "John Doe" and "Doe, John" and "john.doe@gmail.com" and "john.doe@example.com." Anytime email threading encounters any one of these four entities (email addresses) in the Sender field of an email segment, it considers them one and the same person (entity).

  • Email Subject - the trimmed email subject must match exactly. The analytics engine trims and ignores any prefixes such as "RE:", "FW:" or "FWD:" or the equivalent in the non-English languages Analytics supports.
  • Email Body - the email body must match exactly, although white space is ignored.
  • Sent Date - the identification considers the sent date, but permits for a level of variance. In general, the allowed time window is 30 hours for a valid match of email messages with no minute matching involved.
  • The exception to the general case are the specific cases where the authors don't match. For example, if it is impossible to match SMTP addresses and LDAP addresses as the author values, but the subject and text are exact matches, there is a more stringent time frame. In such cases, the time must be within 24 hours, and the minute must be within one minute of each other. For example, 15:35 would match with 18:36, but 15:35 would not match with 18:37.

  • Attachments - attachment text must match exactly, although white space is ignored. As long as attachments were included in the Document Set to Analyze, it will examine the Extracted Text of the attachments and detect changes in the text. Duplicate emails with attachments that have the same names but different text aren't identified as Email Duplicate Spares. Blank attachments are considered unique by the Analytics engine. A duplicate email with a blank attachment is considered inclusive.
  • Note: It's very important that the attachments are included in the Email Threading Structured Analytics Set. If only the parent emails are threaded, then it will not be able to pick up these differences.

Properties ignored during Email Duplicate Spare analysis

The following properties aren't considered during the Email Duplicate Spare analysis:

  • Email To – While Email To is not considered during the Email Duplicate Spare analysis, the Email To must be the same alias for otherwise duplicative emails for them to have the same Email Threading ID. If the Email To alias differs between two emails, the emails will not receive the same Email Threading ID. Emails must have the same Email Threading ID in order to be email duplicate spares.
  • Email CC / Email BCC - these two fields are not considered for this identification.
  • Microsoft Conversation index
  • White space - white space in the email subject, email body, or attachment text.
  • Email Action - email action isn't considered, and the indicators of the email action (RE, FW, FWD) are trimmed from the email subject line for this identification.

Email duplicate spare information storage

Duplicate spare information saves to the following fields in Relativity after Email threading completes:

  • <Structured Analytics Set prefix>::Email Duplicate ID and the field selected for the Destination Email Duplicate ID field on the structured analytics set layout
  • <Structured Analytics Set prefix>::Email Duplicate Spare

See Email threading fields for more information on the Email Duplicate Spare and Email Duplicate ID fields.

Email threading behavior considerations

General considerations

Consider the following before running a structured analytics email threading operation:

  • If email headers contain typos or extra characters as a result of incorrect OCR, the operation doesn't recognize the text as email header fields and the files aren't recognized as email files.
  • If you have produced documents with Bates stamps in the text, this results in extra inclusive flags. As such, email duplicate spares are separated because they have differing text. Filter out the Bates stamps using a regular expression filter linked under Advanced Settings when you set up the Structured Analytics Set. See Repeated content filters.
  • If some emails in a thread contain extra text in the bottom segment (most commonly found from confidentiality footers being applied by a server), this results in extra inclusive flags.
  • When running email threading, any loose e-docs (ex. a non-email that is not an attachment) that are analyzed by the Analytics Engine will receive an inclusive email value of No.

Considerations for unsupported header languages

Email threading supports a limited set of language formats for email headers.

  • When email headers themselves are in a supported language, such as English, then the Analytics engine will thread them even if the header’s contents are not in a supported language (e.g., a subject written in Thai after the header “Subject:”).
  • When the headers themselves are not in a supported language (this commonly happens when a speaker of an unsupported language hits "reply" in his or her email client and the email client then includes the headers in the embedded message), the Analytics engine won't be able to parse them out for email threading.

Processing engines typically insert English-language headers on top of extracted email body text when they process container files such as PST, OST, or NSF. These headers, such as "To," "From," "Subject," etc., take their contents from specific fields in the container file. The container file’s email body text does not, strictly speaking, contain the headers. For this reason, we always recommend that you keep English selected in the list of email header languages.

When the Analytics engine parses emails, it looks for cues based on supported header formats and languages to determine when it is or isn't in an email header. In particular, it's looking for words like "To, From, CC, Subject" in the case of traditional English headers, or “An, Von, Cc, Betreff” in the case of standard German headers. It also looks for other header styles such as "on <date>, <author> wrote:" for single-line replies (English) or “在 <date>, <author> 写道:” (Chinese). There are many other variations and languages other than the ones shown here. For more information, see Supported email header formats

Email threading will be affected as follows by unsupported email header formats and/or headers in unsupported languages:

  • Groups of emails which should be in a single thread will split into multiple threads. This is due to not matching up the unsupported-header-surrounded segment with its supported-header-surrounded version (either when the email itself is in the collection, or when the email was replied to by both a supported and a non-supported language email client).
  • There will be fewer segments than desired in the Email Thread Group of a document which contains unsupported language email headers.
  • If emails contain mixed languages in header fields, (for examples some field names are in English and some are in an unsupported language) your Indentation field is lower than expected because Analytics doesn't identify all of the email segments.

Email threading and the Conversation ID field

When mapped on the Analytics Profile, email threading uses the Microsoft Conversation Index to bring emails together into threads. For example, if you replied to an email thread, but deleted everything below your signature, and changed recipients, email threading could group all emails together based on the Microsoft Conversation Index. If that field weren't present, email threading would not group those emails together.

If the Conversation ID field is present for an email and mapped on the profile, it's used to group the email together with other emails first. The text is not examined to validate the Conversation ID data. If a match is found based upon Conversation ID, no further analysis is done on the email for grouping purposes. If no match is found, the system analyzes all other data to thread the email. If some emails have this field and others don't (e.g., non-Microsoft email clients) they still may be grouped together in the same email thread group when determined necessary.

Email threading doesn't use the Microsoft Conversation Index to break threads apart. Please note that inaccurate Conversation ID data will harm the quality email threading results. Email threading uses the Conversation ID to group together emails with similar Conversation ID’s, even when their Extracted Text differs. The Conversation ID is not typically recommended, as email threading is highly accurate without the use of Conversation ID. Only when the email headers are widely corrupt or in unsupported formats do we recommend the use of this field.

Email threading and the Indentation field

The Analytics server is queried for the true number of found segments in the email. This indentation level is both in the document field and in the bubble/square that is present in the email threading visualization field.

In most cases, the Email threading ID consists of one "block" per email segment (see Sample 1). Thus, "F00000abc-0000-0000+" would be a three-segment email. However, there are cases where the number of segments in the email does not match the total blocks in the email threading ID. When there are fewer blocks in the threading ID than segments in the email, this indicates that the top segment matches (subject, segment body, normalized author and date) with a lower segment. When there are more blocks in the email threading ID than segments in the email, this indicates there is segment corruption or changes (see Sample 2).

Sample 1

The standard case is that we have three documents, Document1, Document2, Document3. The first document has two segments, the result of someone replying to an email from a colleague. The second document has three segments, a reply to Document1. the third document is exactly like the second.

We call the segments in the documents "A" (the original email), "B" (the reply), and "C" (the subsequent reply). The table below describes both the Email Threading ID, Indentation, inclusivity, and whether or not the document is classified as a duplicate spare.

Control number

Document1 

Document2 

Document3 

Document layout (segments and arrangement)

Segment B
- Segment A

Segment C
- Segment B
-- Segment A

Segment C
- Segment B
-- Segment A

Email threading ID

F00000abc-0000+

F00000abc-0000+0000+

F00000abc-0000+0000+

Indentation level (segments)

2

3

3

Inclusive email

No

Yes

Yes

Duplicate Spare

No

No

Yes

As you can see, the Email threading ID of Document1 is the first part of the ID of Document2 and Document3, just as the segments of Document1 make up the bottom part of documents 2 and 3. In other words, "F00000abc-" corresponds directly to "A", the first "0000+" to B, and the second "0000+" to C.

Sample 2

Now, suppose there is a corruption of segment A due to a server-applied confidentiality footer. In this case, we might have "A" at the bottom of Document1, "A" at the bottom of Document2, but "X" at the bottom of Document3 (assuming Document2 was collected from the sending party and Document3 from the receiving party, who sees the footer added by the sending party's server). Because B is a match, Relativity Analytics can successfully thread the documents. However, it can't assert that the bottom segments are the same.

Control number 

Document1 

Document2 

Document3 

Document layout (segments and arrangement) 

Segment B 
- Segment A 

Segment C 
- Segment B
-- Segment A 

Segment C
- Segment B
-- Segment X (A + footer)

Email threading ID 

F00000abc-0000-0000+

F00000abc-0000-0000+0000+

F00000abc-0000-0000+0000+

Indentation level (segments)

Inclusive Email

No

Yes

Yes

Duplicate Spare

No

No

No

As you can see, there is an additional "0000-" that was added after the F00000abc-. This "phantom" node represents the fact that there are two different segments that occurred in the root segment's position (A and X). You can think of "A" being associated with "F00000abc-" again, and "X" with "0000-". But since each ID must begin with the thread group number, we have to list both A's and X's nodes in all documents. If there were a third bottom segment (e.g. if Document2 had "Y" at the bottom rather than A), then all three email threading IDs would have an additional "phantom" 0000-. So Document1 in that case would have an ID of F00000abc-0000-0000-0000+.