Using regular expressions with structured analytics

When running a structured analytics set for email threading or textual near duplication analysis, it is important to filter out extraneous text, such as Bates numbers, from each document. Extraneous text causes emails that should be identified as duplicates to instead be identified as unique. Likewise, when building a conceptual analytics index, extraneous text such as confidentiality footers should not be included to train the system.

By using regular expressions (RegEx), you can filter out extraneous text from structured analytics sets and conceptual analytics indexes. Analytics uses Java's implementation of RegEx.

Regular expressions provide a powerful method for removing extraneous text from your data set, ensuring that emails are threaded properly and textual near duplicates are identified accurately.

Textual Analysis

Extracted text is first sent to a pipeline where filters apply to the text. This process happens prior to submitting the extracted text to the Analytics engine for analysis. When you construct any RegEx to run with Analytics, it must match the text in the Analytics pipeline, not the text in the extracted text viewer. To view text in the Analytics pipeline, you need access to the user interface (UI) on your Analytics server. For more information, see How to access the Analytics server UI to validate RegEx filters.

Note: The extracted text for all documents transforms into one long string in the Analytics pipeline.

Consider the following when matching your RegEx filter to how the text appears in the Analytics pipeline:

Line breaks

  • In pattern matching, the symbols “^” and “$” match the beginning and end of the full file, not the beginning and end of a line. If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”. Whether or not you will have line breaks in your expression depends on what you are trying to match. Line breaks can be useful “anchors” that define where some pattern occurs in relation to the beginning or end of a line. Note that some Analytics operations do not pay attention to line breaks. For example, textual near duplicate identification. While others are very sensitive. For example, email header parsing in name normalization or email threading. The first image shows extracted text in the viewer followed by the same text in the Analytics pipeline.
  • extracted text line breaks analytics pipeline line break

Spaces

Any spaces in the extracted text greater than one space truncate to a single space in the pipeline. If you encounter a data set that contains multiple whitespace characters, you do not need to account for these when constructing your RegEx. The first image shows extracted text in the viewer followed by the same text in the Analytics pipeline.

extra space extracted text

Create a sample set

Before constructing any regular expressions to filter out extraneous text, run a structured analytics set on a small sample of documents that contain extraneous text. This pushes text into the Analytics pipeline which allows you to get an idea of whether you are filtering out all the extraneous text. Then, you can open the Analytics server UI to inspect how the extracted text for each document appears. This assists in constructing RegEx to filter out extraneous text.

Note: Viewing the extraneous text with the Analytics server UI is especially important if you need to filter out text that spans multiple lines, such as confidentiality footers.

RegEx metacharacters

Metacharacters are the building blocks of regular expressions. Characters in RegEx are understood to be either:

  • a metacharacter with a special meaning, or
  • a regular character with its literal meaning

RegEx groups

With RegEx groups you can match for groups of characters within a string. The following table provides examples of how to use groups in your RegEx. Groups are most useful when you use them in conjunction with alternation and quantifiers.

Metacharacter Description Example

(abc)

(123)

Character group, matches the characters abc or 123 in that exact order.

pand(ora) = pandora

pand(123) = pand123

pand(oar) ≠ pandora pand(oar) does not match for pandora because it is looking for the exact phrase pandoar.

RegEx alternation and quantifiers

You can set up your RegEx for alternate matches within a single search string via the pipe (|) alternation metacharacter. RegEx uses quantifiers to indicate the scope of a search string. You can use multiple quantifiers in your search string.

RegEx flags

With RegEx flags you can change how a particular string gets interpreted. RegEx flags only work in Analytics, not in dtSearch. While there are various RegEx flags, the following example focuses on the flag for forcing a case-insensitive match. This is important because text in the Analytics pipeline retains case.

Flag Description Examples
(?i) Forces a case insensitive match.

(?i)(acme) = ACME

(?i)(acme) = Acme

(?i)(acme) = acme

By default, Analytics RegEx employs global matching. There is no need to turn on the global searching flag (?g). With global matching, a single RegEx filters out multiple terms in a document's extracted text that match a particular pattern, not just the first instance of that pattern.

Escaping RegEx metacharacters

When using RegEx to search for a character that is a reserved metacharacter, use the backslash \ to escape the character so it can be recognized. The following table gives an example on how to escape a reserved metacharacter when searching.

Search for RegEx Match results
International phone number (UK) \+[0-9]{12}

+447700900954

+447700900312

If the + sign is not escaped with a backslash, RegEx treats + as a quantifier instead of the literal plus sign character.

Analytics RegEx filter examples

The following table provides common examples of extraneous text and the corresponding RegEx that filter out extraneous text.

Extraneous text Examples RegEx filter
Bates numbers with specific # of digits following Bates prefix

ACME028561

Acme373894

acme

(?i)(acme[0-9]{6})

This RegEx matches for the Bates prefix ACME, regardless of case, followed by six digits.

Bates numbers with different # of digits, some with spaces between Bates prefix and numbers

ACME 123456

ACME012345678

Acme 794028475612

acme 012

(?i)(acme\s?[0-9]+)

This RegEx matches for the Bates prefix ACME, regardless of case, followed by any number of digits. Recall that whitespace characters in the extracted text truncate to a single space in the Analytics pipeline.

Confidentiality footers This message and any attached documents contain information which may be confidential, subject to privilege or exempt from disclosure under applicable law. These materials are intended only for the use of the intended recipient. If you are not the intended recipient of this transmission, you are hereby notified that any distribution, disclosure, printing, copying, storage, modification or the taking of any action in reliance upon this transmission is strictly prohibited.

(?i)(This message and.*?strictly prohibited\.?)

This RegEx matches for a footer block that begins with the phrase "This message and", along with any text that follows this phrase, up to the next instance of the phrase "strictly prohibited", with or without a period at the end of the footer block.

Combination of different Bates numbers and Confidentiality footers

ENRON 123456

enron123456789

Acme 13894037

ACME003846

This message and any attached documents contain information which may be confidential, subject to privilege or exempt from disclosure under applicable law. These materials are intended only for the use of the intended recipient. If you are not the intended recipient of this transmission, you are hereby notified that any distribution, disclosure, printing, copying, storage, modification or the taking of any action in reliance upon this transmission is strictly prohibited.

(?i)(enron\s?[0-9]+)|(acme\s?[0-9]+)|(This message and.*?strictly prohibited\.?)

 

The alternation metacharacter | permits multiple regular expressions in a single RegEx filter. Also, even though you are constructing three RegEx pattern's, you only need to apply (?i) once at the beginning of the RegEx string to force a case insensitive match across all three RegEx pattern's.

RegEx: dtSearch vs. structured analytics

The following table outlines the differences between how RegEx operates in structured analytics versus dtSearch.

dtSearch Analytics
Uses TR1 implementation of RegEx. Uses JAVA implementation of RegEx.
All characters in a dtSearch index are normalized to lower case. A dtSearch RegEx must contain lower case characters only. Case is preserved in the Analytics pipeline. Use the flag (?i) to force case insensitive pattern match.
Spaces do not exist in dtSearch, but are instead interpreted as word breaks. The metacharacter \s will never match and return no results. Any spaces are normalized to single space. The metacharacter \s matches for a space.
Need to encapsulate RegEx string with "##" so that it can be recognized as a regular expression. No need to add special syntax for RegEx string to be recognized as a regular expression.
RegEx anchors \b, ^, and $ do not work in dtSearch. You can use RegEx anchors to match for positions in a string.