Textual near duplicate identification

While textual near duplicate identification is simple to understand, the implementation is very complex and relies on several optimizations so that results can be delivered in a reasonable amount of time. The following is a simplified explanation of this process:

  1. Extracts the text from the Extracted Text field for all documents.
  2. Scans the text and saves various statistics for later use. The task operates on text only (which has been converted to lowercase). White space and punctuation characters are also ignored, except to identify word and sentence boundaries.
  3. The documents are sorted by size—from largest to smallest. This is the order in which they are processed.
  4. The most visible optimization and organizing notion is the principal document. The principal document is the largest document in a group and is the document that all others are compared to when determining whether they are near duplicates. If the current document is a close enough match to the principal document—as defined by the Minimum Similarity Percentage—it is placed in that group. If no current groups are matches, the current document becomes a new principal document.

    Note: Analyzed documents that are not textually similar enough to any other documents will not have fields populated for Textual Near Duplicate Principal or Textual Near Duplicate Group. Documents that only contain numbers or that do not contain text will have the Textual Near Duplicate Group field set to Numbers Only or Empty, respectively.

  5. When the process is complete, only principal documents that have one or more near duplicates are shown in groups. Documents that have the Textual Near Duplicate Group field set to Empty or Numbers Only are also grouped together.

See the following related pages:

Minimum Similarity Percentage

The Minimum Similarity Percentage parameter controls how the task works. This parameter indicates how similar a document must be to a principal document to be placed into that principal's group. A value of 100% would indicate an exact textual duplicate. A higher setting requires more similarity and generally results in smaller groups. A higher setting also makes the process run faster because fewer comparisons have to be made.