While textual near duplicate identification is simple to understand, the implementation is very complex and relies on several optimizations so that results can be delivered in a reasonable amount of time. The following is a simplified explanation of this process:
- Extracts the text from the Extracted Text field for all documents.
- Scans the text and saves various statistics for later use. The task operates on text only (which has been converted to lowercase). White space and punctuation characters are also ignored, except to identify word and sentence boundaries.
- The documents are sorted by size—from largest to smallest. This is the order in which they are processed.
- When the process is complete, only principal documents that have one or more near duplicates are shown in groups. Documents that have the Textual Near Duplicate Group field set to empty or numbers-only are also grouped together.
- Documents that are not textually similar to any other documents in your analysis group, based on the minimum similarity percentage chosen, end up as “standalone” documents that do not belong to a near duplicate group.
The most visible optimization and organizing notion is the principal document. The principal document is the largest document in a group and is the document that all others are compared to when determining whether they are near duplicates. If the current document is a close enough match to the principal document—as defined by the Minimum Similarity Percentage—it is placed in that group. If no current groups are matches, the current document becomes a new principal document.
Note: Analyzed documents that are not textually similar enough to any other documents will not have fields populated for Textual Near Duplicate Principal or Textual Near Duplicate Group. Documents that only contain numbers or that do not contain text will have the Textual Near Duplicate Group field set to numbers-only or empty, respectively.
See the following related pages:
Minimum Similarity Percentage
The Minimum Similarity Percentage parameter controls how the task works. This parameter indicates how similar a document must be to a principal document to be placed into that principal's group. A value of 100% would indicate an exact textual duplicate. A higher setting requires more similarity and generally results in smaller groups. A higher setting also makes the process run faster because fewer comparisons have to be made.
The following fields are created when you run textual near duplicate identification:
- <Structured Analytics Set prefix>::Textual Near Duplicate Principal - identifies the principal document with a "Yes" value. The principal is the largest document in the duplicate group. It acts as an anchor document to which all other documents in the near duplicate group are compared. If the document does not match with any other document in the data set, this field is set to No.
- <Structured Analytics Set prefix>::Textual Near Duplicate Similarity - the percent value of similarity between the near duplicate document and its principal document. If the document does not match with any other document in the data set, this field is set to 0.
- <Structured Analytics Set prefix>::Textual Near Duplicate Group - this is the field that acts as the identifier for a given group of textual near duplicate documents. If the document contains text but does not match with any other document in the data set, this field is empty. Documents that only contain numbers or that do not contain text will have the <Structured Analytics Set prefix>::Textual Near Duplicate Group field set to numbers-only or empty, respectively.