This recipe shows how to use textual near duplicate detection to further divide your documents into groups of textual exact duplicates.
- Structured Analytics
- Relativity 18.104.22.168 or above
You can run this analysis across all documents, including emails, or a subset of documents. For example, this subset of documents could just include certain types of documents such as Microsoft Word, PDF, and Text documents. Both setups are acceptable, and easy to work with.
- Create a field for your duplicate Group ID to be stored. It should be a fixed length text field which is relational:
- Name: Textual Exact Duplicate Group
- Field Type: Fixed-Length Text
- Length: 255
(Click to expand)
- Under Relational Field Properties, set the following fields:
- Relational: Yes
- Friendly Name: Text Duplicates
- Import Behavior: Leave blank values unchanged
- Pane icon: duplicates.png (you can also use the near duplicate icon, or something else entirely)
- Order: 100
- Relational View: Textual Near Duplicates Relational View (or create your own)
- Create a new Structured Analytics Set with the following properties:
- Name: Text Exact Duplicates
- Set prefix: X1
- Select document set to analyze: choose a saved search that you want to run this analysis on
- Select operations: Textual near duplicate identification
(Click to expand)
- Under Textual Near Duplicate Identification settings, set the following fields:
- Minimum similarity percentage: 100
- Ignore numbers: No
- Destination Textual Near Duplicate Group: Textual Exact Duplicate Group (the field you created in the previous step)
- Click Save.
- Click Run Structured Analytics. A pop up appears.
- Click Run. On a new set, it will always populate all documents.
- Add your textual exact duplicate groups to your views, saved searches, etc.