Repeated content identification setup basics
This quick reference guide contains a basic workflow for setting up repeated content identification. For more detailed information, see Analytics.
Repeated content identification setup
The setup for running language identification is comprised of two components:
-
Saved search
-
Structured analytics set
1. Saved Search Setup
Use the following conditions and fields to create the saved search used for email threading. You do not need to set a sort order on this search.
Search Name
There is no recommendation for the saved search name. Follow your team’s normal protocol for naming searches.
Conditions
The condition for this search can be the same as the Conceptual Index search if it is different than the conditions noted below.
-
Extracted text size is greater than 0 kb.
-
Extracted text size is less than 30,000 kb.
Note: For workspaces with millions of documents, we recommend that you consider a sampling workflow. For more information, see Sampling for Repeated Content on the documentation site.
Fields
Any fields are acceptable.
2. Structured Analytics Set
Here are the steps and choices for creating a structured analytics set.
Structured Analytics Set Information
-
Name—enter a name for the structured analytics set.
-
Prefix—keep the default prefix or add your own prefix. Shorter prefixes, even just two characters, such as “LI,” take up less space in your views.
-
Operations to run—select Repeated content identification.
-
Data source—select the saved search you created above.
Repeated Content Identification
-
Minimum number of occurrences—the minimum number of times a phrase must appear to be considered repeat content. We typically set this to .005 times the number of documents in your saved search.
-
Minimum number of words—leave as default.
-
Maximum number of words—leave as default.
-
Maximum number of lines to return—leave as default.
-
Number of tail lines to analyze—leave as default.
Optional Settings
Choose the appropriate analytics server.