Published October 20, 2016
Relativity Analytics uses only the documents and fields you provide to make a search index. Generally, one index is built for each workspace. However, if you wish to limit search results to certain document groups or have more than one language in the document set, multiple indexes might give you better results.
- Existing decimal-type field
- The key search is based on text size. Your processing tool might provide the text size. If not, select the document in your search and apply the mass operation.
- Run the mass operation Set Long Text Field Size.
- Choose the long text field that holds your text information and a decimal field to hold the size. This updates a decimal field with the text size in KB.
- Create a saved search of the data and (name it Analytics Data Search or something similar).
- Bring back all text less than 30 MB which translates to “Extracted Text Size is less than 30,000 and Extracted Text Size is greater than 0. (This will return documents that have had this script run and nothing too large that will encumber the system.)
- Ensure that the Extracted Text field is the only one returned.
If the search contains 100,000 documents or more, we recommend you follow the recipe, Sampling for repeated content.
- If the search contains less that 100,000 documents, Set the Minimum number of occurrences to 0.5% of the total population. Leave the rest of the settings to the default.
- Review the results of repeated content identification, and flag only the text blocks which correspond to unwanted (non-authored) content.
- Use the Analytics Data Search created above for both training and searchable (unless it is for a very large workspace).
- In the Advanced Settings, ensure that you set Optimize Training Set, Remove English Signatures and Footers, and Email Header Filter to Yes.
- Link the appropriate Repeated Content filter results from Step 5 above.
Note: If you have a large amount of Repeated Content results, typically the results which are most occuring and with the most words will suffice.
The following are explanations of the various settings selected above:
- Email Header Filter - removes email headers throughout email-formatted text.
- Repeated Content - removes blocks of text, such as email footer information. Instances of repeated content may be detected and converted into filters automatically by a structured analytics set.
- Optimize Training Set - this option automatically excludes poor quality index training documents based on an analysis of their text. Commonly excluded content includes extremely large documents and files predominated by tables of numbers, long garbage strings of characters, and a preponderance of symbols (rather than words).
- Remove English Signatures and Footers - this option excludes email signatures and footers when it finds them (although it is intentionally conservative to avoid over-exclusion). This feature works for English signatures and footers only.
- Single word queries tend to not return valid results, except with keyword expansion.
- Analytics indexes are not available for searching when they are in the build phase of an incremental update or if they are completely repopulated. A second inactive index can serve as a backup and will allow for continued access during updates. But remember—this secondary index will take up server space.
- Prior to population, remember to link the desired repeated content filters to the Analytics profile to be used with your index.