Analytics indexes

Unlike traditional searching methods like dtSearch, Analytics is an entirely mathematical approach to indexing documents. It doesn’t use any outside word lists, such as dictionaries or thesauri, and it isn’t limited to a specific set of languages. Unlike textual indexing, word order is not a factor.

There are two types of Analytics indexes:

  • Conceptual - uses Latent Semantic Indexing (LSI) to discover concepts between documents. This indexing process is based solely on term co-occurrence. The language, concepts, and relationships are defined entirely by the contents of your documents and learned by the index. For more information, see Analytics and Latent Semantic Indexing (LSI).
  • Classification - uses coded examples to build a Support Vector Machine (SVM) to predict a document's relevance. This index is used solely by the Active Learning application. Classification indexes learn how terms are related to categories based on the contents of your documents and coding decisions made within the Active Learning project. For more information, see Analytics and Support Vector Machine learning (SVM).

You can run the following Analytics operations on documents indexed by a conceptual index:

This page contains the following information:

Analytics and Latent Semantic Indexing (LSI)

Analytics and Support Vector Machine learning (SVM)

Creating an Analytics index

Analytics uses only the documents you provide to make a search index. Because no outside word lists are used, you must create saved searches to dictate which documents are used to build the index. However, if you want to limit search results to certain document groups or have more than one language in the document set, multiple indexes might give you better results.

Note: Permissions for the Search Index object must be kept in sync with permissions on the Analytics Index object. Refer to the Analytics Index object description in the Workspace permissions.

Conceptual index

Classification index

Securing an Analytics index

If you want to apply item-level or workspace-level security to an Analytics index, you must secure both the Analytics Index object and the Search Index object for that particular index.

Restricting a group from viewing an Analytics Index does not restrict them from searching on the index unless access to the corresponding Search Index is also restricted.

Note: If you’re applying item-level security from the Search Indexes tab, you may need to create a new view and add the security field to the view.

Training and searchable set considerations

Analytics index console operations

Once you save the Analytics index, the Analytics index console appears. From the Analytics index console, you can perform the following operations:

Note: Population statistics and index statistics are only available for conceptual indexes.

Populating the index

To populate the Analytics index on the full set of documents, click Populate Index: Full on the Analytics Index console. This adds all documents from the training set and searchable set to the ready-to-index list. Document “preprocessing” also occurs to clean up text. This includes the following:

  • Numbers and symbols are ignored.
  • All words are made lowercase.
  • Filters found under Advanced Settings are applied (for example, email header filter).

Once population is complete, the index builds.

Note: If you have access to SQL, you can change the priority of any Analytics index-related job (index build, population, etc.) by changing the value on the Priority column in the ContentAnalystIndexJob database table for that index. This column is null by default, null being the lowest priority. The higher you make the number in the Priority column, the higher priority that job becomes. When you change the priority of a job while another job is in progress, Analytics doesn't stop the in-progress job. Instead, the job will finish before starting on the new highest priority.

Canceling population

While the index is populating, the following console option become available:

  • Cancel Population - cancels a full or incremental population. Canceling population requires you to perform a full population later. After you click this button, any document with a status of Populated is indexed. After the indexing of those documents is complete, population stops, leaving an unusable partial index. To repair the index, perform a Full Population to purge the existing data. You can also delete the index from Relativity entirely.

Incremental population

Once population is complete, you have the option to populate incrementally to account for new or removed documents from the training and/or searchable sets to the ready-to-index list. To perform an incremental build, click Populate Index: Incremental in the console. See Incremental population considerations for conceptual indexes for more information.

    Notes:
  • If, after building your index, you want to add any documents that were previously excluded from training back into the training set document pool, you must disable the Optimize training set field on the index and perform another full population. An incremental population does not re-introduce these previously excluded documents.
  • Beginning in 9.6.134.78, incremental population automatically triggers a rebuild of the classification index.

Documents greater than 30 MB

Beginning 9.6.134.78, Analytics indexes automatically suppress documents greater than 30 MB before sending them to the Analytics engine. Suppressed large documents will appear in the Document Errors. You can also view suppressed documents from the Document list by using the Excluded from Training and Excluded from Searchable Set choices on the Analytics Index Document field.

Building the index

Once population is complete, the index is ready to be built. This step is required after both full and incremental populations. If you selected No from Continue index steps to completion when you started population, you must manually build the index from the console when the option becomes available. To build the index, click Build Index.

During this phase, training set documents and Latent Semantic Indexing (LSI) are used to build the concept space based on the relationships between words and documents. Searchable set documents are mapped into the concept space, and concept stop words (very common words) are filtered from the index to improve quality.

Please note that the index is unavailable for searching during this phase.

Monitoring population/build status

You can monitor the progress of any Analytics index process with the status progress panel at the top of the layout.

Analytics index status progress panel

Population and index building occurs in the following stages:

  • Step 0 of 3 – Waiting – Indexing Job in Queue
  • Step 1 of 3 – Populating
    • Constructing Population Table
    • Populating
  • Step 2 of 3 – Building
    • Preparing to build
    • Building

      • Starting
      • Copying item data
      • Feature weighting
      • Computing correlations
      • Initializing vector spaces
      • Updating searchable items
      • Optimizing vector space queries
      • Finalizing
  • Step 3 of 3 – Activating
    • Preparing to Enable Queries
    • Enabling Queries
    • Activating

Index status fields

The following fields appear in the Index Status section:

  • Active status - whether the index is active or inactive.
  • Status - where the index is in the build process. Also displays the system recommendation for the next step you should take in the index process (for example, Activation).

Document Breakdown section

  • Training Set - the number of indexed training set documents.
  • Searchable Set - the number of indexed searchable set documents.
  • Document Errors - whether or not there are document errors in your index after the most recent population.
  • Note: If an Analytics index goes unused for 30 days, it is automatically disabled to conserve server resources. It then has a status of Inactive and is not available for use until it is activated again. This setting is determined by the MaxAnalyticsIndexIdleDaysentry in the Instance setting table. The default value for this entry can be edited to change the number of maximum idle days for an index.

Usage Information fields

The Usage Information fields also provide useful data.

The following fields appear in the Usage Information section:

  • Created On - the date and time at which the index was created.
  • Last Used On - the date and time at which the index was last used.
  • Last Error - the last error encountered.

Note: If you automated the index building process but return to check on its progress, you might still see status values containing the word recommended even though the next process is started automatically.

Activating the index

Once you build a conceptual index, you can activate the index. This loads the index to server memory and prepares it for use, and then makes the index available for users by adding the index to the search drop-down menu on the Documents tab and to the right-click menu in the viewer. You must activate the index in order to make it searchable.

If you selected No from Continue index steps to completion when you started population, you must manually activate the index from the console when the option becomes available.

Note: Activating an index loads the index's data into RAM on the Analytics server. Enabling a large number of indexes at the same time can consume much of the memory on the Analytics server, so you should typically only activate indexes that are actively querying or classifying documents.

To activate the index, click Activate Index.

Deactivating the index

Once a conceptual index is activated, you have the option of deactivating it.

You may need to deactivate an index for the following reasons:

  • You need to rebuild the index and did not select the Continue index steps to completion setting.
  • You need to shut the index down so it doesn't continue using RAM.
  • The index is inactive but you don't want to completely remove it.

To deactivate an index, click Deactivate Index on the console.

Note: If you deactivate an index, you can't run concept searches against the index and keyword expansion becomes unavailable on the index.

Retrying errors

If errors occur during population or index build, you have the option of retrying them from the console.

To do this, click Retry Errors.

Retrying errors attempts to populate the index again.

Note: You can only populate one index at a time. If you submit more than one index for population, they'll be processed in order of submission by default.

Showing document errors

To see a list of all document errors, click Show Document Errors.

This displays a record of all document errors in the index and includes the following fields:

Example document errors report opened from the Analytics Index console

  • ArtifactID - the artifact ID of the document that received the error.
  • Message - the system-generated message accompanying the error.
  • Status - the current state of the errored document. The possible values are:
    • Removed From Index - indicates that the errored document was removed from the index.
    • Included in Index - indicates that the errored document was included in the index because you didn't select the option to remove it.
  • Date Removed - the date and time at which the errored document was removed from the index.

Showing population statistics

To see a list of population statistics, click Show Population Statistics.

Note: Population statistics are only available for conceptual indexes.

This option is available immediately after you save the index, but all rows in this window display a value of 0 until population is started.

This displays a list of population statistics that includes the following fields:

Example Population Statistics window for an Analytics index

  • Status - the state of the documents included in the index. This contains the following values:
    • Pending - documents waiting to be included in either population or index build.
    • Processing - documents currently in the process of being populated or indexed.
    • Processed - documents that have finished being populated or indexed.
    • Error - documents that encountered errors in either population or index build.
    • Excluded - excluded documents that were removed from the index as per the Optimize training set field setting or by removing documents in error.
    • Total - the total number of documents in the index, including errored documents.
  • Training Set - documents designated for the training set that are currently in one of the statuses listed in the Status field.
  • Searchable Set - documents designated for the searchable set that are currently in one of the statuses listed in the Status field.

Showing index statistics

To see an in-depth set of index details, click Show Index Statistics. This information can be helpful when investigating issues with your index.

Note: Index statistics are only available for conceptual indexes.

Clicking this displays a view with the following fields:

  • Initial Build Date - the date and time at which the index was first built.
  • Dimensions - the number of concept space dimensions specified by the Analytics profile used for this index.
  • Index ID - the automatically generated ID created with a new index. Its format is: {Workspace ID_{incrementing number}.
  • Unique Words in the Index - the total number of words in all documents in the training set, excluding duplicates. If a word occurs in multiple documents or multiple times in the same document, it's only counted once.
  • Searchable Documents - the number of documents in the searching set, determined by the saved search you selected in the Searchable Set field when creating the index.
  • Training Documents - the number of documents in the training set, determined by the saved search you selected for the Training Set field when creating the index. The normal range is two-thirds of the searchable set up to five million documents, after which it is half of the searchable set. If this value is outside that range, you receive a note next to the value.
  • Unique Words per Document - the total number of words, excluding duplicates, per document in the training set. The normal range is 0.80 - 10.00. If this field shows a value lower or higher than this range, a note appears next to the value. If your dataset has many long technical manuals, this number may be higher for your index. However, a high value might also indicate a problem with the data, such as poor quality OCR.
  • Average Document Size in Words - the average number of words in each document in the training set. The normal range is 120-200. If this field displays a value lower or higher than this range, you receive a note next to the value. If the data contains many very short emails, or errors in the extracted text field, the number might be smaller than usual. If the saved search did not return long text fields, you may also see a value below the normal range. If it contains long documents, the number could be higher than usual. If this number is extremely low (under 10), it's likely the saved searches for the index were set up incorrectly.

Best practices for updating a conceptual index

There may be times when you need to update your index. Depending on the update you’re making, you can save time by running an incremental population or only running a build. The following table outlines various workflows for different index updates.

Workflow Index update

Adding new documents that:

  • Introduce new concepts
  • Make up more than 10% - 30% of your document population
  1. Add documents to both the training and searchable set saved search.
  2. Click Populate Index: Incremental.

Adding new documents that:

  • Don’t introduce new concepts
  • Make up less than 10% - 30% of your document population
  1. Add documents to the searchable set saved search only.
  2. Click Populate Index: Incremental.
Removing documents from the training or searchable set
  1. Remove documents from the training or searchable set saved search.
  2. Click Populate Index: Incremental.
Updating concept stop words
  1. Click Deactivate Index.
  2. Click Build Index.
Updating extracted text (ex. Updating poor quality OCR text)
  1. Update extracted text.
  2. Click Populate Index: Full.
Updating filters (email header, repeated content)
  1. Update filters.
  2. Click Populate Index: Full.

Incremental population considerations for conceptual indexes

Incremental populations don't necessarily force Analytics to go through every stage of an index build.

When managing or updating indexes with new documents, consider the following guidelines:

  • If your index has 1 million records and you're adding 100,000 more, those documents could potentially teach a substantial amount of new information to your index. In this instance, you would update both the Training Set and Searchable Set. However, if you were only adding 5,000 documents, there aren’t likely a lot of new concepts in relation to the rest your index. You would most likely only need to add these new documents to your Searchable Set.
  • If the newly imported data is drastically different from the existing data, you need to train on it. If the new data is similar in nature and subject matter, then it would likely be safe to only add it to the Searchable Set.

You can run an incremental population to add or remove documents from your training and searchable set. This results in an index taking substantially less time to build, and therefore less downtime.

If extracted text has changed or if you have applied different filters, you must run a full population.

Note: You only need to manually deactivate, build, and activate the index if you did not select the Continue index steps to completion setting.

  1. Click the Populate Index: Incremental button on the console to populate the index. This checks the Data Sources for both the Training Set and the Searchable Set. If there are any new documents in either, it populates them. If there are documents that no longer appear in the search, it deletes them from the index. At this stage, the Analytics index can be queried against.
  2. Click Deactivate Index.
  3. Click Build Index to re-index the documents. If the Training Set has changed at all, the Analytics engine performs a full re-build of the concept index. If only Searchable Set documents have been added or removed, the engine starts this build at a later step (Updating searchable items). The index must be deactivated in order to perform the build.
  4. Click Activate Index.

Linking repeated content filters to a conceptual index

Use the Repeated Content Filter (Repeated Content) section on an Analytics index layout to link repeated content filters when the Analytics index is not open in Edit mode. These linked filters will only apply to the currently open Analytics conceptual index; they will not be applied to Structured Analytics Sets. This only applies to conceptual indexes, not classification indexes.

To link one or more existing repeated content filters to an Analytics index, perform the following steps:

  1. Click Link.
  2. Find and select the repeated content filter(s) to link to the profile. If you tagged the Ready to index field with Yes on filters you want to apply, filter for Ready to index = Yes to easily find your predetermined filters.
  3. Click Add followed by Set.

See Repeated content filters tab for more information on repeated content and regular expression filters.