Configuring the Analytics server

Use the following guidelines to configure your Analytics server(s) for optimum performance with Relativity.

This page contains the following sections:

Memory requirements

Analytics indexing

Server memory is the most important component in building an Analytics index. The more memory your server has, the larger the data sets that can be indexed without significant memory paging. Insufficient memory will slow down index build performance.

The following factors affect RAM consumption during indexing:

  • Number of documents in the training set
  • Number of documents in the searchable set
  • Number of unique words across all the documents in the data set being indexed
  • Total mean document size (as measured in unique words)

Use the following equation to estimate how much free RAM is needed to complete an index build:

(Number of Training Documents) * 6000 = Amount of RAM needed in bytes

An easy way to remember this equation is that every 1 Million training documents in the index require 6 GB of free RAM. The equation is based upon the average document set in Relativity. If your data set has more unique terms than an average data set, more RAM will be required to build. We recommend accounting for slightly more RAM than the equation estimates.

For information on performance baselines for Analytics indexing and other features, see Performance baselines and recommendations.

Structured analytics

To run structured analytics, the Analytics server can require substantial server resources. The structured analytics features are run by the Java process, as well as PostgreSQL. One of the most important components of ensuring a successful structured analytics operation is ensuring that Java has access to a sufficient amount of RAM. The following equation may be used to estimate how much RAM will be required for a given structured analytics set:

(Number of Documents) * 6000 = Amount of JVM needed in bytes

An easy way to remember this equation is that every 1 million documents in the set require about 6 GB of RAM for the Java process. If your data set is comprised of very long documents, it may require more JVM. If it is comprised of very small documents, then you may not need as much JVM. If Java does not have sufficient memory to complete a structured analytics operation, it is not often that you will receive an OOM error. More often Java will heap dump and garbage collect endlessly without ever successfully completing the operation. This equation is a good starting point so that these types of problems do not occur. See the section Java heap size (JVM) for information on how to configure JVM.

For information on performance baselines for structured analytics, see Performance baselines and recommendations.

Enabled Analytics indexes

An Analytics index is stored on disk and is loaded into memory when the index has queries enabled (the Enable Queries button on the index). An index with queries enabled may be used for all Analytics functionality such as clustering, categorization, etc. as well as querying. When you enable queries” on an Analytics index, Relativity loads the vectors associated with all searchable documents and words in the conceptual space into RAM in an lsiapp.exe process. For indexes with millions of documents and words, this RAM requirement may be thousands of MB. The number of words per document can range widely, from about 0.80 to 10, depending on the type of data in the index. These ranges indicate the amount RAM needed for an index to be enabled:

  • #SearchableDocuments * 5,000 = High end of # of bytes required
  • # SearchableDocuments * 400 = Low end of # of bytes required

Click Disable Queries on any Analytics indexes that aren’t in use to free up RAM on the Analytics server. The MaxAnalyticsIndexIdleDays instance setting helps with this issue. This value is the number of days that an Analytics index can remain inactive before the Case Manager agent disables queries on it. Inactivity on the Analytics index is defined as not having any categorization, clustering, or any type of searches using the index. This feature ensures that indexes that are not being used are not using up RAM on the Analytics server. If the index needs to be used again, simply navigate to the index in Relativity and click Enable Queries. It will be available for searching again within seconds.

Java heap size (JVM)

Depending on the amount of RAM on your Analytics server, as well as its role, you will need to modify the Java Heap Size setting. This setting controls how much RAM the Java process may consume. Java is used for index populations, as well as structured analytics operations, clustering, and categorization.

Here are some general guidelines:

  • If the Analytics server is used for both indexing and structured analytics, set this value to about 50% of the server's total RAM. You need to leave RAM available for the LSIApp.exe process, which is used for building conceptual indexes.
  • If the Analytics server is used solely for structured analytics, set this value to about 75% of the server's total RAM. Be sure to leave about a quarter of the RAM available for the underlying database processes.
  • If the Analytics server is used solely for indexing, set this value to about one-third of the server's total RAM. You need to leave RAM available for the LSIApp.exe process, which is used for building conceptual indexes.

Due to a limitation in the Java application, do not configure JVM with a value between 32 GB and 47 GB (inclusive). When JVM is set between 32 GB and 47 GB, the application only has access to 20-22 GB heap space. For example, if the server has 64 GB RAM, set JVM to either 31 GB or 48 GB. Java application can access all RAM allocated.

To modify the Java Heap Size setting, perform the following steps:

  1. Navigate to <CAAT install drive>\CAAT\bin.
  2. Edit the env.cmd file.
  3. Locate the line similar to the following: set HEAP_OPTS=-Xms4096m -Xmx16383m.
  4. The configuration starting with –Xmx refers to the maximum amount of RAM available to Java, in megabytes.

  5. Modify this value as needed. Both megabyte and gigabyte values are supported for these settings. The change won't take effect until you stop and start the Content Analyst CAAT Windows Service.

Note: Never set the Java maximum (-Xmx) to be less than the Java minimum (-Xms). Don't modify the Java minimum setting unless instructed by us.

Page file size

We recommend the following settings regarding the page file size for the Analytics server:

  • Set the size of the paging file to 4095 MB or higher. This is because the OS array generally only has enough room for what’s required and is not able to support a page file size of 1.5 times the amount of physical RAM.
  • Set the initial minimum and maximum size of settings for the page to the same value to ensure no processing resources are lost to the dynamic resizing of the paging file.
  • Ensure that the paging area on a disk is one single, contiguous area, which improves disk seek time.
  • For servers with a large amount of RAM installed, set the page file to a size no greater than 50 GB. Microsoft has no specific recommendations about performance gains for page files larger than 50 GB.

Index directory requirements

The index directory stores both the Analytics indexes and the structured analytics sets. Using default settings, the average amount of disk space for the Analytics index or structured analytics set is equal to about 20% of the size of the MDF file of the workspace database. This metric indicates an average amount of disk space usage, but actual indexes may vary considerably in size. The amount of space required depends on the size of the extracted text being indexed, as well as the number of documents, unique words, and settings used to build the index. An Analytics server may not have multiple index locations; it may only reference one disk location for the server’s Analytics indexes and structured analytics sets.

Due to the size requirements, we recommend you don't store indexes on the local drive where the CAAT directory is installed. Upon installation or upgrade, the Relativity Analytics Server installer will prompt for the index directory location. If you would like to move the index directory location after upgrade, see Moving Analytics indexes and Structured Analytics sets.

CAAT® uses the database software PostgreSQL which requires guaranteed synchronous writes to the index directory. The connection from the analytics server to the index directory should be one that guarantees synchronous writes, such as Fibre Channel or iSCSI, rather than NFS or SMB/CIFS. We recommend storing the indexes and structured analytics sets on locally mounted storage rather than a remotely mounted file system.

CPU requirements

Analytics indexing

The Analytics index creation process also depends on CPU and I/O resources at various stages of the build. Ensuring that your server has multiple processors and fast I/O also increases efficiency during the population build process. Adding more CPU cores to an Analytics server ensures that index populations are as fast as possible—especially for large indexes. When you add additional CPU cores, you may also increase the number of Maximum Connectors used per index on the server. To modify this value, see Editing an Analytics server

Structured analytics

The structured analytics features are run by the Java process as well as PostgreSQL. In order for these processes to operate most efficiently, allocate sufficient CPU cores to the Analytics server. For optimal Textual Near Duplicate Identification performance, the Analytics server needs at least 8 CPU cores. Textual Near Duplicate Identification performance improves as additional cores are added.

The following charts illustrate the “Running Analytics Operations” step of a Textual Near Duplicate Identification structured analytics set with the default Minimum Similarity Percentage of 90 in CAAT® 3.17:

Data set 1: Enron data – 1M documents, 6.2 GB total text

Textual Near Duplicate analysis chart - Enron data

Data set 2: Wikipedia data - 1M records, 2.4 GB total text

Textual Near Duplicate analysis chart - Wikipedia data

Data set 3: Emails - 768K documents, 11.7 GB total text

Textual Near Duplicate analysis chart - Email data

The performance of a given data set varies based on certain factors outside of the number of documents or the total text size. The following types of data sets require more time to analyze:

  • Data sets with a very high number of similarly-sized documents
  • Data sets with a very low number of textually similar documents

Additionally, if you lower the Minimum Similarity Percentage from the default of 90, more time is required to analyze the data set.

Scaling

It is often beneficial to add multiple Analytics servers to the Relativity environment. Jobs can run concurrently without adversely affecting each other. It also allows servers to be dedicated to a feature set (structured or indexing) which makes RAM management easier. The following table shows some example environment configurations as well as the typical upper limitation that will be encountered. The upper limitation assumes no other concurrent activity on the server. The upper limit is intended to serve as an estimate and is not a guarantee. Data sets vary widely, and some may require more server resources than usual.

Tier 1 example

For an entry level environment with 100 or more users, usually one Analytics server is enough.

Here is an example environment configuration:

Server Name

Role

RAM

JVM

CPU Cores

Upper limit - Structured

Upper limit - Indexing

ANA-01

Both Structured and Indexing

32 GB

16 GB

8

3 MM documents

3 MM documents

Tier 2 example

For a mid-level environment with over 300 users, you may need to add an additional server to allow for concurrent jobs. Splitting the server roles also allows the servers to work on more data due to the allocation of Java Heap.

Here is an example environment configuration:

Server Name

Role

RAM

JVM

CPU Cores

Upper limit - Structured

Upper limit - Conceptual Indexing

ANA-01

Structured

32 GB

24 GB

8

4 MM documents

N/A

ANA-02

Indexing

32 GB

10 GB

4

N/A

4 MM documents

Tier 3 examples

For a large scale environment, you will likely need to scale up the server and add additional servers. Splitting the server roles allows the servers to work on more data due to the allocation of Java Heap. Add more Analytics servers to run more jobs concurrently. However, adding a large amount of RAM to one server will allow a very large job to complete successfully. The balance needs to be determined based on the client needs.

The following are example Tier 3 environment configurations:

Example 1

Server Name

Role

RAM

JVM

CPU Cores

Upper limit - Structured

Upper limit - Conceptual Indexing

ANA-01

Structured

64 GB

48 GB

16

8 MM documents

N/A

ANA-02

Indexing

64 GB

22 GB

12

N/A

7 MM documents

Example 2

Server Name

Role

RAM

JVM

CPU Cores

Upper limit - Structured

Upper limit - Conceptual Indexing

ANA-01

Structured

128 GB

96 GB

32

16 MM documents

N/A

ANA-02

Indexing

128 GB

48 GB

24

N/A

14 MM documents