Publishing files

Publishing files to a workspace is the step that loads processed data into the environment so reviewers can access the files. At any point after file discovery is complete, you can publish the discovered files to a workspace. During publish, Relativity:

  • Applies all the settings you specified on the profile to the documents you bring into the workspace.
  • Determines which is the master document and master custodian and which are the duplicates.
  • Populates the All Custodians, Other Sources, and other fields with data.

Use the following guidelines when publishing files to a workspace:

  • If you intend to use both the RDC and Relativity Processing to bring data into the same workspace, note that if you select Custodial or Global as the deduplication method on your processing profile(s), the processing engine won't deduplicate against files brought in through the RDC. This is because the processing engine doesn’t recognize RDC-imported data. In addition, you could see general performance degradation in these cases, as well as possible Bates numbering collisions.
  • Beginning in Relativity 9.5.253.62, with the release of the distributed publish enhancement, publish includes the three distinct steps of deduplication document ID creation, master document publish, and overlaying deduplication metadata. Because of this, it’s possible for multiple processing sets to be publishing at the same time in the same workspace.

The following graphic depicts how publish fits into the basic workflow you'd use to reduce the file size of a data set through processing. This workflow assumes that you’re applying some method of de-NIST and deduplication.

(Click to expand)

The following is a typical workflow that incorporates publish:

  1. Create a processing set or select an existing set.
  2. Add data sources to the processing set.
  3. Inventory the files in that processing set to extract top-level metadata.
  4. Apply filters to the inventoried data.
  5. Run discovery on the refined data.
  6. Publish the discovered files to the workspace.

This page contains the following information:

Note: The LoadImportedFullTextFromServer instance setting controls whether the Extracted Text field data is loaded directly from its file path during the publish phase of processing, rather than as part of a client-generated bulk load file.

Running file publish

To publish files, click Publish Files. You only need to manually start publish if you disabled the Auto-publish set field on the profile used by this processing set.

Note: When processing documents without an actual date, Relativity provides a null value for the following fields: Created Date, Created Date/Time, Created Time, Last Accessed Date, Last Accessed Date/Time, Last Accessed Time, Last Modified Date, Last Modified Date/Time, Last Modified Time, and Primary Date/Time.

(Click to expand)

When you click Publish Files, you're presented with a confirmation message containing information about the job you're about to submit. If you haven't mapped any fields in the workspace, the message will reflect this. Click Publish to proceed or Cancel to return to the processing set layout.

Consider the following when publishing files:

  • During publish, Relativity assigns control numbers to documents from the top of the directory (source location) down, unless documents are duplicates, in which case that order is reversed.
  • The publish process includes the three distinct steps of deduplication document ID creation, master document publish, and overlaying deduplication metadata; as a result, it’s possible for multiple processing sets to be publishing at the same time in the same workspace.
  • After data is published, we recommend that you not change the Control Number (Document Identifier) value, as issues can arise in future publish jobs if a data overlay occurs on the modified files.
  • If you have multiple data sources attached to a single processing set, Relativity starts the second source as soon as the first set reaches the DeDuplication and Document ID generation stage. Previously, Relativity waited until the entire source was published before starting the next one.
  • Never disable a worker while it's completing a publish job.
  • The Publish option is available even after publish is complete. This means you can republish data sources that have been previously published with or without errors.
  • If you've arranged for auto-publish on the processing profile, then when you start discovery, you are also starting publish once discovery is complete, even if errors occur during discovery. This means that the Publish button is never enabled.
  • Once you publish files, you are unable to delete or edit the data sources containing those files. You are also unable to change the deduplication method you originally applied to the set.
  • If you delete a document from Relativity after it's been published, that deletion has no effect on the already specified processing and deduplication settings. This means that the processing engine still regards the deleted file as published, and that file isn’t republished if Publish is run again. In addition, global deduplication still removes any duplicates of that deleted file.
  • If you arrange to copy source files to the Relativity file share, Relativity no longer needs to access them once you publish them. In this case, you aren't required to keep your source files in the location from which they were processed after you've published them.
  • If the DeNIST field is set to Yes on the profile but the Invariant database table is empty for the DeNIST field, you can't publish files.
  • Publish is a distributed process that is broken up into separate jobs, which leads to more stability by removing this single point of failure and allowing the distribution of work across multiple workers. These changes enable publish to operate more consistently like the other processing job types in the worker manager server, where batches of data are processed for a specific amount of time before completing each transactional job and moving on. Note the upgrade-relevant details regarding distributed publish:
    • The following instance settings have been added to facilitate the work of distributed publish. Due to the change in publish behavior caused by these new instance settings, we recommend contacting Support for guidance on what values to specify for these settings before performing an upgrade.
      • ProcessingMaxPublishSubJobCountPerRelativitySQLServer- the maximum number of publish jobs per Relativity SQL server that may be worked on in parallel.
        • You can't allocate more jobs per workspace than what is allowed per SQL server. This means that if this value is set to be lower than the value for the MaxPublishJobCountPerRelativitySQLServer instance setting, then Relativity only permits the maximum of jobs per SQL server.
        • The default value is 7. Leaving this setting at its default value will result in increased throughput; however, we recommend contacting Support before you upgrade for guidance on what value will be most beneficial to you based on your environment setup.
        • This updates on a 30-second interval.
        • If you change the default value, note that setting it too high could result in web server, SQL server, or BCP/file server issues. In addition, other jobs in Relativity that use worker threads may see a performance decrease, such discovery or imaging. If you set it too low, publish speeds may be lower than expected.
      • ProcessingMaxPublishSubJobCountPerWorkspace- the maximum number of publish jobs per workspace that may be worked on in parallel.
        • You can't allocate more jobs per workspace than what is allowed per SQL server. This means that if this value is set to be higher than the value for the MaxPublishJobCountPerRelativitySQLServer instance setting, then Relativity only permits the maximum of jobs per SQL server. For example, if you have a workspace limit of 4 and a server limit of 8 and all of your workspaces are on the same SQL server, you will have at most 8 publish sub jobs running concurrently.
        • The default value is 5. Leaving this setting at its default value will result in increased throughput; however, we recommend contacting Support before you upgrade for guidance on what value will be most beneficial to you based on your environment setup.
        • Note: The default value of this setting was changed from 3 to 5 in Relativity 9.6.202.10.

        • This updates on a 30-second interval.
        • If you change the default value, note that setting it too high could result in web server, SQL server, or BCP/file server issues. In addition, other jobs in Relativity that use worker threads may see a performance decrease, such discovery or imaging. If you set it too low, publish speeds may be lower than expected.

    The following table provides the recommended values for each instance setting per environment setup:

    Environment setupProcessingMaxPublishSubJobCountPerWorkspaceProcessingMaxPublishSubJobCountPerRelativitySQLServer
    Tier 137
    Tier 2612
    RelativityOne baseline37

Note: Once you publish data into Relativity, you have the option of exporting it through the Relativity Desktop Client.

When you start publish, the Publish Files button changes to Cancel. You can use this to cancel the processing set. For more information, see Canceling publish.

Publish process

The following graphic and corresponding steps depict what happens behind the scenes when you start publish. This information is meant for reference purposes only.

  1. You click Publish Files on the processing set console. If you’ve arranged for auto-publish after discovery, publish will begin automatically and you aren’t required to start it manually.
  2. A console event handler checks to make sure that the set is valid and ready to proceed.
  3. The event handler inserts all data sources on the processing set into the processing set queue .
  4. The data sources wait in the queue to be picked up by an agent, during which time you can change their priority.
  5. The processing set manager agent picks up each data source based on its order, all password bank entries are synced, and the agent submits each data source as an individual publish job to the processing engine. The agent then provides updates on the status of each job to Relativity, which then displays this information on the processing set layout.
  6. The processing engine publishes the files to the workspace. Relativity updates the reports to include all applicable publish data. You can generate these reports to see how many and what kind of files you published to your workspace.
  7. Note: Beginning in Relativity 9.5.253.62, publish is no longer a single long-running process and instead is a distributed process that is broken up into separate jobs, which leads to more stability by removing this single point of failure and improves performance by allowing the distribution of work across multiple workers. Thus, publish is consistent with the other types of processing jobs performed by the worker manager server, in that it operates on batches of data for a specific amount of time before completing each transactional job and moving on.

  1. Any errors that occurred during publish are logged in the errors tabs. You can view these errors and attempt to retry them. See Processing error workflow for details.
  2. You set up a review project on the documents you published to your workspace, during which you can search across them and eventually produce them.

Monitoring publish status

You can monitor the progress of the publish job through the information provided in the Processing Set Status display on the set layout.

(Click to expand)

Through this display, you can monitor the following:

  • # of Data Sources - the number of data sources currently in the processing queue.
  • Publish | Documents Published - the number of files across all data sources submitted that have been published to the workspace.
  • Publish | Unpublished Files - the number of files across all data sources submitted that have yet to be published to the workspace.
  • Errors - the number of errors that have occurred across all data sources submitted, which fall into the following categories:
    • Unresolvable - errors that you can't retry.
    • Available to Retry - errors that are available for retry.
    • In Queue - errors that you have submitted for retry and are currently in the processing queue.

See Processing error workflow for details.

Once publish is complete, the status section displays a blue check mark and you have the option of republishing your files, if need be. For details, see Republishing files.

Canceling publishing

If the need arises, you can cancel your publish job before it completes.

To cancel publish, click Cancel.

(Click to expand)

Consider the following about canceling publish:

  • You can't cancel a republish job. The cancel option is disabled during republish.
  • Once the agent picks up the cancel publish job, no more errors are created for the data sources.
  • If you click Cancel Publishing while the status is still Waiting, you can re-submit the publish job.
  • If you click Cancel Publishing after the job has already been sent to the processing engine, then the set is canceled, meaning all options are disabled and it is unusable. Deduplication isn’t run against documents in canceled processing sets.
  • Errors that result from a job that is canceled are given a canceled status and can't be retried.
  • Once the agent picks up the cancel publish job, you can't delete or edit those data sources.

Once you cancel publish, the status section is updated to display the canceled set.

  • When you publish multiple sets with global deduplication, dependencies are put in place across the sets to ensure correct deduplication results. Because of this, cancel behavior for publish has been adjusted in the following ways.
  • If you need to cancel three different processing sets that are all set to global or custodial deduplication, you must do so in the reverse order in which you started those publish jobs; in other words, if you started them 1-2-3 order, you must cancel them in 3-2-1 order.
  • When Global deduplication is set, cancel is available on all processing sets in which the DeDuplication and Document ID generation phase has not yet completed. Once the DeDuplication and Document ID generation phase is complete for all data sources on the set and there are other processing sets in the workspace that are also set to be deduped, the cancel button is disabled on the processing set.

Republishing files

You can republish a processing set any time after the Publish Files option is enabled after the previous publish job is complete. Republishing is required after retrying errors if you want to see the previously errored documents in your workspace.

To republish, click Publish Files. The same workflow for publishing files applies to republish with the exception that Relativity doesn't re-copy the settings from the profile to the data sources that you are publishing.

When you click Publish Files again, you're presented with a confirmation message containing information about the job you're about to submit. If you haven't mapped any fields in the workspace, the message will reflect this. Click Publish to proceed or Cancel to return to the processing set layout.

The status section is updated to display the in-progress republish job.

Republish status display

Consider the following when republishing files:

  • All ready-to-retry errors resulting from this publish job are retried when you republish.
  • Deduplication is respected on republish.
  • When you resolve errors and republish the documents that contained those errors, Relativity performs an overlay, meaning that there's only one file for the republished document in the Documents tab.
  • When you republish data, Relativity only updates field mappings for files that previously returned errors.

Retrying errors after publish

You have the option of retrying errors generated during file discovery. When you discover corrupt or password-protected documents, these error files are still published into a Relativity workspace with their file metadata. This is important to remember if you have Auto-publish enabled. However, for documents with these types of errors, neither the document metadata nor the extracted text is available in the workspace.

(Click to expand)

For resolvable issues such as password-protected files, you can retry these errors even after you publish the files into a workspace. If you provide a password via the password bank and successfully retry the file, then its document metadata and extracted text are made available in the workspace after the documents are republished.