Publishing files

Publishing files to a workspace is the step that loads processed data into the environment so reviewers can access the files. At any point after file discovery is complete, you can publish the discovered files to a workspace. During publish, Relativity:

  • Applies all the settings you specified on the profile to the documents you bring into the workspace.
  • Determines which is the master document and master custodian and which are the duplicates.
  • Populates the All Custodians, Other Sources, and other fields with data.

Use the following guidelines when publishing files to a workspace:

  • If you intend to use both the RDC and Relativity Processing to bring data into the same workspace, note that if you select Custodial or Global as the deduplication method on your processing profile(s), the processing engine won't deduplicate against files brought in through the RDC. This is because the processing engine doesn’t recognize RDC-imported data. In addition, you could see general performance degradation in these cases, as well as possible Bates numbering collisions.
  • Beginning in July 2017, with the release of the distributed publish enhancement, publish includes the three distinct steps of deduplication document ID creation, master document publish, and overlaying deduplication metadata. Because of this, it’s possible for multiple processing sets to be publishing at the same time in the same workspace.

The following graphic depicts how publish fits into the basic workflow you'd use to reduce the file size of a data set through processing. This workflow assumes that you’re applying some method of de-NIST and deduplication.

(Click to expand)

File size reduction through progressing diagram

Note: Your environment has been enabled to dynamically scale your Invariant worker servers dependent on load. Sustained activity is automatically detected by the system, and Relativity will add workers to handle this work. Once the work is done, they will automatically scale back down. This feature is continually being improved to be smarter about when we add workers and how many we add.

The following is a typical workflow that incorporates publish:

  1. Create a processing set or select an existing set.
  2. Add data sources to the processing set.
  3. Inventory the files in that processing set to extract top-level metadata.
  4. Apply filters to the inventoried data.
  5. Run discovery on the refined data.
  6. Publish the discovered files to the workspace.

This page contains the following information:

Running file publish

To publish files, click Publish Files. You only need to manually start publish if you disabled the Auto-publish set field on the profile used by this processing set.

Note: When processing documents without an actual date, Relativity provides a null value for the following fields: Created Date, Created Date/Time, Created Time, Last Accessed Date, Last Accessed Date/Time, Last Accessed Time, Last Modified Date, Last Modified Date/Time, Last Modified Time, and Primary Date/Time. The null value is excluded and not represented in the filtered list.

(Click to expand)

Publish files button

When you click Publish Files, you're presented with a confirmation message containing information about the job you're about to submit. If you haven't mapped any fields in the workspace, the message will reflect this. Click Publish to proceed or Cancel to return to the processing set layout.

Publish files confirmation message

Consider the following when publishing files:

  • During publish, Relativity assigns control numbers to documents from the top of the directory (source location) down. Duplicates do not receive unique control numbers.
  • The publish process includes the three distinct steps of deduplication document ID creation, master document publish, and overlaying deduplication metadata; as a result, it’s possible for multiple processing sets to be publishing at the same time in the same workspace.
  • After data is published, we recommend that you not change the Control Number (Document Identifier) value, as issues can arise in future publish jobs if a data overlay occurs on the modified files.
  • If you have multiple data sources attached to a single processing set, Relativity starts the second source as soon as the first set reaches the DeDuplication and Document ID generation stage. Previously, Relativity waited until the entire source was published before starting the next one.
  • Never disable a worker while it's completing a publish job.
  • The Publish option is available even after publish is complete. This means you can republish data sources that have been previously published with or without errors.
  • If you've arranged for auto-publish on the processing profile, then when you start discovery, you are also starting publish once discovery is complete, even if errors occur during discovery. This means that the Publish button is never enabled.
  • Once you publish files, you are unable to delete or edit the data sources containing those files. You are also unable to change the deduplication method you originally applied to the set.
  • If you delete a document from Relativity after it's been published, that deletion has no effect on the already specified processing and deduplication settings. This means that the processing engine still regards the deleted file as published, and that file isn’t republished if Publish is run again. In addition, global deduplication still removes any duplicates of that deleted file.
  • If you arrange to copy source files to the Relativity file share, Relativity no longer needs to access them once you publish them. In this case, you aren't required to keep your source files in the location from which they were processed after you've published them.
  • If the DeNIST field is set to Yes on the profile but the Invariant database table is empty for the DeNIST field, you can't publish files.
  • Publish is a distributed process that is broken up into separate jobs, which leads to more stability by removing this single point of failure and allowing the distribution of work across multiple workers. These changes enable publish to operate more consistently like the other processing job types in the worker manager server, where batches of data are processed for a specific amount of time before completing each transactional job and moving on. Note the upgrade-relevant details regarding distributed publish:
    • UpdateMastersWithDedupeInformation- the third phase of publish that finishes before metadata updates if no deduplication fields are mapped.
      • The deduplication fields are All Custodians, Deduped Custodians, All Paths/Locations, Deduped Count, and Deduped Paths.
      • If no deduplication fields are mapped for a publish job where the deduplication method is either Global or Custodial, then the UpdateMastersWithDedupeInformation job should finish before overlaying or updating any metadata.
      • The tracking log will read "Overlaying dedupe information will not be performed on the masters. The deduplication fields are not mapped."
    • The following instance settings have been added to facilitate the work of distributed publish. Due to the change in publish behavior caused by these new instance settings, we recommend contacting Support for guidance on what values to specify for these settings before performing an upgrade.
      • ProcessingMaxPublishJobCountPerRelativitySQLServer- the maximum number of publish jobs per Relativity SQL server that may be worked on in parallel.
        • The default value is 21. Leaving this setting at its default value will result in increased throughput.
        • This updates on a 30-second interval.
        • If you change the default value, note that setting it too high could result in web server, SQL server, or BCP/file server issues. In addition, other jobs in Relativity that use worker threads may see a performance decrease, such discovery or imaging. If you set it too low, publish speeds may be lower than expected.
        • You can't allocate more jobs per workspace than what is allowed per SQL server.
      • ProcessingMaxPublishSubJobCountPerWorkspace- the maximum number of publish jobs per workspace that may be worked on in parallel.
        • The default value is 7. Leaving this setting at its default value will result in increased throughput.
        • This updates on a 30-second interval.
        • If you change the default value, note that setting it too high could result in web server, SQL server, or BCP/file server issues. In addition, other jobs in Relativity that use worker threads may see a performance decrease, such discovery or imaging. If you set it too low, publish speeds may be lower than expected.
        • You can't allocate more jobs per workspace than what is allowed per SQL server.

    The following table provides the recommended values for each instance setting per environment setup:

    Environment setupProcessingMaxPublishSubJobCountPerWorkspaceProcessingMaxPublishJobCountPerRelativitySQLServer
    Tier 137
    Tier 2612
    RelativityOne baseline37

Note: Once you publish data into Relativity, you have the option of exporting it through the Relativity Desktop Client.

When you start publish, the Publish Files button changes to Cancel. You can use this to cancel the processing set. For more information, see Canceling publish.

Publish process

The following graphic and corresponding steps depict what happens behind the scenes when you start publish. This information is meant for reference purposes only.

Publish process diagram

  1. You click Publish Files on the processing set console. If you’ve arranged for auto-publish after discovery, publish will begin automatically and you aren’t required to start it manually.
  2. A console event handler checks to make sure that the set is valid and ready to proceed.
  3. The event handler inserts all data sources on the processing set into the processing set queue .
  4. The data sources wait in the queue to be picked up by an agent, during which time you can change their priority.
  5. The processing set manager agent picks up each data source based on its order, all password bank entries are synced, and the agent submits each data source as an individual publish job to the processing engine. The agent then provides updates on the status of each job to Relativity, which then displays this information on the processing set layout.
  6. The processing engine publishes the files to the workspace. Relativity updates the reports to include all applicable publish data. You can generate these reports to see how many and what kind of files you published to your workspace.
  7. Note: Beginning in July 2017, publish is no longer a single long-running process and instead is a distributed process that is broken up into separate jobs, which leads to more stability by removing this single point of failure and improves performance by allowing the distribution of work across multiple workers. Thus, publish is consistent with the other types of processing jobs performed by the worker manager server, in that it operates on batches of data for a specific amount of time before completing each transactional job and moving on.

  1. Any errors that occurred during publish are logged in the errors tabs. You can view these errors and attempt to retry them. See Processing error workflow for details.
  2. You set up a review project on the documents you published to your workspace, during which you can search across them and eventually produce them.

Monitoring publish status

You can monitor the progress of the publish job through the information provided in the Processing Set Status display on the set layout.

(Click to expand)

Publish status monitoring items

Through this display, you can monitor the following:

  • # of Data Sources - the number of data sources currently in the processing queue.
  • Publish | Documents Published - the number of files across all data sources submitted that have been published to the workspace.
  • Publish | Unpublished Files - the number of files across all data sources submitted that have yet to be published to the workspace.
  • Errors - the number of errors that have occurred across all data sources submitted, which fall into the following categories:
    • Unresolvable - errors that you can't retry.
    • Available to Retry - errors that are available for retry.
    • In Queue - errors that you have submitted for retry and are currently in the processing queue.

See Processing error workflow for details.

Once publish is complete, the status section displays a blue check mark and you have the option of republishing your files, if need be. For details, see Republishing files.

Canceling publishing

If the need arises, you can cancel your publish job before it completes.

To cancel publish, click Cancel.

(Click to expand)

Cancel publishing button

Consider the following about canceling publish:

  • You can't cancel a republish job. The cancel option is disabled during republish.
  • Once the agent picks up the cancel publish job, no more errors are created for the data sources.
  • If you click Cancel Publishing while the status is still Waiting, you can re-submit the publish job.
  • If you click Cancel Publishing after the job has already been sent to the processing engine, then the set is canceled, meaning all options are disabled and it is unusable. Deduplication isn’t run against documents in canceled processing sets.
  • Errors that result from a job that is canceled are given a canceled status and can't be retried.
  • Once the agent picks up the cancel publish job, you can't delete or edit those data sources.

Once you cancel publish, the status section is updated to display the canceled set.

  • When you publish multiple sets with global deduplication, dependencies are put in place across the sets to ensure correct deduplication results. Because of this, cancel behavior for publish has been adjusted in the following ways.
  • If you need to cancel three different processing sets that are all set to global or custodial deduplication, you must do so in the reverse order in which you started those publish jobs; in other words, if you started them 1-2-3 order, you must cancel them in 3-2-1 order.
  • When Global deduplication is set, cancel is available on all processing sets in which the DeDuplication and Document ID generation phase has not yet completed. Once the DeDuplication and Document ID generation phase is complete for all data sources on the set and there are other processing sets in the workspace that are also set to be deduped, the cancel button is disabled on the processing set.

Republishing files

You can republish a processing set any time after the Publish Files option is enabled after the previous publish job is complete. Republishing is required after retrying errors if you want to see the previously errored documents in your workspace.

To republish, click Publish Files. The same workflow for publishing files applies to republish with the exception that Relativity doesn't re-copy the settings from the profile to the data sources that you are publishing.

When you click Publish Files again, you're presented with a confirmation message containing information about the job you're about to submit. If you haven't mapped any fields in the workspace, the message will reflect this. Click Publish to proceed or Cancel to return to the processing set layout.

Publish files confirmation message

The status section is updated to display the in-progress republish job.

Republish status display

Consider the following when republishing files:

  • All ready-to-retry errors resulting from this publish job are retried when you republish.
  • Deduplication is respected on republish.
  • When you resolve errors and republish the documents that contained those errors, Relativity performs an overlay, meaning that there's only one file for the republished document in the Documents tab.
  • When you republish data, Relativity only updates field mappings for files that previously returned errors.
  • Once published, a processing set may not be republished if the numbering type (default or level) on the set’s profile has been changed.
  • Once published, the start number(s) on a processing set may not be changed. Attempting to do so will be disallowed.
  • Changes made to numbering type in a processing profile will not be respected after initial publishing. Data Source information cannot be changed after initial publishing.

Retrying errors after publish

You have the option of retrying errors generated during file discovery. When you discover corrupt or password-protected documents, these error files are still published into a Relativity workspace with their file metadata. This is important to remember if you have Auto-publish enabled. However, for documents with these types of errors, neither the document metadata nor the extracted text is available in the workspace.

Note: File metadata is derived from the file’s operating system (e.g., File Extension) whereas document metadata is contained in the document itself (e.g., Is Embedded).

(Click to expand)

Extracted text unavailable example

For resolvable issues such as password-protected files, you can retry these errors even after you publish the files into a workspace. If you provide a password via the password bank and successfully retry the file, then its document metadata and extracted text are made available in the workspace after the documents are republished.