Discovering files

Discovery is the phase of processing in which the processing engine retrieves deeper levels of metadata not accessible during Inventory and prepares files for publishing to a workspace.

The following graphic depicts how discovery fits into the basic workflow you'd use to reduce the file size of a data set through processing. This workflow assumes that you’re applying some method of de-NIST and deduplication.

(Click to expand)

The following is a typical workflow that incorporates discovery:

  1. Create a processing set or select an existing set.
  2. Add data sources to the processing set.
  3. Inventory the files in that processing set to extract top-level metadata.
  4. Apply filters to the inventoried data.
  5. Run discovery on the refined data.
  6. Publish the discovered files to the workspace.

This page contains the following information:

Running file discovery

To start discovery click Discover Files on the processing set console. You can click this whether or not you've inventoried or filtered your files.

Note: When processing documents without an actual date, Relativity provides a null value for the following fields: Created Date, Created Date/Time, Created Time, Last Accessed Date, Last Accessed Date/Time, Last Accessed Time, Last Modified Date, Last Modified Date/Time, Last Modified Time, and Primary Date/Time.

A confirmation message pops up reminding you of the settings you're about to use to discover the files. Click Discover to proceed with discovery or Cancel to return to the processing set layout.

If you enabled auto-publish, the confirmation message will provide an option to Discover & Publish. Click this to proceed with discovery and publish or Cancel to return to the processing set layout.

Note: The default priority for all discovery jobs is determined by the current value of the ProcessingDiscoverJobPriorityDefault entry in the Instance setting table.

Consider the following when discovering files:

  • Relativity doesn't re-extract text for a re-discovered file unless an extraction error occurred. This means that if you discover the same file twice and you change any settings on the profile, or select a different profile, between the two discovery jobs, Relativity will not re-extract the text from that file unless there was an extraction error. This is because processing always refers to the original/master document and the original text stored in the database.
  • If you've arranged for auto-publish on the processing set's profile, the publish process begins when discovery finishes, even if errors occur during discovery. This means that the Publish button is not enabled for the set until after the job is finished. You'll also see a status display for both discover publish on the set layout.
  • If your discovery job becomes stuck for an inordinate length of time, don't disable the worker associated with that processing job, as that worker may also be performing other processing jobs in the environment.
  • When discovering file types, Relativity refers to the file header information to detect the file type.
  • You can’t change the settings on any processing job at any point after file discovery begins. This means that once you click Discover, you can’t go back and edit the settings of the processing set and re-click Discover Files. You would need to create a new processing set with the desired settings.
  • You can't start discovery while inventory is running for that processing set.
  • When you start discovery or retry discovery for a processing job, the list of passwords specified in the password bank accompanies the processing job so that password-protected files are processed in that job. For more information, see Password bank.

When you start discovery, the Discover button changes to Cancel. Click this to stop discovery. See Canceling discovery for details.

Discovery process

The following graphic and corresponding steps depict what happens behind the scenes when you start discovery. This information is meant for reference purposes only.

  1. You click Discover Files on the processing set console.
  2. A console event handler copies all settings from the processing profile to the data sources on the processing set and then checks to make sure that the set is valid and ready to proceed.
  3. The event handler inserts all data sources into the processing set queue.
  4. The data sources wait in the queue to be picked up by an agent, during which time you can change their priority.
  5. The processing set manager agent picks up each data source based on its order, all password bank entries are synced, and the agent submits each data source as an individual discovery job to the processing engine. The agent then provides updates on the status of each job to Relativity, which then displays this information on the processing set layout.
  6. The processing engine discovers the files and applies the filters you specified in the Inventory tab. It then sends the finalized discovery results back to Relativity, which then updates the reports to include all applicable discovery data. You can generate these reports to see how much discovery has narrowed down your data set.
  7. Any errors that occurred during discovery are logged in the errors tabs. You can view these errors and attempt to retry them. See Processing error workflow for details.
  8. You can now publish the discovered files to your workspace. If you’ve arranged for auto-publish after discovery, publish will begin automatically and you aren’t required to perform it manually.

Container extraction

It may be useful to understand how the processing engine handles container files during discovery. Specifically, the following graphic depicts how the engine continues to open multiple levels of container files until there are no more containers left in the data source.

This graphic is meant for reference purposes only.

Anmiated GIF for container extraction

Special considerations - OCR and text extraction

Consider the following regarding OCR and text extraction during discovery:

  • During discovery, the processing engine copies native files and OCR results to the document repository. Whether or not you publish these files, they remain in the repository, and they aren't automatically deleted or removed.
  • Relativity populates the Extracted Text field when performing OCR during discovery. Relativity doesn’t overwrite metadata fields during OCR.
  • For multi-page records with a mix of native text and images, Relativity segments out OCR and extracted text at the page level, not the document level. For each page of a document containing both native text and images, Relativity stores extracted text and OCR text separately.
  • In the case where a file contains both native text and OCR within the extracted text of the record, there is a header in the Extracted Text field indicating the text that was extracted through OCR.
  • Relativity extracts OCR to Unicode.

Monitoring discovery status

You can monitor the progress of the discovery job through the information provided in the Processing Set Status display on the set layout.

Through this display, you can monitor the following:

  • # of Data Sources - the number of data sources currently in the processing queue.
  • Inventory | Files Inventoried - the number of files across all data sources submitted that the processing engine inventoried.
  • Inventory | Filtered Inventory - the number of files you excluded from discovery by applying any of the available filters in the Inventory tab. For example, if you applied only a Date Range filter and excluded only 10 .exe files from the your data after you inventoried it, this will display a value of 10. If you applied no filters in the Inventory tab, this value will be 0. This value doesn't include files that were excluded via the DeNIST setting on the processing profile associated with this set.
  • Discover | Files Discovered - the number of files across all data sources submitted that the processing engine has discovered.
  • Errors - the number of errors that have occurred across all data sources submitted, which fall into the following categories:
    • Unresolvable - errors that you can't retry.
    • Available to Retry - errors that are available for retry.
    • In Queue - errors that you have submitted for retry and are currently in the processing queue.

Note: During a discovery job, the different phases of discovery are represented by the following percentage ranges: 1-25% ingestion, 26-50% data extraction (encompassing OCR), and 51-100% finalizing discovery/updating tables.

If you enabled the auto-publish set option on the profile used by this set, you can monitor the progress for both discovery and publish.

Status with auto publish

See Processing error workflow for details.

Once discovery is complete, the status section displays a blue check mark, indicating that you can move on to publishing your files. For more information, see Publishing files.

Canceling discovery

Canceling discovery

Once you start discovery, you can cancel it before the job reaches a status of Discovered with errors or Discover files complete.

To cancel discovery, click Cancel.

Cancel discovery button

Consider the following regarding canceling discovery:

  • If you click Cancel while the status is still Waiting, you can re-submit the discovery job.
  • If you click Cancel after the job has already been sent to the processing engine, the set is canceled, meaning all options are disabled and it is unusable. Deduplication isn’t run against documents in canceled processing sets.
  • If you have auto-publish enabled and you cancel discovery, file publishing does not start.
  • Once the agent picks up the cancel discovery job, no more errors are created for the processing set.
  • Errors resulting from a canceled job are given a canceled status and can't be retried.
  • Once you cancel discovery, you can't resume discovery on those data sources. You must create new data sources to fully discover those files.

Once you cancel discovery, the status section is updated to display the canceled state.

Canceled discovery status