Discovering files

Discovery is the phase of processing in which the processing engine retrieves deeper levels of metadata not accessible during Inventory and prepares files for publishing to a workspace.

Note: Your environment has been enabled to dynamically scale your Invariant worker servers dependent on load. Sustained activity is automatically detected by the system, and Relativity will add workers to handle this work. Once the work is done, they will automatically scale back down. This feature is continually being improved to be smarter about when we add workers and how many we add.

The following graphic depicts how discovery fits into the basic workflow used to reduce the file size of a data set through processing. This workflow assumes that you are applying some method of deNIST and deduplication.

File size reduction through processing diagram

The following is a typical workflow that incorporates discovery:

  1. Create a processing set or select an existing set.
  2. Add data sources to the processing set.
  3. Inventory the files in that processing set to extract top-level metadata.
  4. Apply filters to the inventoried data.
  5. Run discovery on the refined data.
  6. Publish the discovered files to the workspace.

Running file discovery

To start discovery click Discover Files on the processing set console. You can click this whether or not you have inventoried or filtered your files.

Note: When processing documents without an actual date, Relativity provides a null value for the following fields: Created Date, Created Date/Time, Created Time, Last Accessed Date, Last Accessed Date/Time, Last Accessed Time, Last Modified Date, Last Modified Date/Time, Last Modified Time, and Primary Date/Time. The null value is excluded and not represented in the filtered list.

Discover files button on processing set

A confirmation message appears with the discovery settings and filters applied. Click Discover to proceed with discovery or Cancel to return to the processing set layout. If you enabled auto-publish, the confirmation message provides the option to Discover & Publish.

Discover comfirmation prompt

Consider the following when discovering files:

  • Relativity does not re-extract text for a re-discovered file unless an extraction error occurred. This means that if you discover the same file twice and you change any settings on the profile, or select a different profile, between the two discovery jobs, Relativity will not re-extract the text from that file unless there was an extraction error. This is because processing always refers to the original/primary document and the original text stored in the database.
  • If you have arranged for auto-publish on the processing set's profile, the publish process begins when discovery finishes, even if errors occur during discovery. This means that the Publish button is not enabled for the set until after the job is finished. You also see a status display for both discover and publish on the set layout.
  • If your discovery job becomes stuck for an inordinate length of time, do not disable the worker associated with that processing job, as that worker may also be performing other processing jobs in the environment.
  • When discovering file types, Relativity refers to the file header information to detect the file type.
  • You can’t change the settings on any processing job at any point after file discovery begins. This means that once you click Discover, you can’t go back and edit the settings of the processing set and re-click Discover Files. You would need to create a new processing set with the desired settings.
  • You cannot start discovery while inventory is running for that processing set.
  • When you start discovery or retry discovery for a processing job, the list of passwords specified in the password bank accompanies the processing job so that password-protected files are processed in that job. For more information, see Password bank.

Note: Relativity prioritizes application metadata over operating system file properties where possible. For example, if a file type stores application metadata, such as Date Created and Date Modified, Relativity retains those values for the file. If the application metadata fields are empty or the file type does not store application metadata, Relativity uses the operating system's file properties instead. Application metadata is more reliable since it is stored in the file itself. Operating system file properties can often change. For example, moving a file from one folder to another may change property values. Examples of file types that store application metadata include Microsoft Office files such as Word or Excel.

When you start discovery, the Discover button changes to Cancel. Click this to stop discovery. See Canceling discovery for details.

Discovery process

The following graphic and corresponding steps depict what happens behind the scenes when you start discovery. This information is meant for reference purposes only.

Discovery workflow

  1. You click Discover Files on the processing set console.
  2. A console event handler copies all settings from the processing profile to the data sources on the processing set and then checks to make sure that the set is valid and ready to proceed.
  3. The event handler inserts all data sources into the processing set queue.
  4. The data sources wait in the queue to be picked up by an agent, during which time you can change their priority.
  5. The processing set manager agent picks up each data source based on its order, all password bank entries are synced, and the agent submits each data source as an individual discovery job to the processing engine. The agent then provides updates on the status of each job to Relativity, which then displays this information on the processing set layout.
  6. The processing engine discovers the files and applies the filters you specified in the Inventory tab. It then sends the finalized discovery results back to Relativity, which then updates the reports to include all applicable discovery data.
  7. Any errors that occurred during discovery are logged in the errors tabs. You can view these errors and attempt to retry them. See Processing error resolution for details.
  8. You can now publish the discovered files to your workspace. If you’ve arranged for auto-publish after discovery, publish will begin automatically and you will not invoke it manually.

Container extraction

It may be useful to understand how the processing engine handles container files during discovery. Specifically, the following graphic depicts how the engine continues to open multiple levels of container files until there are no more containers left in the data source.

This graphic is meant for reference purposes only.

Container extraction

Special considerations - OCR and text extraction

Consider the following regarding OCR and text extraction during discovery:

  • During discovery, the processing engine copies native files and OCR results to the document repository. Whether or not you publish these files, they remain in the repository, and they are not automatically deleted or removed.
  • Relativity populates the Extracted Text field when performing OCR during discovery. Relativity doesn’t overwrite metadata fields during OCR.
  • For multi-page records with a mix of native text and images, Relativity segments out OCR and extracted text at the page level, not the document level. For each page of a document containing both native text and images, Relativity stores extracted text and OCR text separately.
  • In the case where a file contains both native text and OCR within the extracted text of the record, there is a header in the Extracted Text field indicating the text that was extracted through OCR.
  • Relativity extracts OCR to Unicode.
  • Relativity displays a missing extracted text error if it cannot OCR one or more pages in a document. The error reads: An error occurred on X (number of) page(s) when attempting to OCR. Consider retrying. Page numbers missing OCR text : (page numbers)

Monitoring discovery status

You can monitor the progress of the discovery job through the information provided on the Processing set layout screen.

The Processing Set Layout view during Discovery:

Discovery status monitoring screen

The Processing Set Layout view after Discovery:

Completed Discovery Process

Throughout the processing job, you can monitor the following:

  • # of Data Sources—the number of data sources currently in the processing queue.
  • Inventory (Optional) > Files Inventoried—the number of files across all data sources submitted that the processing engine inventoried.
  • Discover > Files Discovered—the number of files across all data sources submitted that the processing engine discovered.
  • Discover > Files with Extracted Text—the number of files across all data sources submitted that have extracted text. This value only displays while the discovery jobs are still in progress. If the value is 0, text extraction has not started yet.
  • Discover > Filtered Files—the number of files you excluded from discovery by applying any of the available filters from the Inventory tab. For example, if you applied a Date Range filter that excluded 10 files from your data after you inventoried it, the value displayed is 10. If you applied no filters in the Inventory tab, the value displayed is 0. This value does not include files excluded via the DeNIST setting on the processing profile associated with this set. This value only displays when the job is complete and takes the place of the Files with Extracted Text field.
  • Errors—the number of errors that have occurred across all data sources submitted, which fall into the following categories:
    • Unresolvable Job Errors—errors that you cannot retry.
    • Available to Retry Job Errors—job errors that are available for retry.
    • Available to Retry File Errors—file errors that you have submitted for retry and are currently in the processing queue.

Note: Overall progress is calculated based on the number of data sources and percentage complete for each source.

Note: In some rare cases, you may see an additional field called Files removed after Retry. If you retry discovery errors, this field appears when the total file count after retrying is less than the original discovery file count. For example, you see the Files removed after Retry field is 10. This number tells you the total number of files discovered after retrying errors is ten less than the original number of discovered files.

If you enabled the auto-publish set option on the profile used by this set, you can monitor the progress for both discovery and publish.

Status with auto publish

See Processing error overview for details.

Once discovery is complete, the status section displays a check mark, indicating that you can move on to publishing your files. For more information, see Publishing files.

Discovery complete status icon

Viewing text extraction progress in processing sets

This feature is located below the Discover > Files Discovered count and shows the progress of text extraction by displaying an incrementing count of the number of files containing text within the processing sets. The extracted text files count remains at zero until the status bar passes 25%. The final count appears when the Discover process completes.

Text extraction after 25 %

Canceling discovery

Once you start discovery, you can cancel it before the job reaches a status of Discovered with errors or Discover files complete.

To cancel discovery, click Cancel.

Cancel discovery button

Consider the following regarding canceling discovery:

  • If you click Cancel while the status is still Waiting, you can re-submit the discovery job.
  • If you click Cancel after the job has already been sent to the processing engine, the set is canceled, meaning all options are disabled and it is unusable. Deduplication isn’t run against documents in canceled processing sets.
  • If you have auto-publish enabled and you cancel discovery, file publishing does not start.
  • Once the agent picks up the cancel discovery job, no more errors are created for the processing set.
  • Errors resulting from a canceled job are given a canceled status and cannot be retried.
  • Once you cancel discovery, you cannot resume discovery on those data sources. You must create new data sources to fully discover those files.

Once you cancel discovery, the status section is updated to display the canceled state.

Canceled discovery status