OCR

Optical Character Recognition (OCR) translates images of text, such as scanned and redacted documents, into actual text characters. There are three main steps involved in OCRing documents:

  1. Defining a production or saved search that contains the documents you want to OCR. See Creating or editing a saved search or Production sets.
  2. Creating an OCR profile. See Creating and editing an OCR profile.
  3. Creating an OCR set that references your OCR profile. See Creating and editing an OCR set.

With OCR you can view and search on text that's normally locked inside images. It uses pattern recognition to identify individual text characters on a page, such as letters, numbers, punctuation marks, spaces, and ends of lines.

This page contains the following sections:

See these related pages:

See this related recipe:

  • OCR redacted production documents and export text

Creating and editing an OCR profile

An OCR profile is a saved, reusable set of parameters that you use when creating an OCR set. To run an OCR job, you must first create an OCR profile.

You don't have to create a profile for every OCR set you create. You can use only one profile for all sets. However, you may want to have multiple profiles saved with different accuracy or language settings to use for different document sets you plan to OCR.

To create an OCR profile, follow these steps:

  1. Click the OCR Profiles tab under the OCR tab.
  2. Click New OCR Profile.
  3. Complete the fields on the form. See Fields.
  4. Click Save.

Fields

Complete the following OCR profile fields:

OCR Profile layout

  • Name - the name of the profile.
  • Preprocess Images - enhances the images to get rid of distortions before OCRing. If you set this to Yes, preprocessing occurs before the OCR engine attempts to recognize characters. This improves the accuracy of the results while significantly slowing down job completion. Setting preprocess images to Yes also arranges for any or all of the following sub-processes:
    • To improve visibility, resolution enhancement increases pixel density to 1.5 to 2 times of that of the original image.
    • Text line straightening removes the distortion that occurs when capturing curved book pages.
    • Removing parallax distortion assists in situations in which the camera is not perpendicular to the page and the image is flawed as a result; for best results, the image should contain at least six lines of justified text.
    • Deskewing corrects documents that became slanted during scanning.
  • Auto-Rotate Images - makes the OCR engine detect page positioning, and then reposition the page accordingly. This can potentially impact the accuracy of OCR results. The rotated image is not saved back to Relativity in rotated position.
  • Languages - the language(s) you want the OCR engine to recognize while running the job. Click ellipsis button to choose from a list of languages. If the saved search or production you plan to use as your document set contains multiple languages, you may want to select more than one from this list. See Supported languages matrix
  • Note: If the saved search or production you use contains multiple languages and you only select one language from the list, the OCR uses the individual characters of the selected language to OCR all the text.

  • Accuracy - determines the desired accuracy of your OCR results and the speed with which you want the job completed. This drop-down menu contains three options:
    • High (Slowest Speed) - runs the OCR job with the highest accuracy and the slowest speed.
    • Medium (Average Speed) - runs the OCR job with medium accuracy and average speed.
    • Low (Fastest Speed) - runs the OCR job with the lowest accuracy and fastest speed.
  • On Partial Error - determines the behavior when the OCR engine encounters an error on an image:
    • Leave Empty - records no results if an error is encountered in a document; even images without errors are excluded from being written. For example, if one document contains five images and one of the images errors, no results are written for that document.
    • Write Partial Results - records all successfully OCRed text while excluding text from errored images. With this option you can see potentially relevant text that would not be visible if you chose to leave the results of documents containing errored images empty. This option runs the risk of excluding potentially relevant text.
  • Image Timeout (Seconds) - determines the maximum number of seconds per image before the OCR engine times out. If the job doesn't complete in this amount of time, it errors on that image. The default value is 60 seconds.

If you'd like to further distinguish the profile, click the Other tab and enter information in the Keywords and/or Notes fields.

Other tab of the OCR Profile layout

Creating and editing an OCR set

Using the OCR Sets tab you can submit groups of documents defined by a saved search or production to be OCRed based on the settings defined by the OCR profile. Relativity writes the results to the destination field that you specify.

To create an OCR set, you can copy an existing OCR set. If you copy an OCR set, every current setting in that set copies over. This includes the status the original set is currently in, as well as all items in the Documents (OCR Results) list. For this reason, it's recommended that you only copy those sets that haven't been run and that have a status of Staging to avoid potential issues with copied-over results from original OCR sets.

Before you create an OCR set, you first need to create an OCR profile. See Creating and editing an OCR profile. To create an OCR set, follow these steps:

  1. Click the OCR Sets tab under the OCR tab.

    Note: On the default OCR Set list, notice that the Image Completion field contains no values for any of the sets, even if those sets are processing or completed. The Image Completion value only appears when clicking the OCR set and entering its view or edit page.

  2. Click New OCR Set. If you want to edit an existing OCR set, click the Edit link next to the OCR set name.
  3. Complete the fields on the form. See Fields.
  4. Click Save.
    The OCR Set Console appears. See Running an OCR set.

Fields

View OCR set fields

OCR Set Information

  • Name - the name of the OCR set.
  • Email notification recipients - list all email addresses you want to receive email notification upon OCR completion. Separate each email address with a semicolon.

OCR Document Set

  • Data Source - if you're OCRing documents using a saved search, select the saved search containing the appropriate set of documents you plan to OCR.
    Choosing a data source only OCRs the original image and not redactions unless there are redactions on the image itself. The OCR engine only processes files that have been imaged in Relativity or uploaded as image files.
  • Production - if you're OCRing documents using a production set, select the production set containing the documents you plan to OCR.
    Click ellipsis button to open the Production Picker on OCR Set view, which displays all production sets with a status of Produced that you have access to. The engine OCRs all burned-in redactions, branding, headers and footers, and text. All documents with images in the production are OCRed, not only those with redactions.
  • Only OCR Production Documents Containing Redactions - you can OCR only produced documents with redactions. You can set this to Yes only if you selected a production set in the production field for the OCR Document Set. However, this setting doesn't check the selected production to see if there are images with redactions before running the OCR set. By default, this is set to No.

OCR settings

  • OCR Profile - select the OCR Profile that contains the parameters you want to run when you execute the OCR Set.
    Click ellipsis button to bring up the OCR Profile Picker on OCR Set view, which lists Profiles that have already been created in the OCR Profiles tab.
  • Destination Field - specifies the field where you want the OCR text to reside after you run the OCR. This includes Data Grid-enabled long text fields and the extracted text field.
    Click ellipsis button to bring up the Field Picker on OCR Set view, which lists all document long text fields you have access to. If you selected non-Western European languages in your OCR Profile, the destination field should be Unicode-enabled. This field is overwritten each time a document is OCRed with that destination field selected.

OCR status

The following fields are read-only:

  • Status - view where the OCR set is in the process of running. When you save the set, this field shows a value of Staging until you click the Run button in the OCR Set Console. The following statuses occur after you click Run in the console:
    • Waiting
    • Processing – Building Tables
    • Processing – Inserting Records
    • Processing – OCRing
    • Processing – Compiling Results
    • Completed (if no errors occurred)

    If errors occurs or the job is canceled for any reason, the following statuses are possible:

    • Error – Job Failed
    • Completed With Errors
    • Stopping
    • Stopped by User
  • Image Completion -view the count of images completed in the OCR set, the number of images with errors, and the number of images left to be OCRed. Any errors appear in red.
  • Last Run Error - view the last job error that occurred in the running of the OCR set.

Running an OCR set

When you save an OCR set, the OCR Set console appears that you use to run the OCR job.

OCR set layout and console

The OCR Set console provides the following action buttons:

  • OCR Documents - starts the OCR job. This processes all images in the selected data source or production.
    If a user stops the job, it completes with errors, or it fails. Click OCR Documents to start the job again. If there are documents in the OCR Results list, these aren't immediately cleared when the OCR Documents button is clicked on the console. These are only cleared when the job goes into processing, which is reflected in the Status when you click the Refresh Page link.

    Note: Only existing images are OCRed when you click OCR Documents. Images that are currently being loaded will NOT be OCRed if those images are added after you click OCR Documents. Changes made to an OCR profile that's referenced by an OCR set aren't reflected until you click OCR Documents on that set.

  • Stop OCR - terminates the running OCR job.This button enables after you click OCR Documents. Once you stop a job, you can't resume the job from the point it stopped. You have to click OCR Documents to being the job over again.
  • Retry Errors - attempts to re-run a job with errors.
    Selecting this for a job with a status of Error-Job Failed runs the job from the point at which it failed. Selecting this for a job with a status of Completed with Errors attempts to run those images in the OCR set that previously resulted in errors. Only errored documents are processed when the system tries to resolve errors.
  • Show Errors - displays all image-level errors encountered during the OCR job. This link is only enabled if image-level errors occur. Clicking Show Errors brings up a filterable errors item list. Note the error fields that appear:
    • Document ID
    • Control Number
    • Page Number
    • Message
  • Refresh Page - updates the Status and Image Completion fields while the set is running. Clicking this button reloads the page and may reflect different values in those fields depending on what happened during the OCR job.

Once the OCR job completes, the Document (OCR Results) list displays all documents successfully OCRed. The fields in this view are Control Number and File Icon.

Viewing OCR text

Once you run the OCR set, you can review your OCRed text. The most effective way of viewing your OCR text is by following these steps:

  1. To launch the core reviewer interface, click the control number of a document.
  2. Change the viewer mode to either Image or Production, depending on what you OCRed.
  3. Click Stand alone document viewer icon to launch the Stand Alone Viewer.
  4. Click the Unsynced icon to sync the Stand Alone Viewer with the main window.
  5. Change the mode of the Stand Alone Viewer to the long text drop-down menu. Select the destination field you created for the results of the OCR set.

    Note: If this field is not visible in the drop-down menu, then you must edit that field to make the Available in Viewer value Yes.

  6. Compare the OCR text to that of the document’s original or produced image.

Filtering and searching on the OCR text field

After running an OCR set, you can add the OCR text field to the view so that you can filter for specific terms in the OCR text field.

View with OCR Results field added and filter applied

You can also select Save as search icon to save the view as a search so that it's accessible in the saved searches folder.