Processing duplication workflow

The Processing Duplication Workflow is a way to identify and mark duplicate documents after discovery. This way you can review only one copy of each document, while leaving all copies of the document in place across the workspace. You can control how documents are published through the processing profile selected when creating a processing set. You can also create your own processing profile, then choose the deduplication method (global, custodial, or none) to use.

Note: The scripts described on this page help identify duplicate files and their source locations. Use the scripts if you have deduplication set to No Deduplication or are working with files uploaded outside of Relativity. You should not use these scripts if you are running deduplication within Relativity processing.

Downloading and installing the solution to your instance

Before you begin, confirm you have the Processing Duplication Workflow application in your environment's application library. If you do, you can skip to section, Adding the solution to a workspace. If not, download the solution from the Community site, then deploy it to your environment. After it's deployed, you can add it to any workspaces within your environment.

Supported versions

Click the following link to access the solution files on the Community site.

Note: Some versions of this application may not be eligible for support by Relativity Customer Support. For more information, see the Version support policy.

Solution version Supported Relativity version
2022.1.7 All supported versions of Relativity.

Note: You must have valid Relativity Community credentials to download files from the Community site. After accessing the ProcessingDuplicationWorkflow page, click the Download tab. If the file successfully downloads to your local drive, you will not see any other dialog. If you see an error stating "URL No Longer Exists," it may be due to a single sign-on error related to the SAML Assertion Validator, and you should contact your IT department.

Components

This custom solution consists of the following components:

  • The Processing Duplication Workflow application.
  • Item-level and family-level scripts that run at the workspace level. The scripts are described in greater detail in the Before you run the scripts section. They include:
    • All Custodians (item-level and family-level)
    • All Source Locations (item-level and family-level)
    • Update Duplicate Status (item-level and family-level)

Considerations

  • This script should only be run by a system administrator. If you are not a system administrator, you should not run this script.
  • The solution may require you to create fields in your environment. Be sure to read the section Before you run the scripts to confirm you have the necessary fields and saved searches in place.
  • You have the option of tagging a document as Responsive and then propagating that value to the document’s family members. For more information, see Applying propagation to documents.

Deploying and configuring the solution

After downloading the solution from the Community site, you must deploy it to your environment's Application Library.

To deploy the solution to your environment:

  1. Use the search bar to navigate to your environment's Application Library.
  2. Click the New Library Application button.
  3. Next to Application File, click Select File.
  4. Navigate to and select the ProcessingDuplicationWorkflow.rap file you downloaded from the Community site.
  5. Click the Save button.

Adding the solution to a workspace

The Processing Duplication Workflow solution is already deployed to all RelativityOne instances. To install the solution on a workspace, perform the following:

  1. Use the search bar to navigate to the Application Library in the Admin environment.
  2. Use the Name filter to locate the Processing Duplication Workflow application.
  3. Select the application to open the application information page
  4. Locate the Workspaces Installed section and click Select.
  5. Use the Workspace filter to locate your target workspace(s), then use the Move Selected Left to Right arrow to lock in your selection(s).
    Use the Move Left to Right arrow to lock in your selected workspace.
  6. Click Apply.
    You are redirected back to the Workspaces Installed screen. When the installation complete, you should see your workspace with a status of Installed.
    The workspace installed section on the application screen.

Before you run the scripts

To avoid delays later, take a few minutes to confirm you have the prerequisites required by the scripts. These include required fields, populated entities, and saved searches. If you have one or more missing items, create them now before proceeding.

Fields

The following is a list of all the fields used in the Processing Duplication Workflow scripts, along with their equivalent Relativity fields. We recommend you confirm that the Relativity field equivalents exist before proceeding. If they do not, use the field type and associative objects to create them.

Field name in script Relativity field equivalent Used in script Field type Associative Object Field purpose
Saved Search [Your Saved Search Name] All scripts Single Object

Search

This field serves two purposes. First, it defines the data set you want to run the script against. Second, Relativity updates the fields with the results of the script after you run it. You can create one saved search for both item-level and family-level scripts since the results output have the same fields for both.
Custodian Object Custodian All Custodians - Item
All Custodians - Family
Single Object Entity The custodian associated with a file. More than one custodian can be associated with a single file.
Custodian Object Field Full Name All Custodians - Item
All Custodians - Family
Fixed-Length Text (255) Entity This processing field displays the first and last name of the custodian.
Destination Field (output) All Custodians (Long Text) All Custodians - Item
All Custodians - Family
All Source Locations - Item
All Source Locations - Family
Long Text Document When used with the All Custodians scripts, this field displays a semi-colon delimited list of all custodians associated with a file.
Destination Field (output) All Paths/Locations All Custodians - Item
All Custodians - Family
All Source Locations - Item
All Source Locations - Family
Long Text Document When used with the All Source Locations scripts, this field displays a semi-colon delimited list of the source locations for a file.
Duplicate Hash Field Processing Duplicate Hash All scripts Fixed-length Text (64) Document The identifier of the physical native file. Alternatively, you can select another hash field in it's place, such as MD5 or SHA1.
Duplicate Sort Order Field Custodian Sort Order Update Duplicate Status - Item
Update Duplicate Status - Family
Whole Number Entity This is an option field that when selected, sorts the script results according to the custodian sort order rank. To use this field, you must manually assign a sort order value to the custodians in your saved search. See Other considerations for details on how to assign a sort order to custodians.
Duplicate Status Field Duplicate Status Update Duplicate Status - Item
Update Duplicate Status - Family
Single Choice Document Displays one of three values: Unique, Master, or Duplicate based on the relational field (hash.) Unique files have a relational ID not shared with any other files. Master files have a relational ID shared by more than one file where the ID is the lowest of the files. Duplicate files have a relational ID shared by more than one file where the IDs are not the lowest in the files.
Family Identifier Field Family Group All Custodians - Family
All Source Locations - Family
Update Duplicate Status - Family
Fixed-Length Text (64) Document Identifies the family group a file belongs to.
Level Field Level All Custodians - Family
Update Duplicate Status - Family
Whole Number Document Numeric value that represents how deeply nested a file is within a family. The higher the number, the deeper the file is nested.
Source Field Source Path All Source Locations - Item
All Source Locations - Family
Long Text Document The location of the file in the data set.

Saved searches

Saved searches have two functions when using the duplication scripts:

  • They identify the data set you are running the script against.
  • After the script runs, it updates the saved search so that you can identify duplicate documents and their locations.

You can create three saved searches, one for each script type, and use them for both item and family levels.

All custodians

Create a saved search with the following fields:

  • Control Number—the document identifier.
  • Custodian—the custodian associated with the document.
  • Family Group—the relational field that defines groups of related documents.
  • All Custodians (Long Text)—the script uses this field to output a semicolon delimited list of custodians associated with the document or relational group (for family-level results).

All source locations

Create a saved search with the following fields:

  • Control Number—the document identifier.
  • Source Path—the location of the document in the data set.
  • Family Group—the relational field that defines groups of related documents.
  • Level—the nested level of the document. This field is useful when running the family-level script and tells you how deeply nested the document is.
  • All Paths/Locations—the script uses this field to output a semicolon delimited list of source locations associated with the document.

Update duplicate status

Create a saved search with the following fields:

  • Artifact ID—the unique identifier of the database object in Relativity.
  • Control Number—the document identifier.
  • Processing Duplicate Hash—the identifier of the physical native file, such as MD5 or SHA1.
  • Custodian—the custodian associated with the document.
  • Level—the numeric value that represents how deeply nested a file is within a family. The higher the number, the deeper nested.
  • Duplicate Status—the script updates this field with one of the following options: Unique, Master, Duplicate.

Other considerations

Finally, there are a few other housekeeping items to consider.

  • Script access and visibility—after installing the Processing Duplication Workflow application, Relativity adds links to the scripts in the workspace's general menu. You can access the menu by clicking the hamburger link at the bottom of the left column. However, if you intend to use the scripts often and want a quicker way to access them, read the section on Enabling the workflow tabs.
  • Populated entities—if you uploaded your data from another solution outside of Relativity, you should confirm you have populated entities. If you have imported data from a solution outside of Relativity, read the section on Populating entities for data uploaded outside of Relativity.
  • Custodian sort order—(optional) some scripts give you the option of ranking the results based on the custodian's sort order. To use this setting, custodians must have a sort order value, which you manually assign to the custodian's entity object. For example, Custodian A has a sort order of 10, while Custodian B's sort order is 20. If everything else is equal between the two custodians, Custodian A will rank higher in the results than Custodian B, based on the sort order value. If you want to use the sort order to rank custodians, read the section on Setting a custodian's sort order.

Running the scripts

You can run the processing duplication scripts at the item level or family level. When deciding which level you should choose, consider the following:

  • Item level scripts identify all instances of duplicate documents, regardless of email families.
    • For example, an email sent to five custodians has one master and four duplicates.
    • An Excel file attached separately to two different emails is identified as a duplicate.
  • Family level scripts only identify entire duplicate email families, such as an email and its attachments.
    • For example, an email sent to five custodians has one master and four duplicates.
    • An Excel file attached separately to two different emails is not identified as a duplicate.

For each level, there are three scripts:

  • All Custodians—populates a field with a list of all custodians owning a document. Use this script to identify all duplicates in your data set, regardless of family identity. For more information, see All custodians script.
  • All Source Locations—identifies source locations for duplicates. Use this script to identify source locations of duplicates. For more information, see All Source Locations script.
  • Update Duplicate Status—populates a single-choice field with one of three values: Master, Duplicate, or Unique. Rerunning the script overwrites any existing values. Use this script to update the duplicate status of the document. Running this script at the family level also identifies the level, such as parent or child, of the document. For more information, see Update Duplicate Status script.

All Custodians script

The All Custodians script populates a field with the names of all custodians who own a document (item level) or own a document within a given family (family level). At the family level, the script populates a field with the level of document. For example, a parent document (1) or child (2 or greater). You can use this script to identify duplicates regardless of family. The steps for running the script are similar for the item-level and family-level searches. The instructions below use the item-level UI as a base, with any family script differences highlighted.

To run the All Custodians script:

  1. Navigate to the Processing Item Level Scripts sub-tab.
    • Family-level: Navigate to the Processing Family Scripts sub-tab.
  2. From the landing page, select the All Custodians (Item Level) option.
    • Family-level: Select the All Custodians (Family) option.
    All custodians item-level script
    The landing page for the all custodian item-level script.
    All custodians family-level script
    The landing page for the all custodians family-level script.
  3. Complete the following fields: 
    • Saved Search—select the saved search that has the group of documents to run the script against. This should include all files you wish to run the script against regardless of family relationships. This is because the script will run on each document independently of one another.
    • Duplicate Hash Field—select the relational field which defines groups of duplicate documents. For example, MD5, SHA1, or Processing Duplication Hash.
    • Family Identifier Field (Family-level only)—select the relational field which defines groups of family documents.
    • Level Field (Family-level only)—select the whole number field which defines a numeric value indicating how deeply nested the document is within the family. This is commonly called the Level field if you discovered and published the data through Relativity Processing.
    • Custodian Object—select the Custodian object that has your custodian information.
    • Custodian Object Field—select the field that has the custodian's full name.
    • Destination Field (output)—select the long text field to store the semi-colon delimited list of custodians.
    • Batch Size—(Optional—not all product versions have this field.) This field is optional and is the number of files in a saved search batch. Leave this field blank to use the default value of 50,000 or enter a value below 50,000 to increase the script execution speed.
  4. Click Run.

Viewing the results

Return to the saved search used in the script and refresh the list if necessary.

All custodians item-level script results
Document list showing the results from the all custodians item-level script.

All custodians family-level script results
Document list showing the results from the all custodians family-level script.

The following table describes the two report results:

Column Description
Control Number This is the file's ID. Child documents have an underscore and secondary number that tells you where the file is located within a family. For example, _0003 indicates the file is the third duplicate in the family group.
Custodian The primary custodian associated with the document. The primary custodian is the entity having the lowest Artifact ID where there are more than one custodians associated with a file.
Family Group The relational field that defines groups of related documents. Use this field to identify and filter families of documents.
All Custodians (Long Text) The semi-colon delimited list of all custodians associated with the document or relational group.

All Source Locations script

The All Source Locations script identifies all source locations for a duplicate document. The steps for running the script are similar for the item-level and family-level searches. The instructions use the item-level UI as a base, highlighting family script differences.

To run the All Source Locations script:

  1. Navigate to the Processing Item Level Scripts sub-tab.
    • (Family-level) Navigate to the Processing Family Scripts sub-tab.
  2. From the landing page, select the All Source Locations (Item Level) option.
    • (Family-level) Select the All Source Locations (Family) option.
    All source locations item-level script
    The landing page for the all source locations item level script.
    All source locations family-level script
    The landing screen for the all source locations family script.
  3. Complete the following fields: 
    • Saved Search—select the saved search that has the group of documents to run the script against. This should include all files you wish to run the script against regardless of family relationships. This is because the script will run on each document independently of one another.
    • Duplicate Hash Field—select the relational field which defines groups of duplicate documents. For example, MD5, SHA1, or Processing Duplication Hash.
    • Family Identifier Field (Family-level only)—select the relational field which defines groups of family documents.
    • Source Field—select the long text field that has the source location for documents.
    • Level Field (Family-level only)—select the whole number field which defines a numeric value indicating how deeply nested the document is within the family. This is commonly called the Level field if you discovered and published the data through Relativity Processing.
    • Destination Field (output)—select the long text field to store the semi-colon delimited list of source paths.
    • Batch Size—(Optional—not all product versions have this field.) This field is optional and is the number of files in a saved search batch. Leave this field blank to use the default value of 50,000 or enter a value below 50,000 to increase the script execution speed.
  4. Click Run.

Viewing the results

Return to the saved search used in the script and refresh the list if necessary.

All source locations item-level script results
Document list showing the results from the all custodians item-level script.

All source locations family-level script results
Document list showing the results from the all custodians family-level script.

The following table lists and describes the columns in the report: 
Column Description
Control NumberThis is the file's ID. Child documents have an underscore and secondary number that tells you where the file is located within a family. For example, _0003 indicates the file is the third duplicate in the family group.
Source Path The file location associated with the document.
Family Group The relational field that defines groups of related documents. Use this field to identify and filter families of documents.
LevelUse this field value when running the family-level script. This number tells you the nested level of the document. The higher the number, the deeper the document is nested.
All Paths/LocationsThe semi-colon delimited list of all source locations associated with the document or relational group.

Update Duplicate Status script

The Update Duplicate Status script assigns a duplicate status value to each document: Unique, Master, or Duplicate. Rerunning the script overwrites any existing values with new ones. The steps for running the script are similar for the item-level and family-level searches. The instructions use the item-level UI as a base, highlighting family script differences.

To run the Update Duplicate Status script:

  1. Navigate to the Processing Item Level Scripts sub-tab.
    • (Family-level) Navigate to the Processing Family Scripts sub-tab.
  2. From the landing page, select the Update Duplicate Status (Item Level) option.
    • (Family-level) Select the Update Duplicate Status (Family) option.
    Update duplicate status item-level script
    The landing page for the update item-level deuplication status script.
    Update duplicate status family-level script
    The landing screen for the update family-level duplication status script.
  3. Complete the following fields: 
    • Saved Search—select the saved search that has the group of documents to run the script against. This should include all files you wish to run the script against regardless of family relationships. This is because the script will run on each document independently of one another.
    • Duplicate Status Field—select the field where Relativity outputs the duplicate status.
    • Duplicate Hash Field—select the relational field which defines groups of duplicate documents. For example, MD5, SHA1, or Processing Duplication Hash.
    • Family Identifier Field (Family-level only)—select the relational field which defines groups of family documents.
    • Level Field (Family-level only)—select the whole number field which defines a numeric value indicating how deeply nested the document is within the family. This is commonly called the Level field if you discovered and published the data through Relativity Processing.
    • Duplicate Sort Order Field—this is a field located on the Entity object. Leave this blank unless you have set up a field on the Entity object to store a priority sort order value for the custodian. When this field is blank, the system sorts on the document Artifact ID field from the Document object. The first document loaded in the workspace becomes the Primary document when duplicates are identified. See Viewing the Results for more information on how this field operates.
    • Batch Size—(Optional—not all product versions have this field.) This field is optional and is the number of files in a saved search batch. Leave this field blank to use the default value of 50,000 or enter a value below 50,000 to increase the script execution speed.
  4. Click Run.
  5. You see a message indicting the results are permanent. Select Accept.
  6. Return to your saved search to view the results.

Viewing the results

When the script runs, it clears the Duplicate Status field for all documents in the workspace. After clearing, the field updates with the one of the following values for the documents in the saved search:

  • Unique—the document in the saved search has a relational identifier that is unique compared to all the other documents.
  • Master—documents in the saved search where more than one document has the same relational identifier.
    • If you specify the Duplicate Sort Order field, the document with the lowest order of the associated custodian is the master. If multiple documents in the same group share the same custodian, the document with the lowest Artifact ID becomes the master.
    • If you do not specify the Duplicate Sort Order field, the document having the lowest document Artifact ID in the relational group is the master.
  • Duplicate—documents in the saved search where more than one document has the same relational identifier.
    • If you specify the Duplicate Sort Order field, documents not having the lowest ordered custodian in the relational group are duplicates.
    • If you do not specify the Duplicate Sort Order field, documents not having the lowest Artifact ID in the relational group are duplicates.
  • Not Set—the relational identifier for the document in the saved search is not set.

Any documents not included in the selected saved search are excluded from the logic to calculate duplicate status and the Duplicate Status field is not populated.

You will see an Update Complete message when the script completes. Return to the saved search selected in the script to view the results.

Update duplicate status item-level script results
Saved search view of documents updated by the update duplicate status item-level script.

Update duplicate status family-level script results
Saved search view of documents updated by the update duplicate status family-level script.

Column Description
Control NumberThis is the file's ID. Child documents have an underscore and secondary number that tells you where the file is located within a family. For example, _0003 indicates the file is the third duplicate in the family group.
Processing Folder Path The file location associated with the document.
Family Group The relational field that defines groups of related documents. Use this field to identify and filter families of documents.
All Paths/Locations The script uses this field to output a semicolon delimited list of source locations associated with the document.

Note: If you mapped the Duplicate Sort Order Field in the script to the Custodian Sort Order field, the script results will display a list based on the custodian sort order rank. The Custodian Sort Order field itself does not appear in the results.