A processing profile is an object that stores the numbering, deNIST, extraction, and deduplication settings that the processing engine refers to when publishing the documents in each data source that you attach to your processing set. You can create a profile specifically for one set or you can reuse the same profile for multiple sets.
Relativity provides a Default profile upon installation of processing.
This page contains the following information:
You're a litigation support specialist, and your firm has requested you to bring a custodian's data into Relativity without bringing in any embedded Microsoft office objects or images. You have to create a new processing profile for this because none of the profiles in the workspace have specified to exclude embedded images or objects when extracting children from a data set.
To do this, you simply create a new profile with those specifications and select that profile when creating the processing set that you want to use to bring the data into Relativity.
To create or edit a processing profile:
- Go to the Processing Profile tab.
- Click New Processing Profile or select any profile in the list.
- Complete or modify the fields on the Processing Profile layout.
- Click Save. Once you save the processing profile, you can associate it with a processing set.
For more information, see Processing sets.
Note: You can't delete the Default processing profile. If you delete a profile that is associated with a processing set that you've already started, the in-progress processing job will continue with the original profile settings you applied when you submitted the job. If you have an existing processing set that you haven't started that refers to a profile that you deleted after associating it to the set, you must associate a new profile with the set before you can start that processing job.
Note: Relativity doesn't re-extract text for a re-discovered file unless an extraction error occurred. This means that if you discover the same file twice and you change any settings on the profile, or select a different profile, between the two discovery jobs, Relativity will not re-extract the text from that file unless there was an extraction error. This is because processing always refers to the original/master document and the original text stored in the database.
The Processing Profile Information category of the profile layout provides the following fields:
- Name - the name you want to give the profile.
The Numbering Settings category of the profile layout provides the following fields.
- Default document numbering prefix - the prefix applied to each file in a processing set once it is published to a workspace. The default value for this field is REL.
- When applied to documents, this appears as the prefix, followed by the number of digits you specify. For example, <Prefix>xxxxxxxxxx.
- If you use a different prefix for the Custodian field on the processing data source(s) that you add to your processing set, the custodian's prefix takes precedence over the profile's.
- The character limit for this prefix is 75.
- Numbering Type - determines how the documents in each data source are numbered when published to the workspace. This field gives you the option of defining your document numbering schema. This is useful when you're importing from alternative sources and you need to keep your document numbering consistent. The choices for this field are:
- Auto Numbering - determines that the next published document will be identified by the next available number of that prefix.
- Define Start Number - define the starting number of the documents you intend to publish to the workspace.
- If this number is already published to the workspace, Relativity uses the next available number for that prefix.
- This option is useful for when you're processing from a third party tool that doesn't provide a suffix for your documents, and you'd like to define a new start number for the next set of documents for the purposes of keeping the numbering continuous.
- Selecting this choice makes the Default Start Number field available below, as well as the Start Number field on the data source layout.
- Default Start Number - the starting number for documents that are published from the processing set(s) that use this profile.
- This field is only visible if you selected the Define Start Number choice for the Numbering Type field above.
- If you use a different start number for the Start Number field on the data source that you attach the processing set, that number takes precedence over the value you enter here.
- The maximum value you can enter here is 2,147,483,647. If you enter a higher value, you'll receive an Invalid Integer warning next to field value and you won't be able to save the profile.
- Number of Digits (padded with zeros) - determines how many digits the document's control number contains. The range of available values is 1 and 10. By default, this field is set to 10 characters.
- Parent/Child Numbering - determines how parent and child documents are numbered relative to each other when published to the workspace. The choices for this field are as follows. For examples of each type, see Parent/child numbering type examples.
- Suffix Always - arranges for child documents to be appended to their parent with a delimiter.
- Continuous Always - arranges for child documents to receive a sequential control number after their parent.
- Continuous, Suffix on Retry - arranges for child documents to receive a sequential control number after their parent except for child documents that weren't published to the workspace. When these unpublished child documents are retried and published, they will receive the parent's number with a suffix. If you resolve the error post-publish, the control number doesn’t change.
Note: It's possible for your workspace to contain a document family that has both suffixed and non-suffixed child documents. See Suffix special considerations for details.
- Delimiter - the delimiter you want to appear between the different fragments of the control number of your published child documents. The choices for this field are:
- - (hyphen) - adds a hyphen as the delimiter to the control number of child documents. For example, REL0000000001-0001-0001.
- . (period) - adds a period as the delimiter to the control number of child documents. For example, REL0000000001.0001.0001.
- _(underscore) - adds an underscore as the delimiter to the control number of child documents. For example, REL0000000001_0001_0001.
The Inventory | Discovery Settings category of the profile layout provides the following fields.
- DeNIST - if set to Yes, processing separates and removes files found on the National Institute of Standards and Technology (NIST) list from the data you plan to process so that they don't make it into Relativity when you publish a processing set. The NIST list contains file signatures—or hash values—for millions of files that hold little evidentiary value for litigation purposes because they are not user-generated. This list may not contain every known junk or system file, so deNISTing may not remove 100% of undesirable material. If you know that the data you intend to process contains no system files, you can select No. If the DeNIST field is set to Yes on the profile but the Invariant database table is empty for the DeNIST field, you can't publish files. If the DeNIST field is set to No on the processing profile, the DeNIST filter doesn't appear by default in Inventory, and you don't have the option to add it. Likewise, if the DeNIST field is set to Yes on the profile, the corresponding filter is enabled in Inventory, and you can't disable it for that processing set. The choices for this field are:
- Yes - removes all files found on the NIST list. You can further define DeNIST options by specifying a value for the DeNIST Mode field.
Note: When DeNISTing, the processing engine takes into consideration everything about the file, including extension, header information and the content of the file itself. Even if header information is removed and the extension is changed, the engine is still able to identify and remove a NIST file. This is because it references the hashes of the system files that are found in the NIST database and matches up the hash of, for example, a Windows DLL to the hash of known DLL's in the database table.
- No - doesn't remove any files found on the NIST list. Files found on the NIST list are then published with the processing set.
Note: The same NIST list is used for all workspaces in the environment because it's stored on the worker manager server. You should not edit the NIST list. Relativity makes new versions of the NIST list available shortly after the National Software Reference Library (NSRL) releases them quarterly. Email Support to obtain a link to the latest version.
- DeNIST Mode - specify DeNIST options in your documents if DeNIST is set to Yes.
- DeNIST all files - breaks any parent/child groups and removes any attached files found on the NIST list from your document set.
- Do not break parent/child groups - doesn't break any parent/child groups, regardless if the files are on the NIST list. Any loose NIST files are removed.
- Default OCR languages - the language used to OCR files where text extraction isn't possible, such as for image files containing text. This selection determines the default language on the processing data sources that you create and then associate with a processing set.
For more information, see Adding a processing data source.
- Default time zone - the time zone used to display date and time on a processed document. This selection determines the default time zone on the processing data sources that you create and then associate with a processing set. The default time zone is applied from the processing profile during the discovery stage.
For more information, see Adding a processing data source.
Note: The processing engine discovers all natives in UTC and then converts metadata dates and times into the value you enter for the Default Time Zone field. The engine needs the time zone at the time of text extraction to write the date/time into the extracted text and automatically applies the daylight saving time for each file based on its metadata during the publishing stage.
The Extraction Settings category of the profile layout provides the following fields.
- Extract children - arranges for the removal of child items during discovery, including attachments, embedded objects and images and other non-parent files. The options are:
- Yes - extracts all children files during discovery so that both children and parents are included in the processing job.
- No - does not extract children, so that only parents are included in the processing job.
Note: You don’t need to set the Extract children field to Yes to have the files within PST and other container files extracted and processed. This is because Relativity breaks down container files by default without the need to specify to extract children.
- When extracting children, do not extract - exclude one or all of the following file types when extracting children. You can't make a selection here if you set the Extract children field to No.
- MS Office embedded images - excludes images of various file types found inside Microsoft Office files—such as .jpg, .bmp, or .png in a Word file—from discovery so that embedded images aren't published separately in Relativity.
- MS Office embedded objects - excludes objects of various file types found inside Microsoft Office files—such as an Excel spreadsheet inside a Word file—from discovery so that the embedded objects aren't published separately in Relativity. MS Office embedded objects will not have text extracted and will not be searchable.
Note: Relativity currently doesn't support the extraction of embedded images or objects from Visio, Project, or OpenOffice files. In addition, Relativity never extracts any embedded objects or images that were added to any files as links. For a detailed list of the Office file extensions from which Relativity does and does not extract embedded objects and images, see Microsoft Office child extraction support.
- Email inline images - excludes images of various files types found inside emails—such as .jpg, .bmp, or .png in an email—from discovery so that inline images aren't published separately in Relativity.
Note: For a detailed list of the kinds of attachments that Relativity treats as inline, or embedded, images during processing, see Tracking inline/embedded images.
- Email Output - determines the file format in which emails will be published to the workspace. The options are:
- MSG - publishes all emails as MSG to the workspace.
- MHT - converts all emails in your data source from MSG to MHT and publishes them to the workspace.
- This conversion happens during discovery.
- MSG files take up unnecessary space because attachments to an MSG are stored twice, once with the MSG itself and again when they’re extracted and saved as their own records. As a result, when you convert an MSG to an MHT, you significantly reduce your file storage because MHT files do not require duplicative storage of attachments.
- If you need to produce a native email file while excluding all privileged or irrelevant files, convert the email native from MSG to MHT by using the Email Output field. After an email is converted from MSG to MHT, the MHT email is published to the workspace separately from any attachments, reducing the chance of accidentally producing privileged attachments.
- Once you convert an MSG file to MHT, you cannot revert this conversion after the files have been published. For a list of differences between how Relativity handles MSG and MHT files, see MSG to MHT conversion considerations.
Note: There is also a Yes/No Relativity Processing field called Converted Email Format that tracks whether an email was converted to MHT.
- Excel Text Extraction Method - determines whether the processing engine uses Excel or dtSearch to extract text from Excel files during publish.
- Native - tells Relativity to use Excel to extract text from Excel files.
- Native (failover to dtSearch) - tells Relativity to use Excel to extract text from Excel files with dtSearch as a backup text extraction method if Native text extraction is unsuccessful.
- dtSearch (failover to Native) - tells Relativity to use dtSearch to extract text from Excel files with Native as a backup text extraction method if dtSearch text extraction is unsucessful. This typically results in faster extraction speeds; however, we recommend considering some differences between dtSearch and Native extraction. For example, dtSearch doesn't support extracting the Track Changes text from Excel files. For more considerations like this, see dtSearch special considerations.
Note: When you select the dtSearch Text Extraction Method, the Excel Header/Footer Extraction field below is made unavailable for selection because dtSearch automatically extracts header and footer information and places it at the end of the text; if you selected a value for that field before selecting dtSearch here, then that selection is nullified.
- Excel Header/Footer Extraction - extract header and footer information from Excel files when you publish them. This is useful for instances in which the header and footer information in your Excel files is relevant to the case. This field isn't available if you selected dtSearch for the Excel Text Extraction Method field above because dtSearch automatically extracts header and footer information and places it at the end of the text; if you selected a value for this field and then select dtSearch above, your selection here is nullified. The options are:
- Do not extract - doesn't extract any of the header or footer information from the Excel files and publishes the files with the header and footer in their normal positions. This option is selected by default; however, if you change the value for the Excel Text Extraction Method field above from dtSearch, back to Native, this option will be de-selected and you'll have to select one of these options in order to save the profile.
- Extract and place at end - extracts the header and footer information and stacks the header on top of the footer at the end of the text of each sheet of the Excel file. Note that the native file will still have its header and footer.
- Extract and place inline (slows text extraction) - extracts the header and footer information and puts it inline into the file. The header appears inline directly above the text in each sheet of the file, while the footer appear directly below the text. Note that this could impact text extraction performance if your data set includes many Excel files with headers and footers. Note that the native file will still have its header and footer.
- PowerPoint Text Extraction Method - determines whether the processing engine uses PowerPoint or dtSearch to extract text from PowerPoint files during publish.
- Native - tells Relativity to use PowerPoint to extract text from PowerPoint files.
- Native (failover to dtSearch) - tells Relativity to use PowerPoint to extract text from PowerPoint files with dtSearch as a backup text extraction method if Native text extraction is unsuccessful.
- dtSearch (failover to Native) - tells Relativity to use dtSearch to extract text from PowerPoint files with Native as a backup text extraction method if dtSearch text extraction is unsuccessful. This typically results in faster extraction speeds; however, we recommend considering some differences between dtSearch and Native extraction. For example, dtSearch doesn't support extracting watermarks from pre-2007 PowerPoint files, and also certain metadata fields aren't populated when using dtSearch. For more considerations like this, see dtSearch special considerations.
- Word Text Extraction Method - determines whether the processing engine uses Word or dtSearch to extract text from Word files during publish.
- Native - tells Relativity to use Word to extract text from Word files.
- Native (failover to dtSearch) - tells Relativity to use Word to extract text from Word files with dtSearch as a backup text extraction method if Native text extraction is unsucessful.
- dtSearch (failover to Native) - tells Relativity to use dtSearch to extract text from Word files with Native as a backup text extraction method if dtSearch text extraction is unsucessful. This typically results in faster extraction speeds; however, we recommend considering some differences between dtSearch and Native extraction. For example, dtSearch doesn't support extracting watermarks from pre-2007 Word files, and also certain metadata fields aren't populated when using dtSearch. For more considerations like this, see dtSearch special considerations.
- OCR - select Enable to run OCR during processing. If you select Disable, Relativity won't provide any OCR text in the Extracted Text view.
- OCR Accuracy - determines the desired accuracy of your OCR results and the speed with which you want the job completed. This drop-down menu contains three options:
- High (Slowest Speed) - Runs the OCR job with the highest accuracy and the slowest speed.
- Medium (Average Speed) - Runs the OCR job with medium accuracy and average speed.
- Low (Fastest Speed) - Runs the OCR job with the lowest accuracy and fastest speed.
- OCR Text Separator - select Enable to display a separator between extracted text at the top of a page and text derived from OCR at the bottom of the page in the Extracted Text view. The separator reads as, “--- OCR From Images ---“. With the separator disabled, the OCR text will still be on the page beneath the extracted text, but there will be nothing to indicate where one begins and the other ends. By default, this option is enabled.
Note: If OCR isn't essential to your processing job, it's recommended to disable the OCR field on your processing profile, as doing so can significantly reduce processing time and prevent irrelevant documents from having OCR performed on them. You can then perform OCR on only relevant documents outside of the processing job.
Note: When you process files with both the OCR and the OCR Text Separator fields enabled, any section of a document that required OCR will include text that says "OCR from Image." This can then pollute a dtSearch index because that index is typically built off of the extracted text field, and "OCR from Image" is text that was not originally in the document.
The Deduplication Settings category of the profile layout provides the following fields:
- Deduplication method - the method for separating duplicate files during discovery. During deduplication, the system compares documents based on certain characteristics and keeps just one instance of an item when two or more copies exist. The system performs deduplication against published files only. Deduplication doesn't occur during inventory or discovery. Deduplication only applies to parent files; it doesn't apply to children. If a parent is published, all of its children are also published. Select from the following options.
For details on how these settings work, see Deduplication considerations:
Note: Don't change the deduplication method in the middle of running a processing set, as doing so could result in blank DeDuped Custodians or DeDuped paths fields after publish, when those fields would otherwise display deduplication information.
- None - no deduplication occurs.
- Even when you select None as the deduplication method, Relativity identifies duplicates by storing one copy of the native document on the file repository and using metadata markers for all duplicates of that document.
- Relativity doesn't repopulate duplicate documents if you change the deduplication method from None after processing is complete. Changing the deduplication method only affects subsequent processing sets. This means that if you select global deduplication for your processing settings, you can't then tell Relativity to include all duplicates when you go to run a production.
- Global - arranges for documents from each processing data source to be de-duplicated against all documents in all other data sources in your workspace. Selecting this makes the Propagate deduplication data field below visible and required.
Note: If you select Global, there should be no exact e-mail duplicates in the workspace after you publish. The only exception is a scenario in which two different e-mail systems are involved, and the e-mails are different enough that the processing engine can't exactly match them. In the rare case that this happens, you may see email duplicates in the workspace.
- Custodial - arranges for documents from each processing data source to be de-duplicated against only documents in data sources owned by that custodian. Selecting this makes the Propagate deduplication data field below visible and required.
Note: Deduplication is run on custodian ID's; there's no consequence to changing a custodian's name after their files have already been published.
- None - no deduplication occurs.
- Propagate deduplication data - applies the deduplication fields you mapped out of deduped custodians, deduped paths, all custodians, and all paths field data to children documents, which allows you to meet production specifications and perform searches on those fields without having to include family or overlay those fields manually. This field is only available if you selected Global or Custodial for the deduplication method above. You have the following options:
- Select Yes to have the metadata fields you mapped populated for parent and children documents out of the following: All Custodians, Deduped Custodians, All Paths/Locations, Deduped Paths, and Dedupe Count.
- Select No to have the following metadata fields populated for parent documents only: All Custodians, Deduped Custodians, All Paths/Locations, and Deduped Paths.
- If you republish a processing set that originally contained a password-protected error without first resolving that error, then the deduplication data won’t be propagated correctly to the children of the document that received the error.
- In certain cases, the Propagate deduplication data setting can override the extract children setting on your profile. For example, you have two processing sets that both contain an email message with an attachment of a Word document, Processing Set 1 and 2. You publish Processing Set 1 with the Extract children field set to Yes, which means that the Word attachment is published. You then publish Processing Set 2 with the Extract children field set to No but with the Deduplication method field set to Global and the Propagate deduplication date field set to Yes. When you do this, given that the emails are duplicates, the deduplication data is propagated to the Word attachment published in Processing Set 1, even though you didn’t extract it in Processing Set 2.
The Publish Settings category of the profile layout provides the following fields.
- Auto-publish set - arranges for the processing engine to automatically kick off publish after the completion of discovery, with or without errors. By default, this is set to No. Leaving this at No means that you must manually start publish.
- Default destination folder - the folder in Relativity into which documents are placed once they're published to the workspace. This value determines the default value of the destination folder field on the processing data source. You have the option of overriding this value when you add or edit a data source on the processing set. Publish jobs read the destination folder field on the data source, not on the profile. You can select an existing folder or create a new one by right-clicking the base folder and selecting Create.
- If the source path you selected is an individual file or a container, such as a zip, then the folder tree does not include the folder name that contains the individual file or container.
- If the source path you selected is a folder, then the folder tree includes the name of the folder you selected.
- Do you want to use source folder structure - maintain the folder structure of the source of the files you process when you bring these files into Relativity.
Note: If you select Yes for Use source folder structure, subfolders matching the source folder structure are created under this folder. See the following examples:
Example 1 (recommended)
- Select Source for files to process: \\server.ourcompany.com\Fileshare\Processing Data\Jones, Bob\
- Select Destination folder for published files: Processing Workspace \ Custodians \
Results: A subfolder named Jones, Bob is created under the Processing Workspace \ Custodians \ destination folder, resulting in the following folder structure in Relativity: Processing Workspace \ Custodians \ Jones, Bob \
Example 2 (not recommended)
- Select Source for files to process: \\server.ourcompany.com\Fileshare\Processing Data\Jones, Bob\
- Select Destination folder for published files: Processing Workspace \ Custodians \ Jones, Bob \
Results: A sub-folder named Jones, Bob is created under the Processing Workspace \ Custodians \ Jones, Bob \ destination folder, resulting in the following folder structure in Relativity: Processing Workspace \ Custodians \ Jones, Bob \ Jones, Bob \. Any folder structure in the original source data is retained underneath.
If you select No for Do you want to use source folder structure, no sub-folders are created under the destination folder in Relativity. Any folder structure that may have existed in the original source data is lost.
To better understand how each parent/child numbering option appears for published documents, consider the following scenario.
Your data source includes an MSG file containing three Word documents, one of which is password protected:
- Word Child 1
- Word Child 2
- Word Child 3 (password protected)
- sub child 1
- sub child 2
When you process the .msg file, three documents are discovered and published, and there’s an error on the one password-protected child document. You then retry discovery, and an additional two sub-child documents are discovered. You then republish the processing set, and the new two documents are published to the workspace.
If you’d chosen Suffix Always for the Parent/Child Numbering field on the profile, the identifiers of the published documents would appear as the following:
(Click to expand)
If you’d chosen Continuous Always for the Parent/Child Numbering field on the profile, the identifiers of the published documents would appear as the following:
(Click to expand)
- In this case, the .msg file was the last document processed, and Word Child 3.docx was the first error reprocessed in a larger workspace. Thus, the sub child documents of Word Child 3.docx do not appear in the screen shot because they received sequence numbers after the last document in the set.
If you’d chosen Continuous, Suffix on Retry for the Parent/Child Numbering field on the profile, the identifiers of the published documents would appear as the following:
(Click to expand)
- Suffix on retry only applies to errors that haven’t been published to the workspace. If a document has an error and has been published, it will have a continuous number. If you resolve the error post-publish, the control number doesn’t change.
Note the following details regarding how Relativity uses suffixes:
- For suffix child document numbering, Relativity indicates secondary levels of documents with a delimiter and another four digits appended for additional sub-levels. For example, a grandchild document with the assigned prefix REL would be numbered REL0000000001.0001.0001.
- Note the following differences between unpublished documents and published documents with errors:
- If a file is unpublished, and Continuous Always is the numbering option on the profile, Relativity will not add a suffix
- If a file is unpublished, and Suffix Always is the numbering option on the profile, Relativity will add a suffix to it.
- If a file has an error and is published, and Continuous, Suffix on Retry is the numbering option on the profile, Relativity will add a suffix to it.
- It's possible for your workspace to contain a document family that contains both suffixed and non-suffixed child documents. This can happen in the following example scenario:
- You discover a master (level 1) MSG file that contains child (level 2) documents and grandchild (level 3) documents, none of which contain suffixes.
- One of the child documents yields an error.
- You retry the error child document, and in the process you discover two grandchildren.
- The newly discovered grandchildren are suffixed because they came from an error retry job, while the master and non-error child documents remain without suffixes, based on the original discovery.
When you publish Word, Excel, and PowerPoint files with the text extraction method set to dtSearch on the profile, you'll typically see faster extractions speeds, but note that those file properties may or may not be populated in their corresponding metadata fields or included in the Extracted Text value.
The dtSearch text extraction method does not populate the following properties:
- In Excel, Track Changes in the extracted text.
- In Word, Has Hidden Data in the corresponding metadata field.
- In Word, Track Changes in the corresponding metadata field.
- In Powerpoint, Has Hidden Data in the corresponding metadata field.
- In Powerpoint, Speaker Notes in the corresponding metadata field.
Note: The dtSearch text extraction method will display track changes extracted text in-line, but changes may be poorly formatted. The type of change made is not indicated. The Native text extraction method will append track changes extracted text in a Tracked Change section.
The following table breaks down which file properties are populated in corresponding metadata fields and/or Extracted Text for the dtSearch text extraction method:
|File type||Property||Included in dtSearch Corresponding metadata field||Included in dtSearch Extracted text|
|Excel (.xls, .xlsx)||Has Hidden Data||✓||✓|
|Excel (xls, .xlsx)||Track Changes (Inserted cell, moved cell, modified cell, cleared cell, inserted column, deleted column, inserted row, deleted row, inserted sheet, renamed sheet)||✓|
|Word (.doc, .docx)||Has Hidden Data||✓|
|Word (.doc, .docx)||
Track Changes (Insertions, deletions, moves)
|Powerpoint (.ppt, .pptx)||Has Hidden Data||✓|
|Powerpoint (.ppt, .pptx)||Speaker Notes||✓|
Note: Check marks do not apply to .xlsb files.
Note: Relativity does not possess a comprehensive list of all differences between the Native application and dtSearch text extraction methods. For additional information, see support.dtsearch.com.