Processing profiles

A processing profile is an object that stores the numbering, deNIST, extraction, and deduplication settings that the processing engine refers to when publishing the documents in each data source that you attach to your processing set. You can create a profile specifically for one set or you can reuse the same profile for multiple sets.

Relativity provides a Default profile upon installation of processing.

Creating or editing a processing profile

To create or edit a processing profile:

  1. Go to the Processing Profile tab.
  2. Click New Processing Profile or select any profile in the list.
  3. Complete or modify the fields on the Processing Profile layout.
  4. Click Save. Once you save the processing profile, you can associate it with a processing set.

Note: You can't delete the Default processing profile. If you delete a profile that is associated with a processing set you've already started, the in-progress processing phase will continue with the original profile settings you applied when you submitted the job, but you won't be able to proceed to the next phase. For example, if you delete a profile during discovery, you won't be able to publish those discovered files until you add a new profile to the set. If you have an existing processing set that you haven't started that refers to a profile that you deleted after associating it to the set, you must associate a new profile with the set before you can start that processing job.

Fields

Note: Relativity doesn't re-extract text for a re-discovered file unless an extraction error occurred. This means that if you discover the same file twice and you change any settings on the profile, or select a different profile, between the two discovery jobs, Relativity will not re-extract the text from that file unless there was an extraction error. This is because processing always refers to the original/master document and the original text stored in the database.

The Processing Profile Information category of the profile layout provides the following fields:

Processing profile information

  • Name—the name you want to give the profile.

The Numbering Settings category of the profile layout provides the following fields.

Numbering settings

  • Default document numbering prefix—the prefix applied to each file in a processing set once it is published to a workspace. The default value for this field is REL.
    • When applied to documents, this appears as the prefix, followed by the number of digits you specify. For example, <Prefix>xxxxxxxxxx.
    • If you use a different prefix for the Custodian field on the processing data source(s) that you add to your processing set, the custodian's prefix takes precedence over the profile's.
    • The character limit for this prefix is 75.
    • Note: When Level numbering is selected, the prefix corresponds to the PPP section in the PPP.BBBB.FFFF.NNNN format and it can be used to identify the source or owner of the documents also known as ‘party code’ or ‘source’.

  • Numbering Type—determines how the documents in each data source are numbered when published to the workspace. This field gives you the option of defining your document numbering schema. It is useful in keeping your document numbering consistent when importing documents from alternate sources. The choices for this field are:
    • Auto Numbering—determines that the next published document will be identified by the next available number of that prefix.
    • Define Start Number—sets the starting number of the documents you intend to publish to the workspace.
      • Relativity uses the next available number for that prefix if the number is already published to the workspace.
      • To ensure continuity, Relativity will never assign a control number below the defined starting number in future processing sets. For example, if you define a starting number of 100, the numbers 0-99 become unavailable for future use for that prefix.

      • This option is useful when you process from a third-party tool that does not provide a suffix for your documents and you want to define a new start number for the next set of documents to keep the numbering continuous.
      • Selecting this choice makes the Default Start Number field available below and the Start Number field on the data source layout.
        • Default Start Number—the starting number for documents that are published from the processing set(s) that use this profile.
          • This field is only visible if you selected the Define Start Number choice for the Numbering Type field above.
          • If you use a different start number for the Start Number field on the data source that you attach the processing set, that number takes precedence over the value you enter here.
          • The maximum value you can enter here is 2,147,483,647. If you enter a higher value, you'll receive an Invalid Integer warning next to field value and you won't be able to save the profile.
      • Number of Digits—determines how many digits the document's control number contains. The range of available values is 1 to 10 when Define Start Number is selected. By default, this field is set to 10 characters.
      • Parent/Child Numbering—determines how parent and child documents are numbered relative to each other when published to the workspace. The choices for this field are as follows. For examples of each type, see Parent/child numbering type examples.
        • Suffix Always—arranges for child documents to be appended to their parent with a delimiter.
        • Continuous Always—arranges for child documents to receive a sequential control number after their parent.
        • Continuous, Suffix on Retry—arranges for child documents to receive a sequential control number after their parent except for child documents that weren't published to the workspace. When these unpublished child documents are retried and published, they will receive the parent's number with a suffix. If you resolve the error post-publish, the control number doesn’t change.
        • Note: It's possible for your workspace to contain a document family that has both suffixed and non-suffixed child documents. See Suffix special considerations for details.

      • Delimiter—the delimiter you want to appear between the different fragments of the control number of your published child documents. The choices for this field are:
        • - (hyphen)—adds a hyphen as the delimiter to the control number of child documents. For example, REL0000000001-0001-0001.
        • . (period)—adds a period as the delimiter to the control number of child documents. For example, REL0000000001.0001.0001.
        • _(underscore)—adds an underscore as the delimiter to the control number of child documents. For example, REL0000000001_0001_0001.

The Inventory | Discovery Settings category of the profile layout provides the following fields.

Inventory discovery settings

  • DeNIST—if set to Yes, processing separates and removes files found on the National Institute of Standards and Technology (NIST) list from the data you plan to process so that they don't make it into Relativity when you publish a processing set. The NIST list contains file signatures—or hash values—for millions of files that hold little evidentiary value for litigation purposes because they are not user-generated. This list may not contain every known junk or system file, so deNISTing may not remove 100% of undesirable material. If you know that the data you intend to process contains no system files, you can select No. If the DeNIST field is set to Yes on the profile but the Invariant database table is empty for the DeNIST field, you can't publish files. If the DeNIST field is set to No on the processing profile, the DeNIST filter doesn't appear by default in Inventory, and you don't have the option to add it. Likewise, if the DeNIST field is set to Yes on the profile, the corresponding filter is enabled in Inventory, and you can't disable it for that processing set. The choices for this field are:
    • Yes—removes all files found on the NIST list. You can further define DeNIST options by specifying a value for the DeNIST Mode field.
    • Note: When DeNISTing, the processing engine takes into consideration everything about the file, including extension, header information and the content of the file itself. Even if header information is removed and the extension is changed, the engine is still able to identify and remove a NIST file. This is because it references the hashes of the system files that are found in the NIST database and matches up the hash of, for example, a Windows DLL to the hash of known DLL's in the database table.

    • No—doesn't remove any files found on the NIST list. Files found on the NIST list are then published with the processing set.
    • Note: The same NIST list is used for all workspaces in the environment because it's stored on the worker manager server. You should not edit the NIST list. Relativity makes new versions of the NIST list available shortly after the National Software Reference Library (NSRL) releases them quarterly. Login to the NIST Package Download webpage on the Relativity Community website to download the latest package and installer files.

  • DeNIST Mode—specify DeNIST options in your documents if DeNIST is set to Yes.
    • DeNIST all files—breaks any parent/child groups and removes any attached files found on the NIST list from your document set.
    • Do not break parent/child groups—doesn't break any parent/child groups, regardless if the files are on the NIST list. Any loose NIST files are removed.
  • Default OCR languages—the language used to OCR files where text extraction isn't possible, such as for image files containing text. This selection determines the default language on the processing data sources that you create and then associate with a processing set.
  • Default time zone - the time zone used to display date and time on a processed document. This selection determines the default time zone on the processing data sources that you create and then associate with a processing set. The default time zone is applied from the processing profile during the discovery stage. For more information, see Adding a processing data source.
  • Note: The processing engine discovers all natives in UTC and then converts metadata dates and times into the value you enter for the Default Time Zone field. The engine needs the time zone at the time of text extraction to write the date/time into the extracted text and automatically applies the daylight saving time for each file based on its metadata during the publishing stage.

The Extraction Settings category of the profile layout provides the following fields.

Note: For all text extraction methods described below, Relativity is recommended over both Native settings and dtSearch for performance and accuracy.

  • Extract children—arranges for the removal of child items during discovery, including attachments, embedded objects and images and other non-parent files. The options are:
    • Yes—extracts all children files during discovery so that both children and parents are included in the processing job.
    • No—does not extract children, so that only parents are included in the processing job.
    • Note: You don’t need to set the Extract children field to Yes to have the files within PST and other container files extracted and processed. This is because Relativity breaks down container files by default without the need to specify to extract children.

  • When extracting children, do not extract—exclude one or all of the following file types when extracting children. You can't make a selection here if you set the Extract children field to No.
    • MS Office embedded images—excludes images of various file types found inside Microsoft Office files—such as .jpg, .bmp, or .png in a Word file—from discovery so that embedded images aren't published separately in Relativity.
    • MS Office embedded objects—excludes objects of various file types found inside Microsoft Office files—such as an Excel spreadsheet inside a Word file—from discovery so that the embedded objects aren't published separately in Relativity. MS Office embedded objects will not have text extracted and will not be searchable.
    • Note: Relativity currently doesn't support the extraction of embedded images or objects from Visio, Project, or OpenOffice files. In addition, Relativity never extracts any embedded objects or images that were added to any files as links. For a detailed list of the Office file extensions from which Relativity does and does not extract embedded objects and images, see Microsoft Office child extraction support.

    • Email inline images—excludes images of various files types found inside emails—such as .jpg, .bmp, or .png in an email—from discovery so that inline images aren't published separately in Relativity.
    • Note: For a detailed list of the kinds of attachments that Relativity treats as inline, or embedded, images during processing, see Tracking inline/embedded images.

  • Email Output—determines the file format in which emails will be published to the workspace. The options are:
    • MSG—publishes emails which are handled as MSGs during processing as MSG
    • MHT—converts and publishes emails which are handled as MSGs during processing as MHT

      Note: This option affects the following file types: Outlook files, Lotus Notes files, Bloomberg files

      Note: Hashing for deduplication is performed on emails before conversion to MHT. The Processing Duplicate Hash value contains the Body, Header, Recipient, and Attachment hashes instead of the SHA256 hash used on native MHTs. After conversion, unique information from MSGs may render the same in the resulting MHT due to the files format. An example is two MSG's that contain "[www.test.com [http//:www.test.com]" and "www.test.com<http://www.test.com/>" in their respective text. During hash generation, these MSG's result in unique body hashes. When converted to an MHT, this text renders as "www.test.com<http://www.test.com/>". You can view or map individual Body, Header, Recipient, and Attachment hashes from the Files tab.

      • This conversion happens during discovery.
      • MSG files take up unnecessary space because attachments to an MSG are stored twice, once with the MSG itself and again when they’re extracted and saved as their own records. As a result, when you convert an MSG to an MHT, you significantly reduce your file storage because MHT files do not require duplicative storage of attachments.
      • If you need to produce a native email file while excluding all privileged or irrelevant files, convert the email native from MSG to MHT by using the Email Output field. After an email is converted from MSG to MHT, the MHT email is published to the workspace separately from any attachments, reducing the chance of accidentally producing privileged attachments.
      • Once you convert an MSG file to MHT, you cannot revert this conversion after the files have been published. For a list of differences between how Relativity handles MSG and MHT files, see MSG to MHT conversion considerations.
    • Note: There is also a Yes/No Processing field called Converted Email Format that tracks whether an email was converted to MHT.

  • Excel Text Extraction Method—determines whether the processing engine uses Excel, Relativity, or dtSearch to extract text from Excel files during publish.
    • Relativity (Recommended)—Relativity uses its built-in engine to extract text from Excel files.
      Note: Using Relativity's built-in engine is the recommended method for performance and accuracy.
    • Native—Relativity uses Excel to extract text from Excel files.
    • Native (failover to dtSearch) —Relativity uses Excel to extract text from Excel files with dtSearch as a backup text extraction method if extraction fails.
    • dtSearch (failover to Native)—Relativity uses dtSearch to extract text from Excel files with Native as a backup text extraction method if extraction fails. This typically results in faster extraction speeds; however, we recommend considering some differences between dtSearch and Native extraction. For example, dtSearch doesn't support extracting the Track Changes text from Excel files. For more considerations like this, see dtSearch special considerations.
  • Excel Header/Footer Extraction—extract header and footer information from Excel files when you publish them. This is useful for instances in which the header and footer information in your Excel files is relevant to the case. This field isn't available if you selected dtSearch for the Excel Text Extraction Method field above because dtSearch automatically extracts header and footer information and places it at the end of the text; if you selected a value for this field and then select dtSearch above, your selection here is nullified. The options are:
    • Do not extract—doesn't extract any of the header or footer information from the Excel files and publishes the files with the header and footer in their normal positions. This option is selected by default; however, if you change the value for the Excel Text Extraction Method field above from dtSearch, back to Native, this option will be de-selected and you'll have to select one of these options in order to save the profile.
    • Extract and place at end—extracts the header and footer information and stacks the header on top of the footer at the end of the text of each sheet of the Excel file. Note that the native file will still have its header and footer.
    • Extract and place inline (slows text extraction)—extracts the header and footer information and puts it inline into the file. The header appears inline directly above the text in each sheet of the file, while the footer appear directly below the text. Note that this could impact text extraction performance if your data set includes many Excel files with headers and footers. Note that the native file will still have its header and footer.
  • PowerPoint Text Extraction Method—determines whether the processing engine uses PowerPoint, Relativity, or dtSearch to extract text from PowerPoint files during publish.
    • Relativity (Recommended)—Relativity uses its built-in engine to extract text from PowerPoint files.
      Note: Using Relativity's built-in engine is the recommended method for performance and accuracy.
    • Native—Relativity uses PowerPoint to extract text from PowerPoint files.
    • Native (failover to dtSearch)—Relativity uses PowerPoint to extract text from PowerPoint files with dtSearch as a backup text extraction method if extraction fails.
    • dtSearch (failover to Native)—Relativity uses dtSearch to extract text from PowerPoint files with Native as a backup text extraction method if extraction fails. This typically results in faster extraction speeds; however, we recommend considering some differences between dtSearch and Native extraction. For example, dtSearch doesn't support extracting watermarks from pre-2007 PowerPoint files, and also certain metadata fields aren't populated when using dtSearch. For more considerations like this, see dtSearch special considerations.
  • Word Text Extraction Method—determines whether the processing engine uses Word, Relativity, or dtSearch to extract text from Word files during publish.
    • Relativity (Recommended)—Relativity uses its built-in engine to extract text from Word files.
      Note: Using Relativity's built-in engine is the recommended method for performance and accuracy.
    • Native—Relativity uses Word to extract text from Word files.
    • Native (failover to dtSearch)—Relativity uses Word to extract text from Word files with dtSearch as a backup text extraction method if extraction fails.
    • dtSearch (failover to Native)—Relativity uses use dtSearch to extract text from Word files with Native as a backup text extraction method if extraction fails. This typically results in faster extraction speeds; however, we recommend considering some differences between dtSearch and Native extraction. For example, dtSearch doesn't support extracting watermarks from pre-2007 Word files, and also certain metadata fields aren't populated when using dtSearch. For more considerations like this, see dtSearch special considerations.
  • OCR—select Enable to run OCR during processing. If you select Disable, Relativity won't provide any OCR text in the Extracted Text view.
  • Note: If OCR isn't essential to your processing job, it's recommended to disable the OCR field on your processing profile, as doing so can significantly reduce processing time and prevent irrelevant documents from having OCR performed on them. You can then perform OCR on only relevant documents outside of the processing job.

  • OCR Accuracy—determines the desired accuracy of your OCR results and the speed with which you want the job completed. This drop-down menu contains three options:
    • High (Slowest Speed)—Runs the OCR job with the highest accuracy and the slowest speed.
    • Medium (Average Speed)—Runs the OCR job with medium accuracy and average speed.
    • Low (Fastest Speed)—Runs the OCR job with the lowest accuracy and fastest speed.
  • OCR Text Separator—select Enable to display a separator between extracted text at the top of a page and text derived from OCR at the bottom of the page in the Extracted Text view. The separator reads as, “--- OCR From Images ---“. With the separator disabled, the OCR text will still be on the page beneath the extracted text, but there will be nothing to indicate where one begins and the other ends. By default, this option is enabled.
  • Note: When you process files with both the OCR and the OCR Text Separator fields enabled, any section of a document that required OCR includes text that says OCR from Image. This can then pollute a dtSearch index because that index is typically built off of the extracted text field, and OCR from Image is text that was not originally in the document.

The Deduplication Settings category of the profile layout provides the following fields:

  • Deduplication method—the method for separating duplicate files during discovery. During deduplication, the system compares documents based on certain characteristics and keeps just one instance of an item when two or more copies exist. The system performs deduplication against published files only. Deduplication doesn't occur during inventory or discovery. Deduplication only applies to parent files; it doesn't apply to children. If a parent is published, all of its children are also published. Select from the following options. :

    Note: Don't change the deduplication method in the middle of running a processing set, as doing so could result in blank DeDuped Custodians or DeDuped paths fields after publish, when those fields would otherwise display deduplication information.

    • None—no deduplication occurs.
      • Even when you select None as the deduplication method, Relativity identifies duplicates by storing one copy of the native document on the file repository and using metadata markers for all duplicates of that document.
      • Relativity doesn't repopulate duplicate documents if you change the deduplication method from None after processing is complete. Changing the deduplication method only affects subsequent processing sets. This means that if you select global deduplication for your processing settings, you can't then tell Relativity to include all duplicates when you go to run a production.
    • Global—arranges for documents from each processing data source to be de-duplicated against all documents in all other data sources in your workspace. Selecting this makes the Propagate deduplication data field below visible and required.
    • Note: If you select Global, there should be no exact e-mail duplicates in the workspace after you publish. The only exception is a scenario in which two different e-mail systems are involved, and the e-mails are different enough that the processing engine can't exactly match them. In the rare case that this happens, you may see email duplicates in the workspace.

    • Custodial—arranges for documents from each processing data source to be de-duplicated against only documents in data sources owned by that custodian. Selecting this makes the Propagate deduplication data field below visible and required.
    • Note: Deduplication is run on custodian ID's; there's no consequence to changing a custodian's name after their files have already been published.

  • Propagate deduplication data—applies the deduplication fields you mapped out of deduped custodians, deduped paths, all custodians, and all paths field data to children documents, which allows you to meet production specifications and perform searches on those fields without having to include family or overlay those fields manually. This field is only available if you selected Global or Custodial for the deduplication method above. You have the following options:
    • Select Yes to have the metadata fields you mapped populated for parent and children documents out of the following: All Custodians, Deduped Custodians, All Paths/Locations, Deduped Paths, and Dedupe Count.
    • Select No to have the following metadata fields populated for parent documents only: All Custodians, Deduped Custodians, All Paths/Locations, and Deduped Paths.
    • If you republish a processing set that originally contained a password-protected error without first resolving that error, then the deduplication data won’t be propagated correctly to the children of the document that received the error.
    • In certain cases, the Propagate deduplication data setting can override the extract children setting on your profile. For example, you have two processing sets that both contain an email message with an attachment of a Word document, Processing Set 1 and 2. You publish Processing Set 1 with the Extract children field set to Yes, which means that the Word attachment is published. You then publish Processing Set 2 with the Extract children field set to No but with the Deduplication method field set to Global and the Propagate deduplication date field set to Yes. When you do this, given that the emails are duplicates, the deduplication data is propagated to the Word attachment published in Processing Set 1, even though you didn’t extract it in Processing Set 2.

The Publish Settings category of the profile layout provides the following fields.

  • Auto-publish set—arranges for the processing engine to automatically kick off publish after the completion of discovery, with or without errors. By default, this is set to No. Leaving this at No means that you must manually start publish.
  • Default destination folder—the folder in Relativity into which documents are placed once they're published to the workspace. This value determines the default value of the destination folder field on the processing data source. You have the option of overriding this value when you add or edit a data source on the processing set. Publish jobs read the destination folder field on the data source, not on the profile. You can select an existing folder or create a new one by right-clicking the base folder and selecting Create.
    • If the source path you selected is an individual file or a container, such as a zip, then the folder tree does not include the folder name that contains the individual file or container.
    • If the source path you selected is a folder, then the folder tree includes the name of the folder you selected.
  • Do you want to use source folder structure—maintain the folder structure of the source of the files you process when you bring these files into Relativity.
  • Note: If you select Yes for Use source folder structure, subfolders matching the source folder structure are created under this folder. See the following examples:

    Example 1 (recommended)
    - Select Source for files to process: \\server.ourcompany.com\Fileshare\Processing Data\Jones, Bob\
    - Select Destination folder for published files: Processing Workspace \ Custodians \

    Results: A subfolder named Jones, Bob is created under the Processing Workspace \ Custodians \ destination folder, resulting in the following folder structure in Relativity: Processing Workspace \ Custodians \ Jones, Bob \

    Example 2 (not recommended)
    - Select Source for files to process: \\server.ourcompany.com\Fileshare\Processing Data\Jones, Bob\
    - Select Destination folder for published files: Processing Workspace \ Custodians \ Jones, Bob \

    Results: A sub-folder named Jones, Bob is created under the Processing Workspace \ Custodians \ Jones, Bob \ destination folder, resulting in the following folder structure in Relativity: Processing Workspace \ Custodians \ Jones, Bob \ Jones, Bob \. Any folder structure in the original source data is retained underneath.

    If you select No for Do you want to use source folder structure, no sub-folders are created under the destination folder in Relativity. Any folder structure that may have existed in the original source data is lost.

Parent/child numbering type examples

To better understand how each parent/child numbering option appears for published documents, consider the following scenario.

Your data source includes an MSG file containing three Word documents, one of which is password protected:

  • MSG
    • Word Child 1
    • Word Child 2
    • Word Child 3 (password protected)
      • sub child 1
      • sub child 2

When you process the .msg file, three documents are discovered and published, and there’s an error on the one password-protected child document. You then retry discovery, and an additional two sub-child documents are discovered. You then republish the processing set, and the new two documents are published to the workspace.

If you’d chosen Suffix Always for the Parent/Child Numbering field on the profile, the identifiers of the published documents would appear as the following:

Suffix always choice

If you’d chosen Continuous Always for the Parent/Child Numbering field on the profile, the identifiers of the published documents would appear as the following:

Continuous always choice

  • In this case, the .msg file was the last document processed, and Word Child 3.docx was the first error reprocessed in a larger workspace. Thus, the sub child documents of Word Child 3.docx do not appear in the screen shot because they received sequence numbers after the last document in the set.

If you’d chosen Continuous, Suffix on Retry for the Parent/Child Numbering field on the profile, the identifiers of the published documents would appear as the following:

Continuous, suffix on retry choice

  • Suffix on retry only applies to errors that haven’t been published to the workspace. If a document has an error and has been published, it will have a continuous number. If you resolve the error post-publish, the control number doesn’t change.

Prioritizing publishing speed special considerations

Publishing speed can be prioritized by performing one of the following actions:

  • setting the Deduplication method to None

  • setting the Create Source Folder Structure to No

Suffix special considerations

Note the following details regarding how Relativity uses suffixes:

  • For suffix child document numbering, Relativity indicates secondary levels of documents with a delimiter and another four digits appended for additional sub-levels. For example, a grandchild document with the assigned prefix REL would be numbered REL0000000001.0001.0001.
  • Note the following differences between unpublished documents and published documents with errors:
    • If a file is unpublished, and Continuous Always is the numbering option on the profile, Relativity will not add a suffix
    • If a file is unpublished, and Suffix Always is the numbering option on the profile, Relativity will add a suffix to it.
    • If a file has an error and is published, and Continuous, Suffix on Retry is the numbering option on the profile, Relativity will add a suffix to it.
  • It's possible for your workspace to contain a document family that contains both suffixed and non-suffixed child documents. This can happen in the following example scenario:
    • You discover a master (level 1) MSG file that contains child (level 2) documents and grandchild (level 3) documents, none of which contain suffixes.
    • One of the child documents yields an error.
    • You retry the error child document, and in the process you discover two grandchildren.
    • The newly discovered grandchildren are suffixed because they came from an error retry job, while the master and non-error child documents remain without suffixes, based on the original discovery.

dtSearch special considerations

When you publish Word, Excel, and PowerPoint files with the text extraction method set to dtSearch on the profile, you'll typically see faster extractions speeds, but note that those file properties may or may not be populated in their corresponding metadata fields or included in the Extracted Text value.

The dtSearch text extraction method does not populate the following properties:

  • In Excel, Track Changes in the extracted text.
  • In Word, Has Hidden Data in the corresponding metadata field.
  • In Word, Track Changes in the corresponding metadata field.
  • In Powerpoint, Has Hidden Data in the corresponding metadata field.
  • In Powerpoint, Speaker Notes in the corresponding metadata field.

Note: The dtSearch text extraction method will display track changes extracted text in-line, but changes may be poorly formatted. The type of change made is not indicated. The Native text extraction method will append track changes extracted text in a Tracked Change section.

The following table breaks down which file properties are populated in corresponding metadata fields and/or Extracted Text for the dtSearch text extraction method:

File type Property Included in dtSearch Corresponding metadata field Included in dtSearch Extracted text
Excel (.xls, .xlsx) Has Hidden Data
Excel (xls, .xlsx) Track Changes (Inserted cell, moved cell, modified cell, cleared cell, inserted column, deleted column, inserted row, deleted row, inserted sheet, renamed sheet)  
Word (.doc, .docx) Has Hidden Data  
Word (.doc, .docx)

Track Changes (Insertions, deletions, moves)

 
Powerpoint (.ppt, .pptx) Has Hidden Data  
Powerpoint (.ppt, .pptx) Speaker Notes  

Note: Check marks do not apply to .xlsb files.

Note: Relativity does not possess a comprehensive list of all differences between the Native application and dtSearch text extraction methods. For additional information, see support.dtsearch.com.

 

Text extraction method considerations

As text extraction directly impacts search results, the following table lists which features are supported by the Relativity, Native, and dtSearch methods:

  Relativity Native dtSearch
FEATURES Excel Features
Supported
Word
Features Supported
Power Point
Features
Supported
Excel Features
Supported
Word Features
Supported
Power Point Features Supported Excel Features
Supported
Word Features
Supported
Power Point Features Supported
FEATURE DIFFERENCES
Math equations
Not Supported Not Supported Not Supported
Math formulas (sum, avg, etc.) Not Supported Not Supported Not Supported Not Supported Not Supported Not Supported
SmartArt ✓ * ✓ * ✓ * ✓ * ✓ * ✓ * ✓ * ✓ * ✓ *
Speaker notes N/A N/A ✓ ** N/A N/A N/A N/A ✓ ***
Track changes N/A N/A ✓ *** ✓ *** N/A
Hidden data ✓ *** ✓ *** ✓ ***
2016+ new chart styles Not
Supported
Not
Supported
Not
Supported

*

**

***

Pre-2007 Office SmartArt are considered attachments and will be extracted and OCRd.

When a header or footer is in the Speaker Notes section, field codes are not extracted.

For more information, see dtSearch special Considerations

 
FULLY COMPATIBLE AND SUPPORTED FEATURES
Bullet lists
Chart box
CJK and other foreign language characters
Clip art
Comments and replies
Currency format N/A N/A N/A
Date / Time format
Field codes
Footer
Header
Hidden slide N/A N/A N/A N/A N/A N/A
Macros N/A N/A N/A N/A N/A N/A
Margins / Alignment Format N/A N/A N/A N/A N/A N/A
Merged cell (horizontal) N/A N/A N/A N/A N/A N/A
Merged cell (vertical) N/A N/A N/A N/A N/A N/A
Number format (positive / negative) N/A N/A N/A N/A N/A N/A
Number format (fraction) N/A N/A N/A N/A N/A N/A
Number format (with comma) N/A N/A N/A N/A N/A N/A
Number format (with decimal point) N/A N/A N/A N/A N/A N/A
Password protected (cell level) N/A N/A N/A N/A N/A N/A
Password protected (column level) N/A N/A N/A N/A N/A N/A
Password protected (file level)
Password protected (row level) N/A N/A N/A N/A N/A N/A
Password protected (sheet / page level) N/A N/A N/A N/A N/A N/A
Phone number format N/A N/A N/A N/A N/A N/A
Pivot table N/A N/A N/A N/A N/A N/A
Right to left test format N/A N/A N/A N/A N/A N/A
Slide numbers N/A N/A N/A N/A N/A N/A N/A
Table
Text box
Transitions N/A N/A N/A N/A N/A N/A
WordArt
Word wrapping format N/A N/A N/A N/A N/A N/A