Setting up CJK (Chinese, Japanese, Korean) and other Unicode document workspaces in Relativity

When working with documents that contain certain Asian character sets, specific settings help expedite your setup time in Relativity. This information can help you build a clean workspace with more accurate fields to support your Asian language document sets.

Recipe overview

This recipe highlights items to be aware of as you work with workspaces that contain Chinese, Japanese, Korean, Thai, and other non-European character sets.

Requirements

  • Workspace access with the following permissions:
    • Field: edit/delete
    • Searches: edit/delete
  • Relativity Desktop Client

Directions

  1. In your workspace, set the Unicode field to Yes for your text and choice fields. This will ensure that you can store non-Western European characters in these fields.

  2. Set up fields such as Custodian, email fields, File Name, and Extracted Text as Unicode-compliant to capture characters accurately.
  3. Your processing vendor should have provided you with data files that are Unicode/CJK character-compliant. If this is the case, choose the encoding of the load file and any extracted text files. Failure to set the source encoding correctly will lead to corrupt storage in your Relativity fields.

Searching considerations

Consider the following with SQL, dtSearch, and Chinese, Japanese, and Thai documents.

Keyword Search

  • Within a Relativity workspace, a system admin can select the language to use in the SQL Full Text Language. SQL Full Text Language determines the correct stemming and word-break characters used in the full text index. Note that for multiple language workspaces, Microsoft recommends setting the most complex prevalent language as the SQL Full Text language.

  • Once you configure your SQL full-text language to the correct language (Japanese in this example), you can perform keyword searching and filtering in that language.

dtSearch

Index

Set up CJKRanges correctly in the dtSearch index (dtSearch index alphabet list: CJKRanges = 0e00-0e4e 3040-30ff 4e00-9fff) to make each Thai, Chinese, and Japanese character a separate word.

Word/Character Search

As explained previously, you can store Chinese, Japanese, and Thai text as Unicode so that you can use dtSearch to search for words in these languages. However, while dtSearch can search for literal word matches (or wildcard or fuzzy matches), there are some limitations on the support in dtSearch for Chinese, Japanese, and Thai text. Those limitations include:

  • These languages typically do not separate the words with spaces or punctuation. Instead, the characters in a document run together, and a language-specific dictionary is needed to find word breaks. dtSearch does not have the ability to identify word breaks in these documents because it doesn't include any language-specific dictionaries. To make this type of text searchable, enable an option in dtSearch to automatically insert word breaks around Chinese, Japanese, and Thai characters. Once you enable this option, Relativity treats each character as a single word for indexing and searching purposes.
  • This feature is turned on by default in Relativity dtSearch (though it should be tweaked as described above to add Thai and remove Korean from this treatment). Because each term is searchable, we recommend searching for multiple characters to assist in retrieving more accurate results. When looking for multiple characters, use the proximity connectors to assist in finding desired results.
  • The same text can be presented in different ways depending on the context. dtSearch searches for a word as it is provided in the search request and does not generate additional grammatical or script variations for words in Chinese, Japanese, and Thai.
  • The dtSearch Engine has an API that you can use to integrate with dictionary-based language analyzers from companies such as Basis Technologies. But while the dtSearch standalone desktop does allow for integration, the instance within the Relativity environment does not support integration.

For non-Western languages (such as Chinese, Japanese, Hindi, and so forth) there are additional considerations and workarounds that may help in locating search hits.

Chinese considerations

  • Chinese characters are treated as words. Therefore, the Chinese characters do not need an extra space because each character can be compared against a list. If there is a match, then it's highlighted.
  • Numbers are not treated like Chinese characters. For example, when you use the number 54, followed by a Chinese character or characters, you need a match on everything until there is a space. Add a space to the search term so the term is 54 所[space], and then the rest of the line, then it will highlight.

  • The term 54 所 does not match the subject. The subject has the additional characters, hence no match.
  • The Chinese characters do not need the space because each character, such as 所, is treated uniquely, therefore, it highlights.
  • Keyword Search – Once you configure your SQL full-text language to the correct language, you can perform keyword searching and filtering in that language.
  • dtSearch – Remember that hit highlighting and search hits are separate implementations. Thus, there are cases where a document is returned as a hit, but the highlighting does not show how the hit occurred.

Analytics considerations

  • Use language identification (in structured analytics) to identify the languages of the documents in your workspace.
  • Training your Analytics index in one language as opposed to multiple languages generally produces a better index. However, if many of your documents contain multiple languages, this can become a bit challenging. As a general rule, the training set should be higher purity (i.e. documents predominantly in the language of choice), while the searchable documents can be more mixed-language.
  • Be sure to set up stop words appropriate for the languages in your documents. Note that Relativity does not provide stop words for languages other than English-configuration is left up to the administrator.

References