Language identification

Language identification examines the extracted text of each document to determine the primary language and up to two secondary languages present. This allows you to see how many languages are present in your collection, and the percentages of each language by document. You can then easily separate documents by language and batch out files to native speakers for review.

For multi-language documents, it returns the top three languages found and their approximate percentages of the total text bytes (e.g. 80% English and 20% French out of 1000 bytes of text means about 800 bytes of English and 200 bytes of French). Language identification does not use a conceptual index for this operation; a Structured Analytics set is used. The Analytics engine is trained on the prose of each supported language. The operation analyzes each document for the following qualities to determine whether it contains a known language:

  • Character set (e.g., Thai and Greek are particularly distinctive)
  • Letters and the presence or absence of accent marks
  • Spelling of words (e.g., words that end in “-ing” are likely English)

Language identification is a Naïve Bayesian classifier, using one of three different token algorithms: 

  • For Unicode scripts such as Greek and Thai that map one-to-one to detected languages, the script defines the result.
  • For CJK languages, single letters (rather than multi-letter combinations) are scored.
  • For all other languages, language Identification ignores single letters and instead uses sequences of four letters (quadgrams).

It also ignores punctuation and HTML tags. Language identification is done exclusively on lowercase Unicode letters and marks, after expanding HTML entities &xyz; and after deleting digits, punctuation, and <tags>. For each letter sequence, the scoring uses the 3-6 most likely languages and their quantized log probabilities.

The analysis does not use a word list or dictionary. Instead, the engine examines the writing to determine the language. The training corpus is manually constructed from chosen web pages for each language, then augmented by careful automated scraping of over 100M additional web pages. Though the algorithm is designed to work best on sets of at least 200 characters (about two sentences), testing has shown that it does perform well with small fragments, such as Tweets.

Note: Language identification in Relativity 9+ supports 173 languages. Language ID considers all Unicode characters and understands which characters are associated with each of the supported languages. For example, Japanese has many different character sets—Kanji, Katagana, Hirigana—all of which are supported. See the Supported languages matrixfor a complete list of languages that the language identification operation can detect.

See the following related pages: