org.apache.tika.language.detect.LanguageDetector

org.apache.tika.langdetect.opennlp.OpenNLPDetector

public class OpenNLPDetector extends LanguageDetector

This is based on OpenNLP's language detector. However, we've built our own ProbingLanguageDetector and our own language models.

To build our model, we followed OpenNLP's lead by using the (Leipzig corpus) as gathered and preprocessed ( big-data corpus ). We removed azj, plt, sun and zsm because our models couldn't sufficiently well distinguish them from related languages. We removed cmn in favor of the finer-grained zho-trad and zho-simp.

We then added the following languages from cc-100: ben-rom (Bengali Romanized), ful, gla, gug, hau, hin-rom, ibo, ful, linm mya-zaw, nso, orm, quz, roh, srd, ssw, tam-rom, tel-rom, tsn, urd-rom, wol, yor.

We ran our own train/devtest/test code because OpenNLPs required more sentences/data than were available for some languages.

Please open an issue on our JIRA if we made mistakes and/or had misunderstandings in our design choices or if you need to have other languages added.

Citations for the cc-100 corpus:

Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), p. 8440-8451, July 2020, pdf, bib.

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020, pdf, bib.

Field Summary

Fields inherited from class org.apache.tika.language.detect.LanguageDetector
mixedLanguages, shortText
Constructor Summary

Constructors

Constructor

Description

OpenNLPDetector()
Method Summary

Modifier and Type

Method

Description

void

addText(char[] cbuf, int off, int len)

This will buffer up to setMaxLength(int) and then ignore the rest of the text.

List<LanguageResult>

detectAll()

Detect languages based on previously submitted text (via addText calls).

String[]

getSupportedLanguages()

boolean

hasModel(String language)

Provide information about whether a model exists for a specific language.

LanguageDetector

loadModels()

No-op.

LanguageDetector

loadModels(Set<String> languages)

NOT SUPPORTED.

void

reset()

Reset statistics about the current document being processed

void

setMaxLength(int maxLength)

LanguageDetector

setPriors(Map<String,Float> languageProbabilities)

NOT YET SUPPORTED.

Methods inherited from class org.apache.tika.language.detect.LanguageDetector
addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, hasEnoughText, isMixedLanguages, isShortText, setMixedLanguages, setShortText

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- OpenNLPDetector
  
  public OpenNLPDetector()
Method Details
- loadModels
  
  public LanguageDetector loadModels() throws IOException
  
  No-op. Models are loaded statically.
  
  Specified by:
  
  loadModels in class LanguageDetector
  
  Returns:
  
  Throws:
  
  IOException
- loadModels
  
  public LanguageDetector loadModels(Set<String> languages) throws IOException
  
  NOT SUPPORTED. Throws UnsupportedOperationException
  
  Specified by:
  
  loadModels in class LanguageDetector
  
  Parameters:
  
  languages - list of target languages.
  
  Returns:
  
  Throws:
  
  IOException
- hasModel
  
  public boolean hasModel(String language)
  
  Description copied from class: LanguageDetector
  
  Provide information about whether a model exists for a specific language.
  
  Specified by:
  
  hasModel in class LanguageDetector
  
  Parameters:
  
  language - ISO 639-1 name for language
  
  Returns:
  
  true if a model for this language exists.
- setPriors
  
  public LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
  
  NOT YET SUPPORTED. Throws UnsupportedOperationException
  
  Specified by:
  
  setPriors in class LanguageDetector
  
  Parameters:
  
  languageProbabilities - Map from language to probability
  
  Returns:
  
  Throws:
  
  IOException
- reset
  
  public void reset()
  
  Description copied from class: LanguageDetector
  
  Reset statistics about the current document being processed
  
  Specified by:
  
  reset in class LanguageDetector
- addText
  
  public void addText(char[] cbuf, int off, int len)
  
  This will buffer up to setMaxLength(int) and then ignore the rest of the text.
  
  Specified by:
  
  addText in class LanguageDetector
  
  Parameters:
  
  cbuf - Character buffer
  
  off - Offset into cbuf to first character in the run of text
  
  len - Number of characters in the run of text.
- detectAll
  
  public List<LanguageResult> detectAll()
  
  Description copied from class: LanguageDetector
  
  Detect languages based on previously submitted text (via addText calls).
  
  Specified by:
  
  detectAll in class LanguageDetector
  
  Returns:
  
  list of all possible languages with at least medium confidence, sorted by confidence from highest to lowest. There will always be at least one result, which might have a confidence of NONE.
- setMaxLength
  
  public void setMaxLength(int maxLength)
- getSupportedLanguages
  
  public String[] getSupportedLanguages()

Class OpenNLPDetector

Field Summary

Fields inherited from class org.apache.tika.language.detect.LanguageDetector

Constructor Summary

Method Summary

Methods inherited from class org.apache.tika.language.detect.LanguageDetector

Methods inherited from class java.lang.Object

Constructor Details

OpenNLPDetector

Method Details

loadModels

loadModels

hasModel

setPriors

reset

addText

detectAll

setMaxLength

getSupportedLanguages