Class OptimaizeLangDetector

java.lang.Object
org.apache.tika.language.detect.LanguageDetector
org.apache.tika.langdetect.optimaize.OptimaizeLangDetector

public class OptimaizeLangDetector extends LanguageDetector
Implementation of the LanguageDetector API that uses https://github.com/optimaize/language-detector
  • Field Details

    • DEFAULT_MAX_CHARS_FOR_DETECTION

      public static final int DEFAULT_MAX_CHARS_FOR_DETECTION
      See Also:
    • DEFAULT_MAX_CHARS_FOR_SHORT_DETECTION

      public static final int DEFAULT_MAX_CHARS_FOR_SHORT_DETECTION
      See Also:
  • Constructor Details

    • OptimaizeLangDetector

      public OptimaizeLangDetector()
    • OptimaizeLangDetector

      public OptimaizeLangDetector(int maxCharsForDetection)
  • Method Details

    • loadModels

      public LanguageDetector loadModels()
      Description copied from class: LanguageDetector
      Load (or re-load) all available language models. This must be called after any settings that would impact the models being loaded (e.g. mixed language/short text), but before any of the document processing routines (below) are called. Note that it only needs to be called once.
      Specified by:
      loadModels in class LanguageDetector
      Returns:
      this
    • loadModels

      public LanguageDetector loadModels(Set<String> languages) throws IOException
      Description copied from class: LanguageDetector
      Load (or re-load) the models specified in . These use the ISO 639-1 names, with an optional "-" for more specific specification (e.g. "zh-CN" for Chinese in China).
      Specified by:
      loadModels in class LanguageDetector
      Parameters:
      languages - list of target languages.
      Returns:
      this
      Throws:
      IOException
    • hasModel

      public boolean hasModel(String language)
      Description copied from class: LanguageDetector
      Provide information about whether a model exists for a specific language.
      Specified by:
      hasModel in class LanguageDetector
      Parameters:
      language - ISO 639-1 name for language
      Returns:
      true if a model for this language exists.
    • setPriors

      public LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
      Description copied from class: LanguageDetector
      Set the a-priori probabilities for these languages. The provided map uses the language as the key, and the probability (0.0 > probability < 1.0) of text being in that language. Note that if the probabilities don't sum to 1.0, these values will be normalized.

      If hasModel() returns false for any of the languages, an IllegalArgumentException is thrown.

      Use of these probabilities is detector-specific, and thus might not impact the results at all. As such, these should be viewed as a hint.

      Specified by:
      setPriors in class LanguageDetector
      Parameters:
      languageProbabilities - Map from language to probability
      Returns:
      this
      Throws:
      IOException
    • reset

      public void reset()
      Description copied from class: LanguageDetector
      Reset statistics about the current document being processed
      Specified by:
      reset in class LanguageDetector
    • addText

      public void addText(char[] cbuf, int off, int len)
      Description copied from class: LanguageDetector
      Add statistics about this text for the current document. Note that we assume an implicit word break exists before/after each of these runs of text.
      Specified by:
      addText in class LanguageDetector
      Parameters:
      cbuf - Character buffer
      off - Offset into cbuf to first character in the run of text
      len - Number of characters in the run of text.
    • detectAll

      public List<LanguageResult> detectAll()
      Detect languages based on previously submitted text (via addText calls).
      Specified by:
      detectAll in class LanguageDetector
      Returns:
      the detected list of languages
      Throws:
      IllegalStateException - if no models have been loaded with loadModels() or loadModels(java.util.Set)
    • hasEnoughText

      public boolean hasEnoughText()
      Description copied from class: LanguageDetector
      Tell the caller whether more text is required for the current document before the language can be reliably detected.

      Implementations can override this to do early termination of stats collection, which can improve performance with longer documents.

      Note that detect() can be called even when this returns false

      Overrides:
      hasEnoughText in class LanguageDetector
      Returns:
      true if we have enough text for reliable detection.