Package org.apache.lucene.analysis.ja
Class JapaneseTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.ja.JapaneseTokenizer
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public final class JapaneseTokenizer extends Tokenizer
Tokenizer for Japanese that uses morphological analysis.This tokenizer sets a number of additional attributes:
BaseFormAttribute
containing base form for inflected adjectives and verbs.PartOfSpeechAttribute
containing part-of-speech.ReadingAttribute
containing reading and pronunciation.InflectionAttribute
containing additional part-of-speech information for inflected forms.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is
JapaneseTokenizer.Mode.SEARCH
, we output the alternate segmentation as well.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
JapaneseTokenizer.Mode
Tokenization mode: this determines how the tokenizer handles compound and unknown words.static class
JapaneseTokenizer.Type
Token type reflecting the original source of this token-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static JapaneseTokenizer.Mode
DEFAULT_MODE
Default tokenization mode.-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
calcNBestCost(String examples)
void
close()
void
end()
boolean
incrementToken()
void
reset()
void
setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of the Viterbi latticevoid
setNBestCost(int value)
-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
DEFAULT_MODE
public static final JapaneseTokenizer.Mode DEFAULT_MODE
Default tokenization mode. Currently this isJapaneseTokenizer.Mode.SEARCH
.
-
-
Constructor Detail
-
JapaneseTokenizer
public JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.Uses the default AttributeFactory.
- Parameters:
userDictionary
- Optional: if non-null, user dictionary.discardPunctuation
- true if punctuation tokens should be dropped from the output.mode
- tokenization mode.
-
JapaneseTokenizer
public JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.- Parameters:
factory
- the AttributeFactory to useuserDictionary
- Optional: if non-null, user dictionary.discardPunctuation
- true if punctuation tokens should be dropped from the output.mode
- tokenization mode.
-
-
Method Detail
-
setGraphvizFormatter
public void setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of the Viterbi lattice
-
close
public void close() throws IOException
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classTokenizer
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-
end
public void end() throws IOException
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
incrementToken
public boolean incrementToken() throws IOException
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
calcNBestCost
public int calcNBestCost(String examples)
-
setNBestCost
public void setNBestCost(int value)
-
-