org.apache.tika.detect.TextDetector

All Implemented Interfaces:: Serializable, Detector

public class TextDetector extends Object implements Detector

Content type detection of plain text documents. This detector looks at the beginning of the document input stream and considers the document to be a text document if no ASCII (ISO-Latin-1, UTF-8, etc.) control bytes are found. As a special case some control bytes (up to 2% of all characters) are also allowed in a text document if it also contains no or just a few (less than 10%) characters above the 7-bit ASCII range.

Note that text documents with a character encoding like UTF-16 are better detected with MagicDetector and an appropriate magic byte pattern.

Since:

Apache Tika 0.3

See Also:

Serialized Form

Constructor Summary

Constructors

Constructor

Description

TextDetector()

Constructs a TextDetector which will look at the default number of bytes from the beginning of the document.

TextDetector(int bytesToTest)

Constructs a TextDetector which will look at a given number of bytes from the beginning of the document.
Method Summary

Modifier and Type

Method

Description

MediaType

detect(InputStream input, Metadata metadata)

Looks at the beginning of the document input stream to determine whether the document is text or not.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TextDetector
  
  public TextDetector()
  
  Constructs a TextDetector which will look at the default number of bytes from the beginning of the document.
- TextDetector
  
  public TextDetector(int bytesToTest)
  
  Constructs a TextDetector which will look at a given number of bytes from the beginning of the document.
Method Details
- detect
  
  public MediaType detect(InputStream input, Metadata metadata) throws IOException
  
  Looks at the beginning of the document input stream to determine whether the document is text or not.
  
  Specified by:
  
  detect in interface Detector
  
  Parameters:
  
  input - document input stream, or null
  
  metadata - ignored
  
  Returns:
  
  "text/plain" if the input stream suggest a text document, "application/octet-stream" otherwise
  
  Throws:
  
  IOException - if the document input stream could not be read

Class TextDetector

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TextDetector

TextDetector

Method Details

detect