org.apache.tika.io
Class TikaInputStream

java.lang.Object
  extended by java.io.InputStream
      extended by java.io.FilterInputStream
          extended by org.apache.tika.io.ProxyInputStream
              extended by org.apache.tika.io.TaggedInputStream
                  extended by org.apache.tika.io.TikaInputStream
All Implemented Interfaces:
java.io.Closeable

public class TikaInputStream
extends TaggedInputStream

Input stream with extended capabilities. The purpose of this class is to allow files and other resources and information to be associated with the InputStream instance passed through the Parser interface and other similar APIs.

TikaInputStream instances can be created using the various static get() factory methods. Most of these methods take an optional Metadata argument that is then filled with the available input metadata from the given resource. The created TikaInputStream instance keeps track of the original resource used to create it, while behaving otherwise just like a normal, buffered InputStream. A TikaInputStream instance is also guaranteed to support the mark(int) feature.

Code that wants to access the underlying file or other resources associated with a TikaInputStream should first use the get(InputStream) factory method to cast or wrap a given InputStream into a TikaInputStream instance.

Since:
Apache Tika 0.8

Field Summary
 
Fields inherited from class java.io.FilterInputStream
in
 
Method Summary
protected  void afterRead(int n)
          Invoked by the read methods after the proxied call has returned successfully.
static TikaInputStream cast(java.io.InputStream stream)
          Returns the given stream casts to a TikaInputStream, or null if the stream is not a TikaInputStream.
 void close()
          Invokes the delegate's close() method.
static TikaInputStream get(java.sql.Blob blob)
          Creates a TikaInputStream from the given database BLOB.
static TikaInputStream get(java.sql.Blob blob, Metadata metadata)
          Creates a TikaInputStream from the given database BLOB.
static TikaInputStream get(byte[] data)
          Creates a TikaInputStream from the given array of bytes.
static TikaInputStream get(byte[] data, Metadata metadata)
          Creates a TikaInputStream from the given array of bytes.
static TikaInputStream get(java.io.File file)
          Creates a TikaInputStream from the given file.
static TikaInputStream get(java.io.File file, Metadata metadata)
          Creates a TikaInputStream from the given file.
static TikaInputStream get(java.io.InputStream stream)
          Casts or wraps the given stream to a TikaInputStream instance.
static TikaInputStream get(java.io.InputStream stream, TemporaryFiles tmp)
          Deprecated. Use the get(InputStream, TemporaryResources) instead
static TikaInputStream get(java.io.InputStream stream, TemporaryResources tmp)
          Casts or wraps the given stream to a TikaInputStream instance.
static TikaInputStream get(java.net.URI uri)
          Creates a TikaInputStream from the resource at the given URI.
static TikaInputStream get(java.net.URI uri, Metadata metadata)
          Creates a TikaInputStream from the resource at the given URI.
static TikaInputStream get(java.net.URL url)
          Creates a TikaInputStream from the resource at the given URL.
static TikaInputStream get(java.net.URL url, Metadata metadata)
          Creates a TikaInputStream from the resource at the given URL.
 java.io.File getFile()
           
 java.nio.channels.FileChannel getFileChannel()
           
 long getLength()
          Returns the length (in bytes) of this stream.
 java.lang.Object getOpenContainer()
          Returns the open container object, such as a POIFS FileSystem in the event of an OLE2 document being detected and processed by the OLE2 detector.
 long getPosition()
          Returns the current position within the stream.
 boolean hasFile()
           
 boolean hasLength()
           
static boolean isTikaInputStream(java.io.InputStream stream)
          Checks whether the given stream is a TikaInputStream instance.
 void mark(int readlimit)
          Invokes the delegate's mark(int) method.
 boolean markSupported()
          Invokes the delegate's markSupported() method.
 int peek(byte[] buffer)
          Fills the given buffer with upcoming bytes from this stream without advancing the current stream position.
 void reset()
          Invokes the delegate's reset() method.
 void setOpenContainer(java.lang.Object container)
          Stores the open container object against the stream, eg after a Zip contents detector has loaded the file to decide what it contains.
 long skip(long ln)
          Invokes the delegate's skip(long) method.
 java.lang.String toString()
           
 
Methods inherited from class org.apache.tika.io.TaggedInputStream
handleIOException, isCauseOf, throwIfCauseOf
 
Methods inherited from class org.apache.tika.io.ProxyInputStream
available, beforeRead, read, read, read
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Method Detail

isTikaInputStream

public static boolean isTikaInputStream(java.io.InputStream stream)
Checks whether the given stream is a TikaInputStream instance. The given stream can be null, in which case the return value is false.

Parameters:
stream - input stream, possibly null
Returns:
true if the stream is a TikaInputStream instance, false otherwise

get

public static TikaInputStream get(java.io.InputStream stream,
                                  TemporaryResources tmp)
Casts or wraps the given stream to a TikaInputStream instance. This method can be used to access the functionality of this class even when given just a normal input stream instance.

The given temporary file provider is used for any temporary files, and should be disposed when the returned stream is no longer used.

Use this method instead of the get(InputStream) alternative when you don't explicitly close the returned stream. The recommended access pattern is:

 TemporaryResources tmp = new TemporaryResources();
 try {
     TikaInputStream stream = TikaInputStream.get(..., tmp);
     // process stream but don't close it
 } finally {
     tmp.close();
 }
 

The given stream instance will not be closed when the TemporaryResources.close() method is called. The caller is expected to explicitly close the original stream when it's no longer used.

Parameters:
stream - normal input stream
Returns:
a TikaInputStream instance
Since:
Apache Tika 0.10

get

public static TikaInputStream get(java.io.InputStream stream,
                                  TemporaryFiles tmp)
Deprecated. Use the get(InputStream, TemporaryResources) instead


get

public static TikaInputStream get(java.io.InputStream stream)
Casts or wraps the given stream to a TikaInputStream instance. This method can be used to access the functionality of this class even when given just a normal input stream instance.

Use this method instead of the get(InputStream, TemporaryResources) alternative when you do explicitly close the returned stream. The recommended access pattern is:

 TikaInputStream stream = TikaInputStream.get(...);
 try {
     // process stream
 } finally {
     stream.close();
 }
 

The given stream instance will be closed along with any other resources associated with the returned TikaInputStream instance when the close() method is called.

Parameters:
stream - normal input stream
Returns:
a TikaInputStream instance

cast

public static TikaInputStream cast(java.io.InputStream stream)
Returns the given stream casts to a TikaInputStream, or null if the stream is not a TikaInputStream.

Parameters:
stream - normal input stream
Returns:
a TikaInputStream instance
Since:
Apache Tika 0.10

get

public static TikaInputStream get(byte[] data)
Creates a TikaInputStream from the given array of bytes.

Note that you must always explicitly close the returned stream as in some cases it may end up writing the given data to a temporary file.

Parameters:
data - input data
Returns:
a TikaInputStream instance

get

public static TikaInputStream get(byte[] data,
                                  Metadata metadata)
Creates a TikaInputStream from the given array of bytes. The length of the array is stored as input metadata in the given metadata instance.

Note that you must always explicitly close the returned stream as in some cases it may end up writing the given data to a temporary file.

Parameters:
data - input data
metadata - metadata instance
Returns:
a TikaInputStream instance
Throws:
java.io.IOException

get

public static TikaInputStream get(java.io.File file)
                           throws java.io.FileNotFoundException
Creates a TikaInputStream from the given file.

Note that you must always explicitly close the returned stream to prevent leaking open file handles.

Parameters:
file - input file
Returns:
a TikaInputStream instance
Throws:
java.io.FileNotFoundException - if the file does not exist

get

public static TikaInputStream get(java.io.File file,
                                  Metadata metadata)
                           throws java.io.FileNotFoundException
Creates a TikaInputStream from the given file. The file name and length are stored as input metadata in the given metadata instance.

Note that you must always explicitly close the returned stream to prevent leaking open file handles.

Parameters:
file - input file
metadata - metadata instance
Returns:
a TikaInputStream instance
Throws:
java.io.FileNotFoundException - if the file does not exist

get

public static TikaInputStream get(java.sql.Blob blob)
                           throws java.sql.SQLException
Creates a TikaInputStream from the given database BLOB.

Note that the result set containing the BLOB may need to be kept open until the returned TikaInputStream has been processed and closed. You must also always explicitly close the returned stream as in some cases it may end up writing the blob data to a temporary file.

Parameters:
blob - database BLOB
Returns:
a TikaInputStream instance
Throws:
java.sql.SQLException - if BLOB data can not be accessed

get

public static TikaInputStream get(java.sql.Blob blob,
                                  Metadata metadata)
                           throws java.sql.SQLException
Creates a TikaInputStream from the given database BLOB. The BLOB length (if available) is stored as input metadata in the given metadata instance.

Note that the result set containing the BLOB may need to be kept open until the returned TikaInputStream has been processed and closed. You must also always explicitly close the returned stream as in some cases it may end up writing the blob data to a temporary file.

Parameters:
blob - database BLOB
metadata - metadata instance
Returns:
a TikaInputStream instance
Throws:
java.sql.SQLException - if BLOB data can not be accessed

get

public static TikaInputStream get(java.net.URI uri)
                           throws java.io.IOException
Creates a TikaInputStream from the resource at the given URI.

Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

Parameters:
uri - resource URI
Returns:
a TikaInputStream instance
Throws:
java.io.IOException - if the resource can not be accessed

get

public static TikaInputStream get(java.net.URI uri,
                                  Metadata metadata)
                           throws java.io.IOException
Creates a TikaInputStream from the resource at the given URI. The available input metadata is stored in the given metadata instance.

Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

Parameters:
uri - resource URI
metadata - metadata instance
Returns:
a TikaInputStream instance
Throws:
java.io.IOException - if the resource can not be accessed

get

public static TikaInputStream get(java.net.URL url)
                           throws java.io.IOException
Creates a TikaInputStream from the resource at the given URL.

Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

Parameters:
url - resource URL
Returns:
a TikaInputStream instance
Throws:
java.io.IOException - if the resource can not be accessed

get

public static TikaInputStream get(java.net.URL url,
                                  Metadata metadata)
                           throws java.io.IOException
Creates a TikaInputStream from the resource at the given URL. The available input metadata is stored in the given metadata instance.

Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

Parameters:
url - resource URL
metadata - metadata instance
Returns:
a TikaInputStream instance
Throws:
java.io.IOException - if the resource can not be accessed

peek

public int peek(byte[] buffer)
         throws java.io.IOException
Fills the given buffer with upcoming bytes from this stream without advancing the current stream position. The buffer is filled up unless the end of stream is encountered before that. This method will block if not enough bytes are immediately available.

Parameters:
buffer - byte buffer
Returns:
number of bytes written to the buffer
Throws:
java.io.IOException - if the stream can not be read

getOpenContainer

public java.lang.Object getOpenContainer()
Returns the open container object, such as a POIFS FileSystem in the event of an OLE2 document being detected and processed by the OLE2 detector.


setOpenContainer

public void setOpenContainer(java.lang.Object container)
Stores the open container object against the stream, eg after a Zip contents detector has loaded the file to decide what it contains.


hasFile

public boolean hasFile()

getFile

public java.io.File getFile()
                     throws java.io.IOException
Throws:
java.io.IOException

getFileChannel

public java.nio.channels.FileChannel getFileChannel()
                                             throws java.io.IOException
Throws:
java.io.IOException

hasLength

public boolean hasLength()

getLength

public long getLength()
               throws java.io.IOException
Returns the length (in bytes) of this stream. Note that if the length was not available when this stream was instantiated, then this method will use the getFile() method to buffer the entire stream to a temporary file in order to calculate the stream length. This case will only work if the stream has not yet been consumed.

Returns:
stream length
Throws:
java.io.IOException - if the length can not be determined

getPosition

public long getPosition()
Returns the current position within the stream.

Returns:
stream position

skip

public long skip(long ln)
          throws java.io.IOException
Description copied from class: ProxyInputStream
Invokes the delegate's skip(long) method.

Overrides:
skip in class ProxyInputStream
Parameters:
ln - the number of bytes to skip
Returns:
the actual number of bytes skipped
Throws:
java.io.IOException - if an I/O error occurs

mark

public void mark(int readlimit)
Description copied from class: ProxyInputStream
Invokes the delegate's mark(int) method.

Overrides:
mark in class ProxyInputStream
Parameters:
readlimit - read ahead limit

markSupported

public boolean markSupported()
Description copied from class: ProxyInputStream
Invokes the delegate's markSupported() method.

Overrides:
markSupported in class ProxyInputStream
Returns:
true if mark is supported, otherwise false

reset

public void reset()
           throws java.io.IOException
Description copied from class: ProxyInputStream
Invokes the delegate's reset() method.

Overrides:
reset in class ProxyInputStream
Throws:
java.io.IOException - if an I/O error occurs

close

public void close()
           throws java.io.IOException
Description copied from class: ProxyInputStream
Invokes the delegate's close() method.

Specified by:
close in interface java.io.Closeable
Overrides:
close in class ProxyInputStream
Throws:
java.io.IOException - if an I/O error occurs

afterRead

protected void afterRead(int n)
Description copied from class: ProxyInputStream
Invoked by the read methods after the proxied call has returned successfully. The number of bytes returned to the caller (or -1 if the end of stream was reached) is given as an argument.

Subclasses can override this method to add common post-processing functionality without having to override all the read methods. The default implementation does nothing.

Note this method is not called from ProxyInputStream.skip(long) or ProxyInputStream.reset(). You need to explicitly override those methods if you want to add post-processing steps also to them.

Overrides:
afterRead in class ProxyInputStream
Parameters:
n - number of bytes read, or -1 if the end of stream was reached

toString

public java.lang.String toString()
Overrides:
toString in class TaggedInputStream


Copyright © 2007-2011 The Apache Software Foundation. All Rights Reserved.