org.apache.tika.extractor.ParserContainerExtractor

All Implemented Interfaces:: Serializable, ContainerExtractor

public class ParserContainerExtractor extends Object implements ContainerExtractor

An implementation of ContainerExtractor powered by the regular Parser API. This allows you to easily extract out all the embedded resources from within container files supported by normal Tika parsers. By default the AutoDetectParser will be used, to allow extraction from the widest range of containers.

See Also:

Serialized Form

Constructor Summary

Constructors

Constructor

Description

ParserContainerExtractor()

ParserContainerExtractor(TikaConfig config)

ParserContainerExtractor(Parser parser, Detector detector)
Method Summary

Modifier and Type

Method

Description

void

extract(TikaInputStream stream, ContainerExtractor recurseExtractor, EmbeddedResourceHandler handler)

Processes a container file, and extracts all the embedded resources from within it.

boolean

isSupported(TikaInputStream input)

Is this Container Extractor able to process the supplied container?

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ParserContainerExtractor
  
  public ParserContainerExtractor()
- ParserContainerExtractor
  
  public ParserContainerExtractor(TikaConfig config)
- ParserContainerExtractor
  
  public ParserContainerExtractor(Parser parser, Detector detector)
Method Details
- isSupported
  
  public boolean isSupported(TikaInputStream input) throws IOException
  
  Description copied from interface: ContainerExtractor
  
  Is this Container Extractor able to process the supplied container?
  
  Specified by:
  
  isSupported in interface ContainerExtractor
  
  Throws:
  
  IOException
- extract
  
  public void extract(TikaInputStream stream, ContainerExtractor recurseExtractor, EmbeddedResourceHandler handler) throws IOException, TikaException
  
  Description copied from interface: ContainerExtractor
  
  Processes a container file, and extracts all the embedded resources from within it.
  The EmbeddedResourceHandler you supply will be called for each embedded resource in the container. It is up to you whether you process the contents of the resource or not.
  The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
  If required, nested containers (such as a .docx within a .zip) can automatically be recursed into, and processed inline. If no recurseExtractor is given, the nested containers will be treated as with any other embedded resources.
  
  Specified by:
  
  extract in interface ContainerExtractor
  
  Parameters:
  
  stream - the document stream (input)
  
  recurseExtractor - the extractor to use on any embedded containers
  
  handler - handler for the embedded files (output)
  
  Throws:
  
  IOException - if the document stream could not be read
  
  TikaException - if the container could not be parsed

Class ParserContainerExtractor

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

ParserContainerExtractor

ParserContainerExtractor

ParserContainerExtractor

Method Details

isSupported

extract