Class ExternalParser

All Implemented Interfaces:
Serializable, Parser
Direct Known Subclasses:

public class ExternalParser extends AbstractParser
Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.
See Also:
  • Field Details


      public static final String INPUT_FILE_TOKEN
      The token, which if present in the Command string, will be replaced with the input filename. Alternately, the input data can be streamed over STDIN.
      See Also:

      public static final String OUTPUT_FILE_TOKEN
      The token, which if present in the Command string, will be replaced with the output filename. Alternately, the output data can be collected on STDOUT.
      See Also:
  • Constructor Details

    • ExternalParser

      public ExternalParser()
  • Method Details

    • check

      public static boolean check(String checkCmd, int... errorValue)
      Checks to see if the command can be run. Typically used with something like "myapp --version" to check to see if "myapp" is installed and on the path.
      checkCmd - The check command to run
      errorValue - What is considered an error value?
    • check

      public static boolean check(String[] checkCmd, int... errorValue)
    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Description copied from interface: Parser
      Returns the set of media types supported by this parser when used with the given parse context.
      context - parse context
      immutable set of media types
    • getSupportedTypes

      public Set<MediaType> getSupportedTypes()
    • setSupportedTypes

      public void setSupportedTypes(Set<MediaType> supportedTypes)
    • getCommand

      public String[] getCommand()
    • setCommand

      public void setCommand(String... command)
      Sets the command to be run. This can include either of INPUT_FILE_TOKEN or OUTPUT_FILE_TOKEN if the command needs filenames.
      See Also:
    • getIgnoredLineConsumer

      public ExternalParser.LineConsumer getIgnoredLineConsumer()
      Gets lines consumer
      consumer instance
    • setIgnoredLineConsumer

      public void setIgnoredLineConsumer(ExternalParser.LineConsumer ignoredLineConsumer)
      Set a consumer for the lines ignored by the parse functions
      ignoredLineConsumer - consumer instance
    • getMetadataExtractionPatterns

      public Map<Pattern,String> getMetadataExtractionPatterns()
    • setMetadataExtractionPatterns

      public void setMetadataExtractionPatterns(Map<Pattern,String> patterns)
      Sets the map of regular expression patterns and Metadata keys. Any matching patterns will have the matching metadata entries set. Set this to null to disable Metadata extraction.
    • parse

      public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
      Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler. Metadata is only extracted if setMetadataExtractionPatterns(Map) has been called to set patterns.
      stream - the document stream (input)
      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      context - parse context
      IOException - if the document stream could not be read
      SAXException - if the SAX events could not be processed
      TikaException - if the document could not be parsed