Class ExternalParser

Serializable, Parser
public class ExternalParser extends AbstractParser
Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.
  • Field Details


      public static final String INPUT_FILE_TOKEN
      The token, which if present in the Command string, will be replaced with the input filename. Alternately, the input data can be streamed over STDIN.
      public static final String OUTPUT_FILE_TOKEN
      The token, which if present in the Command string, will be replaced with the output filename. Alternately, the output data can be collected on STDOUT.
  • Constructor Details

    • ExternalParser

      public ExternalParser()
  • Method Details

    • check

      public static boolean check(String checkCmd, int... errorValue)
      Checks to see if the command can be run. Typically used with something like "myapp --version" to check to see if "myapp" is installed and on the path.
      checkCmd - The check command to run
      errorValue - What is considered an error value?
    • check

      public static boolean check(String[] checkCmd, int... errorValue)
    • getSupportedTypes

      public Set<MediaType> getSupportedTypes(ParseContext context)
      Description copied from interface: Parser
      Returns the set of media types supported by this parser when used with the given parse context.
      context - parse context
      immutable set of media types
    • getSupportedTypes

      public Set<MediaType> getSupportedTypes()
    • setSupportedTypes

      public void setSupportedTypes(Set<MediaType> supportedTypes)
    • getCommand

      public String[] getCommand()
    • setCommand

      public void setCommand(String... command)
      Sets the command to be run. This can include either of INPUT_FILE_TOKEN or OUTPUT_FILE_TOKEN if the command needs filenames.
    • getIgnoredLineConsumer

      public ExternalParser.LineConsumer getIgnoredLineConsumer()
      Gets lines consumer
      consumer instance
    • setIgnoredLineConsumer

      public void setIgnoredLineConsumer(ExternalParser.LineConsumer ignoredLineConsumer)
      Set a consumer for the lines ignored by the parse functions
      ignoredLineConsumer - consumer instance
    • getMetadataExtractionPatterns

      public Map<Pattern,String> getMetadataExtractionPatterns()
    • setMetadataExtractionPatterns

      public void setMetadataExtractionPatterns(Map<Pattern,String> patterns)
      Sets the map of regular expression patterns and Metadata keys. Any matching patterns will have the matching metadata entries set. Set this to null to disable Metadata extraction.
    • parse

      public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
      Executes the configured external command and passes the given document stream as a simple XHTML document to the given SAX content handler. Metadata is only extracted if setMetadataExtractionPatterns(Map) has been called to set patterns.
      stream - the document stream (input)
      handler - handler for the XHTML SAX events (output)
      metadata - document metadata (input and output)
      context - parse context
      IOException - if the document stream could not be read
      SAXException - if the SAX events could not be processed
      TikaException - if the document could not be parsed