Class BoilerpipeContentHandler

java.lang.Object
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler
All Implemented Interfaces:
ContentHandler

public class BoilerpipeContentHandler extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe library to automatically extract the main content from a web page.

Use this as a ContentHandler object passed to HtmlParser#parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)

  • Constructor Details

    • BoilerpipeContentHandler

      public BoilerpipeContentHandler(ContentHandler delegate)
      Creates a new boilerpipe-based content extractor, using the DefaultExtractor extraction rules and "delegate" as the content handler.
      Parameters:
      delegate - The ContentHandler object
    • BoilerpipeContentHandler

      public BoilerpipeContentHandler(Writer writer)
      Creates a content handler that writes XHTML body character events to the given writer.
      Parameters:
      writer - writer
    • BoilerpipeContentHandler

      public BoilerpipeContentHandler(ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
      Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to the content handler.
      Parameters:
      delegate - The ContentHandler object
      extractor - Extraction rules to use, e.g. ArticleExtractor
  • Method Details

    • isIncludeMarkup

      public boolean isIncludeMarkup()
    • setIncludeMarkup

      public void setIncludeMarkup(boolean includeMarkup)
    • getTextDocument

      public de.l3s.boilerpipe.document.TextDocument getTextDocument()
      Retrieves the built TextDocument
      Returns:
      TextDocument
    • startDocument

      public void startDocument() throws SAXException
      Specified by:
      startDocument in interface ContentHandler
      Overrides:
      startDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      Throws:
      SAXException
    • startPrefixMapping

      public void startPrefixMapping(String prefix, String uri) throws SAXException
      Specified by:
      startPrefixMapping in interface ContentHandler
      Overrides:
      startPrefixMapping in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      Throws:
      SAXException
    • startElement

      public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException
      Specified by:
      startElement in interface ContentHandler
      Overrides:
      startElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      Throws:
      SAXException
    • characters

      public void characters(char[] chars, int offset, int length) throws SAXException
      Specified by:
      characters in interface ContentHandler
      Overrides:
      characters in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      Throws:
      SAXException
    • endElement

      public void endElement(String uri, String localName, String qName) throws SAXException
      Specified by:
      endElement in interface ContentHandler
      Overrides:
      endElement in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      Throws:
      SAXException
    • endDocument

      public void endDocument() throws SAXException
      Specified by:
      endDocument in interface ContentHandler
      Overrides:
      endDocument in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
      Throws:
      SAXException