ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

Content extraction with Apache Tika

Jukka Zitting

Audience level:
Intermediate
Track:
Lucene, Solr & Friends

Tuesday 4:15 p.m.–5 p.m. in Press Room

Description

How to index and search for things like PDF documents, Excel spreadsheets or Keynote presentations? The Apache Tika toolkit allows you to easily extract the text content from these and dozens of other document formats. This talk shows how to use Tika to feed your Lucene or SOLR -based full text search index.

Abstract

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. To show how the toolkit can be used with a Lucene or Solr search index, this talk covers

  • Introduction to Apache Tika
  • Full text extraction with Tika
  • Using the Tika-based ExtractingRequestHandler in Solr
  • Integrating Tika directly with Lucene
  • Link extraction for web crawlers
  • Advanced features like forked parsing and the Tika server

This talk assumes basic knowledge of Lucene or Solr and of Java programming.