Content extraction with Apache Tika

Jukka Zitting

Audience level:: Intermediate
Track:: Lucene, Solr & Friends

Tuesday 4:15 p.m.–5 p.m. in Press Room

Description

How to index and search for things like PDF documents, Excel spreadsheets or Keynote presentations? The Apache Tika toolkit allows you to easily extract the text content from these and dozens of other document formats. This talk shows how to use Tika to feed your Lucene or SOLR -based full text search index.

Abstract

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. To show how the toolkit can be used with a Lucene or Solr search index, this talk covers

Introduction to Apache Tika
Full text extraction with Tika
Using the Tika-based ExtractingRequestHandler in Solr
Integrating Tika directly with Lucene
Link extraction for web crawlers
Advanced features like forked parsing and the Tika server

This talk assumes basic knowledge of Lucene or Solr and of Java programming.

ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

Content extraction with Apache Tika

Jukka Zitting

Tuesday 4:15 p.m.–5 p.m. in Press Room

Description

Abstract

Content extraction with Apache Tika

Jukka Zitting

Tuesday 4:15 p.m.–5 p.m. in Press Room

Description

Abstract

Sponsors

Platinum

Gold

Gold

Gold

Gold

Silver

Silver

Silver

Silver

Evening Events

Evening Events

Evening Events

Evening Events

Community

Community

Community

Community

Community

Community

Community

Community

Community

Community

Community

Community

Community