ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

The secrets of a file

Jukka Zitting

Audience level:
Intermediate
Track:
Apache Daily

Thursday 10:15 a.m.–10:45 a.m. in Level 1 Left

Description

Inside your files are pieces of metadata and other information that normally remain hidden, but can be extracted with the right tools. This talk shows how to use Apache Tika to detect and extract such bits and how to use such information to make your applications more perceptive.

Abstract

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. In this talk you'll learn to:

  • Automatically determine the type of a document
  • Extract hidden metadata from all kinds of documents
  • Understand what common bits of metadata mean and how to use them
  • Extract embedded or attached documents from within another file

This talk assumes basic understanding of common file formats and the Internet media type system. Knowledge of Java programming is assumed for some examples, but not required for the overall presentation.