ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

Extracting Custom Entities with the Stanbol Enhancer

Rupert Westenthaler

Audience level:
Intermediate
Track:
Linked Data

Wednesday 3:45 p.m.–4:30 p.m. in Level 2 Left

Description

Apache Stanbol is a set of reusable components intended to be used to extend CMS with semantic features. This talk will show how to extract domain/company specific Entities (e.g. Contacts, Products) from Documents by using the Stanbol Enhancer.

Abstract

Extracting custom Entities with the Stanbol Enhancer

Intention of the Talk

This talk will mainly cover the Content Enhancement component of Apache Stanbol. By default Content Enhancement extracts Persons, Organizations and Places from parsed Content and links them the Entities defined by DBpedia - the Linked Data version of Wikipedia. Most users however need to extract domain specific Entities (e.g. Drugs, Ingredients, Side Effect ... in the Life Science domain) or want to link to custom Entities such as Customers, Products, Projects, or Tags used by their Intranet Users.

The intention of this talk is to cover all steps necessary to customize the Stanbol Enhancer for those kind of use cases.

Outline of the Talk

Following a short overview about Stanbol and the Stanbol Enhancer this Talk will be structured around the following three topics

  • Using Domain Vocabularies: This part will explain how to index, share and load domain vocabularies to the Stanbol Entityhub. Domain vocabularies refer to set of Entities (e.g. a Thesaurus) that is managed by some Organization for a Domain (e.g. e-Government datasets, list of authorized Drugs).
  • Manage Custom Entities: This section shows how to manage Entities user specific or frequently updated Entities. This allows e.g. to immediately sync a created/changed Entity (e.g. Contact in a CRM, or Tag in an CMS) so that the Stanbol Enhancer can find/link it for precessed documents
  • Configure the Content Enhancement: This section will focus on how to use imported/managed vocabularies for content enhancement and provide an overview about available Enhancement Engines and their usage. In addition this section will also discuss multi lingual support.

Apache Stanbol Overview

Apache Stanbol (currently incubating) provides a set of reusable components for semantic content management. Functionalities are provided as RESTful services returning results as RDF (Resource Description Language) and JSON.

Apache Stanbol's main features are:

  • Content Enhancement
    Services that add semantic information to “non-semantic” pieces of content.
  • Reasoning
    Services that are able to retrieve additional semantic information about the content based on the semantic information retrieved via content enhancement.
  • Knowledge Models
    Services that are used to define and manipulate the data models (e.g. ontologies) that are used to store the semantic information.
  • Persistence
    Services that store (or cache) semantic information, i.e. enhanced content, entities, facts, and make it searchable.

Content enhancement - extraction of knowledge from parsed content -