ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

Semantic Indexing and Search for Content Management Systems

Suat Gonul

Audience level:
Intermediate
Track:
Linked Data

Thursday 9:15 a.m.–10 a.m. in Level 2 Left

Description

Apache Stanbol provides creation of semantically meaningful Apache Solr based indexes for specific domains/needs. A content management system (CMS) administrator can create multiple such indexes and associate those indexes with the actual CMS. As a result, the documents of actual CMS can be indexed in custom, semantic indexes. During this process, documents are also enhanced using the LOD cloud.

Abstract

Semantic Indexing and Search for Content Management Systems

To provide semantic indexing and search facilities for content management systems (CMSes), we have designed a two layered architecture in the scope of Apache Stanbol. Related JIRA issue here.

  • Layer 1: Storage
  • Layer 2: Indexing

Storage Layer

This layer simply provides storage facilities for the documents, which are managed in a CMS. This layer is capable of storing documents together with its additional semantic enhancements.

During the enhancement process, first named entities are recognized within the textual content of the documents and then each entity are tried to be linked to a known entity which may reside in a custom RDF data or other RDF data contained in the LOD cloud. It is possible to configure Stanbol with custom, domain specific RDF datasets. And these datasets can be used during the enhancement process as source vocabularies containing domain specific named entities.

This layer also aims to realize integration with the actual CMS. It keeps track of the changes in the actual CMS and updates the enhancements regarding to the documents accordingly.

Related JIRA issue with this layer here.

Indexing Layer

This layer provides facilities for creation of semantically meaningful Apache Solr based indexes for specific domains/needs. We use the LDPath language for this operation. Within an LDPath instance, the index fields to be created and their Solr-specific properties can be configured. After the index creation, the same LDPath is also used during the actual indexing operation. This will be explained below.

Indexing Layer keeps track of the changes in the Store Layer. So, each semantic index instance automatically indexes the new items managed in the Store Layer according to the its own configuration.

In case the enhancement facilities of Stanbol is configured with custom domain specific datasets, those datasets can be used as additional information sources for the named entities recognized within the document. From this perspective, the LDPath which was used to configure the underlying Solr index is used to gather additional semantically related information. Thanks to LDPath and Stanbol integration, it is possible gather information considering not only a single dataset but also multiple of them by defining RDF paths traversing multiple datasets.

The managed Solr indexes are exposed through their RESTful endpoints. So, CMS administrators easily make use of the them. It is also possible to configure more than one semantic index for one storage instance. This means that the same documents within the CMS can be indexed according to different configurations at the same time.

Related JIRA issue with this layer is here.