NOTE: DRAFT REVISION
Welcome to Apache UIMA, a project of the
Apache incubator. Our goal is
to create and sustain a thriving community of users and developers of
UIMA frameworks which support
components for analysing unstructured content such as text,
audio and video.
What is UIMA?
UIMA is the name of a component-based architecture designed to facilitate building
applications that extract structured information from unstructured artificats (e.g.
text, videos, etc.).
Apache UIMA
Apache UIMA is a Java implementation of the UIMA framework, plus tooling, examples, and
documentation, all licensed under the Apache version 2.0 license. Soon we expect to include
also a C++ implementation of the UIMA framework,
which efficiently interoperates with the Java version; this version also has
add-ons which enable calling Analysis Engine components that are written in
popular scripting languages such as Python, Perl, and TCL.
UIMA is working!
Organizations that previously were using or developing stand-alone,
complex solutions are switching to using the UIMA
framework to assemble components such as language identifiers, language parsers, named-entity
detectors, video scene detectors, audio speech recognizers, etc., into new solutions.
A few things differentiate UIMA from other component framework architectures; here is a
simplified view of the central concepts:
- There is a subject-of-analysis, which is an unstructured information artifact
- This artifact and the structured information extracted are kept in an object
called the Common Analysis System (CAS) having a standardized interface to it;
this object passed from component to component, in a pipeline.
- Each component implements the same fixed API interface, which the framework uses to
initialize and process instances of the CAS through the component.
- all components have XML-formatted metadata to enable tooling for development and
deployment.
Unstructured Information Management applications are software systems
that analyze large volumes of unstructured information in order to
discover knowledge that is relevant to an end user. UIMA is a
framework and SDK for developing such applications. An example UIM
application might ingest plain text and identify entities, such as
persons, places, organizations; or relations, such as works-for or
located-at. UIMA enables such an application to be decomposed into
components, for example "language identification" >> "language
specific segmentation" >> "sentence boundary detection" >> "entity
detection (person/place names etc.)". Each component must implement
interfaces defined by the framework and must provide self-describing
metadata via XML descriptor files. The framework manages these
components and the data flow between them. Components are written in
Java or C++; the data that flows between components is designed for
efficient mapping between these languages. UIMA additionally provides
capabilities to wrap components as network services, and can scale to
very large volumes by replicating processing pipelines over a cluster
of networked nodes.
Apache UIMA is an open source implementation of the UIMA specification
(that specification is, in turn, being
developed concurrently by a technical committee
within OASIS, a standards organization).
We invite and encourage you to participate in both the implementation and specification efforts.
How is this project related to earlier UIMA development?
UIMA has been released on IBM's
alphaWorks previously. You can find there versions 1.4 and a beta level of
version 2.0.
Apache UIMA (this project) will continue development of UIMA, using the Apache
Open Source development model - all are welcome to join and participate.
The first version we plan to release within the Apache Incubator will be
version 2.1; we're working hard on this and hope to be able to release this
sometime very early next year.