Apache UIMA Apache UIMA UIMA project logo

General

Community

Development

Conferences

Welcome to Apache UIMA

NOTE: DRAFT REVISION

Welcome to Apache UIMA, a project of the Apache incubator. Our goal is to create and sustain a thriving community of users and developers of UIMA frameworks which support components for analysing unstructured content such as text, audio and video.

What is UIMA?

UIMA is the name of a component-based architecture designed to facilitate building applications that extract structured information from unstructured artificats (e.g. text, videos, etc.).

Apache UIMA

Apache UIMA is a Java implementation of the UIMA framework, plus tooling, examples, and documentation, all licensed under the Apache version 2.0 license. Soon we expect to include also a C++ implementation of the UIMA framework, which efficiently interoperates with the Java version; this version also has add-ons which enable calling Analysis Engine components that are written in popular scripting languages such as Python, Perl, and TCL.

UIMA is working!

Organizations that previously were using or developing stand-alone, complex solutions are switching to using the UIMA framework to assemble components such as language identifiers, language parsers, named-entity detectors, video scene detectors, audio speech recognizers, etc., into new solutions.

A few things differentiate UIMA from other component framework architectures; here is a simplified view of the central concepts:

  • There is a subject-of-analysis, which is an unstructured information artifact
  • This artifact and the structured information extracted are kept in an object called the Common Analysis System (CAS) having a standardized interface to it; this object passed from component to component, in a pipeline.
  • Each component implements the same fixed API interface, which the framework uses to initialize and process instances of the CAS through the component.
  • all components have XML-formatted metadata to enable tooling for development and deployment.

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" >> "language specific segmentation" >> "sentence boundary detection" >> "entity detection (person/place names etc.)". Each component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

Apache UIMA is an open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS, a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.

How is this project related to earlier UIMA development?

UIMA has been released on IBM's alphaWorks previously. You can find there versions 1.4 and a beta level of version 2.0.

Apache UIMA (this project) will continue development of UIMA, using the Apache Open Source development model - all are welcome to join and participate. The first version we plan to release within the Apache Incubator will be version 2.1; we're working hard on this and hope to be able to release this sometime very early next year.


Incubator Notice and Disclaimer

Note : Apache UIMA is an effort undergoing incubation at the Apache Software Foundation (ASF). Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.


UIMA News

November, 2006: UIMA has been accepted into the Apache Incubator. Work begins on converting the previous code base to Apache standards.



Copyright © 2006-2008, The Apache Software Foundation