|
| | | | I have a set of (DTD or XML Schema) grammars that I
use a lot. How can I make Xerces
reuse the representations it builds for these grammars,
instead of parsing them anew with every new document?
| | | | |
| |
Before answering this question, it will greatly help to
understand how Xerces handles grammars internally. To do
this, here are some terms:
Grammar : defined in the
org.apache.xerces.xni.grammars.Grammar
interface; simply differentiates objects that are Xerces
grammars from other objects, as well as providing a means
to get at the location information (XMLGrammarDescription ) for the grammar represented.
XMLGrammarDescription : defined by the
org.apache.xerces.xni.grammars.XMLGrammarDescription
interface, holds some basic location information common to all grammars.
This can be used to distinguish one
Grammar object from another, and also
contains information about the type of the grammar.
- Validator: A generic term used in Xerces to denote
an object which compares the structure of an XML document
with the expectations of a certain type of grammar.
Currently, we have DTD and XML Schema validators.
XMLGrammarPool : Defined by the
org.apache.xerces.xni.grammars.XMLGrammarPool
interface, this object is owned by the application and it
is the means by which the application and Xerces pass
complex grammars to one another.
- Grammar bucket: An internal data structure owned by
a Xerces validator in which grammars--and information
related to grammars--to be used in a given validation
episode is stored.
XMLGrammarLoader : defined in the
org.apache.xerces.xni.grammars.XMLGrammarLoader
interface, this defines an object that "knows how" to
read the XML representation of a particular kind of
grammar and construct a Xerces-internal representation (a
Grammar object) out of it. These objects
may interact with validators during parsing of instance
documents, or with external code during grammar
preparsing.
Now that the terminology is out of the way, it's possible
to relate all these objects together. At the commencement of
a validation episode, a validator will call the
retrieveInitialGrammarSet(String grammarType) method of the
XMLGrammarPool instance to which it has access. It
will use the Grammar objects it procures in this
way to seed its grammar bucket.
When the validator determines that it needs a grammar, it
will consult its grammar bucket. If it finds a matching
grammar, it will attempt to use it. Otherwise, if it has
access to an XMLGrammarPool instance, it
will request a grammar from that object with the
retrieveGrammar(XMLGrammarDescription desc)
method. Only if both of these steps fail will it fall
back to attempting to resolve the grammar entity and
calling the appropriate XMLGrammarLoader
to actually create a new Grammar object.
At the end of the validation episode, the validator will
call the cacheGrammars(String grammarType,
Grammar[] grammars) method of the
XMLGrammarPool (if any) to which it has
access. There is no guarantee grammars that the grammar
pool itself supplied to the validator will not be
included in this set, so a grammar pool implementation
cannot rely only on new grammars to be passed back in
this situation.
At long last, it's now possible to answer the original
question--how can one cache grammars? Assuming one has a
reasonable XMLGrammarPool
implementation--such as that provided with Xerces--there are two
answers:
- The "passive" approach: Don't do any preparsing,
just register the grammar pool implementation with the
parser, and as new grammars are requested by instance
documents, simply let the validators add them to the
pool. This is very unobtrusive to the application, but
doesn't provide that much control over what grammars are
added; even if a custom EntityResolver is registered,
it's still possible that unwanted grammars will make it
into the pool.
- The "active" approach: Preload a grammar pool
implementation with all the grammars you'll need, then
lock it so that no new grammars will be added. Then
registering this on the configuration will allow
validators to make use of this set; registering a
do-nothing EntityResolver will allow the application to
deny validators from using any but the "approved" grammar
set. This will oblige the application to use more Xerces
code, but provides a far more fine-grained approach to
controlling what grammars may be used.
We discuss both these approaches in a bit more detail
below, complete with some (broad) examples.
As a starting point, though, the
XMLGrammarBuilder sample, from the
xni package, should provide a starting-point
for implementing either the active or passive approach.
|
| | | | Exactly how does Xerces default implementation of things
like the grammar pool work? | | | | |
| |
Before proceeding further, let there be no doubt that, by default, Xerces
does not cache grammars at all. In order to trigger Xerces grammar caching, an XMLGrammarPool
must be set, using the setProperty method,
on a Xerces configuration that supports grammar pools. On the other hand,
you could simply use the XMLGrammarCachingConfiguration as
discussed briefly below.
When enabled, by default, Xerces's grammar pool implementation stores
any grammar offered to it (provided it does not already
have a reference matching that grammar). It also makes
available all grammars it has, of a particular type, on
calls to retrieveInitialGrammarSet . It will
also try and retrieve a matching grammar on calls to
retrieveGrammar .
Xerces uses hashing to distinguish different grammar
objects, by hashing on the
XMLGrammarDescription objects that those
grammars contain. Thus, both of Xerces implementations
of XMLGrammarDescription--for DTD's and XML
Schemas--provide implementations of hashCode():
int and equals(Object):boolean that
are used by the hashing algorithm.
In XML Schemas, hashing is simply carried out on the
target namespace of the schema. Thus, two grammars are
considered equal (by our default implementation) if and
only if their XMLGrammarDescriptions are instances of
org.apache.xerces.impl.xs.XSDDescription (our schema implementation of
XMLGrammarDescription) and the targetNamespace fields of
those objects are identical.
The case in DTD's is much more difficult. Here is the
algorithm, which describes the conditions under which two
DTD grammars will be considered equal:
- Both grammars must have XMLGrammarDescriptions that
are instances of
org.apache.xerces.impl.dtd.XMLDTDDescription .
- If their publicId or expandedSystemId fields are
non-null they must be identical.
- If one of the descriptions has a root element
defined, it must be the same as the root element defined
in the other description, or be in the list of global
elements stored in that description.
- If neither has a root element defined, then they must
share at least one global element declaration in
common.
The DTD grammar caching also assumes that the entirety of
the cached grammar will lie in an external subset. i.e.,
in the example below, Xerces will happily cache--or use a
cached version of--the DTD in "my.dtd". If the document
contained an internal subset, the declarations would be
ignored.
| | | | <!DOCTYPE myDoc SYSTEM "my.dtd">
<myDoc ...>...</myDoc> | | | | |
Using these heuristics, Xerces's default grammar caching
implementation appears to do a reasonable job at matching
grammars up with appropriate instance documents. This
functionality is very new, so in addition to bug reports
we'd very much appreciate, especially on the DTD front,
feedback on whether this form of caching is indeed useful or
whether--for instance--it would be better if internal
declarations were somehow incorporated into the grammar
that's been cached.
|
| | | | I like the idea of "active" caching (or I want the grammar
object for some purpose); how do I go about parsing a grammar
independent of an instance document? | | | | |
| |
First, if you haven't read the first FAQ on this page and
have trouble with terminology, hopefully answers
lie there.
Preparsing of grammars in Xerces is accomplished with
implementations of the XMLGrammarLoader
interface. Each implementation needs to know how to
parse a particular type of grammar and how to build a
data structure representing that grammar that Xerces can
efficiently make use of in validation. Since most
application programs won't want to deal with Xerces
implementations per se, we have provided a handy utility
class to handle grammar preparsing generally:
org.apache.xerces.parsers.XMLGrammarPreparser .
This FAQ describes the use of this class.
For a live example, check out the
XMLGrammarBuilder sample in the
samples/xni directory of the binary
distribution.
XMLGrammarPreparser has methods for
installing XNI error handlers, entity resolvers, setting
the Locale, and generally doing similar things as an XNI
configuration. Any object passed to XMLGrammarPreparser
by any of these methods will be passed on to all
XMLGrammarLoader s registered with
XMLGrammarPreparser.
Before XMLGrammarPreparser can be used, its
registerPreparser(String, XMLGrammarLoader):
boolean method must be called. This allows a
String identifying an arbitrary grammar type to be
associated with a loader for that type. To make peoples'
lives easier, if you want DTD grammars or XML Schema
grammar support, you can pass null for the
second parameter and XMLGrammarPreparser
will try and instantiate the appropriate default grammar
loader. For DTD's, for instance, just call
registerPreparser like:
| | | | grammarPreparser("http://www.w3.org/TR/REC-xml", null) | | | | |
Schema grammars correspond to the URI
"http://www.w3.org/2001/XMLSchema"; both these constants
can be found in the
org.apache.xerces.xni.grammars.XMLGrammarDescription
interface. The method returns true if an
XMLGrammarLoader was successfully associated with the
given grammar String, false otherwise.
XMLGrammarPreparser also contains methods for setting
features and properties on particular loaders--keyed on
with the same string that was used to register the
loader. It also allows features and properties the
application believes to be general to all loaders to be
set; it transmits such features and properties to each
loader that is registered. These methods also silently consume any
notRecognized/notSupported exceptions that the loaders throw. Particularly useful here is
registering an XMLGrammarPool
implementation, such as that found in
org.apache.xerces.util.XMLGrammarPoolImpl .
To actually parse a grammar, one simply calls the
preparseGrammar(String grammarType, XMLInputSource
source): Grammar method. As above, the String
represents the type of the grammar to be parsed, and the
XMLInputSource is the location of the grammar to be
parsed; this will not be subjected to entity expansion.
It's worth noting that Xerces default grammar loaders
will attempt to cache the resulting grammar(s) if a
grammar pool implementation is registered with them.
This is particularly useful in the case of schema
grammars: If a schema grammar imports another grammar,
the Grammar object returned will be the schema doing the
importing, not the one being imported. For caching,
this means that if this grammar is cached by itself, the grammars
that it imports won't be available to the grammar pool
implementation. Since our Schema Loader knows about this
idiosyncrasy, if a grammar pool is registered with it,
it will cache all schema grammars it encounters,
including the one which it was specifically called to
parse. In general, it is probably advisable to register
grammar pool implementations with grammar loaders for
this reason; generally, one would want to cache--and make
available to the grammar pool implementation--imported
grammars as well as specific schema grammars, since the
specific schemas cannot be used without those that they
import.
|
|