------ Apache Any23 - Validation and Fixing ------ The Apache Software Foundation ------ 2011-2012 ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Validation and Fixing Introduction <> Is able to detect <> and apply fixes over it. This section will show how to write RDFa validation Rule and Fix for RDFa. It's widely recognized that RDFa is subjected to a plethora of different and {{{http://rdfa.info/wiki/Common-publishing-mistakes}common mistakes}}. These errors may lead to a failures during RDF extraction process from HTML pages but since they are, typically, syntax errors they could be easily detected and fixed with some heuristics. This pages describes the <> rule-based approach, that allows it to detect, fix and correctly extract RDF from those ill-formed RDFa in XHTML pages. More specifically, <> allows you to write a {{{./xref/org/apache/any23/validator/Rule.html}Rule}} able to detect the errors, a {{{./xref/org/apache/any23/validator/Fix.html}Fix}} containing the logic to fix the problem and a {{{./xref/org/apache/any23/validator/Validator.html}Validator}} which acts as a register of rules and fixes. The Validator calls all the registered rules and when one of them is applied it calls the associated Fix. The following code snipped shows how to programmatically detect and fix a very common data error with <>. Fix Missing Prefix Mappings Declaration Sometimes, web authors forget to declare prefix mappings. For example, you can't just use something like dcterms:title without first declaring the dcterms prefix mapping. If a prefix mapping isn't declared, the RDFa parser won't understand the prefix when it is used in your document. This may lead <> to don't extract such embedded RDF triples. This: +------------------------------------------------------------------------------------------
The title of this document is Why RDFa is Awesome.
+------------------------------------------------------------------------------------------ Should be: +------------------------------------------------------------------------------------------
The title of this document is Why RDFa is Awesome.
+------------------------------------------------------------------------------------------ With the <> {{{./xref/org/apache/any23/validator/package-summary.html}Validator}} classes it's possible to solve this problem simply implementing the {{{./xref/org/apache/any23/validator/Rule.html}Rule}} interface as described below: +------------------------------------------------------------------------------------------ public class MissingOpenGraphNamespaceRule implements Rule { public String getHRName() { return "missing-opengraph-namespace-rule"; } public boolean applyOn(DOMDocument document, RuleContext context, ValidationReport validationReport) { List metas = document.getNodes("/HTML/HEAD/META"); boolean foundPrecondition = false; for (Node meta : metas) { Node propertyNode = meta.getAttributes().getNamedItem("property"); if( propertyNode != null && propertyNode.getTextContent().indexOf("og:") == 0) { foundPrecondition = true; break; } } if (foundPrecondition) { Node htmlNode = document.getNode("/HTML"); if (htmlNode.getAttributes().getNamedItem("xmlns:og") == null) { validationReport.reportIssue( ValidationReport.IssueLevel.error, "Missing OpenGraph namespace declaration.", htmlNode ); return true; } } return false; } } +------------------------------------------------------------------------------------------ The {{{./xref/org/apache/any23/validator/rule/MissingOpenGraphNamespaceRule.html}MissingOpenGraphNamespaceRule}} inspects the DOM structure of the HTML page and if it finds some META tags with some RDFa property (of the OpenGraph Protocol vocabulary, in this case) it looks for the declaration of that name space. If there is no declaration it return <>, that means that an error has been detected within the document. Writing a fix for the Rule depicted above it's quite simple: +------------------------------------------------------------------------------------------ public class OpenGraphNamespaceFix implements Fix { public static final String OPENGRAPH_PROTOCOL_NS = "http://opengraphprotocol.org/schema/"; public String getHRName() { return "opengraph-namespace-fix"; } public void execute(Rule rule, RuleContext context, DOMDocument document) { document.addAttribute("/HTML", "xmlns:og", OPENGRAPH_PROTOCOL_NS); } } +------------------------------------------------------------------------------------------ At this point it's enough to register the Rule and the relative Fix to the Validator: +------------------------------------------------------------------------------------------ validator.addRule(MissingOpenGraphNamespaceRule.class, OpenGraphNamespaceFix.class); +------------------------------------------------------------------------------------------ When the Rule precondition is matched, then the Fix is triggered modifying the DOM structure.