------ Apache Any23 - Data Extraction ------ The Apache Software Foundation ------ 2011-2012 ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Data Extraction +---------------------------------------------------------------------------------------------- /*1*/ Apache Any23 runner = new Apache Any23(); /*2*/ runner.setHTTPUserAgent("test-user-agent"); /*3*/ HTTPClient httpClient = runner.getHTTPClient(); /*4*/ DocumentSource source = new HTTPDocumentSource( httpClient, "http://www.rentalinrome.com/semanticloft/semanticloft.htm" ); /*5*/ ByteArrayOutputStream out = new ByteArrayOutputStream(); /*6*/ TripleHandler handler = new NTriplesWriter(out); try { /*7*/ runner.extract(source, handler); } finally { /*8*/ handler.close(); } /*9*/ String n3 = out.toString("UTF-8"); +---------------------------------------------------------------------------------------------- This example demonstrates the data extraction, that is the main purpose of <> library. At <> we define the <> facade instance. As described before, the constructor allows to enforce the usage of specific extractors. The <> defines the , used to identify the client during data collection. At <> we use the runner to create an instance of {{{./xref/org/apache/any23/http/HTTPClient.html}HTTPClient}}, used by {{{./xref/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} for content fetching. The <> instantiates an {{{./xref/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} instance, specifying the {{{./xref/org/apache/any23/http/HTTPClient.html}HTTPClient}} and the URL addressing the content to be processed. At <> we define a buffered output stream used to store data produced by the {{{./xref/org/apache/any23/writer/TripleHandler.html}TripleHandler}} defined at <>. The extraction method at <> will run the metadata extraction. The produced metadata will be written within the passed {{{./xref/org/apache/any23/writer/TripleHandler.html}TripleHandler}} instance. The {{{./xref/org/apache/any23/writer/TripleHandler.html}TripleHandler}} needs to be explicitly closed, this is done safely in a <> block at <>. The expected output is encoded at <> and is: +---------------------------------------------------------------------------------------------- "Semantic Loft (beta) - Trastevere apartments | Rental in Rome - rentalinrome.com" . . . . . _:node14r93a8dex1 . [The complete output is omitted for brevity.] +---------------------------------------------------------------------------------------------- Filter Out Accidental Triples To remove accidental triples <> provides a set of useful filters, located within the <> package. The filter {{{./xref/org/apache/any23/filter/IgnoreTitlesOfEmptyDocuments.html}IgnoreTitlesOfEmptyDocuments}} removes triples generated by the {{{./xref/org/apache/any23/extractor/html/TitleExtractor.html}TitleExtractor}} whether the document is empty. The filter {{{./xref/org/apache/any23/filter/IgnoreAccidentalRDFa.html}IgnoreAccidentalRDFa}} removes accidental <> related triples. +------------------------------------ RDFWriter rdfWriter = ... TripleHandler rdfWriterHandler = RDFWriterTripleHandler(rdfWriter); TripleHandler tripleHandler = new ReportingTripleHandler( new IgnoreAccidentalRDFa( new IgnoreTitlesOfEmptyDocuments(rdfWriterHandler), true // if true the CSS triples will be removed in any case. ) ); DocumentSource documentSource = ... any23.extract(documentSource, rdfWriterHandler); +------------------------------------