------ Apache Any23 - Getting started ------ The Apache Software Foundation ------ 2011-2012 ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Getting started with <> <> can be used: * via CLI (command line interface) from your preferred shell environment; * as a RESTful Webservice; * as a library. * <> Modules <> is composed of the following modules: * <<>> The core library. * <<>> The REST service. * <<>> The core additional plugins. * Use the <> CLI The command-line tools support is provided by the <> module. Once <> has been correctly {{{./install.html}installed}}, if you want to use it as a command line tool, use the shell script within the <<>> directory. These are provided both for Unix (Linux/OSX) and Windows. The <<>> script provides analysis, documentation, testing and debugging utilities. Simply running <./any23> without options will show the options. +------------------------------------------- any23-core$ ./bin/any23 A command must be specified. Usage: any23 [options] [command] [command options] Options: -h, --help Display help information. Default: false --plugins-dir The Any23 plugins directory. Default: ~/.any23/plugins -X, --verbose Produce execution verbose output. Default: false -v, --version Display version information. Default: false Commands: extractor Utility for obtaining documentation about metadata extractors. Usage: extractor [options] Extractor name Options: -a, --all shows a report about all available extractors Default: false -i, --input shows example input for the given extractor Default: false -l, --list shows the names of all available extractors Default: false -o, --outut shows example output for the given extractor Default: false microdata Commandline Tool for extracting Microdata from file/HTTP source. Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/local.file} mimes MIME Type Detector Tool. Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content} verify Utility for plugin management verification. Usage: verify [options] plugins-dir rover Any23 Command Line Tool. Usage: rover [options] input URIs {|}+ Options: -d, --defaultns Override the default namespace used to produce statements. -e, --extractors a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle Default: [] -f, --format the output format Default: turtle -l, --log Produce log within a file. -n, --nesting Disable production of nesting triples. Default: false -t, --notrivial Filter trivial statements (e.g. CSS related ones). Default: false -o, --output Specify Output file (defaults to standard output) Default: java.io.PrintStream@79dfc547 -p, --pedantic Validate and fixes HTML content detecting commons issues. Default: false -s, --stats Print out extraction statistics. Default: false vocab Prints out the RDF Schema of the vocabularies used by Any23. Usage: vocab [options] Options: -f, --format Vocabulary output format Default: NQuads +------------------------------------------- The <<>> script detects a list of available utilities within the <> and <> classpath and allows to activate them. The CLI tools are: * <<>>: a utility for obtaining useful information about extractors. * <<>>: commandline parser to extract specific Microdata content from a web page (local or remote) and produce a JSON output compliant with the Microdata specification ({{{http://www.w3.org/TR/microdata/}http://www.w3.org/TR/microdata/}}). * <<>>: detects the MIME Type for any HTTP / file / direct input resource. * <<>>: a utility for verifying plugins. * <<>>: the RDF extraction tool. * <<>>: allows to dump all the <> vocabularies declared within Apache Any23. ** The Rover tool Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP) resources, specify a custom list of extractors, specify the desired output format and other flags to suppress noise and generate advanced reports. Extract metadata from an <> page: +----------------------------------------- any23-core$ ./bin/any23 rover http://yourdomain/yourfile +----------------------------------------- Extract metadata from a <> resource: +-------------------------------------- any23-core$ ./bin/any23 rover myfoaf.rdf +-------------------------------------- Specify the output format, use the option <<"-f">> or <<"--format">>: (Default output format is <>). +-------------------------------------- any23-core$ ./bin/any23 rover -f quad myfoaf.rdf +-------------------------------------- Filtering trivial statements By default, <> will extract meta information, such as links to or meta information like the author or the software used to create the . Hence, if the user is only interested in the structured content from the tag we offer a filter functionality, activated by the <<"-t">> command line argument. +------------------------- any23-core$ ./bin/any23 rover -t -f quad myfoaf.rdf +------------------------- ** The ExtractorDocumentation tool The ExtractorDocumentation returns human readable information about the registered extractors. List all the available extractors: +-------------------------------------- any23-core/core$ ./bin/any23 extractor --list csv [class org.apache.any23.extractor.csv.CSVExtractor] html-head-icbm [class org.apache.any23.extractor.html.ICBMExtractor] html-head-links [class org.apache.any23.extractor.html.HeadLinkExtractor] html-head-title [class org.apache.any23.extractor.html.TitleExtractor] html-mf-adr [class org.apache.any23.extractor.html.AdrExtractor] html-mf-geo [class org.apache.any23.extractor.html.GeoExtractor] html-mf-hcalendar [class org.apache.any23.extractor.html.HCalendarExtractor] html-mf-hcard [class org.apache.any23.extractor.html.HCardExtractor] html-mf-hlisting [class org.apache.any23.extractor.html.HListingExtractor] html-mf-hrecipe [class org.apache.any23.extractor.html.HRecipeExtractor] html-mf-hresume [class org.apache.any23.extractor.html.HResumeExtractor] html-mf-hreview [class org.apache.any23.extractor.html.HReviewExtractor] html-mf-license [class org.apache.any23.extractor.html.LicenseExtractor] html-mf-species [class org.apache.any23.extractor.html.SpeciesExtractor] html-mf-xfn [class org.apache.any23.extractor.html.XFNExtractor] html-microdata [class org.apache.any23.extractor.microdata.MicrodataExtractor] html-rdfa11 [class org.apache.any23.extractor.rdfa.RDFa11Extractor] html-script-turtle [class org.apache.any23.extractor.html.TurtleHTMLExtractor] rdf-nq [class org.apache.any23.extractor.rdf.NQuadsExtractor] rdf-nt [class org.apache.any23.extractor.rdf.NTriplesExtractor] rdf-trix [class org.apache.any23.extractor.rdf.TriXExtractor] rdf-turtle [class org.apache.any23.extractor.rdf.TurtleExtractor] rdf-xml [class org.apache.any23.extractor.rdf.RDFXMLExtractor] +-------------------------------------- ** The MicrodataParser tool The tool allows to apply the only MicrodataExtractor on a specific input source and returns the extracted data in the JSON format declared in the Microdata specification section {{{http://www.w3.org/TR/microdata/#json}JSON}}. +-------------------------------------- any23-core/core$ ./bin/any23 microdata http://path/to/resource.html +-------------------------------------- ** The VocabPrinter tool The VocabPrinter Tool prints out the RDFSchema declared by all the <> declared vocabularies. Just launch the command below to see all the managed vocabularies. +-------------------------------------- any23-core/core$ ./bin/any23 vocab +-------------------------------------- : <> ** The MimeDetector tool The MimeDetector Tool extracts the <> for a given source (http:// file:// inline://). Examples: +-------------------------------------- any23-core$ ./bin/any23 mimes http://www.michelemostarda.com/foaf.rdf application/rdf+xml +-------------------------------------- +-------------------------------------- any23-core$ ./bin/any23 mimes file://../src/test/resources/application/trix/test1.trx application/trix +-------------------------------------- +-------------------------------------- any23-core$ ./bin/any23 mimes 'inline:// .' text/n3 +-------------------------------------- ** The PluginVerifier tool The PluginVerifier tool allows checking installed plugin in the specified input directory Just launch the command below to sanity-check the input plugins directory +-------------------------------------- any23-core$ ./bin/any23 verify [/path/to/plugins/dir] +-------------------------------------- * <> CLI The <> ToolRunner CLI () supports the auto detection of Tool plugins within the classpath. For further details see {{{./any23-plugins.html}Plugins}} section. The default <> CLI plugins are enlisted below. ** Crawler Plugin {crawler-tool} The provides basic site crawling and metadata extraction capabilities. +---------------------------------------------------------------------------- any23-core$ ./bin/any23 -h [...] crawler Any23 Crawler Command Line Tool. Usage: crawler [options] input URIs {|}+ Options: -d, --defaultns Override the default namespace used to produce statements. -e, --extractors a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle Default: [] -f, --format the output format Default: turtle -l, --log Produce log within a file. -md, --maxdepth Max allowed crawler depth. Default: 2147483647 -mp, --maxpages Max number of pages before interrupting crawl. Default: 2147483647 -n, --nesting Disable production of nesting triples. Default: false -t, --notrivial Filter trivial statements (e.g. CSS related ones). Default: false -nc, --numcrawlers Sets the number of crawlers. Default: 10 -o, --output Specify Output file (defaults to standard output) Default: java.io.PrintStream@2911a3a4 -pf, --pagefilter Regex used to filter out page URLs during crawling. Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$ -p, --pedantic Validate and fixes HTML content detecting commons issues. Default: false -pd, --politenessdelay Politeness delay in milliseconds. Default: 2147483647 -s, --stats Print out extraction statistics. Default: false -sf, --storagefolder Folder used to store crawler temporary data. Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce +---------------------------------------------------------------------------- A usage example: +---------------------------------------------------------------------------- any23-core$ ./bin/any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log +---------------------------------------------------------------------------- * Use <> as a RESTful Web Service <> provides a Web Service that can be used to extract from Web documents. <> services can be accessed through a {{{./service.html}RESTful API}}. Running the server The server command line tool is defined within the <> module. Run the <<>> script +-------------------------- any23-service$ ./bin/any23server +-------------------------- from the command line in order to start up the server, then go to {{{http://localhost:8080/}}} to access the web interface. A live demo version of such service is running at {{{http://any23.org/}}}. You can also start the server from Java by running the {{{./xref/org/apache/any23/servlet/Servlet.html}Apache Any23 Servlet}} class. Maven can be used to create a WAR file for deployment into an existing servlet container such as {{{http://tomcat.apache.org/}Apache Tomcat}}. * Use <> as a Library See our {{{./developers.html}Developers guide}} for more details.