------
                                    Apache Any23 - Getting started
                                    ------
                              The Apache Software Foundation
                                    ------
                                     2011-2012

~~  Licensed to the Apache Software Foundation (ASF) under one or more
~~  contributor license agreements.  See the NOTICE file distributed with
~~  this work for additional information regarding copyright ownership.
~~  The ASF licenses this file to You under the Apache License, Version 2.0
~~  (the "License"); you may not use this file except in compliance with
~~  the License.  You may obtain a copy of the License at
~~
~~     http://www.apache.org/licenses/LICENSE-2.0
~~
~~  Unless required by applicable law or agreed to in writing, software
~~  distributed under the License is distributed on an "AS IS" BASIS,
~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~  See the License for the specific language governing permissions and
~~  limitations under the License.

Getting started with <<Apache Any23>>

    <<Apache Any23>> can be used:

      * via CLI (command line interface) from your preferred shell environment;
      * as a RESTful Webservice;
      * as a library.

* <<Apache Any23>> Modules

    <<Apache Any23>> is composed of the following modules:

      * <<<any23-core/>>>      The core library.

      * <<<any23-service/>>>   The REST service.

      * <<<any23-plugins/>>>   The core additional plugins.

* Use the <<Apache Any23>> CLI

   The command-line tools support is provided by the <<any23-core>> module.

   Once <<Apache Any23>> has been correctly {{{./install.html}installed}}, if you want to use it as a command line tool,
   use the shell script within the <<<any23-core/bin>>> directory.
   These are provided both for Unix (Linux/OSX) and Windows.

   The <<<any23>>> script provides analysis, documentation, testing and debugging utilities.

   Simply running <./any23> without options will show the <usage> options.

+-------------------------------------------
any23-core$ ./bin/any23
A command must be specified.
Usage: any23 [options] [command] [command options]
  Options:
    -h, --help          Display help information.
                        Default: false
        --plugins-dir   The Any23 plugins directory.
                        Default: ~/.any23/plugins
    -X, --verbose       Produce execution verbose output.
                        Default: false
    -v, --version       Display version information.
                        Default: false
  Commands:
    extractor      Utility for obtaining documentation about metadata extractors.
      Usage: extractor [options] Extractor name      
  Options:
          -a, --all     shows a report about all available extractors
                        Default: false
          -i, --input   shows example input for the given extractor
                        Default: false
          -l, --list    shows the names of all available extractors
                        Default: false
          -o, --outut   shows example output for the given extractor
                        Default: false

    microdata      Commandline Tool for extracting Microdata from file/HTTP source.
      Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/local.file}
    mimes      MIME Type Detector Tool.
      Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content}
    verify      Utility for plugin management verification.
      Usage: verify [options] plugins-dir
    rover      Any23 Command Line Tool.
      Usage: rover [options] input URIs {<url>|<file>}+
  Options:
          -d, --defaultns    Override the default namespace used to produce
                             statements.
          -e, --extractors   a comma-separated list of extractors, e.g.
                             rdf-xml,rdf-turtle
                             Default: []
          -f, --format       the output format
                             Default: turtle
          -l, --log          Produce log within a file.
          -n, --nesting      Disable production of nesting triples.
                             Default: false
          -t, --notrivial    Filter trivial statements (e.g. CSS related ones).
                             Default: false
          -o, --output       Specify Output file (defaults to standard output)
                             Default: java.io.PrintStream@79dfc547
          -p, --pedantic     Validate and fixes HTML content detecting commons
                             issues.
                             Default: false
          -s, --stats        Print out extraction statistics.
                             Default: false

    vocab      Prints out the RDF Schema of the vocabularies used by Any23.
      Usage: vocab [options]      
  Options:
          -f, --format   Vocabulary output format
                         Default: NQuads
+-------------------------------------------

   The <<<any23>>> script detects a list of available utilities within the <<any23-core>> and <<plugins>>
   classpath and allows to activate them.

   The <any23-core> CLI tools are:

       * <<<extractor>>>: a utility for obtaining useful information about extractors.

       * <<<microdata>>>:  commandline parser to extract specific Microdata content from a web page
         (local or remote) and produce a JSON output compliant with the Microdata
         specification ({{{http://www.w3.org/TR/microdata/}http://www.w3.org/TR/microdata/}}).

       * <<<mimes>>>: detects the MIME Type for any HTTP / file / direct input resource.

       * <<<verify>>>: a utility for verifying <Apache Any23> plugins.

       * <<<rover>>>: the RDF extraction tool.

       * <<<vocab>>>: allows to dump all the <<RDFSchema>> vocabularies declared within Apache Any23.

** The Rover tool

   Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP)
   resources, specify a custom list of extractors, specify the desired output format and other flags
   to suppress noise and generate advanced reports.

  Extract metadata from an <<HTML>> page:

+-----------------------------------------
any23-core$ ./bin/any23 rover http://yourdomain/yourfile
+-----------------------------------------

  Extract metadata from a <<local>> resource:

+--------------------------------------
any23-core$ ./bin/any23 rover myfoaf.rdf
+--------------------------------------

  Specify the output format, use the option <<"-f">> or <<"--format">>:
  (Default output format is <<TURTLE>>).

+--------------------------------------
any23-core$ ./bin/any23 rover -f quad myfoaf.rdf
+--------------------------------------

  Filtering trivial statements

    By default, <<Apache Any23>> will extract <HTML/head> meta information, such as links to <CSS stylesheets> or meta
    information like the author or the software used to create the <html>. Hence, if the user is only interested
    in the structured content from the <HTML/body> tag we offer a filter functionality, activated by the <<"-t">>
    command line argument.

+-------------------------
any23-core$ ./bin/any23 rover -t -f quad myfoaf.rdf
+-------------------------

** The ExtractorDocumentation tool

   The ExtractorDocumentation returns human readable information
   about the registered extractors.

   List all the available extractors:

+--------------------------------------
any23-core/core$ ./bin/any23 extractor --list
                      csv [class org.apache.any23.extractor.csv.CSVExtractor]
           html-head-icbm [class org.apache.any23.extractor.html.ICBMExtractor]
          html-head-links [class org.apache.any23.extractor.html.HeadLinkExtractor]
          html-head-title [class org.apache.any23.extractor.html.TitleExtractor]
              html-mf-adr [class org.apache.any23.extractor.html.AdrExtractor]
              html-mf-geo [class org.apache.any23.extractor.html.GeoExtractor]
        html-mf-hcalendar [class org.apache.any23.extractor.html.HCalendarExtractor]
            html-mf-hcard [class org.apache.any23.extractor.html.HCardExtractor]
         html-mf-hlisting [class org.apache.any23.extractor.html.HListingExtractor]
          html-mf-hrecipe [class org.apache.any23.extractor.html.HRecipeExtractor]
          html-mf-hresume [class org.apache.any23.extractor.html.HResumeExtractor]
          html-mf-hreview [class org.apache.any23.extractor.html.HReviewExtractor]
          html-mf-license [class org.apache.any23.extractor.html.LicenseExtractor]
          html-mf-species [class org.apache.any23.extractor.html.SpeciesExtractor]
              html-mf-xfn [class org.apache.any23.extractor.html.XFNExtractor]
           html-microdata [class org.apache.any23.extractor.microdata.MicrodataExtractor]
              html-rdfa11 [class org.apache.any23.extractor.rdfa.RDFa11Extractor]
       html-script-turtle [class org.apache.any23.extractor.html.TurtleHTMLExtractor]
                   rdf-nq [class org.apache.any23.extractor.rdf.NQuadsExtractor]
                   rdf-nt [class org.apache.any23.extractor.rdf.NTriplesExtractor]
                 rdf-trix [class org.apache.any23.extractor.rdf.TriXExtractor]
               rdf-turtle [class org.apache.any23.extractor.rdf.TurtleExtractor]
                  rdf-xml [class org.apache.any23.extractor.rdf.RDFXMLExtractor]
+--------------------------------------

** The MicrodataParser tool

   The <MicrodataParser> tool allows to apply the only MicrodataExtractor 
   on a specific input source and returns the extracted data in the JSON format
   declared in the Microdata specification section {{{http://www.w3.org/TR/microdata/#json}JSON}}.

+--------------------------------------
any23-core/core$ ./bin/any23 microdata http://path/to/resource.html
+--------------------------------------


** The VocabPrinter tool

   The VocabPrinter Tool prints out the RDFSchema declared by all the <<Apache Any23>>
   declared vocabularies.

  Just launch the command below to see all the managed vocabularies.

+--------------------------------------
any23-core/core$ ./bin/any23 vocab
+--------------------------------------

   <NOTE>: <<This tool is still in beta version.>>

** The MimeDetector tool

   The MimeDetector Tool extracts the <<MIME Type>> for a given source (http:// file:// inline://).

   Examples:

+--------------------------------------
any23-core$ ./bin/any23 mimes http://www.michelemostarda.com/foaf.rdf
application/rdf+xml
+--------------------------------------

+--------------------------------------
any23-core$ ./bin/any23 mimes file://../src/test/resources/application/trix/test1.trx
application/trix
+--------------------------------------

+--------------------------------------
any23-core$ ./bin/any23 mimes 'inline://<http://s> <http://p> <http://o> .'
text/n3
+--------------------------------------

** The PluginVerifier tool

  The PluginVerifier tool allows checking installed plugin in the specified input directory

  Just launch the command below to sanity-check the input plugins directory

+--------------------------------------
any23-core$ ./bin/any23 verify [/path/to/plugins/dir]
+--------------------------------------

* <<Apache Any23>> CLI <Plugins>

   The <<Apache Any23>> ToolRunner CLI (<bin/any23tools>) supports the auto detection of Tool plugins within the classpath.
   For further details see {{{./any23-plugins.html}Plugins}} section.

   The default <<any23>> CLI plugins are enlisted below.

** Crawler Plugin

   {crawler-tool}
   The <Crawler Plugin> provides basic site crawling and metadata extraction capabilities.

+----------------------------------------------------------------------------
any23-core$ ./bin/any23 -h
[...]
    crawler      Any23 Crawler Command Line Tool.
      Usage: crawler [options] input URIs {<url>|<file>}+
  Options:
          -d, --defaultns          Override the default namespace used to
                                   produce statements.
          -e, --extractors         a comma-separated list of extractors, e.g.
                                   rdf-xml,rdf-turtle
                                   Default: []
          -f, --format             the output format
                                   Default: turtle
          -l, --log                Produce log within a file.
          -md, --maxdepth          Max allowed crawler depth.
                                   Default: 2147483647
          -mp, --maxpages          Max number of pages before interrupting
                                   crawl.
                                   Default: 2147483647
          -n, --nesting            Disable production of nesting triples.
                                   Default: false
          -t, --notrivial          Filter trivial statements (e.g. CSS related
                                   ones).
                                   Default: false
          -nc, --numcrawlers       Sets the number of crawlers.
                                   Default: 10
          -o, --output             Specify Output file (defaults to standard
                                   output)
                                   Default: java.io.PrintStream@2911a3a4
          -pf, --pagefilter        Regex used to filter out page URLs during
                                   crawling.
                                   Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$
          -p, --pedantic           Validate and fixes HTML content detecting
                                   commons issues.
                                   Default: false
          -pd, --politenessdelay   Politeness delay in milliseconds.
                                   Default: 2147483647
          -s, --stats              Print out extraction statistics.
                                   Default: false
          -sf, --storagefolder     Folder used to store crawler temporary data.
                                   Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce
+----------------------------------------------------------------------------

    A usage example:

+----------------------------------------------------------------------------
any23-core$ ./bin/any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
+----------------------------------------------------------------------------

* Use <<Apache Any23>> as a RESTful Web Service

   <<Apache Any23>> provides a Web Service that can be used to extract <RDF> from Web documents.
   <<Apache Any23>> services can be accessed through a {{{./service.html}RESTful API}}.

   Running the server

    The server command line tool is defined within the <<any23-service>> module.
    Run the <<<any23server>>> script

+--------------------------
any23-service$ ./bin/any23server
+--------------------------

    from the command line in order to start up the server, then go to {{{http://localhost:8080/}}}
    to access the web interface. A live demo version of such service is running at {{{http://any23.org/}}}.
    You can also start the server from Java by running the
    {{{./xref/org/apache/any23/servlet/Servlet.html}Apache Any23 Servlet}} class. Maven can be used to create a WAR
    file for deployment into an existing servlet container such as {{{http://tomcat.apache.org/}Apache Tomcat}}.

* Use <<Apache Any23>> as a Library

   See our {{{./developers.html}Developers guide}} for more details.