Basic Crawler Plugin
The Basic Crawler Plugin implements a CLI Tool extending Rover to add site crawling capabilities.
The tool can be used to extract semantic content from a small/medium size sites.
To use it make sure to have correctly configured the basic-crawler plugin to be found by the any23tools script (follow the Plugins section instructions):
core/bin/$ ./any23tools Crawler usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>] [-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o <arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s] [-storagefolder <arg>] [-t] [-v] -d,--defaultns <arg> Override the default namespace used to produce statements. -e <arg> Specify a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle. -f,--Output format <arg> [turtle (default), rdfxml, ntriples, nquads, trix, json, uri] -h,--help Print this help. -l,--log <arg> Produce log within a file. -maxdepth <arg> Max allowed crawler depth. Default: no limit. -maxpages <arg> Max number of pages before interrupting crawl. Default: no limit. -n,--nesting Disable production of nesting triples. -numcrawlers <arg> Sets the number of crawlers. Default: 10 -o,--output <arg> Specify Output file (defaults to standard output). -p,--pedantic Validate and fixes HTML content detecting commons issues. -pagefilter <arg> Regex used to filter out page URLs during crawling. Default: '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2| mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm il|pdf|swf|zip|rar|gz|xml|txt))$' -politenessdelay <arg> Politeness delay in milliseconds. Default: no limit. -s,--stats Print out extraction statistics. -storagefolder <arg> Folder used to store crawler temporary data. Default: [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g q/T/] -t,--notrivial Filter trivial statements (e.g. CSS related ones). -v,--verbose Show debug and progress information.