org.apache.any23.plugin.crawler
Class SiteCrawler

java.lang.Object
  extended by org.apache.any23.plugin.crawler.SiteCrawler

public class SiteCrawler
extends Object

A basic site crawler to extract semantic content of small/medium size sites.

Author:
Michele Mostarda (mostarda@fbk.eu)

Field Summary
static int DEFAULT_NUM_OF_CRAWLERS
          Default number of crawler instances.
static String DEFAULT_PAGE_FILTER_RE
           
static Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> DEFAULT_WEB_CRAWLER
          Default crawler implementation.
 Pattern defaultFilters
          Default filter applied to skip contents.
 
Constructor Summary
SiteCrawler(File storageFolder)
          Constructor.
 
Method Summary
 void addListener(CrawlerListener listener)
          Registers a CrawlerListener to this crawler.
 int getMaxDepth()
           
 int getMaxPages()
           
 int getNumOfCrawlers()
           
 int getPolitenessDelay()
           
 Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> getWebCrawler()
           
 void removeListener(CrawlerListener listener)
          Deregisters a CrawlerListener from this crawler.
 void setMaxDepth(int maxDepth)
          Sets the maximum depth.
 void setMaxPages(int maxPages)
          Sets the maximum collected pages.
 void setNumOfCrawlers(int n)
          Sets the number of crawler instances.
 void setPolitenessDelay(int millis)
          Sets the politeness delay.
 void setWebCrawler(Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> c)
          Sets the actual crawler clas.
 void start(URL seed, boolean wait)
          Starts the crawler process with the defaultFilters.
 void start(URL seed, Pattern filters, boolean wait)
          Starts the crawling process.
 void stop()
          Interrupts the crawler process if started with wait flag == false.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_PAGE_FILTER_RE

public static final String DEFAULT_PAGE_FILTER_RE
See Also:
Constant Field Values

DEFAULT_NUM_OF_CRAWLERS

public static final int DEFAULT_NUM_OF_CRAWLERS
Default number of crawler instances.

See Also:
Constant Field Values

DEFAULT_WEB_CRAWLER

public static final Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> DEFAULT_WEB_CRAWLER
Default crawler implementation.


defaultFilters

public final Pattern defaultFilters
Default filter applied to skip contents.

Constructor Detail

SiteCrawler

public SiteCrawler(File storageFolder)
Constructor.

Parameters:
storageFolder - location used to store the temporary data structures used by the crawler.
Method Detail

getNumOfCrawlers

public int getNumOfCrawlers()
Returns:
number of crawler instances.

setNumOfCrawlers

public void setNumOfCrawlers(int n)
Sets the number of crawler instances.

Parameters:
n - an integer >= 0.

getWebCrawler

public Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> getWebCrawler()

setWebCrawler

public void setWebCrawler(Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> c)
Sets the actual crawler clas.

Parameters:
c - a not class.

getMaxDepth

public int getMaxDepth()
Returns:
the max allowed crawl depth, -1 means no limit.

setMaxDepth

public void setMaxDepth(int maxDepth)
Sets the maximum depth.

Parameters:
maxDepth - maximum allowed depth. -1 means no limit.

getMaxPages

public int getMaxPages()
Returns:
max number of allowed pages.

setMaxPages

public void setMaxPages(int maxPages)
Sets the maximum collected pages.

Parameters:
maxPages - maximum allowed pages. -1 means no limit.

getPolitenessDelay

public int getPolitenessDelay()
Returns:
the politeness delay in milliseconds.

setPolitenessDelay

public void setPolitenessDelay(int millis)
Sets the politeness delay. -1 means no politeness.

Parameters:
millis - delay in milliseconds.

addListener

public void addListener(CrawlerListener listener)
Registers a CrawlerListener to this crawler.

Parameters:
listener -

removeListener

public void removeListener(CrawlerListener listener)
Deregisters a CrawlerListener from this crawler.

Parameters:
listener -

start

public void start(URL seed,
                  Pattern filters,
                  boolean wait)
           throws Exception
Starts the crawling process.

Parameters:
seed - the starting URL for the crawler process.
filters - filters to be applied to the crawler process. Can be null.
wait - if true the process will wait for the crawler termination.
Throws:
Exception

start

public void start(URL seed,
                  boolean wait)
           throws Exception
Starts the crawler process with the defaultFilters.

Parameters:
seed - the starting URL for the crawler process.
wait - if true the process will wait for the crawler termination.
Throws:
Exception

stop

public void stop()
Interrupts the crawler process if started with wait flag == false.



Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.