|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.any23.plugin.crawler.SiteCrawler
public class SiteCrawler
A basic site crawler to extract semantic content of small/medium size sites.
Field Summary | |
---|---|
static int |
DEFAULT_NUM_OF_CRAWLERS
Default number of crawler instances. |
static String |
DEFAULT_PAGE_FILTER_RE
|
static Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> |
DEFAULT_WEB_CRAWLER
Default crawler implementation. |
Pattern |
defaultFilters
Default filter applied to skip contents. |
Constructor Summary | |
---|---|
SiteCrawler(File storageFolder)
Constructor. |
Method Summary | |
---|---|
void |
addListener(CrawlerListener listener)
Registers a CrawlerListener to this crawler. |
int |
getMaxDepth()
|
int |
getMaxPages()
|
int |
getNumOfCrawlers()
|
int |
getPolitenessDelay()
|
Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> |
getWebCrawler()
|
void |
removeListener(CrawlerListener listener)
Deregisters a CrawlerListener from this crawler. |
void |
setMaxDepth(int maxDepth)
Sets the maximum depth. |
void |
setMaxPages(int maxPages)
Sets the maximum collected pages. |
void |
setNumOfCrawlers(int n)
Sets the number of crawler instances. |
void |
setPolitenessDelay(int millis)
Sets the politeness delay. |
void |
setWebCrawler(Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> c)
Sets the actual crawler clas. |
void |
start(URL seed,
boolean wait)
Starts the crawler process with the defaultFilters . |
void |
start(URL seed,
Pattern filters,
boolean wait)
Starts the crawling process. |
void |
stop()
Interrupts the crawler process if started with wait flag == false . |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_PAGE_FILTER_RE
public static final int DEFAULT_NUM_OF_CRAWLERS
public static final Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> DEFAULT_WEB_CRAWLER
public final Pattern defaultFilters
Constructor Detail |
---|
public SiteCrawler(File storageFolder)
storageFolder
- location used to store the temporary data structures used by the crawler.Method Detail |
---|
public int getNumOfCrawlers()
public void setNumOfCrawlers(int n)
n
- an integer >= 0.public Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> getWebCrawler()
public void setWebCrawler(Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> c)
c
- a not class
.public int getMaxDepth()
-1
means no limit.public void setMaxDepth(int maxDepth)
maxDepth
- maximum allowed depth. -1
means no limit.public int getMaxPages()
public void setMaxPages(int maxPages)
maxPages
- maximum allowed pages. -1
means no limit.public int getPolitenessDelay()
public void setPolitenessDelay(int millis)
-1
means no politeness.
millis
- delay in milliseconds.public void addListener(CrawlerListener listener)
CrawlerListener
to this crawler.
listener
- public void removeListener(CrawlerListener listener)
CrawlerListener
from this crawler.
listener
- public void start(URL seed, Pattern filters, boolean wait) throws Exception
seed
- the starting URL for the crawler process.filters
- filters to be applied to the crawler process. Can be null
.wait
- if true
the process will wait for the crawler termination.
Exception
public void start(URL seed, boolean wait) throws Exception
defaultFilters
.
seed
- the starting URL for the crawler process.wait
- if true
the process will wait for the crawler termination.
Exception
public void stop()
wait
flag == false
.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |