CrawlerX is an object that represents an individual crawler that crawls a single site at a time. It is a subclass of Abot's PoliteWebCrawler and adds some useful functionality.
This tutorial assumes that AbotX is configured using app/web.config. A more thorough tutorial on configuration is available here.
Create an instance of AbotX.Crawler.CrawlerX.
Register for events and create processing methods (both synchronous and asynchronous versions available)
Run the crawl synchronously.
Run the crawl asynchronously.
CrawlerX has default implementations for all its dependencies. However, there are times where you may want to override one or all of those implementations. Below is an example of how you would plugin your own implementations. The new ImplementationOverride class makes plugging in nested dependencies much easier than it use to be with Abot. It will handle finding exactly where that implementation is needed.
Pause and resume work as you would expect. However, just be aware that any in progress http requests will be finished, processed and any events related to those will be fired.
Stopping the crawl is as simple as calling Stop(). The call to Stop() tells AbotX to not make any new http requests but to finish any that are in progress. Any events and processing of the in progress requests will finish before CrawlerX stops the crawl.
By passing true to the Stop() method, AbotX will stop the crawl more abruptly. Anything in pogress will be aborted.
CrawlerX can be "sped up" by calling the SpeedUp() method. The call to SpeedUp() tells AbotX to increase the number of concurrent http requests to the currently running sites. You can can call this method as many times as you like. Adjustments are made instantly so you should see more concurrency immediately.
Name | Description | Used By |
---|---|---|
config.Accelerator.ConcurrentSiteCrawlsIncrement | The number to increment the MaxConcurrentSiteCrawls for each call the the SpeedUp() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl. | ParallelCrawlerEngine |
config.Accelerator.ConcurrentRequestIncrement | The number to increment the MaxConcurrentThreads for each call the the SpeedUp() method. This deals with the number of concurrent http requests for a single crawl. | CrawlerX |
config.Accelerator.DelayDecrementInMilliseconds | If there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to remove from that configured value on every call to the SpeedUp() method. | CrawlerX |
config.Accelerator.MinDelayInMilliseconds | If there is a configured (manual or programatically determined) delay in between requests to a site, this is the minimum amount of milliseconds to delay no matter how many calls to the SpeedUp() method. | CrawlerX |
config.Accelerator.ConcurrentSiteCrawlsMax | The maximum amount of concurrent site crawls to allow no matter how many calls to the SpeedUp() method. | ParallelCrawlerEngine |
config.Accelerator.ConcurrentRequestMax | The maximum amount of concurrent http requests to a single site no matter how many calls to the SpeedUp() method. | CrawlerX |
CrawlerX can be "slowed down" by calling the SlowDown() method. The call to SlowDown() tells AbotX to reduce the number of concurrent http requests to the currently runnning sites. You can can call this method as many times as you like. Any currently executing http requests will finish normally before any adjustments are made.
Name | Description | Used By |
---|---|---|
config.Decelerator.ConcurrentSiteCrawlsDecrement | The number to decrement the MaxConcurrentSiteCrawls for each call the the SlowDown() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl. | ParallelCrawlerEngine |
config.Decelerator.ConcurrentRequestDecrement | The number to decrement the MaxConcurrentThreads for each call the the SlowDown() method. This deals with the number of concurrent http requests for a single crawl. | CrawlerX |
config.Decelerator.DelayIncrementInMilliseconds | If there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to add to that configured value on every call to the SlowDown() method | CrawlerX |
config.Decelerator.MaxDelayInMilliseconds | The maximum value the delay can be. | CrawlerX |
config.Decelerator.ConcurrentSiteCrawlsMin | The minimum amount of concurrent site crawls to allow no matter how many calls to the SlowDown() method. | ParallelCrawlerEngine |
config.Decelerator.ConcurrentRequestMin | The minimum amount of concurrent http requests to a single site no matter how many calls to the SlowDown() method. | CrawlerX |