Parallel Crawler Engine

A crawler instance can crawl a single site quickly. However, if you have to crawl 10,000 sites quickly you need the ParallelCrawlerEngine. It allows you to crawl a configurable number of sites concurrently to maximize throughput.

Example Usage

The concurrency is configurable by setting the maxConcurrentSiteCrawls in the web/app.config file. The default value is 3 so the following block of code will crawl three sites simultaneously.

static void Main(string[] args)
{
    XmlConfigurator.Configure();//So the logger works

    var siteToCrawlProvider = new SiteToCrawlProvider();
    siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
    {
        new SiteToCrawl{ Uri = new Uri("http://somesitetocrawl1.com/") },
        new SiteToCrawl{ Uri = new Uri("http://somesitetocrawl2.com/") },
        new SiteToCrawl{ Uri = new Uri("http://somesitetocrawl3.com/") },
    });

    //Create the crawl engine instance
    var crawlEngine = new ParallelCrawlerEngine(new ParallelImplementationContainer {
        SiteToCrawlProvider = siteToCrawlProvider
    });

    //Register for site level events
    crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
    {
        Console.WriteLine("Completed crawling all sites");
    };
    crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
    {
        Console.WriteLine("Completed crawling site {0}", eventArgs.CrawledSite.SiteToCrawl.Uri);       
    };
    crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
    {
        //Register for crawler level events. These are Abot's events!!!
        eventArgs.Crawler.PageCrawlCompleted += (abotSender, abotEventArgs) =>
        {
            Console.WriteLine("You have the crawled page here in abotEventArgs.CrawledPage...");
        };
    };

    crawlEngine.StartAsync();

    Console.WriteLine("Press enter key to stop");
    Console.Read();
}

Easy Override Of Default Implementations

ParallelCrawlerEngine allows easy override of one or all of it's dependent implementations. Below is an example of how you would plugin your own implementations (same as above). The new ParallelImplementationOverride class makes plugging in nested dependencies much easier than it use to be. It will handle finding exactly where that implementation is needed.

var crawlEngine = new ParallelCrawlerEngine(new ParallelImplementationContainer {
    SiteToCrawlProvider = yourSiteToCrawlProvider,
    WebCrawlerFactory = yourFactory,
        ...(Excluded)
});

Pause And Resume

Pause and resume on the ParallelCrawlerEngine simply relays the command to each active CrawlerX instance. However, just be aware that any in progress http requests will be finished, processed and any events related to those will be fired.

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.Pause();
System.Threading.Thread.Sleep(10000);
crawlEngine.Resume();

Stop

Stopping the crawl is as simple as calling Stop(). The call to Stop() tells AbotX to not make any new http requests but to finish any that are in progress. Any events and processing of the in progress requests will finish before each CrawlerX instance stops its crawl as well.

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.Stop();

By passing true to the Stop() method, it will stop each CrawlerX instance more abruptly. Anything in pogress will be aborted.

crawlEngine.Stop(true);

Speed Up

The ParallelCrawlerEngine can be "sped up" by calling the SpeedUp() method. The call to SpeedUp() tells AbotX to increase the number of concurrent site crawls that are currently running. You can can call this method as many times as you like. Adjustments are made instantly so you should see more concurrency immediately.

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.SpeedUp();

System.Threading.Thread.Sleep(3000);
crawlEngine.SpeedUp();

Configure how agressively to speed up

Name Description Used By
config.Accelerator.ConcurrentSiteCrawlsIncrement The number to increment the MaxConcurrentSiteCrawls for each call the the SpeedUp() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl. ParallelCrawlerEngine
config.Accelerator.ConcurrentRequestIncrement The number to increment the MaxConcurrentThreads for each call the the SpeedUp() method. This deals with the number of concurrent http requests for a single crawl. CrawlerX
config.Accelerator.DelayDecrementInMilliseconds If there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to remove from that configured value on every call to the SpeedUp() method. CrawlerX
config.Accelerator.MinDelayInMilliseconds If there is a configured (manual or programatically determined) delay in between requests to a site, this is the minimum amount of milliseconds to delay no matter how many calls to the SpeedUp() method. CrawlerX
config.Accelerator.ConcurrentSiteCrawlsMax The maximum amount of concurrent site crawls to allow no matter how many calls to the SpeedUp() method. ParallelCrawlerEngine
config.Accelerator.ConcurrentRequestMax The maximum amount of concurrent http requests to a single site no matter how many calls to the SpeedUp() method. CrawlerX

Slow Down

The ParallelCrawlerEngine can be "slowed down" by calling the SlowDown() method. The call to SlowDown() tells AbotX to reduce the number of concurrent site crawls that are currently running. You can can call this method as many times as you like. Any currently executing crawls will finish normally before any adjustments are made.

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.SlowDown();

System.Threading.Thread.Sleep(3000);
crawlEngine.SlowDown();

Configure how agressively to slow down

Name Description Used By
config.Decelerator.ConcurrentSiteCrawlsDecrement The number to decrement the MaxConcurrentSiteCrawls for each call the the SlowDown() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl. ParallelCrawlerEngine
config.Decelerator.ConcurrentRequestDecrement The number to decrement the MaxConcurrentThreads for each call the the SlowDown() method. This deals with the number of concurrent http requests for a single crawl. CrawlerX
config.Decelerator.DelayIncrementInMilliseconds If there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to add to that configured value on every call to the SlowDown() method CrawlerX
config.Decelerator.MaxDelayInMilliseconds The maximum value the delay can be. CrawlerX
config.Decelerator.ConcurrentSiteCrawlsMin The minimum amount of concurrent site crawls to allow no matter how many calls to the SlowDown() method. ParallelCrawlerEngine
config.Decelerator.ConcurrentRequestMin The minimum amount of concurrent http requests to a single site no matter how many calls to the SlowDown() method. CrawlerX

;