Configuration

Configuration of AbotX is very flexible. You can use the xml web.config, app.config, POCO or both! All of these are demonstrated below.

Like the functionality, the configuration also extends/builds on top of Abot. For more information on how to configure and use Abot (NOT ABOTX) you can see the quickstart here. To find out exactly what each config value does for Abot you can look at the code comments here.

See the full xml example at the bottom of this page or by clicking here.

Configure AbotX

The following is an example of a full xml configuration. AbotX will automatically load the configs when put in the app/web.config file.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <configSections>
    <section name="abotX" type="AbotX.Core.AbotXConfigurationSectionHandler, AbotX" />
  </configSections>
  <abotX
      maxConcurrentSiteCrawls="3"
      sitesToCrawlBatchSizePerRequest="25"
      minSiteToCrawlRequestDelayInSecs="15"
      isJavascriptRenderingEnabled="false"
      javascriptRenderingWaitTimeInMilliseconds="3500"
    >
    <autoThrottling
      isEnabled="false"
      thresholdMed="5"
      thresholdHigh="10"
      thresholdTimeInMilliseconds="5000"
      minAdjustmentWaitTimeInSecs="30"
    />
    <autoTuning
      isEnabled="false"
      cpuThresholdMed="65"
      cpuThresholdHigh="85"
      minAdjustmentWaitTimeInSecs="30"
    />
    <accelerator
      concurrentSiteCrawlsIncrement="2"
      concurrentRequestIncrement="2"
      delayDecrementInMilliseconds="2000"
      minDelayInMilliseconds="0"
      concurrentRequestMax="10"
      concurrentSiteCrawlsMax="3"  
    />
    <decelerator
      concurrentSiteCrawlsDecrement="2"
      concurrentRequestDecrement="2"      
      delayIncrementInMilliseconds="2000"
      maxDelayInMilliseconds="15000"
      concurrentRequestMin="1"
      concurrentSiteCrawlsMin="1"         
    />
  </abotX>
</configuration>

And this would be the equivalent configuration using POCO (Plain Old C# Objects). Notice the X in CrawlConfigurationX. This is a subclass of Abot's CrawlConfiguration class. However, you must then pass the config object into the constructor of CrawlerX or ParallelCrawlerEngine.

var config = new CrawlConfigurationX
{
    MaxConcurrentSiteCrawls = 10,
    SitesToCrawlBatchSizePerRequest = 25,
    MinSiteToCrawlRequestDelayInSecs = 15,
    IsJavascriptRenderingEnabled = false,
    JavascriptRenderingWaitTimeInMilliseconds = 3500
    //etc...
};

//Now you must pass it into the constructor of CrawlerX or ParallelCrawlerEngine var crawler = new CrawlerX(config);
var crawlerEngine = new ParallelCrawlerEngine(config);

You can also use a combination of both. The following creates a config object using the xml values then overrides some of the values manually.

var config = AbotConfigurationXSectionHandler.LoadFromXml().Convert();
config.MaxConcurrentSiteCrawls = 10;//Override an AbotX config value
config.MaxConcurrentThreads = 5;//Override an Abot config value
etc...

//Now you must pass it into the constructor of CrawlerX or ParallelCrawlerEngine var crawler = new CrawlerX(config);
var crawlerEngine = new ParallelCrawlerEngine(config);

Configure Abot

Since CrawlerX and the ParallelCrawlerEngine classes extend or use Abot, we need to configure Abot as well. To learn more about configuring Abot see the quickstart here.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <configSections>
    <section name="abot" type="Abot.Core.AbotConfigurationSectionHandler, Abot" />
  </configSections>
  <abot>
    <crawlBehavior
      maxConcurrentThreads="10"
      maxPagesToCrawl="1000"
      maxPagesToCrawlPerDomain="0"
      maxPageSizeInBytes="1048576"
      userAgentString="Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko"
      crawlTimeoutSeconds="10"
      downloadableContentTypes="text/html"
      isUriRecrawlingEnabled="false"
      isExternalPageCrawlingEnabled="false"
      isExternalPageLinksCrawlingEnabled="false"
      httpServicePointConnectionLimit="200"
      httpRequestTimeoutInSeconds="15"
      httpRequestMaxAutoRedirects="7"
      isHttpRequestAutoRedirectsEnabled="true"
      isHttpRequestAutomaticDecompressionEnabled="false"
      isSendingCookiesEnabled="false"
      isSslCertificateValidationEnabled="false"
      isRespectUrlNamedAnchorOrHashbangEnabled="false"
      minAvailableMemoryRequiredInMb="0"
      maxMemoryUsageInMb="0"
      maxMemoryUsageCacheTimeInSeconds="0"
      maxCrawlDepth="1000"
      maxLinksPerPage="1000"
      isForcedLinkParsingEnabled="false"
      maxRetryCount="0"
      minRetryDelayInMilliseconds="0" />
    <politeness
      isRespectRobotsDotTextEnabled="false"
      isRespectMetaRobotsNoFollowEnabled="false"
      isRespectHttpXRobotsTagHeaderNoFollowEnabled="false"
      isRespectAnchorRelNoFollowEnabled="false"
      robotsDotTextUserAgentString="abotagent"
      maxRobotsDotTextCrawlDelayInSeconds="5"
      minCrawlDelayPerDomainMilliSeconds="0" />
    <extensionValues>
      <add key="key1" value="value1" />
      <add key="key2" value="value2" />
    </extensionValues>
  </abot>

</configuration>

Configure Logging

Abot & AbotX use Log4Net to log messages. These log statements are a great way to see whats going on during a crawl. However, if you dont want to use log4net you can skip this section.

Below is an example log4net configuration. Read more abot log4net at their website

Add using statement for log4net.

using log4net.Config;

Be sure to call the following method to tell log4net to read in the config file. This is usually called in the beginning of a console app or service or the global.asax of a web app.

XmlConfigurator.Configure();

The following configuration data should be added to the app.config file of the application that will be running AbotX. When run, this will create a "Logs" directory with two log files. AbotLog.txt, which has logs that Abot writes and AbotXLog.txt which has logs that AbotXWrites. To learn more about modifying the logging behavior see their website.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <configSections>
    <section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net" />
  </configSections>
  <log4net>
    <appender name="ConsoleAppender" type="log4net.Appender.ConsoleAppender">
      <layout type="log4net.Layout.PatternLayout">
        <conversionPattern value="[%date] [%thread] [%-5level] - %message%newline" />
      </layout>
    </appender>
    <appender name="AbotAppender" type="log4net.Appender.RollingFileAppender">
      <file value="Logs\AbotLog.txt" />
      <appendToFile value="true" />
      <rollingStyle value="Size" />
      <maxSizeRollBackups value="10" />
      <maximumFileSize value="10240KB" />
      <staticLogFileName value="true" />
      <preserveLogFileNameExtension value="true" />
      <layout type="log4net.Layout.PatternLayout">
        <conversionPattern value="[%date] [%thread] [%-5level] - %message%newline" />
      </layout>
    </appender>
    <appender name="AbotXAppender" type="log4net.Appender.RollingFileAppender">
      <file value="Logs\AbotXLog.txt" />
      <appendToFile value="true" />
      <rollingStyle value="Size" />
      <maxSizeRollBackups value="10" />
      <maximumFileSize value="10240KB" />
      <staticLogFileName value="true" />
      <preserveLogFileNameExtension value="true" />
      <layout type="log4net.Layout.PatternLayout">
        <conversionPattern value="[%date] [%-3thread] [%-5level] - %message%newline" />
      </layout>
    </appender>
    <logger name="AbotLogger">
      <level value="INFO" />
      <appender-ref ref="ConsoleAppender" />
      <appender-ref ref="AbotAppender" />
    </logger>
    <logger name="AbotXLogger">
      <level value="INFO" />
      <appender-ref ref="ConsoleAppender" />
      <appender-ref ref="AbotXAppender" />
    </logger>
  </log4net>
</configuration>

Full Configuration Example

The following is a full xml example. It includes all the config sections needed to fully configure AbotX through xml.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <configSections>
    <section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net" />
    <section name="abot" type="Abot.Core.AbotConfigurationSectionHandler, Abot" />
    <section name="abotX" type="AbotX.Core.AbotXConfigurationSectionHandler, AbotX" />
  </configSections>
  <log4net>
    <appender name="ConsoleAppender" type="log4net.Appender.ConsoleAppender">
      <layout type="log4net.Layout.PatternLayout">
        <conversionPattern value="[%date] [%thread] [%-5level] - %message%newline" />
      </layout>
    </appender>
    <appender name="AbotAppender" type="log4net.Appender.RollingFileAppender">
      <file value="Logs\AbotLog.txt" />
      <appendToFile value="true" />
      <rollingStyle value="Size" />
      <maxSizeRollBackups value="10" />
      <maximumFileSize value="10240KB" />
      <staticLogFileName value="true" />
      <preserveLogFileNameExtension value="true" />
      <layout type="log4net.Layout.PatternLayout">
        <conversionPattern value="[%date] [%thread] [%-5level] - %message%newline" />
      </layout>
    </appender>
    <appender name="AbotXAppender" type="log4net.Appender.RollingFileAppender">
      <file value="Logs\AbotXLog.txt" />
      <appendToFile value="true" />
      <rollingStyle value="Size" />
      <maxSizeRollBackups value="10" />
      <maximumFileSize value="10240KB" />
      <staticLogFileName value="true" />
      <preserveLogFileNameExtension value="true" />
      <layout type="log4net.Layout.PatternLayout">
        <conversionPattern value="[%date] [%-3thread] [%-5level] - %message%newline" />
      </layout>
    </appender>
    <logger name="AbotLogger">
      <level value="INFO" />
      <appender-ref ref="ConsoleAppender" />
      <appender-ref ref="AbotAppender" />
    </logger>
    <logger name="AbotXLogger">
      <level value="INFO" />
      <appender-ref ref="ConsoleAppender" />
      <appender-ref ref="AbotXAppender" />
    </logger>
  </log4net>

  <abot>
    <crawlBehavior
      maxConcurrentThreads="10"
      maxPagesToCrawl="1000"
      maxPagesToCrawlPerDomain="0"
      maxPageSizeInBytes="1048576"
      userAgentString="Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko"
      crawlTimeoutSeconds="0"
      downloadableContentTypes="text/html"
      isUriRecrawlingEnabled="false"
      isExternalPageCrawlingEnabled="false"
      isExternalPageLinksCrawlingEnabled="false"
      httpServicePointConnectionLimit="200"
      httpRequestTimeoutInSeconds="15"
      httpRequestMaxAutoRedirects="7"
      isHttpRequestAutoRedirectsEnabled="true"
      isHttpRequestAutomaticDecompressionEnabled="false"
      isSendingCookiesEnabled="false"
      isSslCertificateValidationEnabled="false"
      isRespectUrlNamedAnchorOrHashbangEnabled="false"
      minAvailableMemoryRequiredInMb="0"
      maxMemoryUsageInMb="0"
      maxMemoryUsageCacheTimeInSeconds="0"
      maxCrawlDepth="1000"
      maxLinksPerPage="1000"
      isForcedLinkParsingEnabled="false"
      maxRetryCount="0"
      minRetryDelayInMilliseconds="0" />
    <politeness
      isRespectRobotsDotTextEnabled="false"
      isRespectMetaRobotsNoFollowEnabled="false"
      isRespectHttpXRobotsTagHeaderNoFollowEnabled="false"
      isRespectAnchorRelNoFollowEnabled="false"
      robotsDotTextUserAgentString="abotagent"
      maxRobotsDotTextCrawlDelayInSeconds="5"
      minCrawlDelayPerDomainMilliSeconds="0" />
    <extensionValues>
      <add key="key1" value="value1" />
      <add key="key2" value="value2" />
    </extensionValues>
  </abot>

  <abotX
      maxConcurrentSiteCrawls="3"
      sitesToCrawlBatchSizePerRequest="25"
      minSiteToCrawlRequestDelayInSecs="15"
      isJavascriptRenderingEnabled="false"
      javascriptRenderingWaitTimeInMilliseconds="3500"
    >
    <autoThrottling
      isEnabled="false"
      thresholdMed="5"
      thresholdHigh="10"
      thresholdTimeInMilliseconds="5000"
      minAdjustmentWaitTimeInSecs="30"
    />
    <autoTuning
      isEnabled="false"
      cpuThresholdMed="65"
      cpuThresholdHigh="85"
      minAdjustmentWaitTimeInSecs="30"
    />
    <accelerator
      concurrentSiteCrawlsIncrement="2"
      concurrentRequestIncrement="2"
      delayDecrementInMilliseconds="2000"
      minDelayInMilliseconds="0"
      concurrentRequestMax="10"
      concurrentSiteCrawlsMax="3"  
    />
    <decelerator
      concurrentSiteCrawlsDecrement="2"
      concurrentRequestDecrement="2"      
      delayIncrementInMilliseconds="2000"
      maxDelayInMilliseconds="15000"
      concurrentRequestMin="1"
      concurrentSiteCrawlsMin="1"         
    />
  </abotX>
</configuration>

;