Content
Site Audits are performed by our crawler: AhrefsSiteAudit. More information about this specific bot here.
At Ahrefs, we provide very comprehensive options to how Site Audits can be performed, giving you full flexibility and control over your technical site audits.
Overview of each settings section
First, navigate over to the Site Audit settings of your project. You will find three sections:
1. Schedule
This configures when to schedule site audit to crawl your website regularly, and how often it does so. You can adjust everything from the day, to the time and timezone of which the site audit is scheduled to crawl:
Please note that actual crawl may start anytime within the selected hour. If you don't wish site audit to run automatically, toggle "Run scheduled Crawls" off.
2. URL sources
URL sources specify the "seed URLs", or the starting pages that Site Audit will try visit first. By default, only "Website" and "Auto-detected sitemaps" options will be selected, which is best if you just want to crawl all pages in this project's scope.
💡 If you only want to crawl URLs of a specific sitemap, follow this guide.
💡 Click on this toggle for more information of all 5 URL source or Seed URL option
💡 Click on this toggle for more information of all 5 URL source or Seed URL option
Website. Checking this box means Site Audit will take the project URL as the starting point for the crawl. Aka, whatever URL you've entered in for this project's scope:
Auto-detected sitemaps. Checking this box means Site Audit will start crawling from the sitemap files listed in your website’s robots.txt file. If the robots.txt file does not list sitemaps, it will check the default sitemap locations:
<your website>.com/sitemap.xml
<your website>.com/sitemap_index.xml
Specific sitemaps. Checking this box allows you to start crawling from a custom list of sitemap files. An input box to enter in sitemap URLs will open once the box is checked:
Custom URL list Checking this box allows you to enter in a list of URLs for Site Audit to start crawling from, either in the provided input box or upload from CSV / TXT file. The file size limit is 16Mb. Do note that only URLs within the project's scope will be crawled.
Backlinks Checking this box means Site Audit will start crawling your website from the URLs that have external backlinks in our database. You can check which urls these are by entering your project URL into Site Explorer and checking its backlinks report:
3. Crawl settings
There is a considerable list of setting options that are available here, each of them have tooltips providing more information about each tooltip:
Click on each toggle below for more information about each settings option:
Speed settings
Speed settings
Controls how quickly the crawler "follows" links of your website. In the below example, 30,000 URLs are crawled per minute.
Settings
Settings
In the below example settings:
The crawler will not render javascript when checking any pages. But, it will check image, CSS, and javascript links for any issues.The crawler will also click on links on non-canonical pages, and click on nofollow links.
The crawler will completely ignore any links outside of your project's website scope
The crawler will also check links exactly as they are found, without removing URL parameters
Limits
Limits
This section refers to thresholds where the crawler will stop trying to crawl new pages. In the below example, the crawler will stop if any of the limits are hit:
10,000 pages are crawled
Crawl takes 48 hours
And any pages of the following pages are ignored:
Deeper than 16 levels from seed
More than 16 folders deep
Have a URL longer than 2048 characters
Have more than 12 URL query parameters
Robots instructions
Robots instructions
In this section, you can instruct the crawler to ignore the robots.txt and change the user agent from Desktop to Mobile. The full user agent string for both can be found on AhrefsSiteAudit's own page.
This feature is only available for verified projects. It is useful for auditing parts of the website that may be disallowed to be crawled by bots.
PageSpeed Insights
PageSpeed Insights
PageSpeed Insights (PSI) helps to score the speed and user experience of a webpage. Site audit will flag any pages where the PSI score is low. You'd need to enter in your API from Google to use this feature.
Include and exclude URLs
Include and exclude URLs
Use this if you want to crawl very specific pages, or avoid crawling specific pages by using regex expressions. View this article more information on how to use regex, and some examples you can try.
Please note that only regex expressions will be accepted. If the data entered into the box does not form a valid regex expression, it will be ignored. Please also do not enter in blank lines in the box.
URL rewrite rules
URL rewrite rules
You can view examples of how to use this field (especially with regex expressions) here.
Frequently asked questions
I'm new to Site Audit. Which settings should I use?
If you wish to just crawl your website fully, you can leave the settings by default as they are. We would recommend that the "Execute Javascript" toggle be turned on in the case your website heavily uses javascript to generate the content in your pages. If you still aren't sure, you can contact our support team by email or by live chat.
I made changes to the site audit settings but nothing changed in my site audit reports. Why?
Any changes to project settings saved will only apply to new Site Audit crawls. Past, or ongoing Site Audit crawls will not be affected.
The data in Site Explorer for my website is wrong/incomplete. Is it because I'm not crawling the website properly in Site Audit?
Crawling in Site Audit does not update any data in Site Explorer. Site Explorer's data is populated by AhrefsBot, a different crawler than the one for Site Audit. If the website is new, it can take some time for our crawler to get to it. Otherwise, please check your website here to see if there are issues with our crawler visiting your website.
Related