In general, you can use “Remove URL Parameters” or use exclusion rules to avoid crawling URLs matching specific queries or query patterns.
To access these settings, go to Project Settings > Site Audit > Crawl Settings. Note: This option is also available when creating a new project.
Include and Exclude URLs
Only crawl URLs matching the pattern - we match against the full URL of the sitemap links.
For example, /products - will only include URLs with /products in it.
Don’t crawl URLs matching the pattern - we match against the full URL of the sitemap links.
For example, /news - will exclude URLs with /news in it.
*Note: these rules accept regular expression. Here is a more detailed help article on how to use Regular Expressions in Site Audit advanced filters and how to set "Include" and "Exclude" rules for the crawl.
Removing URL Parameters:
Toggling this on means that when a URL has parameters, we remove them and crawl without parameters.
Only ahrefs.com/help/ will be crawled.
Removing nofollow links:
Go to SiteAudit > + New Project > Crawl Settings.
From here, you can toggle “Follow nofollow links” on and off.
Removing noindex pages:
As our bots are not able to tell whether a page is set to noindex before crawling it, it's not possible to exclude noindex pages before a crawl.
Please see the above to manually exclude URLs.
*Please note that when a page is set as noindex, it is still crawled by Google, but won't be shown in results.
While you cannot exclude noindex pages in your crawl, you can exclude noindex data from your Site Audit reports.