Ahrefs’ Site Audit tool allows you to crawl a specific set of URLs and exclude any URLs from the crawl. To do that, you must define some patterns using regular expressions (aka regex or regexp).

Keep in mind that these rules also apply to the seeds. So whenever you set a new pattern, make sure that our crawler has something to begin the crawl with.

If you’ve never encountered regular expressions before, it may seem like random gibberish to you :)

I will show you how to use the regex in the advanced settings of Ahrefs’ Site Audit. It’s not hard at all.

The basics

If you put some pattern into the “Include” field of the crawl settings, Site Audit will crawl only the URLs that match this pattern.

The “Exclude” field works similarly. Our crawler will skip the URLs that match the pattern.

If you use both fields and some URL matches both “Include” and “Exclude” patterns, Site Audit will exclude that URL from the crawl.

To create a pattern, you must use Regex

Let’s move from the simplest examples to more advanced ones.

The following configuration includes all URLs that contain the word “blog” and exclude URLs containing the word “product.”

This instructs our bot to crawl:
https://ahrefs.com/blog
https://ahrefs.com/blog/seo-techniques/
https://ahrefs.com/academy/blogging-for-busines

And to ignore:
https://ahrefs.com/blog/category/product-blog/
https://ahrefs.com/blog/ecommerce-out-of-stock-products/

That was easy, right?

But what if you want to include the URLs from /blog/ subfolder specifically but not https://ahrefs.com/academy/blogging-for-business?

You can use a bit more advanced pattern:

You probably wonder what those symbols before and after “blog” are. 

In the regex, you have to “escape” some symbols so that they are not recognized as special characters. To do that, use the backslash \ before the character.

A simple dot . in the regex, for example, stands for any character. But \. works as a full stop symbol. That’s why I escaped the slash character in the example above like this: \/

Some handy Regex tokens

^ - this symbol indicates the beginning of the URL

$ - this symbol indicates the beginning of the URL

. - the decimal point matches any single character

* - matches the preceding expression 0 or more times

+ - matches 1 or more of the previous

? - Matches 0 or 1 of the previous

| - is equivalent to OR.

[__] - is similar to |, but can be used to define ranges

(__) - parentheses group the regex between them

\d - matches one digit

\D - matches one non-digit

\w - matches a word character

\W - matches a non-word character

Now let’s see how they work together.

Practical examples

1. Https URLs in /wp-content/ subfolder

^https:.*\/wp-content\/

^ indicates the beginning of the URL. This rule will match all URLs that start with “https:” followed by 0 to any number of characters .* before “/wp-content/”.

URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png

URLs not matching the pattern:
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png

2. URLs in a subfolder, including the directory URL itself.

\/blog(\/.*)?$

This rule will match all URLs that end $ with “/blog”, optionally followed by the slash and 0 to any number of characters (\/.*)?. The question mark in this pattern matches the expression in the brackets between zero and one time, making it optional.

URLs matching the pattern:
https://ahrefs.com/blog
https://ahrefs.com/blog/301-redirects/

URLs not matching the pattern:
https://ahrefs.com/blogging
https://ahrefs.com/academy/blogging-for-business

3. URLs that contain @ or % symbols

@|% or [@%] 

| and [__] work as OR

URLs matching the pattern:
https://ahrefs.com/@timsoulo
https://ahrefs.com/%D1%81%D0%B5%D0%BE

URLs not matching the pattern:
https://ahrefs.com/blog/nofollow-links

4. “Add to Cart” URLs in Woocommerce

\?add-to-cart=

Keep in mind that ? is a special symbol in the regex. To use it as a simple question mark, don’t forget to escape it like this: \?

URLs matching the pattern:
https://yourdomain.com/?add-to-cart=25

URLs not matching the pattern:
https://yourdomain.com/smartphones

5. URLs containing a year (4 digits)

[0-9]{4} or \d{4}

[0-9]{4} will match all URLs containing four {4} digits [0-9] in a row

\d{4} does the same as \d stands for one digit

URLs matching the pattern:
https://yourdomain.com/best-smartphones-2019

URLs not matching the pattern:
https://yourdomain.com/smartphones

6. All URLs of the subdomain (both http and https)

^https?:\/\/help.ahrefs.com

This rule will match all URLs that begin with "http://help.ahrefs.com" or "https://help.ahrefs.com".

The question mark here s? indicates that “s” is optional, so both http and https will match this rule.

URLs matching the pattern:
https://help.ahrefs.com
http://help.ahrefs.com/
http://help.ahrefs.com/site-audit

URLs not matching the pattern:
https://ahrefs.com/site-audit
ftp://help.ahrefs.com

7. Various file URLs 

\.(jpg|gif|bmp|png|css|pdf)$

This rule will match all URLs that end $ with .jpg OR .gif OR .bmp OR .png OR .css OR .pdf.

Parentheses (__) group the regex between them and | stands for OR

URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png 

URLs not matching the pattern:
https://ahrefs.com/site-audit

Final words

I hope this article helped you learn a few cool tricks you can do with Regular Expressions.

Please note that you can apply multiple patterns to include or exclude URLs in the crawl settings. To add an extra rule, click on the “+” icon:

The rules above will instruct our crawler to crawl the URLs that contain the words:

“blog” or “product”

AND do not contain the words:

“blogging” or “productive.”

And remember that these rules also apply to the seeds. So whenever you set them, make sure that our crawler has something to begin the crawl with.

You can test your Regex on this website: https://regex101.com/. Note that you should select “Golang” from the left menu.

Did this answer your question?