All Collections
Site Audit
Tutorials
How to Use Regular Expressions in Ahrefs' Site Audit
How to Use Regular Expressions in Ahrefs' Site Audit

Learn how to use Regular Expressions in Site Audit advanced filters and how to set "Include" and "Exclude" rules for the crawl.

Nick Churick avatar
Written by Nick Churick
Updated over a week ago

Ahrefs’ Site Audit tool allows you to crawl a specific set of URLs and exclude any URLs from the crawl. To do that, you must define some patterns using regular expressions (aka regex or regexp).

Keep in mind that these rules also apply to the seeds. So whenever you set a new pattern, make sure that our crawler has something to begin the crawl with.

You can also use regex in the advanced filters inside Site Audit reports.

If you’ve never encountered regular expressions before, it may seem like random gibberish to you :)

I will show you how to use the regex in the advanced settings of Ahrefs’ Site Audit. It’s not hard at all.

The basics

If you put some pattern into the “Include” field of the crawl settings, Site Audit will crawl only the URLs that match this pattern.

The “Exclude” field works similarly. Our crawler will skip the URLs that match the pattern.

If you use both fields and some URL matches both “Include” and “Exclude” patterns, Site Audit will exclude that URL from the crawl.

To create a pattern, you must use Regex

Let’s move from the simplest examples to more advanced ones.

The following configuration includes all URLs that contain the word “blog” and exclude URLs containing the word “product.”

This instructs our bot to crawl:
https://ahrefs.com/blog
https://ahrefs.com/blog/seo-techniques/
https://ahrefs.com/academy/blogging-for-busines

And to ignore:
https://ahrefs.com/blog/category/product-blog/
https://ahrefs.com/blog/ecommerce-out-of-stock-products/

That was easy, right?

But what if you want to include the URLs from /blog/ subfolder specifically but not https://ahrefs.com/academy/blogging-for-business?

You can use a bit more advanced pattern:

You probably wonder what those symbols before and after “blog” are. 

In the regex, you have to “escape” some symbols so that they are not recognized as special characters. To do that, use the backslash \ before the character.

A simple dot . in the regex, for example, stands for any character. But \. works as a full stop symbol. That’s why I escaped the slash character in the example above like this: \/

Some handy Regex tokens

^ - this symbol indicates the beginning of the URL

$ - this symbol indicates the end of the URL

. - the decimal point matches any single character

* - matches the preceding expression 0 or more times

+ - matches 1 or more of the previous

? - Matches 0 or 1 of the previous

| - is equivalent to OR.

[__] - is similar to |, but can be used to define ranges

(__) - parentheses group the regex between them

\d - matches one digit

\D - matches one non-digit

\w - matches a word character

\W - matches a non-word character

Different tools and platforms may be using different regex libraries. Our Site Audit is using RE2. You can find its full syntax here.

Some Practical Examples

1. Https URLs in /wp-content/ subfolder

^https:.*\/wp-content\/

^ indicates the beginning of the URL. This rule will match all URLs that start with “https:” followed by 0 to any number of characters .* before “/wp-content/”.

URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png

URLs not matching the pattern:
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png

2. URLs in a subfolder, including the directory URL itself.

\/blog(\/.*)?$

This rule will match all URLs that end $ with “/blog”, optionally followed by the slash and 0 to any number of characters (\/.*)?. The question mark in this pattern matches the expression in the brackets between zero and one time, making it optional.

URLs matching the pattern:
https://ahrefs.com/blog
https://ahrefs.com/blog/301-redirects/

URLs not matching the pattern:
https://ahrefs.com/blogging
https://ahrefs.com/academy/blogging-for-business

3. URLs that contain @ or % symbols

@|% or [@%] 

| and [__] work as OR

URLs matching the pattern:
https://ahrefs.com/@timsoulo
https://ahrefs.com/%D1%81%D0%B5%D0%BE

URLs not matching the pattern:
https://ahrefs.com/blog/nofollow-links

4. “Add to Cart” URLs in Woocommerce

\?add-to-cart=

Keep in mind that ? is a special symbol in the regex. To use it as a simple question mark, don’t forget to escape it like this: \?

URLs matching the pattern:
https://yourdomain.com/?add-to-cart=25

URLs not matching the pattern:
https://yourdomain.com/smartphones

5. URLs containing a year (4 digits)

[0-9]{4} or \d{4}

[0-9]{4} will match all URLs containing four {4} digits [0-9] in a row

\d{4} does the same as \d stands for one digit

URLs matching the pattern:
https://yourdomain.com/best-smartphones-2019

URLs not matching the pattern:
https://yourdomain.com/smartphones

6. All URLs of the subdomain (both http and https)

^https?:\/\/help.ahrefs.com

This rule will match all URLs that begin with "http://help.ahrefs.com" or "https://help.ahrefs.com".

The question mark here s? indicates that “s” is optional, so both http and https will match this rule.

URLs matching the pattern:
https://help.ahrefs.com
http://help.ahrefs.com/
http://help.ahrefs.com/site-audit

URLs not matching the pattern:
https://ahrefs.com/site-audit
ftp://help.ahrefs.com

7. Various file URLs 

\.(jpg|gif|bmp|png|css|pdf)$

This rule will match all URLs that end $ with .jpg OR .gif OR .bmp OR .png OR .css OR .pdf.

You can reduce this expression to target a smaller number of extensions. For example, for only URLs that end $ with .jpg OR .png only:


.*\.(jpg|png)$

Parentheses (__) group the regex between them and | stands for OR

URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png 

URLs not matching the pattern:
https://ahrefs.com/site-audit

8. Using URL rewrite rules to crawl a staging website

You can use URL rewrite rules in Site Audit settings to replace some parts of the URLs with another value.

E.g. When your staging website is based on a subdomain, e.g. staging.ahrefs.com

This rule will replace "ahrefs.com" in every URL with "staging.ahrefs.com"

9. Using URL rewrite rules with numbered capturing groups

With capturing groups, you can replace several parts of the URL with one rule.

Example

Pattern to match:

www\.ahrefs\.com([^\?|#]*)?([#]?[^\?]*)\??(.*)

Replace with:

www.ahrefs.com\1?parameter1=5273&\2?parameter2=7465&\3

([^\?|#]*) is the 1st capturing group

([#]?[^\?]*) is the 2nd capturing group

(.*) is the 3rd capturing group

\1?parameter1=5273&\2\3

replaces the value for the 1st capturing group with "?parameter1=5273"

\2?parameter2=7465& replaces the value for the 2nd capturing group with "?parameter2=7465&"

\3 inserts the value matching the 3rd capturing group

Final words

I hope this article helped you learn a few cool tricks you can do with Regular Expressions.

Please note that you can apply multiple patterns to include or exclude URLs in the crawl settings. If no matches are found for one of the rules, the rule will be skipped.

The rules above will instruct our crawler to crawl the URLs that contain the words:

“blog” or “product”

AND do not contain the words:

“blogging” or “productive.”

For multiple URL rewrite rules, each new rule is executed sequentially to the result of previous rewrite.

And remember that all these rules also apply to the seeds. So whenever you set them, make sure that our crawler has something to begin the crawl with.

You can test your Regex on this website: https://regex101.com/. Note that you should select “Golang” from the left menu.

Did this answer your question?