Ahrefs’ Site Audit tool allows you to crawl a specific set of URLs and exclude any URLs from the crawl. To do that, you must define some patterns using regular expressions (aka regex or regexp).
Keep in mind that these rules also apply to the seeds. So whenever you set a new pattern, make sure that our crawler has something to begin the crawl with.
You can also use regex in the advanced filters inside Site Audit reports.
If you’ve never encountered regular expressions before, it may seem like random gibberish to you :)
I will show you how to use the regex in the advanced settings of Ahrefs’ Site Audit. It’s not hard at all.
The basics
If you put some pattern into the “Include” field of the crawl settings, Site Audit will crawl only the URLs that match this pattern.
The “Exclude” field works similarly. Our crawler will skip the URLs that match the pattern.
If you use both fields and some URL matches both “Include” and “Exclude” patterns, Site Audit will exclude that URL from the crawl.
To create a pattern, you must use Regex.
Let’s move from the simplest examples to more advanced ones.
The following configuration includes all URLs that contain the word “blog” and exclude URLs containing the word “product.”
This instructs our bot to crawl:
https://ahrefs.com/blog
https://ahrefs.com/blog/seo-techniques/
https://ahrefs.com/academy/blogging-for-busines
And to ignore:
https://ahrefs.com/blog/category/product-blog/
https://ahrefs.com/blog/ecommerce-out-of-stock-products/
That was easy, right?
But what if you want to include the URLs from /blog/ subfolder specifically but not https://ahrefs.com/academy/blogging-for-business
?
You can use a bit more advanced pattern:
You probably wonder what those symbols before and after “blog” are.
In the regex, you have to “escape” some symbols so that they are not recognized as special characters. To do that, use the backslash \
before the character.
A simple dot .
in the regex, for example, stands for any character. But \.
works as a full stop symbol. That’s why I escaped the slash character in the example above like this: \/
Some handy Regex tokens
^
- this symbol indicates the beginning of the URL
$
- this symbol indicates the end of the URL
.
- the decimal point matches any single character
*
- matches the preceding expression 0 or more times
+
- matches 1 or more of the previous
?
- Matches 0 or 1 of the previous
|
- is equivalent to OR.
[__]
- is similar to |, but can be used to define ranges
(__)
- parentheses group the regex between them
\d
- matches one digit
\D
- matches one non-digit
\w
- matches a word character
\W
- matches a non-word character
Different tools and platforms may be using different regex libraries. Our Site Audit is using RE2. You can find its full syntax here.
Some Practical Examples
1. Https URLs in /wp-content/ subfolder
^https:.*\/wp-content\/
^
indicates the beginning of the URL. This rule will match all URLs that start with “https:” followed by 0 to any number of characters .*
before “/wp-content/”.
URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
URLs not matching the pattern:
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
2. URLs in a subfolder, including the directory URL itself.
\/blog(\/.*)?$
This rule will match all URLs that end $
with “/blog”, optionally followed by the slash and 0 to any number of characters (\/.*)?
. The question mark in this pattern matches the expression in the brackets between zero and one time, making it optional.
URLs matching the pattern:
https://ahrefs.com/blog
https://ahrefs.com/blog/301-redirects/
URLs not matching the pattern:
https://ahrefs.com/blogging
https://ahrefs.com/academy/blogging-for-business
3. URLs that contain @ or % symbols
@|%
or [@%]
|
and [__]
work as OR
URLs matching the pattern:
https://ahrefs.com/@timsoulo
https://ahrefs.com/%D1%81%D0%B5%D0%BE
URLs not matching the pattern:
https://ahrefs.com/blog/nofollow-links
4. “Add to Cart” URLs in Woocommerce
\?add-to-cart=
Keep in mind that ?
is a special symbol in the regex. To use it as a simple question mark, don’t forget to escape it like this: \?
URLs matching the pattern:
https://yourdomain.com/?add-to-cart=25
URLs not matching the pattern:
https://yourdomain.com/smartphones
5. URLs containing a year (4 digits)
[0-9]{4}
or \d{4}
[0-9]{4}
will match all URLs containing four {4}
digits [0-9]
in a row
\d{4}
does the same as \d
stands for one digit
URLs matching the pattern:
https://yourdomain.com/best-smartphones-2019
URLs not matching the pattern:
https://yourdomain.com/smartphones
6. All URLs of the subdomain (both http and https)
^https?:\/\/help.ahrefs.com
This rule will match all URLs that begin with "http://help.ahrefs.com" or "https://help.ahrefs.com".
The question mark here s?
indicates that “s” is optional, so both http and https will match this rule.
URLs matching the pattern:
https://help.ahrefs.com
http://help.ahrefs.com/
http://help.ahrefs.com/site-audit
URLs not matching the pattern:
https://ahrefs.com/site-audit
ftp://help.ahrefs.com
7. Various file URLs
\.(jpg|gif|bmp|png|css|pdf)$
This rule will match all URLs that end $
with .jpg OR .gif OR .bmp OR .png OR .css OR .pdf.
You can reduce this expression to target a smaller number of extensions. For example, for only URLs that end $
with .jpg OR .png only:
.*\.(jpg|png)$
Parentheses (__)
group the regex between them and |
stands for OR
URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
URLs not matching the pattern:
https://ahrefs.com/site-audit
8. Using URL rewrite rules to crawl a staging website
You can use URL rewrite rules in Site Audit settings to replace some parts of the URLs with another value.
E.g. When your staging website is based on a subdomain, e.g. staging.ahrefs.com
This rule will replace "ahrefs.com" in every URL with "staging.ahrefs.com"
9. Using URL rewrite rules with numbered capturing groups
With capturing groups, you can replace several parts of the URL with one rule.
Example
Pattern to match:
www\.ahrefs\.com([^\?|#]*)?([#]?[^\?]*)\??(.*)
Replace with:
www.ahrefs.com\1?parameter1=5273&\2?parameter2=7465&\3
([^\?|#]*)
is the 1st capturing group
([#]?[^\?]*)
is the 2nd capturing group
(.*)
is the 3rd capturing group
\1?parameter1=5273&\2\3
replaces the value for the 1st capturing group with "?parameter1=5273
"
\2?parameter2=7465&
replaces the value for the 2nd capturing group with "?parameter2=7465&
"
\3
inserts the value matching the 3rd capturing group
Final words
I hope this article helped you learn a few cool tricks you can do with Regular Expressions.
Please note that you can apply multiple patterns to include or exclude URLs in the crawl settings. If no matches are found for one of the rules, the rule will be skipped.
The rules above will instruct our crawler to crawl the URLs that contain the words:
“blog” or “product”
AND do not contain the words:
“blogging” or “productive.”
For multiple URL rewrite rules, each new rule is executed sequentially to the result of previous rewrite.
And remember that all these rules also apply to the seeds. So whenever you set them, make sure that our crawler has something to begin the crawl with.
You can test your Regex on this website: https://regex101.com/. Note that you should select “Golang” from the left menu.