Regular expressions (also known as REGEX or REGEXP) help you find URLs or text that match a particular pattern.
💡REGEX is supported in Site Explorer, Site Audit and certain endpoints in our API:
Learn more about all the places you can use them.
How REGEX works
Let’s move from the simplest examples to more advanced ones.
The following configuration includes all URLs that contain the word “blog” and exclude URLs containing the word “product.”
This instructs our bot to crawl:
https://ahrefs.com/blog
https://ahrefs.com/blog/seo-techniques/
https://ahrefs.com/academy/blogging-for-busines
And to ignore:
https://ahrefs.com/blog/category/product-blog/
https://ahrefs.com/blog/ecommerce-out-of-stock-products/
That was easy, right?
But what if you want to include the URLs from /blog/ subfolder specifically but not https://ahrefs.com/academy/blogging-for-business
?
You can use a bit more advanced pattern:
You probably wonder what those symbols before and after “blog” are.
In the regex, you have to “escape” some symbols so that they are not recognized as special characters. To do that, use the backslash \
before the character.
A simple dot .
in the regex, for example, stands for any character. But \.
works as a full stop symbol. That’s why I escaped the slash character in the example above like this: \/
How Ahrefs handles multiple REGEX expressions
Please note that you can apply multiple patterns to include or exclude URLs in the crawl settings in Site Audit. If no matches are found for one of the rules, the rule will be skipped.
The rules above will instruct our crawler to crawl the URLs that contain the words:
“blog” or “product”
AND do not contain the words:
“blogging” or “productive.”
For multiple URL rewrite rules, each new rule is executed sequentially to the result of previous rewrite.
Some handy Regex tokens
^
- this symbol indicates the beginning of the URL
$
- this symbol indicates the end of the URL
.
- the decimal point matches any single character
*
- matches the preceding expression 0 or more times
+
- matches 1 or more of the previous
?
- Matches 0 or 1 of the previous
|
- is equivalent to OR.
[__]
- is similar to |, but can be used to define ranges
(__)
- parentheses group the regex between them
\d
- matches one digit
\D
- matches one non-digit
\w
- matches a word character
\W
- matches a non-word character
Different tools and platforms may be using different regex libraries. Our Site Audit is using RE2. You can find its full syntax here.
Some Practical Examples
1. Https URLs in /wp-content/ subfolder
^https:.*\/wp-content\/
^
indicates the beginning of the URL. This rule will match all URLs that start with “https:” followed by 0 to any number of characters .*
before “/wp-content/”.
URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
URLs not matching the pattern:
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
2. URLs in a subfolder, including the directory URL itself.
\/blog(\/.*)?$
This rule will match all URLs that end $
with “/blog”, optionally followed by the slash and 0 to any number of characters (\/.*)?
. The question mark in this pattern matches the expression in the brackets between zero and one time, making it optional.
URLs matching the pattern:
https://ahrefs.com/blog
https://ahrefs.com/blog/301-redirects/
URLs not matching the pattern:
https://ahrefs.com/blogging
https://ahrefs.com/academy/blogging-for-business
3. URLs that contain @ or % symbols
@|%
or [@%]
|
and [__]
work as OR
URLs matching the pattern:
https://ahrefs.com/@timsoulo
https://ahrefs.com/%D1%81%D0%B5%D0%BE
URLs not matching the pattern:
https://ahrefs.com/blog/nofollow-links
4. “Add to Cart” URLs in Woocommerce
\?add-to-cart=
Keep in mind that ?
is a special symbol in the regex. To use it as a simple question mark, don’t forget to escape it like this: \?
URLs matching the pattern:
https://yourdomain.com/?add-to-cart=25
URLs not matching the pattern:
https://yourdomain.com/smartphones
5. URLs containing a year (4 digits)
[0-9]{4}
or \d{4}
[0-9]{4}
will match all URLs containing four {4}
digits [0-9]
in a row
\d{4}
does the same as \d
stands for one digit
URLs matching the pattern:
https://yourdomain.com/best-smartphones-2019
URLs not matching the pattern:
https://yourdomain.com/smartphones
6. All URLs of the subdomain (both http and https)
^https?:\/\/help.ahrefs.com
This rule will match all URLs that begin with "http://help.ahrefs.com" or "https://help.ahrefs.com".
The question mark here s?
indicates that “s” is optional, so both http and https will match this rule.
URLs matching the pattern:
https://help.ahrefs.com
http://help.ahrefs.com/
http://help.ahrefs.com/site-audit
URLs not matching the pattern:
https://ahrefs.com/site-audit
ftp://help.ahrefs.com
7. Various file URLs
\.(jpg|gif|bmp|png|css|pdf)$
This rule will match all URLs that end $
with .jpg OR .gif OR .bmp OR .png OR .css OR .pdf.
You can reduce this expression to target a smaller number of extensions. For example, for only URLs that end $
with .jpg OR .png only:
.*\.(jpg|png)$
Parentheses (__)
group the regex between them and |
stands for OR
URLs matching the pattern:
https://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
http://ahrefs.com/blog/wp-content/uploads/2019/03/fb-ranking-1-image-1.png
URLs not matching the pattern:
https://ahrefs.com/site-audit
8. Using URL rewrite rules to crawl a staging website in Site Audit
You can use URL rewrite rules in Site Audit settings to replace some parts of the URLs with another value.
E.g. When your staging website is based on a subdomain, e.g. staging.ahrefs.com
This rule will replace "ahrefs.com" in every URL with "staging.ahrefs.com"
9. Using URL rewrite rules with numbered capturing groups
With capturing groups, you can replace several parts of the URL with one rule.
Pattern to match:
www\.ahrefs\.com([^\?|#]*)?([#]?[^\?]*)\??(.*)
Replace with:
www.ahrefs.com\1?parameter1=5273&\2?parameter2=7465&\3
([^\?|#]*)
is the 1st capturing group
([#]?[^\?]*)
is the 2nd capturing group
(.*)
is the 3rd capturing group
\1?parameter1=5273&\2\3
replaces the value for the 1st capturing group with "?parameter1=5273
"
\2?parameter2=7465&
replaces the value for the 2nd capturing group with "?parameter2=7465&
"
\3
inserts the value matching the 3rd capturing group
6. Filter for text that is longer than 50 characters
^[\s\S]{50,}$
Further resources
I hope this article helped you learn a few cool tricks you can do with Regular Expressions.
You can test your Regex on this website: https://regex101.com/. Note that you should select “Golang” from the left menu.