Hunting bad regex with good regex.

In this post i’ll look at how a simple regex flaw I found on a web application, lead me down a pretty big exploratory hole of trying to search for regex vulnerabilities in applications… using regex; the results of which have since been useful on many of my engagements.

On a pentest for a web-app a few months ago I saw something quite ridiculous. A regular expression, similar to the one below, was being used in client-side JavaScript to validate the format/appearance of a user-supplied email address:

If the address matched the regex, the validation passed, and the email address was sent to the server via AJAX. if not, the application threw a hissy-fit and asked the user to try again.
This regex is very basic, but does kinda represent a stripped-down version of an email address correctly. Let’s step through its key parts:

  1. It starts off okay: ([a-zA-Z0-9]+\.)* => “match any alphanumeric followed by a full-stop, zero to infinite times” (eg. i.have.a.huge.).
  2. Then: [a-zA-Z0-9]+ => “at least onealphanumeric character, once or more” (eg.
  3. @ => A mandatory “@” symbol (eg.
  4. And then the first step repeats again: ([a-zA-Z0-9]+\.)* => (eg.

BUT WAIT ONE MOMENT BATMAN! what the bloody hell is that at the end of this regex? A cheeky (.*) !!!

The unescaped ( ) followed by a ( ) means “match anything, zero or more times”. Anything.
I imagine that, at some point in time, the regex originally looked something like this and matched predefined TLDs:

…but a lazy developer got tired of manually adding all the fancy TLDs available these days ( .williamhill is a legit TLD! ) and decided to just slap this dot-star in there instead to match anything. “future-proofing” they probably thought. What harm could come of that eh?
A LOT. Now the regex matches all manner of naughty input. E.g:
daniel@company.<script src=”foo”>daniel@company.’ OR 1=1;
Sure enough, starting the input with a valid looking email address and then inputting anything under the sun in the TLD portion, bypassed the client-side validation check and submitted the form. Even better, our lazy dev must have copy-pasted the ludicrous new regex onto the server-side validation and BOOM! This was a stored XSS attack (who needs to output encode letters, numbers and fullstops right? ;P). Our lazy dev earned (him|her)self the new moniker of “regex noob” that day.
I found this so absurd that I wondered if bad regexes like this were actually more common that I’d have expected, and OH.MY.GOD. When you start actively looking for bad regexes in applications, not just web, they are everywhere!
The problem is though that, today, even basic webpages can often pull in over 10MB of JavaScript, which will mostly be “corner-rounding” rubbish-script. So tracking down the potential .* can be tedious.
So, I took a look at what combinations of regex meta-characters could lead a poorly thought-out regular expression to inadvertently match malicious strings. This it the list I came up with:

The keen-eyed of you will have noticed that the list includes things like ( \W+ ) which is the equivalent of ( [^a-zA-Z0-9_] ) or “not a word character”. If you’re wondering how on earth it’d be possible to execute arbitrary JavaScript code without a word character, check this out:
Anyway, assuming we have some data that may contain regexes, and that these regexes may contain ‘bad’ sequences of meta-characters, the below regex matches all of the scenarios in the list to try and hunt those sequences out.

Simply throwing this regex as a search query against your Burp history/data parsing tool, will often find interesting results in input validation areas of applications.
And if you wanted to push the boat out you could begin the regex with ( ((^|[^\\])(\\\\)* ) this would ensure that any matches did not have escaping backslash characters in-front like ( \.* ) or ( \\\\\\\\\\\\\\\W+ ).
You can stop reading here, copy that regex, and start trying to see if you find anything interesting with it. Or carry on to see some caveats of this regex, another bad example and how it can be exploited 🙂
Part 2
Some caveats to success using regex to match regex;

  1. Because of the complex nature of regex, you would need a large series of regexes to attempt to capture every possible place where bad sequences of meta-characters could exist in another regex; so things ARE missed using the above.
  2. Because of the different implementations of regex in JS, Java etc.. trying to create a one-fits all regex that can be used within all tools (ie. a browser or burp) and scenarios is nigh on impossible. Burp’s Java implementation, for example, hates complicated regexes and has less features then say PERL’s, meanwhile MySQL regexes don’t even support lookarounds. This limits the effectiveness of any regex you could design as you need to stick with basic regex constructs and not get fancy.

Imagine a web application using this regex for email appearance validation:

The problem here is, again, that our regex noob has not escaped his ( . ) characters. However, at first glance, it might not appear exploitable as it would seem that we can only insert a single arbitrary character in defined parts of the email address (the places where a literal . should be).
The regex allows the attacker to insert any number of alphanumeric characters, followed by any character, followed by any number of alphanumeric characters again, then has an @, and finally, repeats the pre-@ phase at the tail end. So normal-looking emails like this work: , these work.
as well as, wrongly, these: daniel$reece@admin| , d!a”n£i$e%l@h^i&b*u(r) also works.
What it does not allow however, is multiple non-alphanumeric characters next to one another, so:
dan<script src=””>, won’t work.
How is this exploitable then? Easy:

Most modern browsers will see that the above is an illegal way to reference a src in a script tag, as there are no quotation marks, but will just fix it for you in the DOM. A friend of mine pointed out that, if the application also accepted URL-encoded input, the following would allow those quotes to still be sent:

This is a slightly trickier regex issue to find and exploit, we can’t simply match all (.)’s in source-code as practically everything would be a false positive, and ( .)? ) isn’t likely to be an issue in 99% of cases, so is purposefully not matched by our regex-hunting regex.
So, what does this mean? well, you can often automate the process of finding that “needle in a haystack” vulnerability, but you won’t do it every time 🙂 and that’s the reason we all have jobs!