One of my pet peeves is that most regular expressions matching URLs fall somewhat short of what I expect. This pattern from John Gruber is so far the best I’ve found but, like virtually every other implementation, it doesn’t match URLs without protocol. Nobody expects to have to include “www” in a URL for it to work these days, and in daily conversations it’s rare to enunciate “aitch-tee-tee-pee-colon-slash-slash” when you refer to some website. So why is it so hard to match URLs without using these strings as crutches?

To people, (or at least to people with some computer competence) it’s pretty easy to see the difference between a seemingly valid URL and an invalid one. To a computer it’s a lot harder than you’d might think if you want it to be both flexible and accurate at the same time.
To clarify; I want to match any of the following URLS,
- http://www.domain.com
- www.domain.com
- domain.com
- subdomain.domain.com
- subdomain.domain.com/some/path?to&stuff
- www.domain.com/
- ftp://martin@domain.com:8080
Naturally, I do not want to match false positives such as
- http@\/\:www.domain.combo
- www.breakfast.com-org
- bad.punctuation
- http://.com
This has turned out to be a lot more difficult than I thought; and not only because the ActionScript implementation of RegExp is, in my opinion, flaky as hell, but because I found it hard to explain in rules that a computer understands, the human intuition that tells us “google dot com” is a good URL while “burger dot ham” isn’t.
I made an honest attempt at resolving this problem with the URLValidator class back in february. The class performs pretty well at validating and finding URLs. However it works by doing some nasty jumping through various string-parsing hoops, and I’m currently rewriting it to be a tad more legible and efficient. I’ll publish the code when it’s done, but I thought I’d share some of the lessons and frustrations I’ve had at this point.
Warning! This stuff is pretty geeky. No eye candy experiments here.
The components of a valid URL
So let’s have a look at what can, should, or may be found in a valid URL. I’ve made this handy chart as an intro (click for big version):
There are several points to be considered here:
- The protocol: Just matching literally http:// is at best naíve. There are multiple other protocols to consider. These days there seems to be even more of them as apps invent their own protocol prefixes. iTunes podcast prefixes start with itpc://, Spotify links have spotify:// and you can even send commands to TextMate with txmt://. Perhaps you don’t want to support all these crazy protocols, but you need to at least be able to accept ( http | https | ftp ). Matching the protocol is also extremely handy to use as a word boundary. That is; Search for something that looks like a protocol and that’s where we’ll start our matching.
Seeing as I want the protocol to be optional there’s no such luck here though. In order to handle the protocol if it is present you might do something like this:- Does the string contain ://?
- If yes, check for valid characters before the match and proceed to validate domain after the match.
- If no; proceed to validate domain from the start of the string.
- The URL may contain the www. prefix, or it may not. It might include any number of subdomains or none at all. http://a.long.subdomain.bin-debug.net/ is a valid URL, so searching for www. is silly as it would exclude any other valid subdomain. Yes www. is certainly indicative of a URL, but in my opinion it should be treated as any other subdomain.
- Matching domain + Top Level Domain:
Optional protocols, haphazard subdomains… Is there anything concrete left to match?
Yes. Yes there is. Every valid URL will somewhere in it contain the following structure:
[domain-name][.][TLD][(end-of-string | / | ? | : )] where- [domain-name] is a string of 1 or more characters in the range a-z, 0-9 or “-” and may not start with “-”
- [.] is literally a period and absolutely nothing else.
- [TLD] is one of the 267 legal Top Level Domains.
- [(end-of-string | / | ? | : )] is either the “/” character, the “?” character, the “:” character (usually followed by numbers) or nothing at all. That is to say: If the next character is a space, the url string is at its end.
- After the domain-dot-TLD-/? sequence you’ll find all sorts of stuff. These are the legal characters you might run into: /~*?.,;:-%=_$s&’”@!)(][+. To be honest; If I manage to confirm a valid URL up to the character following the TLD I just run a match on those characters until I find a white space although there are probably some combinations of these legal characters that are not valid.
Ok. That's something to hang on to. Knowing that no white space is allowed in URLs, and that we can't rely on the handy protocol to find the beginning of the potential URL we instead fixate on this structure which is guaranteed to be present in a valid URL (disregarding IP-addresses for the moment.) Our method of matching should then, from a mile high view, be:
- Is there a valid domain-dot-tld-endcharacter structure present in the string?
- If yes, work backwards to find the legal structures of subdomains or protocols or end-of-string.
- Work forward to find end-of-string or illegal character.
Validating the Domain
Having decided to start with the root domain for your matching this is what you're looking for:
http://www.all.kinds.of.crap.domain-name.com/even/more?crap#here
If you match this unbroken sequence (remember [domain-name][.][TLD][end-of-string or ( ? | / | : )]) you have a starting point. Now you need to see if whatever comes before or after your match is valid.
The end of the TLD string is very important. Consider this; http://sub.combo.co/. In this case we find the “.com” first, but it isn’t followed by a legal end character. Moving on we see that .co however is, and identify the TLD correctly.
Verifying the TLD
At this point regular expressions won’t be able to help you any more. At least not efficiently. The Top Level Domain list just isn’t really mappable in patterns. Example. .cc .cd and .cf are all valid TLDs, but .ce is not. If you don’t really care whether the TLD is actually valid rather than just structurally sound you would typically match a string of 2-6 characters in the a-z range and leave it at that.
If you do want to make sure that domain.fuck or domain.yo aren’t regarded as positives and that the valid country codes .no and .ly are, there’s nothing for it but to use the actual list and loop through it.
For your convenience here is a code example of a Vector.<String> containing all TLDs and sorted by length.
Stop. Exhale.
The rest of the string
Once you’ve identified the aforementioned structure you need to look forward and backward in the string. When looking forward you’re typically looking for either an illegal character or whitespace and assuming the end of the URL at the matching point.
When you look backward there are several different options you need to match. You will be looking for either:
- A subdomain. These are groups of characters in the a-z,0-9 range or “-”, just like domain names. They are separated by dots and there doesn’t seem to be any limit to how many you can make, but usually you won’t find more than one of these.
- A protocol. Protocols seem to follow the same convention as domain names, except for a couple of characteristics.
- They are followed by “://” rather than “.”
- They’ll always be found at the start of the URL string, if present.
- To my knowledge you can only have a single protocol in a URL.
- A user prefix. These are the ftp://martin@domain.com or https://user:password@domain.com variants.
Why can’t it ever be easy…
So a pattern to match whatever comes before the domain-dot-TLD-/? sequence would look something like this:
- Start from the beginning of the validated domain.
- Is there anything at all (except white space) before the validated domain?
- If yes; Does it contain “://”, and only once?
- If yes; is “://” preceded by a string of characters in the range (“a-z”, “0-9″, “-”)?
- If yes; Start from the end of “://”. Is there anything before the validated domain?
- If yes; Does it consist of valid domain strings separated by dots (“a-z” “0-9″ “-”) and alternatively either or both of “@” and “:”?
And so forth.
But wait!
So you’ve finally got it working perfectly. Your pattern gets matches on all the valid examples at the top of this post, and rejects all the false positives. Pat yourself on the back, and then consider this.
What about characters that invalidate your URL, but frequently occur in the wild? Or what about the characters that are allowed in the URL, but almost certainly were meant as punctuation.
For example:
- =”http:// isn’t valid as the beginning of a protocol, and .com” isn’t a valid TLD but you will probably run into this: <a href=”http://domain.com”>
- While “)”, “,” and “.” are legal characters in a URL, it probably wasn’t meant to be part of the link in this example:
“W00t! Listening to TWIT (http.twit.tv).”
However, in this case:
“Have you guys heard of this? http://en.wikipedia.org/wiki/WWII_(disambiguation) “
It’s actually meant to be included.
As is plain to see it’s not easy validating URLs with precision once you remove the protocols. It seems like it should be, but it’s not. I never got it to work, but just the fact that this pattern exists tells you something about just how frustrating an experience it is. The usual compromise is to decide (perhaps sensibly) that it’s not worth the effort to try and catch unlikely cases such as http://subdo://main.com since the chance of it occurring in the wild are slim. Me, I obsess about shit like this.
If any of you have some insights that I may have missed, be sure to leave a comment. I know that this text excludes both local characters (æøå) and IP-addresses, which is something I may or may not rectify.
The question is do you want to validate all given domains.
Or known domains.
You could perhaps do a
^(http|https):\/\/(\-[a-zA-Z0-9]+)*\.)domainname\.tld\/.+
We can break it down into the following segments:
^(http|https) => starts with either http or https
:\/\/ => must have this ://
(\-[a-zA-Z0-9]+)*\.) => all subdomains are valid
if you want to test for particual subdomain do (subdomain\.) instead
Then we match:
domainname\.tld/ => our domain name we are looking for
.+ => All other words in the uri.
Related posts that can be interesting for you:
* http://www.shauninman.com/archive/2006/05/08/validating_domain_names
* http://stackoverflow.com/questions/399932/can-i-improve-this-regex-check-for-valid-domain-names
Perhaps this guy could help you out :)
“I Know Regular Expressions by XKCD” http://xkcd.com/208/
@Cristobal
I remember seeing both these examples when I wrote the previous iteration of this class, but the top comment on the stackoverflow post combined with the fact that I couldn’t get either of these patterns to work in ActionScript made me decide that the solution was to break up the matching between a RegEx and storing the TLDs in a Vector or a Dictionary.
Also, as I mentioned in the post I don’t want protocols to be mandatory. If that was the case the pattern from Gruber would get me almost there.
Oh, and that xkcd is entirely accurate of how I expect the outcome of this. Chicks love regular expressions. ;)
Ok i got hooked on this, so i more or less created a pattern which test for a valid url.
The rest is to validate:
1. The protocol if any given
2. The domain name up to the IANA list http://bit.ly/SGbyo
The pattern is the following:
Which we can split into the following patterns.
1. Does it start with an protocol (optional)
2. Does it have an authentication (optional)
3. Does it have an subdomain (optional)
4. Domain.TLD
5. Port (optional).
6. Remaining rest: / or /.+ or nada
If you execute the Expression over an url it will give you an array with 12 elements,
which can contain the following values.
So the only false positives you would have to check here are those in the protocol if any given, or the tld with an simple match for valid tld list.
I Created an simple air.app which you can test, which checks the url.
@app http://ria.creuna.com/wp-content/uploads/2009/12/UrlMatcher.air
@pastie for logic http://pastie.org/728031
That’s awesome Cris! Would you mind letting me have a look at the source of that AIR app?
Thanks, i think embedded the source to be readable.
If not it’s available at http://github.com/cristobal/UrlMatcher
[...] this site you should be sure to check it out. For example you could start with my huge post about URL Recognition, or you could read how my co-worker exposes the clickTAG sham for the lousy stinky piece of bad [...]