Regular Expressions: Don’t Use Google Analytics Without Them

Note: I rescued this post from the now-defunct blueglass.com site and updated it with current screenshots.

I have to admit: I’m a recovering regexaphobe. When I was new to analytics, I remember someone sending me a snippet of regular expressions (AKA regex) to solve a goal setup conniption I was having. It looked like a foreign language to me. I was fascinated by it but repelled at the same time.

Sadly, my intimidation of regex prevented me from doing more powerful analysis. I tried everything to avoid it and would copy and paste code from articles I saved when I had to create a custom filter. But eventually I hit a wall I couldn’t scale unless I conquered this beast, and I set out on a quest to learn it. But I resolved to only learn enough regex to help me as an analyst. No propeller for me, thankeww.

As nerdy as regex is, I’m writing this post because if you don’t know the basics, you too will be limited in your ability to create segments, goals, and filters in Google Analytics — or whatever Web-based analytics platform you’re using. (Even Omniture’s SiteCatalyst started finally supporting regex just this year. Welcome to the 21st century, guys!)

So I’m going to hit on the main ones you’ll need, without the use of geek speak. I will even subject myself to public scorn by my awesome programmer friends by sharing the goofy mnemonic devices I used early on to remember a few of them I just couldn’t seem to get down.

For ease of scanning, I’m also breaking my regex characters up into leagues to signify which ones I use most, occasionally and seldom to never.

Major League

Pipe (|)

The pipe character (|) is the regex equivalent to or. So let’s say you want to find out how many conversions you received from Google, Bing, or Yahoo, you could set up a segment that looks like this:

pipe character in regex in Google Analytics

Click for larger image


Tip:
 Remember to change the Condition field to Matching RegExp if you use regex to create a segment.

Another example of when I use the | character is when I’m creating a goal, and a step in the goal funnel or conversion can include more than one page:

pipe character in regex in Google Analytics

Click for larger image

Dot (.)

The . is a wildcard character. It means match any one character. It can be a number, letter, or special character (even a white space). By itself, it’s not that amazing, but with the help of the next playa, the asterisk (*), it’s all kindsa bad to the bone.

Asterisk (*)

This is the MVP of all regex characters, in my opinion. It says to match 0 or more of the characters before it. So, in other words, it looks at the character before it (most often the . character) and says there may or may not be that character and an unlimited number of matches afterwards.

To be honest, the Advanced Segments area was made so that you could easily go without ever using regex to create segments. It may take you longer — like if you use the Or operator to include all of the different sites that you want to include in your social media segment — but you can get away with it. Between and/or operators and the ability to choose options like Contains or Starts with from the Condition field, you can oftentimes avoid using .*, so I’ll use a more advanced example of how I use these wonder twins.

We have several clients who use subdomains. By default, Google Analytics only shows the URI (the part of the URL after the domain). The problem with that is it clumps all of the site’s pages into one repository, and you can’t easily see which pages are from which subdomains. So I created the following filter that combines the Hostname (domain) and the Request URI (URI) and replaces the standard URI with the full URL. Here, the .* means the Hostname and URI can use any characters.

regex asterisk


Backslash (\)

This character escapes out the following character. In plain English that simply means that it says treat the following character as a regular ol’ character and NOT a regex character. So if I write out index\.aspx\?query=funky\+boots (shout out to Michelle Robbins) I’m saying treat the . , ?, and + signs as characters and don’t interpret them as regex. (You’ll learn about the ? and + characters soon.)

Minor League

Caret (^)

This simply means your selection has to begin with whatever you put after it. I use this both in segments and goals. Let’s say I want to look at just the landing pages in one directory of my website.  I would use something like this:

regex carat in Google Analytics

Click for larger image

I’m only putting this character in the minor leagues because you could choose Starts with from the Condition drop-down menu when creating a segment. But Google doesn’t offer you that option elsewhere.

Dollar Sign ($)

This regex character means that your string ends at that point. For example, health insurance$ matches cheap health insurance but not health insurance rates. Or you could attach a $ to the end of a URL to prevent that URL with any query strings from being included in your match. Or at the end of a directory to analyze only traffic to your category page and not its subpages.

Now here’s a little mnemonic device I, a non-propeller head, came up with when I first started learning regex, but you have to promise not to laugh.

Promise?

Okay, I thought of how you lead someone with a carrot (I know it’s a different spelling — work with me) by putting it out in front and how at the end of the day it’s all about the money.  So the ^ goes in front in a regex expression and the $ at the end. Go ahead and laugh (promise breaker), but I guarantee you’ll remember next time.

Question Mark (?)

Technically, this character means 0 or 1 of the character before, but I like to think of it as the previous character being optional. Maybe it’s there, maybe it’s not — who knows, really? Hence the ?. See how easy this is when you’re not learning from a text book printed on recycled paper with a monospaced font?

Okay, so let’s say you want to see keywords that include dining room, but some of your searchers passed notes all through third grade and never learned that doubling up the consonants before –ing makes the vowel short. So how do you include these misspellings? You could use the ? this way:

Click for larger image

It would return keywords that match dining room and dinning room.

Parentheses ( )

Parentheses are used to form groups — just like you learned in algebra. I really don’t use these often in creating garden-variety segments or goals. I use these more when I’m creating rewrite filters. Why would I do that? Because I’m in desperate need of a hobby. But besides that, I use them for sites that, for whatever reason, can’t (or won’t) rewrite their nasty dynamic URLs. It’s very difficult to interpret landing page reports that consist of dynamic URLs. So I give them prettier, more intuitive names. (Hmm … Sounds like another post for another day.)

For one client’s site, I wanted to create a bucket for all the URLs that were generated when someone searched for a property on their site. Believe it or not, this was the regex I had to write to create a net big enough to scoop up all of those pages:

(^/index\.html\?pclass.*)|(/index.html\?action=search.*)|(/index\.php\?cur_page=.*)|(/index\.html\?searchtext.*)|(realty/index\.html\?pclass.*)

We’ll get to what all of these regex characters mean, but each group in parentheses was a different version of the resulting search listings pages, depending on where you initiated your search. Ugly, huh? I mean, the regex I wrote was beautiful; it was the code that necessitated this regex that should be sent to bed without dinner.

Another example would be Sep(tember)? would match Sep or September. Or if you wanna get all crazy with it, (S|s)ep(tember)? would match sep, Sep, September, and september. But now I’m just showing off. Sorry.

City League

Square Brackets ([ ])

This means match any one of the characters between the brackets. So, c[aou]p would match cap, cop, and cup. But you can only pick one; that’s the key to the brackets. You can throw in a dash to indicate a range of characters to choose from. For example, [0-5] would mean you could pick any one digit between 0 and 5. I have used these when filtering out IP addresses for larger companies that have a span of IPs. So the IP might look something like this: