Regular Expressions: Don’t Use Google Analytics Without Them

regex1

Note: I rescued this post from the now-defunct blueglass.com site and updated it with current screenshots.

I have to admit: I’m a recovering regexaphobe. When I was new to analytics, I remember someone sending me a snippet of regular expressions (AKA regex) to solve a goal setup conniption I was having. It looked like a foreign language to me. I was fascinated by it but repelled at the same time.

Sadly, my intimidation of regex prevented me from doing more powerful analysis. I tried everything to avoid it and would copy and paste code from articles I saved when I had to create a custom filter. But eventually I hit a wall I couldn’t scale unless I conquered this beast, and I set out on a quest to learn it. But I resolved to only learn enough regex to help me as an analyst. No propeller for me, thankeww.

As nerdy as regex is, I’m writing this post because if you don’t know the basics, you too will be limited in your ability to create segments, goals, and filters in Google Analytics — or whatever Web-based analytics platform you’re using. (Even Omniture’s SiteCatalyst started finally supporting regex just this year. Welcome to the 21st century, guys!)

So I’m going to hit on the main ones you’ll need, without the use of geek speak. I will even subject myself to public scorn by my awesome programmer friends by sharing the goofy mnemonic devices I used early on to remember a few of them I just couldn’t seem to get down.

For ease of scanning, I’m also breaking my regex characters up into leagues to signify which ones I use most, occasionally and seldom to never.

Major League

Pipe (|)

The pipe character (|) is the regex equivalent to or. So let’s say you want to find out how many conversions you received from Google, Bing, or Yahoo, you could set up a segment that looks like this:

pipe character in regex in Google Analytics
Click for larger image


Tip:
 Remember to change the Condition field to Matching RegExp if you use regex to create a segment.

Another example of when I use the | character is when I’m creating a goal, and a step in the goal funnel or conversion can include more than one page:

pipe character in regex in Google Analytics
Click for larger image

Dot (.)

The . is a wildcard character. It means match any one character. It can be a number, letter, or special character (even a white space). By itself, it’s not that amazing, but with the help of the next playa, the asterisk (*), it’s all kindsa bad to the bone.

Asterisk (*)

This is the MVP of all regex characters, in my opinion. It says to match 0 or more of the characters before it. So, in other words, it looks at the character before it (most often the . character) and says there may or may not be that character and an unlimited number of matches afterwards.

To be honest, the Advanced Segments area was made so that you could easily go without ever using regex to create segments. It may take you longer — like if you use the Or operator to include all of the different sites that you want to include in your social media segment — but you can get away with it. Between and/or operators and the ability to choose options like Contains or Starts with from the Condition field, you can oftentimes avoid using .*, so I’ll use a more advanced example of how I use these wonder twins.

We have several clients who use subdomains. By default, Google Analytics only shows the URI (the part of the URL after the domain). The problem with that is it clumps all of the site’s pages into one repository, and you can’t easily see which pages are from which subdomains. So I created the following filter that combines the Hostname (domain) and the Request URI (URI) and replaces the standard URI with the full URL. Here, the .* means the Hostname and URI can use any characters.

regex asterisk


Backslash ()

This character escapes out the following character. In plain English that simply means that it says treat the following character as a regular ol’ character and NOT a regex character. So if I write out index.aspx?query=funky+boots (shout out to Michelle Robbins) I’m saying treat the . , ?, and + signs as characters and don’t interpret them as regex. (You’ll learn about the ? and + characters soon.)

Minor League

Caret (^)

This simply means your selection has to begin with whatever you put after it. I use this both in segments and goals. Let’s say I want to look at just the landing pages in one directory of my website.  I would use something like this:

regex carat in Google Analytics
Click for larger image

I’m only putting this character in the minor leagues because you could choose Starts with from the Condition drop-down menu when creating a segment. But Google doesn’t offer you that option elsewhere.

Dollar Sign ($)

This regex character means that your string ends at that point. For example, health insurance$ matches cheap health insurance but not health insurance rates. Or you could attach a $ to the end of a URL to prevent that URL with any query strings from being included in your match. Or at the end of a directory to analyze only traffic to your category page and not its subpages.

Now here’s a little mnemonic device I, a non-propeller head, came up with when I first started learning regex, but you have to promise not to laugh.

Promise?

Okay, I thought of how you lead someone with a carrot (I know it’s a different spelling — work with me) by putting it out in front and how at the end of the day it’s all about the money.  So the ^ goes in front in a regex expression and the $ at the end. Go ahead and laugh (promise breaker), but I guarantee you’ll remember next time.

Question Mark (?)

Technically, this character means 0 or 1 of the character before, but I like to think of it as the previous character being optional. Maybe it’s there, maybe it’s not — who knows, really? Hence the ?. See how easy this is when you’re not learning from a text book printed on recycled paper with a monospaced font?

Okay, so let’s say you want to see keywords that include dining room, but some of your searchers passed notes all through third grade and never learned that doubling up the consonants before –ing makes the vowel short. So how do you include these misspellings? You could use the ? this way:

regex2
Click for larger image

It would return keywords that match dining room and dinning room.

Parentheses ( )

Parentheses are used to form groups — just like you learned in algebra. I really don’t use these often in creating garden-variety segments or goals. I use these more when I’m creating rewrite filters. Why would I do that? Because I’m in desperate need of a hobby. But besides that, I use them for sites that, for whatever reason, can’t (or won’t) rewrite their nasty dynamic URLs. It’s very difficult to interpret landing page reports that consist of dynamic URLs. So I give them prettier, more intuitive names. (Hmm … Sounds like another post for another day.)

For one client’s site, I wanted to create a bucket for all the URLs that were generated when someone searched for a property on their site. Believe it or not, this was the regex I had to write to create a net big enough to scoop up all of those pages:

(^/index.html?pclass.*)|(/index.html?action=search.*)|(/index.php?cur_page=.*)|(/index.html?searchtext.*)|(realty/index.html?pclass.*)

We’ll get to what all of these regex characters mean, but each group in parentheses was a different version of the resulting search listings pages, depending on where you initiated your search. Ugly, huh? I mean, the regex I wrote was beautiful; it was the code that necessitated this regex that should be sent to bed without dinner.

Another example would be Sep(tember)? would match Sep or September. Or if you wanna get all crazy with it, (S|s)ep(tember)? would match sep, Sep, September, and september. But now I’m just showing off. Sorry.

City League

Square Brackets ([ ])

This means match any one of the characters between the brackets. So, c[aou]p would match cap, cop, and cup. But you can only pick one; that’s the key to the brackets. You can throw in a dash to indicate a range of characters to choose from. For example, [0-5] would mean you could pick any one digit between 0 and 5. I have used these when filtering out IP addresses for larger companies that have a span of IPs. So the IP might look something like this:

regex3
This would cover a range of IPs where the last octet spans from 130 to 138.

Plus Sign (+)

To be honest, I never use this character. Actually, I think I used it once just to get the t-shirt. But it means one or more of the previous character. So it’s a lot like the asterisk, except it requires that at least one character matches. It’s a diva.

Curly Braces ({ })

Again, I rarely use these in Google Analytics — usually only with really tricky URL rewrites. But curly braces indicate how many times you may want a character repeated. For simplicity’s sake, I’ll explain how to use it with an example that you probably wouldn’t use in your analytics but would make more sense. (Life is all about compromises.) Let’s say you want to indicate a number that is a US-based five-digit zip code. You would write it as [0-9]{5} because there are five digits in a US zip code.

You could also express a range with curly braces by using the convention {minimum, maximum}. For example, let’s say you have a list of product IDs that start with three lower case letters followed by a hyphen and then three-to-five digits. You could indicate them this way:

[a-z]{3}-[0-9]{3,5}

Testing Your Regex

The best part of Google Analytics is every report comes with a line-item filter. And it has a regex option in it. I tried several different regex testers before discovering this is the best regex testing ground when creating regex specific to Google Analytics.

regex4
Oops. I meant to highlight “Matching RegExp”

How would you leverage it? Just go to the report that contains the items you’re writing the regex for: the Keyword report if you’re trying to concatenate keywords, Traffic Sources if you’re trying to identify specific sources, etc.

So if I’m writing regex to capture a group of pages to concatenate in a segment to analyze, I’ll go to the Top Content report and paste my regex into the filter. If all of my pages are present and accounted for, I’m golden. It’s a real time saver.

Caveat About Regex In Excel

A common frustration I had for a long time was that I couldn’t use regex in Excel. I could Word but not Excel. Go figure. You can use a plugin like the SeoTools plugin or do all your regex in Google Docs and bring it back into Excel or (my personal fave) use advanced filters in Excel. They actually give you more options than regex and are easier to master.

Learn More

If you want to learn more about using regex, I cut my teeth on LunaMetric’s Regular Expressions for Google Analytics guide (PDF). And Robbin Steif personally answered questions I had about the quiz at the end. That was impressive.

Anything you want to learn more about with Google Analytics? Let me know here.

Image by blyzz.

Comments

  1. Peter Kirwan says

    Great post. Particularly love the memory aids. Question: under Testing Your Regex did you mean to draw a red rectangle around “containing” or should that rectangle be around “Matching RegExp”?

  2. says

    This is great! So if I have a URL that’s example.com/project/projectname/ would I use /project/*/ in my Google Analytics destination goals?

    • says

      It depends what you’re trying to capture. If there’s only one destination URL, you would jut use the URL. If there’s more than one for a goal (like any URL in the /project/ subdirectory, then you might use something like [[ /project/.* ]]. (No brackets.)

  3. Anthony says

    Hi Annie, fantastic post as always. I was hoping you could advise me with setting up a Funnel in in GA. I wish to track these pages mywebsite.com/checkout/step1?fiid=077bec1912d8b31e&as=Christmas#special-1 through to a page with step5 in the url. Will this goal url work to track it. /checkout/step1|as=Christmas#special-1|

    • says

      Disqus cut off the first URL and takes me to a redirected page when I try to click on it. Perhap you could type it out in a Google doc and then link to it from here or email me at annie((at))annielytics.com.

  4. Anita Marie Shelburn Smallwood says

    Using regex for shopping cart,

    I need only this category to be discounted, SS16-50-Gross

    So I tried this code:

    .*SS16-50-Gross.* That code produced no results whatsoever.

    I need to exclude any code that points to quantities of 200. All of the products that have 200 have this in the description, “Item #: VT-Acc:SS16B-VTI-01,02,03,04″, so I tried as the 16B delineates the quantity 200.

    .*SS16.*[^SS16B]
    and
    .*SS16.*[^SS16B.*]

    neither took out the 16s in the 200 category.

    Can you point me in a better direction? Did I miss something
    obvious?

    • says

      Hi Anita,

      I’m sorry. I couldn’t follow you. And since I don’t have data that aligns with your data, I can’t test out the regex in the line item filter in Google Analytics, which is my favorite way to test. So what I would do is open up the Pages report and make sure the page you want to filter comes up. Then tweak your regex until you get exactly what you want.

    • says

      If you don’t want to include the /showrooms page in the regex, you wouldn’t use .* after /showrooms/; you’d use .+. The plus sign says there has to be at least one character after the string you enter.

      The best way to whittle your results is to test your regex using the report filter. Just pull up your All Pages report (under Behavior > Site Content), then enter /showrooms/.+ into the filter. Screenshot: http://screencast.com/t/fY9JAifv0A3 .

      You could even use regex to segment those who enter zip codes vs those who enter a state. How you’d set it up would depend on how the data comes through from your form, but it’s quite easy using the techniques in the post.

      Hope this helps.

  5. Mike Farney says

    Hi Annie,

    I would like to combine the sources Yahoo and r.search.yahoo.com so I can compare YOY traffic source data. Seems like the easiest thing to do is an advanced segment but I am wondering if it would be better if I used a search and replace filter and if I did would I need to use regex? I am open to other reporting suggestions too.

    Thanks,
    Mike

  6. Jeremy Capp says

    Hello Annie,

    I thought I had my regexp set up properly, but I don’t believe it is. Could you help?

    I have a base URL: …/HowToBuy/Showrooms/

    However, what I want to track is the (not quite) unlimited number of URLs that could follow it. Like this:

    …/HowToBuy/Showrooms/IL
    …/HowToBuy/Showrooms/NY
    …/HowToBuy/Showrooms/60613
    …/HowToBuy/Showrooms/90210

    etc etc

    I have [Showrooms/].+ in there now, but it’s not working properly. Can you please help? This is driving me batty :)

    Thanks!!
    Jeremy

    • Jeremy Capp says

      Actually, i think i realized what’s going on. the brackets are messing it up. If I remove the brackets, I believe it will work. :)

      Thanks,
      Jeremy

      • says

        Yep, those brackets are unnecessary. But the best way to test it is to pull up your All Pages report and use the report filter. Play with it until you get what you want. But, yeah, .+ is the winning combo for you here.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>