Consider decoding URLs if you’re just storing them for attribution

Level 10 - Community Moderator
Level 10 - Community Moderator

We waste a lot of space and bandwidth treating URL-like strings* as if they’re too precious to decode. I’m thinking it’s time for a new approach!


URLs are ASCII-only

Remember (hoping you do!) that URLs can only contain visible ASCII characters — capital A-Z, lowercase a-z, numeric 0-9 and a handful of symbols. (And some of those characters need to be %-encoded, depending on how you use them.)


More important, all characters outside the ASCII range must be %-encoded.


So a common letter like accented é (%C3%A9), everyday symbols like © and  (%C2%A9 and %E2%84%A2 respectively), and every non-Latin character can’t appear in human-readable form.


Percent-encoding is wildly wasteful (but usually unavoidable)

While it’s great that there’s an established way to include any character, percent-encoding takes up a crazy amount of space.


Take the trademark  above. In its shortest encoded form**,  is packed into just 2 bytes, but leaving that aside, in more common UTF-8, it takes 3 bytes. That is, it would only take 3 bytes to send that symbol over the internet... anyplace other than in an URL.


But in the percent-encoded form mandated by URLs, it takes 9 bytes! The percent-encoded sequence %E2%84%A2 is a simple ASCII string. Each character takes only one byte. But there are nine of ’em, creating 200% overhead on the wire.


It’s not just about consuming bandwidth, either. When you store full URLs in databases, they’re taking up that much more space, permanently.


When is a URL no longer a URL?

The Big Question: must we always treat a value that happened to be sent by a browser at some point as if it still needs to be a valid URL? Even if we’re never putting it back on the wire? Can it not revert to a URL-like string?


I was thinking about this code snippet a lot of people use (me included) to add more context to a Marketo form post:

    lastMarketoFormURL : document.location.href,
    lastMarketoFormReferrerURL : document.referrer

This code straightforwardly adds the current page (the page with the form) and the previous page (referrer, as available) to the form payload. It keeps the original percent-encoded URLs.


So say someone browsed your upcoming events and clicked on an upcoming speech by Ai Weiwei. The URL they clicked looked like this:

<a href="">艾未未 (Ai Weiwei)</a>


The browser sent that exact href to the server, but it displayed the friendlier Chinese characters in the Location bar:



What you’d see in Marketo, with the above Forms 2.0 code, is:



But is that most appropriate? Wouldn’t it be at least as informative (and much more informative for a reader of Chinese) to see this:



And if you’re doing a Contains match in a Smart List, wouldn’t it make more sense to paste the Chinese characters? (Note the percent-encoded form doesn’t match the graphical form, nor vice versa. They’re different strings.)


Obviously, I’m thinking Yes: unless you have a compelling reason to the contrary, if you’re only storing a thing that once was a URL, you should be decoding it first. It saves space, is better for performance, and is more readable. (Here, the encoded value is 68 bytes long, decoded only 50 bytes — a 26% savings.)


Decoding URLs in Forms 2.0 JS

JavaScript has a built-in method decodeURI that’s perfect for this:

    lastMarketoFormURL : decodeURI(document.location.href),
    lastMarketoFormReferrerURL : decodeURI(document.referrer)


Interestingly, in too many years to count, I’ve never had occasion to use decodeURI before!*** (It’s not the same as decodeURIComponent, which I use constantly.)


A decoded URL is still a valid IRI

Internationalized Resource Identifier (IRI) is a standardized format that essentially means “URL/URI but with international characters left intact instead of encoded.”


So these are both valid IRIs:艾未未艾未未&event=Q%26A

Note the second one still has a percent-encoded reserved character, but the international Chinese characters are left intact. You could also choose to encode the Chinese characters and still have a valid IRI. The key is it’s more permissive than URI/URL syntax.


There are a variety of reasons that IRIs can’t replace URLs in the world at large, but they do exist. So we’re not going too far afield.



* I’m deliberately using “URL-like” and not “URL” because there’s kind of epistemological question involved: “What is a URL?” Or perhaps When is a URL?”

** That is, UTF-16.

*** decodeURI ignores reserved ASCII characters, which makes it the wrong choice when you’re trying to decode params and values. But it works here, as our focus is on the non-ASCII characters.