We waste a lot of space and bandwidth treating URL-like strings* as if they’re too precious to decode. I’m thinking it’s time for a new approach!
Remember (hoping you do!) that URLs can only contain visible ASCII characters — capital A-Z, lowercase a-z, numeric 0-9 and a handful of symbols. (And some of those characters need to be %-encoded, depending on how you use them.)
More important, all characters outside the ASCII range must be %-encoded.
So a common letter like accented é (%C3%A9), everyday symbols like © and ™ (%C2%A9 and %E2%84%A2 respectively), and every non-Latin character can’t appear in human-readable form.
While it’s great that there’s an established way to include any character, percent-encoding takes up a crazy amount of space.
Take the trademark ™ above. In its shortest encoded form**, ™ is packed into just 2 bytes, but leaving that aside, in more common UTF-8, it takes 3 bytes. That is, it would only take 3 bytes to send that symbol over the internet... anyplace other than in an URL.
But in the percent-encoded form mandated by URLs, it takes 9 bytes! The percent-encoded sequence %E2%84%A2 is a simple ASCII string. Each character takes only one byte. But there are nine of ’em, creating 200% overhead on the wire.
It’s not just about consuming bandwidth, either. When you store full URLs in databases, they’re taking up that much more space, permanently.
The Big Question: must we always treat a value that happened to be sent by a browser at some point as if it still needs to be a valid URL? Even if we’re never putting it back on the wire? Can it not revert to a URL-like string?
I was thinking about this code snippet a lot of people use (me included) to add more context to a Marketo form post:
MktoForms2.whenReady(function(mktoForm){
mktoForm.addHiddenFields({
lastMarketoFormURL : document.location.href,
lastMarketoFormReferrerURL : document.referrer
});
});
This code straightforwardly adds the current page (the page with the form) and the previous page (referrer, as available) to the form payload. It keeps the original percent-encoded URLs.
So say someone browsed your upcoming events and clicked on an upcoming speech by Ai Weiwei. The URL they clicked looked like this:
<a href="https://eventcatalog.example.com/?artist=%E8%89%BE%E6%9C%AA%E6%9C%AA">艾未未 (Ai Weiwei)</a>
The browser sent that exact href to the server, but it displayed the friendlier Chinese characters in the Location bar:
What you’d see in Marketo, with the above Forms 2.0 code, is:
But is that most appropriate? Wouldn’t it be at least as informative (and much more informative for a reader of Chinese) to see this:
And if you’re doing a Contains match in a Smart List, wouldn’t it make more sense to paste the Chinese characters? (Note the percent-encoded form doesn’t match the graphical form, nor vice versa. They’re different strings.)
Obviously, I’m thinking Yes: unless you have a compelling reason to the contrary, if you’re only storing a thing that once was a URL, you should be decoding it first. It saves space, is better for performance, and is more readable. (Here, the encoded value is 68 bytes long, decoded only 50 bytes — a 26% savings.)
JavaScript has a built-in method decodeURI that’s perfect for this:
MktoForms2.whenReady(function(mktoForm){
mktoForm.addHiddenFields({
lastMarketoFormURL : decodeURI(document.location.href),
lastMarketoFormReferrerURL : decodeURI(document.referrer)
});
});
Interestingly, in too many years to count, I’ve never had occasion to use decodeURI before!*** (It’s not the same as decodeURIComponent, which I use constantly.)
Internationalized Resource Identifier (IRI) is a standardized format that essentially means “URL/URI but with international characters left intact instead of encoded.”
So these are both valid IRIs:
https://eventcatalog.example.com/?artist=艾未未
https://eventcatalog.example.com/?artist=艾未未&event=Q%26A
Note the second one still has a percent-encoded reserved character, but the international Chinese characters are left intact. You could also choose to encode the Chinese characters and still have a valid IRI. The key is it’s more permissive than URI/URL syntax.
There are a variety of reasons that IRIs can’t replace URLs in the world at large, but they do exist. So we’re not going too far afield.
* I’m deliberately using “URL-like” and not “URL” because there’s kind of epistemological question involved: “What is a URL?” Or perhaps “When is a URL?”
** That is, UTF-16.
*** decodeURI ignores reserved ASCII characters, which makes it the wrong choice when you’re trying to decode params and values. But it works here, as our focus is on the non-ASCII characters.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.