Sure, spaces in an ˂a href˃ or ˂img src˃ make it invalid — but “invalid” doesn’t mean what you think it means

SanfordWhiteman · ‎12-07-2021

As I explored in my last Product Blogs post, as a technical marketer, knowing when you must not encode is just as important as knowing when you must.

There’s also a 3rd, more nuanced case: when you should have encoded, yet because apps have certain safeguards the results won’t be fatal. You screwed up (and it was a screwup) but it isn’t a fireable offense, nor likely even noticed by management. But you should still slap yourself on the wrist for it.😊

The backstory

You’ve surely heard that “URLs with literal space characters are invalid.” And that’s not just a rumor, it’s absolutely true. This is not a valid URL, per W3C/WHATWG and IETF standards:

https://www.example.com/my page with spaces.html

To be a valid URL, each space would have to be replaced by %20.[1]

Then that basic truth gets extended to “HTML attributes that represent URLs and have literal space characters are invalid.” That’s also true. Neither of these tags are valid HTML, per the HTML5 standard:

<a href="https://www.example.com/my page with spaces.html">Go</a>
<img src="https://www.example.com/my image with spaces.png">

But then it gets further extended to “Those invalid HTML <a> or <img> tags will not work when clicked or downloaded.” That one’s just not true.

I held this misunderstanding myself for an embarrassingly long time.

We used to tell people not to even use spaces in underlying filenames at all — even though Mac, Windows, and Linux filesystems have all supported spaces forever — for fear that we’d forget to URL-encode when linking to those assets.[2] The idea that spaces always have to be URL-encoded to %20, or else pages and emails would be horribly broken, is still ingrained.

(Of course, the single rule “no spaces anywhere” was silly. Though they are indeed invalid, spaces are no more invalid than other characters that need to be URL-encoded. We didn’t say “no plus signs” or “no ampersands” or “no equals signs”... nor “no percent signs” — which need to be encoded to %25 to not be mistaken for the start of a percent-encoded sequence![3])

The funny thing is... forgetting to encode not-otherwise-special characters like spaces was such a common worry, and so easy to do by accident, that all modern browsers will encode for you if you forget.

Yep, auto-encoding is a thing

This feature is in the HTML5 spec. When parsing the path and query string, the parser is supposed to note that invalid characters were originally present, then proceed to do the percent-encoding itself:

At first read, it seems like an error is immediately thrown if %NN encoding wasn’t used. But here’s the catch: that validation error is not a fatal error. It’s more like a handled exception.

The spec dictates that processing continues after such an error:

clipimage

So the validation error is logged “on the side,” if you will — and you’ve still got invalid HTML, as any validator will tell you — but that’s not the end of the story.

That’s why attributes like <a href> or <img src> that we (sort of incorrectly) think of as “a URL” are still functional with spaces. The spaces are encoded before the final URL goes on the wire!

Auto-encoding only applies to characters that are also not special

The auto-URL-encoding feature only applies to characters that do not have a special meaning at their current position in a URL.

That is, it does apply to spaces and to all non-ASCII characters[4]. It doesn’t apply to ASCII characters like & and = and # when they might be special.

This restriction makes sense when you think about it. The browser can’t possibly know whether you meant the special & that separates query params, or the non-special & meaning “and” (which would need to be %26 to be de-special-ized). It can’t know if you meant the special hashtag separator # or the non-special number sign # (which would be %23). And so on.

So in those cases, the character will be left exactly as-is. If you want the non-special meaning, you must encode it yourself. Better yet: correctly encode all invalid characters and don’t rely on the browser to rescue you. But at the same time, know that browsers offer some cushion and don’t freak out if you forget %20 on occasion.

NOTES

[1] I will not be taking questions about encoding spaces to + at this time.😛

[2] Or not be able to encode for technical reasons. Like when a value is stored unencoded in a database — as is proper — but the platform doesn’t offer any way to choose an encoding upon display/output.

[3] Note all characters mentioned in this section ( + & = % ) are allowed in filenames at the OS/filesystem level, so they’re easy to use in day-to-day work.

There are a few URL-sensitive characters, like forward slash / on all platforms and question mark ? on Windows, which are — coincidentally! —disallowed in filenames. So you don’t need a rule about those, since they’re impossible to type by accident.

[4] Well, not all non-ASCII characters in all browsers, unfortunately. In IE, auto-encoding the Latin-1 Supplement (the 128 characters just after the ASCII range) is buggy, though not, as far as I know, auto-encoding characters above that range. I could get into the reason, but this post is long enough already.