Transforming IRIs to URIs/URLs using Velocity

Level 10 - Community Moderator
Level 10 - Community Moderator

In this recent post I described how you can save database space, optimize bandwidth, and increase readability by storing URLs in their decoded form:


A decoded URL is equivalent to an IRI (Internalized Resource Identifier). That’s the official standard for “A string that’s like a URI, except it allows international characters to be used without percent-encoding.”

Using IRIs whenever you can is super-smart: an IRI can make the difference between fitting in a 255-character “URL” field[1] and losing critical data.

But do think about whether you’re storing the URL for reporting (i.e. a history of page visits, form conversion pages, referrers) vs. whether you intend to use it in an outbound link later on.

If you might link to it in a web page, an IRI is still fine. But IRIs aren’t ready for email links.


IRIs don’t work, as-is, in an <a> tag in an email

Contrary to popular belief, an IRI (a.k.a. URL with international characters) can be used in an href in an HTML5 document. The user-agent does the percent-encoding for you just before putting the value on the wire.

But note the number 5 there. An email is at best an HTML4 environment. And in HTML4, an href is a traditional ASCII-only URI, so international characters must be percent-encoded ahead of time.[2]

To sum up the above, these are both valid HTML5:

<a href="">Go</a>
<a href="艾未未">Go</a>

But only this one is valid HTML4:

<a href="">Go</a>


Velocity to the rescue, as usual

I swear the previous post was not secretly designed to create a need for a Velocity post.

But I’ve got Velocity-brain 24/7, so started thinking and experimenting: What if you had stored an IRI for efficiency, then later had a business requirement to link to it, say in a Marketo Alert?

Velocity’s LinkTool is designed to manipulate all parts of a URL, and it was written before IRIs and HTML5 were things. It automatically assumes international characters need to be percent-encoded. So I was expecting to tell you to use $link.uri($lead.iriField) and call it a day — until I ran some tests and realized LinkTool is weirdly broken when it comes to certain ASCII reserved characters! Huh?


LinkTool gets the hard part right, but an easy part wrong

It does the international part right, converting 艾未未 to %E8%89%BE%E6%9C%AA%E6%9C%AA.

But it doesn’t understand that an existing %26 (encoded &) and %3D (encoded =) must stay percent-encoded. They can’t be decoded, because there was a reason they were encoded in the first place, and decoding them changes the meaning of the URL.

Imagine our artist is hosting a Q&A session — an event simply named Q&A.

The & has to be encoded as %26, because the plain ampersand & has a very special meaning in query strings.

So the correct IRI would be:艾未未&eventname=Q%26A

Unfortunately, when you feed that to $link.uri, it bugs out:

Wrong move. Now the query string has three query params:

  1. artist with the value, correctly encoded, 艾未未
  2. eventname with the value Q
  3. A with an empty value


Luckily, Velocity can still do the trick, though we need to “massage” the IRI a little before and after passing it to LinkTool so the above bug doesn’t get triggered. I commented the code pretty well, but let me know if anything doesn’t make sense here:

## your original IRI (URL-decoded) value
IRI: ${lead.iriField}

 * Convert IRI to URI using Velocity LinkTool 
 * @author Sanford Whiteman, TEKNKL
## the reserved characters LinkTool has trouble with
#set( $doubleEncodeNeeded = ["&","="] )
## double-percent-encode troublesome characters
## i.e,. "%26" becomes "%2526"
#set( $iriMassaged = $lead.iriField )
#foreach( $escapable in $doubleEncodeNeeded )
#set( $hex = $esc.url($escapable).substring(1) )
#set( $iriMassaged = $iriMassaged.replaceAll("(?i)%(${hex})","%25$1" ) )
## then create the LinkTool instance
#set( $uri = $link.uri($iriMassaged) )
#set( $void = $uri.setXHTML(false) )
## before reserializing, make sure all query param keys and values 
## use literal values for those troublesome reserved characters
## clear initial params and work from clone
#set( $replaceParams = $uri.getParams().clone() )
#set( $void = $uri.setParams({}) )
#foreach( $param in $replaceParams.entrySet() )
#set( $key = $param.getKey() )
#set( $originalValueStack = $convert.toStrings($param.getValue()) )
#set( $newValueStack = [] )
#foreach( $escapable in $doubleEncodeNeeded )
#set( $key = $key.replace($esc.url($escapable),$escapable) )
#foreach( $repeatedValue in $originalValueStack )
#foreach( $escapable in $doubleEncodeNeeded )
#set( $repeatedValue = $repeatedValue.replace($esc.url($escapable),$escapable) )
#set( $void = $newValueStack.add($repeatedValue) )
#set( $void = $uri.setParam($key,$newValueStack,false) )

## now, the international characters will be percent-encoded and everything else the same
URL: ${uri}

[1] Of course a field designed for URLs should not be limited to 255 (assuming UTF-16) code units in the first place, since URLs have no actual max length and in practice can easily be several hundred characters long. But if they are, you have to play by their rules.
[2] Many, but not all, HTML4 environments and/or email clients also percent-encode for you as in the HTML5 spec. But that behavior was not standardized until HTML5 and you cannot rely on it.