URL-encoding surrogate pairs and line breaks to avoid data loss, in a few different languages

SanfordWhiteman
Level 10 - Community Moderator
Level 10 - Community Moderator

In a recent post you learned that specific characters must be URL-encoded before storing String/Text fields in Marketo. You don’t need to encode the whole value, and other characters should be left alone for readability.

 

Below I show to do that kind of selective URL-encoding in 4 languages: JavaScript, Java, PHP, and C#, and how to decode in Velocity for correct email output.

 

Decoding in Velocity

I’ll cover decoding first because it’s so easy. Just do:

#set( $decodedStuff = $link.decode($lead.encodedStuff) )

 

$link.decode doesn’t care if the entire input isn’t URL-encoded, as long as the URL-encoded parts are done right.

 

(This is how the URL decoding function works in all languages that I know of. And it makes sense: otherwise, a string like My%20love%20of%20🥨%20is%20twisted could never be decoded, despite it having a single very clear meaning!)

 

Encoding in a variety of languages

In all cases, we’re replacing these characters with their URL-encoded equivalents:

  • the percent sign
  • the standard line break CR and/or LF characters
  • all characters outside the Unicode BMP, U+10000 through U+10FFFF

So we expect this original value:

Longtime product user.👍
Hoping to get pricing for an enterprise contract.​

 

To become this encoded value:

Longtime product user.%F0%9F%91%8D%0AHoping to get pricing for an enterprise contract.​

 

(Gotta say it wasn’t so fun to get back into PHP nor into C#, which I’ve never actually used for a full app! But I do it all for you guys and/or your devs.😛)

 

In JavaScript

JS sets the standard for simplicity with String#replace(regex, callback) and encodeURIComponent:

let pattern = /[%\r\n\u{10000}-\u{10FFFF}]/ug;
let replaced = original.replace(pattern, encodeURIComponent);

 

I went for callback-style in the other languages too, so you can easily see the differences.

 

In Java

Here we use Matcher.replaceAll(function) and URLEncoder.encode:

Pattern pattern = Pattern.compile("[%\\r\\n\\x{10000}-\\x{10FFFF}]");
String replaced = pattern.matcher(original).replaceAll( match -> URLEncoder.encode(match.group(), StandardCharsets.UTF_8) );

 

Note the Java regex is Unicode-aware by default, but it has that double-escaping requirement. Plus the UTF_8 hint is seemingly redundant but required.

 

In PHP

Never going back to PHP professionally but it’s pretty good here. preg_replace_callback(pattern, callable) and urlencode:

$pattern = "/[%\r\n\x{10000}-\x{10FFFF}]/u";
$replaced = preg_replace_callback($pattern, function ($matches) { return urlencode($matches[0]); }, $original);​

 

In C#

Deciding to use C# as my 4th example was… questionable. Turned out .NET is one of the few “modern” runtimes that doesn’t have Unicode-aware regexes yet. Instead, we have to look for 2 {Cs} characters (surrogates) in a row, which implicitly means they’re encoding a character beyond U+FFFF. In turn that means we can’t use a simple character class [] but need to switch to alternation this|or|that.

 

Then pattern.replace(string, MatchEvaluator) does the trick:

Regex pattern = new Regex(@"%|\r|\n|\p{Cs}{2}");
MatchEvaluator callback = new MatchEvaluator((Match match) => {
  return WebUtility.UrlEncode(match.Value);
});
string replaced = pattern.Replace(original, callback);

 

Why encode the literal percent %?

You might wonder why % is encoded to %25, since that character wasn’t in our must-encode list. It’s because we can’t risk breaking on user input that looks URL-encoding-ish. $link.decode, decodeURIComponent, et al. will error out on this string:

Do you still do 50% off student subscriptions?​

 

To pass it safely through a decoder it needs to be:

Do you still do 50%25 off student subscriptions?​

 

A final boring, can’t-stop-myself note

In an earlier version of this post, I had an additional (r/R)eplace("+","%20") in the callbacks for Java and C# and used rawurlencode in PHP instead of urlencode.

 

That’s because neither Java, C#, nor PHP correctly encodes the space character as %20 in their “frequently used” functions — only JavaScript does it correctly! and even though we aren’t replacing spaces in this particular case it felt better to have the languages be aligned. But decided to clip that out for brevity.

 

Also didn’t show imports for Java and C# (java.net.*, java.util.regex.*, java.nio.charset.* / System.Text.RegularExpressions, System.Net) but you’d probably figure those out, all things considered.

584
0