We have a significant number of email addresses submitted via our non-Marketo forms that have typos in the domain field. For example, instead of .com, they have the following.
At some point, we're going to implement real-time email address validation in our forms using a service like Informatica, but in the meantime, does anyone have a recommendation on how best to do this (e.g. Javascript code, webhook to a service that will let us manipulate strings, etc.)?
Batching your whole DB through a webhook will bog down your instance something fierce (I'd say webhooks are better suited for responding to Interesting Moments such as form fillout, though this might change as Marketo beefs up 'hook performance). If you can prefilter your db to exclude known TLDs, then a 'hook may be feasible. I have an endpoint you can use for this (DM me if interested).
But really you'd get equivalent satisfaction from exporting that same Smart List, touching up the email address using Excel or Google Sheets, and reimporting.
The caveat in both cases is that even though you can know for sure there's a typo if an email doesn't end with a (currently) valid TLD, you can't always extrapolate what TLD they meant to type. For example, did ".con" mean ".com" or ".cn"? Is ".inf" supposed to be ".info," or ".in"?* And so on. Of course you can push people toward the most well-known TLDs for your geographic area (if that's sufficiently limited) and use physical and IP addresses as an additional guide. This complexity is why more intelligent systems usually come at a cost, yet they still aren't 100% accurate.
On the input side, even without paying for Informatica, simply matching the email address against the known TLDs will be a huge help: MktoForms2 :: Force known TLD
Here's an interesting list from DomainTools of possible typos for "com" (I think they're in order of likelihood):
* Note also that from an AI perspective, these pairs of possible typos have the same Levenshtein Distance, so a non-keyboard-aware software program sees them as, er, equally different.
I've resigned myself to using the Force.com Connector in Excel and manipulating the errant TLDs there, but I'm definitely interested in a webhook I can call whenever an email address contains one of these common typos.
The list in my post was ordered by frequency of occurrence. Below are the top 20 typos and their percent of the total.
CON | 42.23% |
CIM | 4.99% |
C | 4.82% |
VOM | 3.15% |
CPM | 2.18% |
COMM | 1.91% |
NER | 1.77% |
NEY | 1.54% |
MET | 1.44% |
XOM | 1.31% |
COOM | 1.21% |
COML | 1.07% |
CCOM | 1.04% |
COME | 0.94% |
OCM | 0.87% |
CLM | 0.60% |
COMI | 0.54% |
NT | 0.54% |
BET | 0.54% |
COMN | 0.50% |
Really appreciate the form script, which I added it to our Marketo LP templates with the following tweaks.
Do you have to periodically manually update the list of valid TLDs by copying them from http://data.iana.org/TLD/tlds-alpha-by-domain.txt or do you have an automated way of maintaining the list in the code? The list of valid TLD must change frequently as over 200 TLDs have been added since I checked it about 7 months ago.
Do you have to periodically manually update the list of valid TLDs by copying them from http://data.iana.org/TLD/tlds-alpha-by-domain.txt or do you have an automated way of maintaining the list in the code?
I'd download the latest list every day and then serve it up via a static textfile, set to expire every day at midnight. The only thing that prevents you from hitting IANA's website directly is it doesn't honor CORS requests. Or you could just proxy the website and set a TTL of an hour -- that'd work, too.
There are a number of (metered) API services that let you do an individual TLD lookup in seeming real-time (though their back-end is probably only refreshed 1x/day, too). But the reality is with a payload this small you might as well download the whole thing. It's not worth even the tiny overhead of a server-side query/authkey/etc.
Another way to do this is via a HTTP-DNS gateway. StatDNS has one. So you can do an NS lookup on the TLD.
http://api.statdns.com/com/ns - returns 200 OK
http://api.statdns.com/cozzz/ns - returns 500 Error
The API does support CORS requests!. So you'd be pretty golden. You could even use the HEAD method because you don't care about the response (only if it's 200 or 500).
Using the StatDNS API for real-time TLD validation: MktoForms2 :: Force known TLD - DNS API
BTW Elliott Lowe that second link also demos the important concept of asynchronous validation. Any time you need to call a remote service for extended validation, you're going to want to (or going to have to) call it asynchronously.
Because built-in validation runs synchronously, you have to elegantly leave and reenter the Forms 2.0 validation path when the remote service responds (even if it's a millisecond later). This might be something I touch on in my MUG talk because there's a lot of hype about real-time validation/enhancement services, but not much knowledge about the options for plugging in a service that isn't Marketo-aware.
You could also search for a list of bounced emails and manually correct them.
Josh Hill , 2-3% of the hundreds of new leads we generate each day have a typo in the email address. And I now need to correct the thousands that have gotten into our database over the years, since there was no form validation of the email address. Not something I have the bandwidth to do manually. I'm just going to use Excel's string manipulation capabilities to correct the existing ones and we're in the process of implementing StrikeIron to validate them in real-time, so we don't have to do this again.
That's a good idea. I wouldn't necessarily say manual is the best.
How is Informatica/Strikeiron? I never tested their reliability, just the system.