Email Verification and Validation

Email verification can be tricky, both to actually get right and for users to deal with, when engineers get over-zealous with their validation requirements.

What Does an Email Address Look Like?

The simple answer, that spring into most minds could be a short list of requirements:

  • it has an @ and then there's a . somewhere
  • there's something in front of and after the @ and behind the .

That's almost correct, but even governments struggle with email addresses that don't comply with some made up requirement that top level domains are something like .co.uk or .de. They can be pretty long these days, but lots of email validations will fail them.

Also jonathan@gov.co.uk has two dots in it, so our initial statement would already only be a half-truth.

Luckily there are some specifications that describe what an email address CAN look like, like RFC: 5322 and extensions to it like RFC: 6854. However, many engineers don't have the luxory of looking up specs while implementing a feature and grab the nearest software package available to accomodate their needs or further clamp down requirements (aka violate the spec) to fulfill some sort of business requirement.

This, a weird talk on YouTube and some reddit comments also taught me that apparently cow@[dead::beef] is a valid, spec compliant email address. 🤯

  1. cow <- this is fine
  2. [] possible part of an IP address notation?
  3. dead::beef valid characters in an IPv6 address (0-9,a-z)

Note: It's up to you, running a business, to disregard specs purposefully, but usually I bet it's done by accident.

How to Verify that a String is an Email Address?

There's plenty of packages for various programming languages to ensure this, an often followed approach is to verify compliance of an email address via regular expressions. There's even a website that gathered some regular expressions for different languages (and even SQL dialects) to figure out if a user input is an email address: emailregex.com, which is helpful.

I often work with Node.js and there's a LOT of email validation packages on npm. The disappointing thing is that they barely state which RFC spec they target.

Some of them do other interesting things though, like for example the deep-email-validator

  • Validates email was not generated by disposable email service using disposable-email-domains.
  • Validates MX records are present on DNS.
  • Validates SMTP server is running.
  • Validates mailbox exists on SMTP server.

This was new to me, that fateful night that I stumbled down the email validation rabbit hole. How smart to check the DNS resolution for DNS records and check in the with SMTP server!

Other packages focus solely on the external components of DNS and SMTP like email-existence while others focues on only the user provided information like validator.js, which actually has a lot of great test cases in their github repository

Is This a Throwaway Email Account?

There are projects to track which email service providers generate temporary mail boxes for users in order to remain anonymous or creatively experience a number of consecutive trial accounts like disposable-email-domains. Given how quickly you can register domains and point them at the same service, I would take this list and its accuracy with a grain of salt, since it's prone to go out of date quickly and some of the discarded domains might actually belong to someone being a good netizen.

Third Party Providers and Email Sending APIs

Companies in the email business also make up creative definitions of when an email address should be considered valid, like Agillic:

VALID_EMAIL = Not false - Checks if the email address has previously bounced (1 Hard bounce or 1.000 soft bounces)

This is basically custom to their infrastructure and how they curate their clients data sets over time, which they probably have some experience with. The field name however, might be a little misleading without having their glossary of definitions open on the side.

Some email API providers also create some listicles that have questionable reasoning behind their recommendations, like this post on mailtrap:

with Deep email validator, an MIT-certified dependency package

I'm sure they meant MIT-licensed, but come on, who proof reads these?

Obviously it's a post to pitch their own services, as we can see later in the article:

As you can see, for my email service, I use Mailtrap Email Sending, which offers robust email-sending capabilities.

which brings me closer to my use-case.

So Jonathan, How DID You Get Into This Rabbithole?

There was a reddit post that caught my attention, which had some interesting critique of the project, calling out possible errors in the code and structure of the project. I gave in to curiosity and started reading the spec and started lifting test cases from various sources like wikipedia, reddit comments and various node libraries, which is where I discovered the questionable state of test cases.

I will applaud the creator of the post for putting themself out there to get some feedback (or github stars) and I'm happy they did, otherwise it would probably have taken me a little longer to get a better understanding of the topic myself!

Even worse, implementing spec compliant email validation might produce some very very unexpected results for engineers and users. Lots of email service providers do not allow users to create local parts (the stuff in front of the @) that would theoretically be supported in the spec.

gmail does only allow certain characters in emails

My email use cases are usually that something happens and I need to send someone an email. Like they sign up for a web project and need to click a link to give me some reasonable indicator that they're an individual with a mailbox.

If I would get targeted in some ridiculous attack, someone might get the idea to just tank my email domain reputation by signing up 100_000 bogus accounts that makes ME look like a spammer, for attempting to send the "welcome" emails, because I didn't check if an email exists in the first place. This would be unfortunate for me and might just fast track any of my future emails to a spam folder or even worse, get my account blocked from sending.

Honourable Mention: Golang

One of the excellent things I found about GO is that it comes with a reasonably accurate function for parsing email addresses net/mail ParseAddress. It's very conveniently allowing you to parse strings like Jonathan M. Hethey <public@jonathanmh.com> into an Address struct:

type Address struct {
  Name    string // "Jonathan M. Hethey"
  Address string // "public@jonathanmh.com"
}

This is incredibly useful as a baseline for checking user input and you can layer additional methods on top (like DNS and SMTP probes).

Also, there's a go package which lets you mock the DNS resolver if you want to write some tests for your DNS resolution feature: go-mockdns, another feature I was missing from the published GO project.

I haven't had the chance to look into mocking an SMTP server yet, but that sounds like another rabbit hole for another day.

Summary

  1. If I provide an open source library for validation/verification I will note which god damn spec I'm targeting 🤓
  2. If I provide functionality for talking to different protocols, I will mock my tests that don't break if someone registers the domain I use in my tests. 🙃
  3. If I have elaborate test cases I will strongly consider putting them in a portable format to make them easier for other libraries to adopt. 📖
Tagged with: #go #golang #email

Thank you for reading! If you have any comments, additions or questions, please tweet or toot them at me!