Case study – Clean and verify emails using Verteego Data Science Suite

At Verteego we have accumulated a list of several thousands of customer and prospect emails over the years in our CRM. As you might have experienced, after a while the quality of an email database decreases as people change their company or jobs, organizations change their name, mail servers are shut down, etc.

All these things don’t matter if you don’t look to bulk import your email lists into a email marketing system (eg. Mailchimp, Sendinblue, Mailjet,…). Those systems are equipped with powerful prediction algorithms (Omnivore for Mailchimp) able to analyze imported mailing lists and reject them in order to prevent spamming and preserve the trust of their mailing servers. A good thing when it comes to avoiding useless spam in our mailboxes. Not so cool if you just try to inform your customers about a webinar.

So even though if the email database we tried to import was entirely composed of people we know, it was detected as a “bad” list due to bad quality of a part of our contacts.

As deadlines for sending our campaign were coming dangerously closer, our marketing and tech teams joined their forces and built an email cleaning and verification tool powered by the Verteego Data Science Suite.

If you feel the need to do that at home, here is how it works (you can download the scripts and data flow template here).

If you have any questions while reading this let us a comment below or write us.

1. Clean emails

First we exported our raw email spreadsheet from our CRM and uploaded it to the Verteego Data Science Suite cleaning tool to run some common data cleansing jobs :

  • Remove white spaces before and behind email addresses (trim)
  • Remove general emails (contact@…, info@…, hello@…, hr@,… etc.)
  • Remove duplicates

This first step was quite straight forward as our data comes from our CRM and is generally quite well organized.

screenshot-from-2016-11-08-17-47-49

2. Verify emails

The main question to answer during email verification is: “Does this email address have the right syntax and does it really exist?”
In our case we have answered the question of existence by checking if at least one MX record is attached to the domain. We could have gone a step further by establishing an SMTP conversation with the server but time was short and we had to get things done. But we’ve got some ideas for a Part 2 of this case study … 😉

Here is the email verification flow we built.

screenshot-from-2016-11-08-16-04-03

To download this flow as a template and run it directly in Verteego Data Science Suite click here.

 

1. Load CSV file (GetFile processor)

Loads the previously cleaned CSV file as a new flow file (~ 7000 rows).

screenshot-from-2016-11-08-17-13-10

 

2. Split file into lines (SplitText processor)

Splits the initial flow file into one new flow file per line. We don’t care here if the header line of our CSV file is also included as it will be simply rejected in a later processor.

screenshot-from-2016-11-08-16-19-18

 

3. Extract email (ExtractText processor)

Extracts the email address from the CSV line and stores it as an attribute. In this case, the regex to extract a column from the CSV we used was quite straight forward to write as the email was stored in the first column. We also put the whole content of the flow file into an attribute to use it later on for concatenation.

Regex to extract the first column from CSV line: 
^(.+?),

Regex to extract the whole flow file content:
(?s)(^.*$)

screenshot-from-2016-11-08-14-10-05

 

4. Verify email (ExecuteStreamCommand processor)

Sends the email to a bash script (validate-email.sh) that launches a Python script (VerifyEmailAddress.py) that checks the email syntax and verifies if the domain name of the mail is attached to at least one MX record.

The output of the command is stored in the “email_validation_output” attribute.

screenshot-from-2016-11-08-17-20-28

validate-email.sh

python /home/demos/validate-emails/VerifyEmailAddress.py -e "$1"

VerifyEmailAddress.py

import re
import dns.resolver
from optparse import OptionParser

# parse incoming arguments
parser = OptionParser()
parser.add_option("-e", "--email", dest="email", help ="Email address to check")
(options, args) = parser.parse_args()

# Simple Regex for syntax checking
regex = '^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$'

# Email address to verify
inputAddress = options.email
addressToVerify = str(inputAddress)

# Syntax check
match = re.match(regex, addressToVerify)
if match == None:
	print 'Bad syntax'
    
else:
    # Get domain for DNS lookup
    splitAddress = addressToVerify.split('@')
    domain = str(splitAddress[1])
   
    #try to contact the domain 
    try:
        records = dns.resolver.query(domain, 'MX')
        mxRecord = records[0].exchange
        mxRecord = str(mxRecord)
        print "Good"
    except :
        print "Bad domain name"


5. Add verification status to line (ReplaceText processor)

Concatenates the original line with the verification status attribute and creates a new column in the CSV line.

screenshot-from-2016-11-08-15-13-28

 

6. Merge lines by verification status (MergeContent processor)

Creates a bundle (“bin”) made of all incoming lines depending on the content of the “email_validation_output” attribute. We set the minimum number of entries to 10,000 to wait that all 7,000 emails from our list are dispatched without creating a second file. We set the “Max Bin Age” property to “10 minutes” to be sure that the whole process of checking emails is done (takes about 7 minutes) before generating the output file. This 10-minutes delay is of course purely arbitrary and could certainly be improved in a production environment.
As defined in the Python script, this processor will generate 3 files: one for the “good” emails, one for emails with syntax and a third one for emails with a broken domain.

screenshot-from-2016-11-08-15-22-54

 

7. Update file name (UpdateAttribute processor)

Creates a unique file name for the output file by updating the “filename” attribute using the ${uuid} function.

screenshot-from-2016-11-08-15-24-52

 

8. Write file to output folder (PutFile)

Saves the file to the output folder.

screenshot-from-2016-11-08-17-21-29

Download this example

All files (Python script for MX record check and NiFi flow template) can be downloaded here.