E-mail spam and non-spam filtering using Machine Learning

In the new era of technical advancement, electronic mails (e-mails) have gathered significant users for professional, commercial, and personal communications. In 2019, on average, every person was receiving 130 emails each day, and overall, 296 Billion emails have been sent in that year.

Because of the high demand and huge user base, there is an upsurge in unwanted emails, also known as spam emails. There were times when more than 50% of the total emails were spam emails. Even in the current date, people lose millions of dollars to frauds every day.

But, in the figure shown below, it can be observed that the quantity of such emails has decreased significantly after 2016 because of the evolution of the software that can detect these spam emails and can filter them out.

spam email filtering using machine learning image 1

Percentage of emails marked as Spam (Source: Statista)

Many several techniques are present in the market to detect spam e-mails. If we want to classify broadly, there are 5 different techniques based on which algorithms decide whether any mail is spam or not.

1. Content-Based Filtering Technique

Algorithms analyze words, the occurrence of words, and the distribution of words and phrases inside the content of e-mails and segregate them into spam non-spam categories.

2. Case Base Spam Filtering Method

Algorithms trained on well-annotated spam/non-spam marked emails try to classify the incoming mails into two categories.

3. Heuristic or Rule-Based Spam Filtering Technique

Algorithms use pre-defined rules in the form of a regular expression to give a score to the messages present in the e-mails. Based on the scores generated, they segregate emails into spam non-spam categories.

4. The Previous Likeness Based Spam Filtering Technique

Algorithms extract the incoming mails' features and create a multi-dimensional space vector and draw points for every new instance. Based on the KNN algorithm, these new points get assigned to the closest class of spam and non-spam.

5. Adaptive Spam Filtering Technique

Algorithms classify the incoming mails in various groups and, based on the comparison scores of every group with the defined set of groups, spam and non-spam emails got segregated.

This article will give an idea for implementing content-based filtering using one of the most famous algorithms for spam detection, which is K-Nearest Neighbour (KNN).

k-NN based algorithms are widely used for clustering tasks. Let’s quickly know the entire architecture of this implementation first and then explore every step. Executing these 5 steps, one after the other, will help us implement our spam classifier smoothly.

Training Testing Phase

spam email filtering using machine learning image 2

New Email Classification

spam email filtering using machine learning image 3

Step 1: E-mail Data Collection

The dataset contained in a corpus plays a crucial role in assessing the performance of any spam filter. Many open-source datasets are freely available in the public domain. Below mentioned two datasets are widely popular as they contain a huge amount of emails.

Enron corpus datasets (Created in 2006 and having 55% spam emails)
Trec 2007 dataset ( Created in 2007 and having 67% spam emails)

Train/Test Split: Split the dataset into train and test datasets but make sure that both sets must balance numbers of ham and spam emails ( ham is a fancy name for non-spam emails).

Enron Corpus Dataset on Kaggle

spam email filtering using machine learning image 4

Step 2: Pre-processing of E-mail content

At this step, we mainly perform tokenization of mails. Tokenization is a process where we break the content of an email into words and transform big messages into a sequence of representative symbols termed tokens. These tokens are extracted from the email body, header, subject, and image.

Extracting words from images (For a simple implementation, this can be ignored)These days, senders have options to attach inline images to the mail. These emails can be categorized as spam emails not based on their mail content but the images' content.
Believe me! This was not an easy task until google came up with the open-source library Tesseract. This library extracts the words from images automatically with certain accuracy. But still, Times New Roman and Captcha words are difficult to read automatically.

spam email filtering using machine learning image 5

Step 3: Feature Extraction and Selection

After pre-processing, we can have a large number of words. Here we can maintain a database that contains the frequency of the different words represented in each column. These attributes can be categorized on a different basis, like:

Important attributes: Frequency of repeated words, Number of semantic discrepancies, an Adult content bag of words, etc.
Additional Attributes: Sender account features like Sender country, IP address, email, age of sender, Number of replies, number of recipients, website address.
Note: These web addresses are converted in the word format only. For example, https://www.google.com/ can be converted to “HTTP google.”
Sometimes these processes are called Normalization.
Less important attributes: Geographical distance between sender and receiver, Sender’s date of birth, Account lifespan, Sex of sender, and Age of the recipient.

You must be clear that the more the number of attributes → more the time complexity of the model.
These attributes can be huge, and hence techniques like Stemming, noise removal, and stop-words removal can be used. One of the famous stemming algorithms is the Porter Stemmer Algorithm.
Some general things that we do in stemming are :

Removing of suffixes (-ed, -ing, -full, -ness, etc.)
Removing of prefixes (Un-, Re-, Pre-, etc.)

List of stop words

spam email filtering using machine learning image 6

Example dataset format

spam email filtering using machine learning image 7

Step 4: KNN (K-Nearest Neighbour) Implementation

Similar to the Nearest Neighbour algorithm, the K-Nearest Neighbour algorithm serves the purpose of clustering. Still, instead of giving just one nearest instance, it looks at the closest K instances to the new incoming instance. Based on the frequency of those K instances, K-NN classifies the new instances. The value of K is considered to be a hyperparameter that needs tuning. To tune this, one can take one of the famous Hit and Trial approaches where we try some K's values and then check the model's performance.

spam email filtering using machine learning image 8

KNN, Credit: Mathworks

To find the nearest instance, one can use the Euclidean distance. One can use the Scikit-learn library to implement the K-NN algorithm for this task.

spam email filtering using machine learning image 9

Step 5: Performance Analysis

Now our algorithm is ready, so we must check the performance of the model.

Even a single missed important message may cause a user to reconsider the value of spam filtering.

So we must be sure that our algorithm will be as close to 100% accurate. But some researchers feel that considering only the accuracy as the evaluation parameter for spam classification is not enough.

According to the below table (also known as confusion matrix), we must evaluate our spam-classification model based on 4 different parameters.

spam email filtering using machine learning image 10

Accuracy : (TP + TN)/(TP + FP + FN + TN)
Precision : TP / (TP + FP)
Sensitivity : TP / (TP + FN)
Specificity : TN / (TN + FP)

More advanced algorithms are available in the market for this classification, but you can easily achieve more than 90% accuracy using k-NN based implementation.

GMAIL, YAHOO and OUTLOOK CASE STUDY

Gmail

Google data centers use thousands of rules to filter spam emails. They provide the weightage to different parameters, and based on that; they filter the mails. Google’s spam classifier is said to be a state of an art technique that uses various techniques like Optical character recognition, linear regression, and a combination of various neural networks.

Yahoo

Yahoo mail is the world’s first free webmail service provider, which still has more than 320 million active users. They have their own filtering techniques to categorize the emails. Yahoo's basic methods are URL filtering, email content, and spam complaints from users. Unlike Gmail, Yahoo filter emails messages by domain and not the IP address. Yahoo provides custom filtering options to users as well to directly send the mail in the junk folders.

Outlook

Microsoft-owned mailing platform widely used among professionals. In 2013, Microsoft renamed the Hotmail and Windows Live Mail to Outlook. At present, the outlook has more than 400 Million active users. Outlook has its own distinctive feature based on which it filters every incoming mail. Based on their official website, they have provided the list of spam filters they use to send any mail in the junk folder, which includes :

Safe Senders list
Safe Recipients list
Blocked Senders list
Blocked Top-Level Domains list
Blocked Encodings list

Conclusion

In terms of the number of spam emails sent daily and the number of money people lose every day because of these spam scams, Spam-filtering becomes the primary need for all email-providing companies. This article discussed the complete process of spam email filtering using advanced technologies of machine learning. We also have closed one possible way of implementing our own spam-classifier using one of the most famous algorithms, k-NN. We also discussed the case studies of famous companies like Gmail, Outlook, and Yahoo to review how they use ML and AI techniques to filter such spammers.

Possible Interview Questions

Question 1: What is Porter Stemmer Algorithm?
Question 2: Why k-NN algorithm for this problem?
Question 3: Is this supervised learning or unsupervised learning?
Question 4: What are the different algorithms that can replace k-NN here?
Question 5: What steps can be taken to improve accuracy further?

References

Emmanuel Gbenga Dada, Joseph Stephen Bassi, Machine learning for email spam filtering: review, approaches, and open research problems.
Loredana Fire, Camelia Lemnaru, Spam Detection Filter using KNN Algorithm and Resampling
Anirudh Harisinghaney, Arnan Dixit, Text and Image-Based Spam Email Classification using KNN, Naive Bayes, and Reverse DBSCAN Algorithm