Spam mail has been an almost-constant companion to the world since the internet's inception. Which would make for a sweet sentiment, if not for the fact that absolutely no one likes it. Spam clutters up inboxes with unnecessary offers and scams that most people wouldn't take a second glance at if they were pressed. Some email are so over-the-top in their delivery that a British-based comedian James Veitch made an entire career out of responding to scam/spam emails and recording the ensuing hilarity.
But how is spam filtered in the first place? And why is the process so awkward, often claiming relevant emails with the irrelevant ones? Well, let's take a peek into the figurative chocolate factory and see how things work.
First, the challenges. Well, there is very little to distinguish spam mail from normal emails. A computer can't recognize what is and isn't relevant just by looking at its screen. There's also the issue of variety. Spam email doesn't have any distinguishing characteristics, and are of many different types. Be it unnecessary branding, annoying scams, or attempts at data phishing, there's more than one face to recognize.
Spam-recognizing AI also needs to quickly adapt as the world and current needs changes. The pandemic affected a lot of online communication, and algorithms needed to update accordingly. Ultimately, from a computer's point of view (fictional, as that may be), the only common thread linking all spam together is irrelevancy to users. So, our goalposts has been set up, and now it's time to take shots.
There are some simple parameters that AI can abide to in the hindering of spam. First is looking at the number of recipients. Often, spam emails are forwarded as multiple Blind Carbon Copies (BCC), so as to not alert their recipients. However, in-built algorithms for mail websites and apps can identify excessive BCCs, and categorize the email as spam. Other factors to recognize are short body texts, often a couple of sentences, and too much capitalization. These parameters constitute the general rule of thumb all inbox-streamlining AI tend to follow.
Let's move on to more technical jargon. Particularly, machine learning. AI can be taught to recognize when a certain set of words or phrases strung together strongly resemble spam mail. A popular currently-used example is the naive Bayes, which (to significantly simply the concept) allows AI to learn predictive behavior when provided a set of predictors. While this method is certainly not infallible (conjunctions, for example, are words that appear in every email regardless of user relevance), it's certainly proven to be effective, with some work put into defining its predictors.
However, since AI cannot learn to recognize spam on its own, the process must begin by feeding it examples of spam and normal mail and telling it which is which (a process labelled training). The spam must also be itself processed before training. Breaking emails down into smaller, concise pieces helps the algorithm recognize particular phrases which are to be highlighted, as well as recognizing which words are useless at differentiating between spam and relevant emails. The afore-mentioned conjunctions are an example of this. Most companies even have entire libraries chock-full of samples for AI to gorge on and get better at learning.
As mentioned before, the naive Bayes model is inherently flawed. While highly effective, it ultimately relies on statistics and probabilities to nail down spam. Again, computers don't have the knowledge or consciousness to comprehend languages the way we do. Therefore, even normal mail can fall victim to AI unable to gather context from the emails. Ultimately, our discussion has been a simple way of defining the filtration. Companies such as Google have even more sophisticated algorithms at work. At the end of the day, this author hopes that readers will still leave this webpage feeling enlightened.
Read next: A Retrospective of Facebook's Climb to the Top and its Current Position (Chart)
But how is spam filtered in the first place? And why is the process so awkward, often claiming relevant emails with the irrelevant ones? Well, let's take a peek into the figurative chocolate factory and see how things work.
First, the challenges. Well, there is very little to distinguish spam mail from normal emails. A computer can't recognize what is and isn't relevant just by looking at its screen. There's also the issue of variety. Spam email doesn't have any distinguishing characteristics, and are of many different types. Be it unnecessary branding, annoying scams, or attempts at data phishing, there's more than one face to recognize.
Spam-recognizing AI also needs to quickly adapt as the world and current needs changes. The pandemic affected a lot of online communication, and algorithms needed to update accordingly. Ultimately, from a computer's point of view (fictional, as that may be), the only common thread linking all spam together is irrelevancy to users. So, our goalposts has been set up, and now it's time to take shots.
There are some simple parameters that AI can abide to in the hindering of spam. First is looking at the number of recipients. Often, spam emails are forwarded as multiple Blind Carbon Copies (BCC), so as to not alert their recipients. However, in-built algorithms for mail websites and apps can identify excessive BCCs, and categorize the email as spam. Other factors to recognize are short body texts, often a couple of sentences, and too much capitalization. These parameters constitute the general rule of thumb all inbox-streamlining AI tend to follow.
Let's move on to more technical jargon. Particularly, machine learning. AI can be taught to recognize when a certain set of words or phrases strung together strongly resemble spam mail. A popular currently-used example is the naive Bayes, which (to significantly simply the concept) allows AI to learn predictive behavior when provided a set of predictors. While this method is certainly not infallible (conjunctions, for example, are words that appear in every email regardless of user relevance), it's certainly proven to be effective, with some work put into defining its predictors.
However, since AI cannot learn to recognize spam on its own, the process must begin by feeding it examples of spam and normal mail and telling it which is which (a process labelled training). The spam must also be itself processed before training. Breaking emails down into smaller, concise pieces helps the algorithm recognize particular phrases which are to be highlighted, as well as recognizing which words are useless at differentiating between spam and relevant emails. The afore-mentioned conjunctions are an example of this. Most companies even have entire libraries chock-full of samples for AI to gorge on and get better at learning.
As mentioned before, the naive Bayes model is inherently flawed. While highly effective, it ultimately relies on statistics and probabilities to nail down spam. Again, computers don't have the knowledge or consciousness to comprehend languages the way we do. Therefore, even normal mail can fall victim to AI unable to gather context from the emails. Ultimately, our discussion has been a simple way of defining the filtration. Companies such as Google have even more sophisticated algorithms at work. At the end of the day, this author hopes that readers will still leave this webpage feeling enlightened.
Photo: Shutterstock
Read next: A Retrospective of Facebook's Climb to the Top and its Current Position (Chart)