Spam Detection using Count Vectorizer – with source code – easy explanation – 2024

In today’s blog, we will see how we can perform Spam detection in the simplest way possible with the help of a Count Vectorizer and Multinomial Naive Bayes algorithm. This is going to be a very fun project. So without any further due, Let’s do it…

Checkout the video here – https://youtu.be/hc70yiJUsc4

Step 1 – Importing libraries required for Spam detection.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud

%matplotlib inline

Step 2 – Reading the SMS data.

sms = pd.read_csv('spam.csv',encoding='ISO-8859-1')
sms.head()
  • We have passed encoding=’ISO-8859-1′ so that it can tackle special characters like emojis also.
Spam Detection
input data

Step 3 – Delete the unnecessary columns.

cols_to_drop = ['Unnamed: 2','Unnamed: 3','Unnamed: 4']
sms.drop(cols_to_drop,axis=1,inplace=True)
sms.columns = ['label','message']
sms.head()
Spam Detection
cleaned data

Step 4 – Checking the info of our SMS data.

sms.info()
Spam Detection

Step 5 – Declaring Count Vectorizer.

cv = CountVectorizer(decode_error='ignore')
X = cv.fit_transform(sms['message'])

X_train, X_test, y_train, y_test = train_test_split(X, sms['label'], test_size=0.3, random_state=101)
  • Count Vectorizer simply creates a Bag of words. In Bag of words, all the words in the vocabulary go along columns and documents go along rows.
  • Each row depicts one message.
  • Then we are simply splitting our data into training and testing data.
sms['message'][0]
4 4

Step 6 – Creating a Multinomial Naive Bayes model for Spam detection.

mnb = MultinomialNB()
mnb.fit(X_train,y_train)
print('training accuracy is --> ',mnb.score(X_train,y_train)*100)
print('test accuracy is --> ',mnb.score(X_test,y_test)*100)
Spam Detection accuracy
  • Here we are just declaring Multinomial NB.
  • Naive Bayes works best on text data because of its naive assumption that features are independent of each other.

Step 7 – Visualizing the results.

def visualize(label):
    words = ''
    for msg in sms[sms['label']==label]['message']:
        msg = msg.lower()
        words+=msg + ' '
    wordcloud = WordCloud(width=600,height=400).generate(words)
    plt.imshow(wordcloud)
    plt.axis('off')
  • This is just a function that creates a wordcloud for all the words in spam and ham categories.
  • This is just a function that creates a wordcloud for all the words in spam and ham categories.
  • Bigger words in wordcloud depict their high frequency.
  • The bigger the words, the more it occurs.
  • Smaller the word, the lesser it occurs.
visualize('spam')
Spam Detection wordcloud spam
Wordcloud of Spam messages

NOTE – Like here free is the biggest word, and as we know that messages containing free words are mostly SPAM.

visualize('ham')
Spam Detection wordcloud ham
Wordcloud of Ham messages

Step 8 – Live Spam detection.

# just type in your message and run
your_message = 'You are the lucky winner for the lottery price of $6million.'
your_message = cv.transform([your_message])
claass = mnb.predict(your_message)
print(f'This is a {claass[0]} message')
8 3
  • Just type in your message in ‘your_message’ and run this cell and it will try its best to classify it as either spam or ham.

Download Source Code for Spam detection …

Do let me know if there’s any query regarding Spam detection by contacting me on email or LinkedIn.

So this is all for this blog folks, thanks for reading it and I hope you are taking something with you after reading this and till the next time ?…

Read my previous post: PREDICTING THE TAX OF A HOUSE USING RANDOM FOREST – BOSTON HOUSING DATA

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net.

Leave a Reply

Your email address will not be published. Required fields are marked *