Spam Detection using Count Vectorizer – with source code – easy explanation – 2022

In today’s blog, we will see that how we can perform Spam detection in the simplest way possible with the help of Count Vectorizer and Multinomial Naive Bayes algorithm. This is going to be a very fun project. So without any further due.

Let’s do it…

Step 1 – Importing libraries required for Spam detection .

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud

%matplotlib inline

Step 2 – Reading the SMS data.

sms = pd.read_csv('spam.csv',encoding='ISO-8859-1')
sms.head()
  • We have passed encoding=’ISO-8859-1′ so that it can tackle special characters like emojis also.
Spam Detection
input data

Step 3 – Delete the unnecessary columns.

cols_to_drop = ['Unnamed: 2','Unnamed: 3','Unnamed: 4']
sms.drop(cols_to_drop,axis=1,inplace=True)
sms.columns = ['label','message']
sms.head()
Spam Detection
cleaned data

Step 4 – Checking the info of our SMS data.

sms.info()
Spam Detection

Step 5 – Declaring Count Vectorizer.

cv = CountVectorizer(decode_error='ignore')
X = cv.fit_transform(sms['message'])

X_train, X_test, y_train, y_test = train_test_split(X, sms['label'], test_size=0.3, random_state=101)
  • Count Vectorizer simply creates a Bag of words. In Bag of words, all the words in the vocabulary go along columns and documents go along rows.
  • Each row depicts one message.
  • Then we are simply splitting our data into training and testing data.
sms['message'][0]

Step 6 – Creating Multinomial Naive Bayes model for Spam detection.

mnb = MultinomialNB()
mnb.fit(X_train,y_train)
print('training accuracy is --> ',mnb.score(X_train,y_train)*100)
print('test accuracy is --> ',mnb.score(X_test,y_test)*100)
Spam Detection accuracy
  • Here we are just declaring Multinomial NB.
  • Naive Bayes works best on text data because of its naive assumption that features are independent of each other.

Step 7 – Visualizing the results.

def visualize(label):
    words = ''
    for msg in sms[sms['label']==label]['message']:
        msg = msg.lower()
        words+=msg + ' '
    wordcloud = WordCloud(width=600,height=400).generate(words)
    plt.imshow(wordcloud)
    plt.axis('off')
  • This is just a function which creates a wordcloud for all the words in spam and ham categories.
  • This is just a function that creates a wordcloud for all the words in spam and ham categories.
  • Bigger words in wordcloud depict their high frequency.
  • The bigger the words, the more it occurs.
  • Smaller the word, the lesser it occurs.
visualize('spam')
Spam Detection wordcloud spam
Wordcloud of Spam messages

NOTE – Like here free is the biggest word, and as we know that messages which contain free word are mostly SPAM.

visualize('ham')
Spam Detection wordcloud ham
Wordcloud of Ham messages

Step 8 – Live Spam detection.

your_message = 'You are the lucky winner for the lottery price of $6million.'

your_message = cv.transform([your_message])
class = mnb.predict(your_message)
print(class[0])
  • Just type in your message in ‘your_message’ and run this cell and it will try its best to classify it as either spam or ham.

Download Source Code for Spam detection …

Do let me know if there’s any query regarding Spam detection by contacting me on email or LinkedIn.

So this is all for this blog folks, thanks for reading it and I hope you are taking something with you after reading this and till the next time ?…

Read my previous post: PREDICTING THE TAX OF A HOUSE USING RANDOM FOREST – BOSTON HOUSING DATA

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net.

Leave a Comment

Your email address will not be published.