Spam Detection using Count Vectorizer – with source code – easy explanation – 2023

In today’s blog, we will see how we can perform Spam detection in the simplest way possible with the help of a Count Vectorizer and Multinomial Naive Bayes algorithm. This is going to be a very fun project. So without any further due, Let’s do it…

Checkout the video here – https://youtu.be/hc70yiJUsc4

Step 1 – Importing libraries required for Spam detection.

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud

%matplotlib inline```

Step 2 – Reading the SMS data.

```sms = pd.read_csv('spam.csv',encoding='ISO-8859-1')
• We have passed encoding=’ISO-8859-1′ so that it can tackle special characters like emojis also.

Step 3 – Delete the unnecessary columns.

```cols_to_drop = ['Unnamed: 2','Unnamed: 3','Unnamed: 4']
sms.drop(cols_to_drop,axis=1,inplace=True)
sms.columns = ['label','message']

Step 4 – Checking the info of our SMS data.

`sms.info()`

Step 5 – Declaring Count Vectorizer.

```cv = CountVectorizer(decode_error='ignore')
X = cv.fit_transform(sms['message'])

X_train, X_test, y_train, y_test = train_test_split(X, sms['label'], test_size=0.3, random_state=101)```
• Count Vectorizer simply creates a Bag of words. In Bag of words, all the words in the vocabulary go along columns and documents go along rows.
• Each row depicts one message.
• Then we are simply splitting our data into training and testing data.
`sms['message'][0]`

Step 6 – Creating a Multinomial Naive Bayes model for Spam detection.

```mnb = MultinomialNB()
mnb.fit(X_train,y_train)
print('training accuracy is --> ',mnb.score(X_train,y_train)*100)
print('test accuracy is --> ',mnb.score(X_test,y_test)*100)```
• Here we are just declaring Multinomial NB.
• Naive Bayes works best on text data because of its naive assumption that features are independent of each other.

Step 7 – Visualizing the results.

```def visualize(label):
words = ''
for msg in sms[sms['label']==label]['message']:
msg = msg.lower()
words+=msg + ' '
wordcloud = WordCloud(width=600,height=400).generate(words)
plt.imshow(wordcloud)
plt.axis('off')```
• This is just a function that creates a wordcloud for all the words in spam and ham categories.
• This is just a function that creates a wordcloud for all the words in spam and ham categories.
• Bigger words in wordcloud depict their high frequency.
• The bigger the words, the more it occurs.
• Smaller the word, the lesser it occurs.
`visualize('spam')`

NOTE – Like here free is the biggest word, and as we know that messages containing free words are mostly SPAM.

`visualize('ham')`

Step 8 – Live Spam detection.

```# just type in your message and run
your_message = 'You are the lucky winner for the lottery price of \$6million.'
your_message = cv.transform([your_message])
claass = mnb.predict(your_message)
print(f'This is a {claass[0]} message')```
• Just type in your message in ‘your_message’ and run this cell and it will try its best to classify it as either spam or ham.