Site icon Machine Learning Projects

Singular Value Decomposition – with source code – easiest way – 2022

Machine Learning Projects

So guys, in today’s blog we will see that how we can perform Singular Value Decomposition of some book titles we are having in our dataset using TruncatedSVD. This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD).

Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently. When we perform SVD (Singular Value Decomposition) on text data it is also called LSA (Latent Semantic Analysis).

So without wasting any time, Let’s do it…

Checkout the video here – https://youtu.be/D3cwjRJOmp8

Step 1 – Importing libraries required for Singular Value Decomposition.

import nltk
from nltk.stem import WordNetLemmatizer
import numpy as np
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

nltk.download('punkt')
nltk.download('wordnet')

Step 2 – Reading lines from our text file.

titles = [line.strip() for line in open('all_book_titles.txt')]
titles

Step 3 – Creating a Stopwords set.

stopwords = set(word.strip() for word in open('stopwords.txt'))
stopwords = stopwords.union({
    'introduction', 'edition', 'series', 'application',
    'approach', 'card', 'access', 'package', 'plus', 'etext',
    'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed',
    'third', 'second', 'fourth', })

word_lemmatizer = WordNetLemmatizer()

Step 4 – Creating tokenizer function.

def tokenizer(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    tokens = [t for t in tokens if len(t)>2]
    tokens = [word_lemmatizer.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [t for t in tokens if not any(c.isdigit() for c in t)]
    return tokens

Step 5 – Checking tokenizer.

tokenizer('my name is abhishek and i am 19 years old!!')

Step 6 – Creating word_2_int and int_2_word dictionaries.

word_2_int = {}
int_2_words = {}
ind = 0
error_count = 0

for title in titles:
    try:
        title = title.encode('ascii', 'ignore').decode('utf-8') # this will throw exception if bad characters
        tokens = tokenizer(title)
        for token in tokens:
            if token not in word_2_int:
                word_2_int[token] = ind
                int_2_words[ind]=token
                ind += 1
    except Exception as e:
        print(e)
        print(title)
        error_count += 1

Step 7 – Creating tokens_2_vectors function.

def tokens_2_vectors(tokens):
    X = np.zeros(len(word_2_int))
    for t in tokens:
        try:
            index = word_2_int[t]
            X[index]=1
        except:
            pass
    return X

Step 8 – Creating a final matrix and fitting it into our SVD.

final_matrix = np.zeros((len(titles),len(word_2_int)))

for i in range(len(titles)):
    title = titles[i]
    token = tokenizer(title)
    final_matrix[i,:] = tokens_2_vectors(token)

svd = TruncatedSVD()
Z = svd.fit_transform(final_matrix)
Z.shape

Step 9 – Visualize the results.

fig = plt.figure(figsize=(15,9))
plt.scatter(Z[:,0],Z[:,1])
for i in range(len(word_2_int)):
    plt.annotate(int_2_words[i],(Z[i,0],Z[i,1]))

Download Source Code for Singular Value Decomposition…

Do let me know if there’s any query regarding Singular Value Decomposition by contacting me on email or LinkedIn. I have tried my best to explain this code.

So this is all for this blog folks, thanks for reading it and I hope you are taking something with you after reading this and till the next time ?…

Read my previous post: TOPIC MODELING USING LATENT DIRICHLET ALLOCATION

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net.

Exit mobile version