# Topic Modeling using Latent Dirichlet Allocation – easiest way – with source code – 2022 So guys in today’s blog we will see that how we can perform topic modeling using Latent Dirichlet Allocation. What we do in Topic Modeling is we try to club together different objects(documents in this case) on the basis of some similar words. This means that if 2 documents contain similar words, then there are very high chances that they both might fall under the same category. So without wasting any time.

### Let’s do it…

#### Step 1 – Importing required libraries.

```import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation```

#### Step 2 – Reading input data.

```articles = pd.read_csv('npr.csv')

#### Step 3 – Checking info of our data.

`articles.info()`
• We can see that our data is having just one column named Article with 11992 entries.

#### Step 4 – Creating a Document Term Matrix of our data.

```cv = CountVectorizer(max_df=0.95,min_df=2,stop_words='english')
dtm = cv.fit_transform(articles['Article'])
dtm.shape```
• Here we are using CountVectorizer to convert our documents to arrays of word counts.
• Here we can see that our dtm is having the shape as (11992,54777) where 11992 shows the no. of documents in our dataset and 54777 depicts the no. of distinct words in our total vocabulary.

#### Step 5 – Initializing Latent Dirichlet Allocation object.

```LDA = LatentDirichletAllocation(n_components=7,random_state=42)
topic_results = LDA.fit_transform(dtm)
LDA.components_.shape```
• Let’s initialize the LatentDirichletAllocation object.
• Fit this object on our document term matrix we created above.
• And check its shape.
• We can see that the shape of LDA components is (7,54777) where 7 is the no. of components and 54777 is the size of the vocabulary.

#### Step 6 – Printing a list of features/words on which clustering will be done.

```for i,arr in enumerate(LDA.components_):

print(f'TOP 15 WORDS FOR TOPIC #{i}')
print([cv.get_feature_names()[i] for i in arr.argsort()[-15:]])
print('\n\n')```
• arr.argsort() will sort the words on the basis of the probability of the occurrence of that word in the document of that specific topic in the ascending order we have taken the last 15 words which means the 15 most probable words that will occur for that topic.
• cv.get_feature_names is just a list of all the words in our corpus
• See, top 15 words of topic #0 are companies, money, year percent etc. Looks like it is the financial group.
• Topic #1 seems like the political group.
• Topic #3 seems to be a health topic.
• Topic #6 looks to be an educational group.

#### Step 7 – Final results.

```articles['topic'] = topic_results.argmax(axis=1)
articles```
• Finally giving topic numbers to documents.