# Wine Quality Prediction – with source code – 2023

In today’s blog, we will see how we can build a Wine Quality Prediction model using the Random Forest algorithm. So without any further due, Let’s do it…

Checkout the video here – https://youtu.be/7_IjRrns_uo

## Step 1 – Importing libraries required for Wine Quality Prediction.

```import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

%matplotlib inline```

## Step 2 – Read input data.

```wine = pd.read_csv('winequality-red.csv')

## Step 3 – Describe the data.

`wine.describe()`

## Step 4 – Take info from the data.

`wine.info()`
• From the data below, we can infer that there is no NULL value in our data.

## Step 5 – Plot out the data.

```fig = plt.figure(figsize=(15,10))

plt.subplot(3,4,1)
sns.barplot(x='quality',y='fixed acidity',data=wine)

plt.subplot(3,4,2)
sns.barplot(x='quality',y='volatile acidity',data=wine)

plt.subplot(3,4,3)
sns.barplot(x='quality',y='citric acid',data=wine)

plt.subplot(3,4,4)
sns.barplot(x='quality',y='residual sugar',data=wine)

plt.subplot(3,4,5)
sns.barplot(x='quality',y='chlorides',data=wine)

plt.subplot(3,4,6)
sns.barplot(x='quality',y='free sulfur dioxide',data=wine)

plt.subplot(3,4,7)
sns.barplot(x='quality',y='total sulfur dioxide',data=wine)

plt.subplot(3,4,8)
sns.barplot(x='quality',y='density',data=wine)

plt.subplot(3,4,9)
sns.barplot(x='quality',y='pH',data=wine)

plt.subplot(3,4,10)
sns.barplot(x='quality',y='sulphates',data=wine)

plt.subplot(3,4,11)
sns.barplot(x='quality',y='alcohol',data=wine)

plt.tight_layout()```
• From the plots below we can infer:
• Quality is high when volatile acidity is less.
• Quality is high when citric acid is high.
• Quality is high when chlorides are less.
• Quality is high when sulphates are more.
• Quality is high when alcohol is more.

## Step 6 – Count the no. of instances of each class.

`wine['quality'].value_counts()`
• We can see that we have 6 classes of quality that are 3,4,5,6,7,8 but we don’t want it like this.
• So what we will do is we will mark every rating from 3 to 6 as BAD and ratings of 7 and 8 as GOOD.

## Step 7 – Make just 2 categories good and bad.

```ranges = (2,6.5,8)
wine['quality'] = pd.cut(wine['quality'],bins=ranges,labels=groups)```
• Here we are cutting bins use pd.cut() in 2 categories 2-6.5 as BAD and 6.5-8 as GOOD.

## Step 8 – Alloting 0 to bad and 1 to good.

```le = LabelEncoder()
wine['quality'] = le.fit_transform(wine['quality'])
• Replace GOOD with 1.
• For reference see the quality column in the image below.

## Step 9 – Again check counts.

`wine['quality'].value_counts()`
• Now we have just 2 classes, 0 and 1 or BAD and GOOD.
• But as we can see that the data is highly unbalanced, so we will balance it in the next step.

## Step 10 – Balancing the two classes.

```good_quality = wine[wine['quality']==1]

new_df = new_df.sample(frac=1)
new_df```
• In this step, we are simply balancing our dataset.
• We are making a new data frame good_quality in which we will have data of just good_quality wine or we can say where the quality is 1.
• Similarly, we are making for bad_quality.
• Then we are simply shuffling bad quality data using df.sample(frac=1). It means shuffle the data and take a 100% fraction of the data.
• Then we are taking out 217 samples of bad_quality because we have just 217 samples of good_quality.
• Then we are joining both 217 samples of each class and our final data frame will have 217*2=434 rows.
• Finally, again shuffling the data.

## Step 11 – Again check the counts of classes in the new dataframe.

`new_df['quality'].value_counts()`
• Now we can see that both the classes have 217 instances and hence our data is shuffled.

## Step 12 – Checking the correlation between columns.

`new_df.corr()['quality'].sort_values(ascending=False)`
• From the image below we can infer that quality is highly dependent on the alcohol quantity in the wine.

## Step 13 – Splitting the data into train and test.

```from sklearn.model_selection import train_test_split

X = new_df.drop('quality',axis=1)
y = new_df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)```

## Step 14 – Finally training our Wine Quality Prediction model.

```param = {'n_estimators':[100,200,300,400,500,600,700,800,900,1000]}

grid_rf = GridSearchCV(RandomForestClassifier(),param,scoring='accuracy',cv=10,)
grid_rf.fit(X_train, y_train)

print('Best parameters --> ', grid_rf.best_params_)

# Wine Quality Prediction
pred = grid_rf.predict(X_test)

print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
print('\n')
print(accuracy_score(y_test,pred))```
• I also used some other algorithms like SVM and SGD Classifier but Random Forest stood out as always.
• Here I have used GridSearchCV with Random Forest to find the best value of the ‘n_estimators’ parameter.
• Finally, we ended up with an accuracy of 83.9% which is very good for this much small dataset.