Wine Quality Prediction – with source code – 2024

In today’s blog, we will see how we can build a Wine Quality Prediction model using the Random Forest algorithm. So without any further due, Let’s do it…

Checkout the video here – https://youtu.be/7_IjRrns_uo

Step 1 – Importing libraries required for Wine Quality Prediction.

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

%matplotlib inline

Step 2 – Read input data.

wine = pd.read_csv('winequality-red.csv')
wine.head()
Wine Quality Prediction
Input data

Step 3 – Describe the data.

wine.describe()
Wine Quality Prediction
Description of our data

Step 4 – Take info from the data.

wine.info()
  • From the data below, we can infer that there is no NULL value in our data.
Wine Quality Prediction
Info of our data

Step 5 – Plot out the data.

fig = plt.figure(figsize=(15,10))

plt.subplot(3,4,1)
sns.barplot(x='quality',y='fixed acidity',data=wine)

plt.subplot(3,4,2)
sns.barplot(x='quality',y='volatile acidity',data=wine)

plt.subplot(3,4,3)
sns.barplot(x='quality',y='citric acid',data=wine)

plt.subplot(3,4,4)
sns.barplot(x='quality',y='residual sugar',data=wine)

plt.subplot(3,4,5)
sns.barplot(x='quality',y='chlorides',data=wine)

plt.subplot(3,4,6)
sns.barplot(x='quality',y='free sulfur dioxide',data=wine)

plt.subplot(3,4,7)
sns.barplot(x='quality',y='total sulfur dioxide',data=wine)

plt.subplot(3,4,8)
sns.barplot(x='quality',y='density',data=wine)

plt.subplot(3,4,9)
sns.barplot(x='quality',y='pH',data=wine)

plt.subplot(3,4,10)
sns.barplot(x='quality',y='sulphates',data=wine)

plt.subplot(3,4,11)
sns.barplot(x='quality',y='alcohol',data=wine)

plt.tight_layout()
  • From the plots below we can infer:
  • Quality is high when volatile acidity is less.
  • Quality is high when citric acid is high.
  • Quality is high when chlorides are less.
  • Quality is high when sulphates are more.
  • Quality is high when alcohol is more.
Wine Quality Prediction
plots

Step 6 – Count the no. of instances of each class.

wine['quality'].value_counts()
  • We can see that we have 6 classes of quality that are 3,4,5,6,7,8 but we don’t want it like this.
  • So what we will do is we will mark every rating from 3 to 6 as BAD and ratings of 7 and 8 as GOOD.
Wine Quality Prediction
value counts

Step 7 – Make just 2 categories good and bad.

ranges = (2,6.5,8) 
groups = ['bad','good']
wine['quality'] = pd.cut(wine['quality'],bins=ranges,labels=groups)
  • Here we are cutting bins use pd.cut() in 2 categories 2-6.5 as BAD and 6.5-8 as GOOD.

Step 8 – Alloting 0 to bad and 1 to good.

le = LabelEncoder()
wine['quality'] = le.fit_transform(wine['quality'])
wine.head()
  • Replace BAD with 0.
  • Replace GOOD with 1.
  • For reference see the quality column in the image below.
Wine Quality Prediction
labels encoded to 0/1

Step 9 – Again check counts.

wine['quality'].value_counts()
  • Now we have just 2 classes, 0 and 1 or BAD and GOOD.
  • But as we can see that the data is highly unbalanced, so we will balance it in the next step.
Wine Quality Prediction
Unequal value counts

Step 10 – Balancing the two classes.

good_quality = wine[wine['quality']==1]
bad_quality = wine[wine['quality']==0]

bad_quality = bad_quality.sample(frac=1)
bad_quality = bad_quality[:217]

new_df = pd.concat([good_quality,bad_quality])
new_df = new_df.sample(frac=1)
new_df
  • In this step, we are simply balancing our dataset.
  • We are making a new data frame good_quality in which we will have data of just good_quality wine or we can say where the quality is 1.
  • Similarly, we are making for bad_quality.
  • Then we are simply shuffling bad quality data using df.sample(frac=1). It means shuffle the data and take a 100% fraction of the data.
  • Then we are taking out 217 samples of bad_quality because we have just 217 samples of good_quality.
  • Then we are joining both 217 samples of each class and our final data frame will have 217*2=434 rows.
  • Finally, again shuffling the data.
Wine Quality Prediction
balanced dataset

Step 11 – Again check the counts of classes in the new dataframe.

new_df['quality'].value_counts()
  • Now we can see that both the classes have 217 instances and hence our data is shuffled.
Wine Quality Prediction
Equal value counts

Step 12 – Checking the correlation between columns.

new_df.corr()['quality'].sort_values(ascending=False)
  • From the image below we can infer that quality is highly dependent on the alcohol quantity in the wine.
Wine Quality Prediction
Correlations

Step 13 – Splitting the data into train and test.

from sklearn.model_selection import train_test_split

X = new_df.drop('quality',axis=1) 
y = new_df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Step 14 – Finally training our Wine Quality Prediction model.

param = {'n_estimators':[100,200,300,400,500,600,700,800,900,1000]}

grid_rf = GridSearchCV(RandomForestClassifier(),param,scoring='accuracy',cv=10,)
grid_rf.fit(X_train, y_train)

print('Best parameters --> ', grid_rf.best_params_)

# Wine Quality Prediction
pred = grid_rf.predict(X_test)

print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
print('\n')
print(accuracy_score(y_test,pred))
  • I also used some other algorithms like SVM and SGD Classifier but Random Forest stood out as always.
  • Here I have used GridSearchCV with Random Forest to find the best value of the ‘n_estimators’ parameter.
  • Finally, we ended up with an accuracy of 83.9% which is very good for this much small dataset.
Wine Quality Prediction
final metrics

Download Source Code for Wine Quality Prediction…

Do let me know if there’s any query regarding Wine Quality Prediction by contacting me on email or LinkedIn. I have tried my best to explain this code

So this is all for this blog folks, thanks for reading it and I hope you are taking something with you after reading this and till the next time…

Read my previous post: FLIPKART REVIEWS EXTRACTION AND SENTIMENT ANALYSIS USING FLASK APP

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net.

2 Comments

    • Hey, if you keep so much difference between the classes, your model will become biased. While training if you give 100 cases of 0 and 2 cases of 1, no matter what instance you give while testing it will always predict 1. That’s why it’s a good practice to balance your classes when the difference is huge.

Leave a Reply

Your email address will not be published. Required fields are marked *