Easiest way to Detect Data Drift in your dataset using Evidently in Python – 2023

Machine Learning Projects

Hey guys, in today’s blog we will see how to Detect Data Drift in your dataset using evidently module in Python. Checking Data Drift is a very important preprocessing step while preparing your data.

This is going to be a very interesting and informative blog, so without any further due, Let’s do it…

Snapshot of our Final Report…

detect Data Drift in your dataset

Step 1 – Importing required Packages

  • Importing Pandas to read our CSV dataset.
  • Importing Evidently library to create interactive Data Drift Dashboards.
import pandas as pd
from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab

Step 2 – Reading the Data

df = pd.read_csv('UCI_Credit_Card.csv')
print(df.columns)
detect Data Drift in your dataset

Step 3 – Creating a Data Drift report

  • Create a Dashboard object and pass DataDriftTab as the parameter.
  • Then calculate the Data Drift using the calculate method which takes two data frames to compare data distributions.
  • Then we are simply just saving the Dashboard in the HTML format.
credit_data_drift_dashboard = Dashboard(tabs=[DataDriftTab(verbose_level=1)])
credit_data_drift_dashboard.calculate(df[:25000], df[25000:], column_mapping=None)
credit_data_drift_dashboard.save('DataDrift.html')
print('Data Drift saved')

Our Final Report

  • The image below shows the final view of our Dashboard.
  • Let’s observe the BILL_AMT_4 column.
  • The first column says that our BILL_AMT_4 is of numeric type.
  • Then the next two columns display both, the reference distribution and the current distribution. We can observe the difference between the two.
  • Then in the last column, we can see the p-value for the similarity test. You can set your own p-value threshold in the code above.
detect Data Drift in your dataset

Let’s open the BILL_AMT_4 field

detect Data Drift in your dataset

Let’s see the full code…

import pandas as pd
from evidently.dashboard.tabs import DataDriftTab
from evidently.dashboard import Dashboard


df = pd.read_csv('UCI_Credit_Card.csv')
print(df.columns)


credit_data_drift_dashboard = Dashboard(tabs=[DataDriftTab(verbose_level=1)])
credit_data_drift_dashboard.calculate(df[:25000], df[25000:], column_mapping=None)
credit_data_drift_dashboard.save('DataDrift.html')
print('Data Drift saved')

Do let me know if thereā€™s any query when you Detect Data Drift in your dataset.

So this is all for this blog folks. Thanks for reading it and I hope you are taking something with you after reading this and till the next time ā€¦

Read my previous post: How to Deploy a Flask app online using Pythonanywhere

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projects, and Flask projects at machinelearningprojects.net.

Leave a Comment

Your email address will not be published. Required fields are marked *