Check Data Drift in DataBricks using Evidently and MLflow – 2023

Hey guys, in this blog we will see how we can Check Data Drift in DataBricks and log the results in a mlflow experiment. This is going to be a very interesting blog, so without any further due, let’s do it…

Step 1 – Installing required libraries

## Importing and installing necessary libraries
!pip install evidently

import json
import mlflow
import numpy as np
import pandas as pd
from datetime import datetime

from evidently.tests import *
from evidently.test_suite import TestSuite

from databricks.feature_store import FeatureStoreClient

pd.set_option('display.max_rows', 100)

Step 2 – Fetching Data from the Feature Store

## Accessing data from feature store

# Create a FeatureStoreClient
feature_store = FeatureStoreClient()

# Specify the name of the feature table to read from
table_name = "your.table.name"

# Read data from the feature store
feature_df = feature_store.read_table(table_name)

# converting to pandas df
feature_df_pds = feature_df.toPandas()
feature_df_pds.head()
  • After importing all the required libraries, we will fetch our data from the Feature Store.
  • To do that, we need to create a FeatureStoreClient() first.
  • Then we need to give a table name from which we need to fetch the data.
  • Then finally we will run the feature_store.read_table(table_name) command to read the table from the feature store. This command will return a Spark Dataframe and we will store that in feature_df.
  • Then finally for easy operations, we will convert this Spark Dataframe to a Panda Dataframe. For this, we will use feature_df.toPandas() command.

Output

Check Data Drift in DataBricks

Step 3 – Creating Reference and Current Data

## creating reference data and current data 
thres = int(0.9*len(feature_df_pds))

ref = feature_df_pds[:thres]
print(ref.shape)

cur = feature_df_pds[thres:]
print(cur.shape)
  • Now we will split this data into 2 sets. Reference Dataset and Current Dataset.
  • We will keep 90% of our data in the Reference Dataset and the rest 10% in the Current Dataset.
  • We will name them ref and cur respectively.

Output

Check Data Drift in DataBricks

Step 4 – Let’s Check Data Drift in DataBricks

## Running Drift Test
tests = TestSuite(tests=[TestNumberOfDriftedColumns()])
tests.run(reference_data=ref, current_data=cur)

## Converting to JSON
res = json.loads(tests.json())

## Creating a Dataframe of it for easy visualization
drift = res['tests'][0]['parameters']['features']
driftdf = pd.DataFrame(drift).transpose()
driftdf
  • Now in this step, we will create a TestSuite object and pass a list of all the tests we want to perform on our data.
  • In our case, we just want to perform TestNumberOfDriftedColumns() test.
  • We will run the test and pass our reference and current dataset.
  • Then we are simply converting the results into a JSON file to create a Dataframe out of it for easier visualization.
  • And then we are simply printing our Dataframe.
  • You can check all the tests you can perform here.

Output

Check Data Drift in DataBricks

Step 5 – Logging the results in a mlflow experiment

## Logging this drift report in a mlflow experiment 
with mlflow.start_run() as run:
    mlflow.log_param('date',datetime.now().strftime ('%y-%m-%d-%H:%M:%S'))
    mlflow.log_param('reference_data', 'ref') 
    mlflow.log_param('current_data', 'cur')
    mlflow.log_param('n_features', len(driftdf)) 
    mlflow.log_param('features', list(driftdf.index)) 
    mlflow.log_param('n_drifted_features', len(driftdf[driftdf['detected']==True])) 
    mlflow.log_param('drifted_features', list(driftdf[driftdf['detected']==True].index))
    mlflow.log_param('drifted_features_p_vals', driftdf[driftdf['detected']==True]['score'].values) 
  • Now finally we will log all this information in a mlflow experiment.
  • We will start a mlflow run using with mlflow.start_run() as run and we will start logging parameters in that.
  • Following are the parameters we are logging in this experiment:
    • date
    • reference_data
    • current_data
    • n_features
    • features
    • n_drifted_features
    • drifted_features
    • drifted_features_p_vals

Output

Check Data Drift in DataBricks

Conclusion

So in this way, you can Check Data Drift in DataBricks and log the results in a mlflow experiment. This was all for this blog guys, hope you enjoyed it…

Read my last article – Easiest way to Create a Databricks Feature Store

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net

Abhishek Sharma
Abhishek Sharma

Started my Data Science journey in my 2nd year of college and since then continuously into it because of the magical powers of ML and continuously doing projects in almost every domain of AI like ML, DL, CV, NLP.

Articles: 521

Leave a Reply

Your email address will not be published. Required fields are marked *