Hey guys, in this blog we will see how we can Check Data Drift in DataBricks and log the results in a mlflow experiment. This is going to be a very interesting blog, so without any further due, let’s do it…
Step 1 – Installing required libraries
## Importing and installing necessary libraries !pip install evidently import json import mlflow import numpy as np import pandas as pd from datetime import datetime from evidently.tests import * from evidently.test_suite import TestSuite from databricks.feature_store import FeatureStoreClient pd.set_option('display.max_rows', 100)
Step 2 – Fetching Data from the Feature Store
## Accessing data from feature store # Create a FeatureStoreClient feature_store = FeatureStoreClient() # Specify the name of the feature table to read from table_name = "your.table.name" # Read data from the feature store feature_df = feature_store.read_table(table_name) # converting to pandas df feature_df_pds = feature_df.toPandas() feature_df_pds.head()
- After importing all the required libraries, we will fetch our data from the Feature Store.
- To do that, we need to create a FeatureStoreClient() first.
- Then we need to give a table name from which we need to fetch the data.
- Then finally we will run the feature_store.read_table(table_name) command to read the table from the feature store. This command will return a Spark Dataframe and we will store that in feature_df.
- Then finally for easy operations, we will convert this Spark Dataframe to a Panda Dataframe. For this, we will use feature_df.toPandas() command.
Step 3 – Creating Reference and Current Data
## creating reference data and current data thres = int(0.9*len(feature_df_pds)) ref = feature_df_pds[:thres] print(ref.shape) cur = feature_df_pds[thres:] print(cur.shape)
- Now we will split this data into 2 sets. Reference Dataset and Current Dataset.
- We will keep 90% of our data in the Reference Dataset and the rest 10% in the Current Dataset.
- We will name them ref and cur respectively.
Step 4 – Let’s Check Data Drift in DataBricks
## Running Drift Test tests = TestSuite(tests=[TestNumberOfDriftedColumns()]) tests.run(reference_data=ref, current_data=cur) ## Converting to JSON res = json.loads(tests.json()) ## Creating a Dataframe of it for easy visualization drift = res['tests']['parameters']['features'] driftdf = pd.DataFrame(drift).transpose() driftdf
- Now in this step, we will create a TestSuite object and pass a list of all the tests we want to perform on our data.
- In our case, we just want to perform TestNumberOfDriftedColumns() test.
- We will run the test and pass our reference and current dataset.
- Then we are simply converting the results into a JSON file to create a Dataframe out of it for easier visualization.
- And then we are simply printing our Dataframe.
- You can check all the tests you can perform here.
Step 5 – Logging the results in a mlflow experiment
## Logging this drift report in a mlflow experiment with mlflow.start_run() as run: mlflow.log_param('date',datetime.now().strftime ('%y-%m-%d-%H:%M:%S')) mlflow.log_param('reference_data', 'ref') mlflow.log_param('current_data', 'cur') mlflow.log_param('n_features', len(driftdf)) mlflow.log_param('features', list(driftdf.index)) mlflow.log_param('n_drifted_features', len(driftdf[driftdf['detected']==True])) mlflow.log_param('drifted_features', list(driftdf[driftdf['detected']==True].index)) mlflow.log_param('drifted_features_p_vals', driftdf[driftdf['detected']==True]['score'].values)
- Now finally we will log all this information in a mlflow experiment.
- We will start a mlflow run using with mlflow.start_run() as run and we will start logging parameters in that.
- Following are the parameters we are logging in this experiment:
So in this way, you can Check Data Drift in DataBricks and log the results in a mlflow experiment. This was all for this blog guys, hope you enjoyed it…
Read my last article – Easiest way to Create a Databricks Feature Store