Site icon Machine Learning Projects

Easiest way to read Redshift data from Sagemaker Notebook – 2023

Machine Learning Projects

Hey guys in this blog we will discuss how we can read Redshift data from Sagemaker Notebook using credentials stored in the secrets manager. So without any further due, Let’s do it…

Step 1 – Creating a Secret in Secrets Manager

Step 2 – Importing required packages

import botocore.session as s
from botocore.exceptions import ClientError
import boto3.session
import json
import boto3

from botocore.exceptions import WaiterError
from botocore.waiter import WaiterModel
from botocore.waiter import create_waiter_with_client

import pandas as pd

Step 3 – Define a waiter

# Create custom waiter for the Redshift Data API to wait for finish execution of current SQL statement
waiter_name = 'DataAPIExecution'

delay=2
max_attempts=3

#Configure the waiter settings
waiter_config = {
  'version': 2,
  'waiters': {
    'DataAPIExecution': {
      'operation': 'DescribeStatement',
      'delay': delay,
      'maxAttempts': max_attempts,
      'acceptors': [
        {
          "matcher": "path",
          "expected": "FINISHED",
          "argument": "Status",
          "state": "success"
        },
        {
          "matcher": "pathAny",
          "expected": ["PICKED","STARTED","SUBMITTED"],
          "argument": "Status",
          "state": "retry"
        },
        {
          "matcher": "pathAny",
          "expected": ["FAILED","ABORTED"],
          "argument": "Status",
          "state": "failure"
        }
      ],
    },
  },
}

Step 4 – Retrieve DB details from AWS Secrets Manager

In this step, we will retrieve details from our secret. Mainly we need to retrieve following items from AWS Secrets Manager-

secret_name = 'mysecret' ## replace the secret name with yours
session = boto3.session.Session()
region = session.region_name

client = session.client(
        service_name='secretsmanager',
        region_name=region
    )

try:
    get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    secret_arn=get_secret_value_response['ARN']

except ClientError as e:
    print("Error retrieving secret. Error: " + e.response['Error']['Message'])
    
else:
    # Depending on whether the secret is a string or binary, one of these fields will be populated.
    if 'SecretString' in get_secret_value_response:
        secret = get_secret_value_response['SecretString']
    else:
        secret = base64.b64decode(get_secret_value_response['SecretBinary'])
            
secret_json = json.loads(secret)
print(secret_json)

cluster_id=secret_json['dbClusterIdentifier']
db=secret_json['db']
print("Cluster_id: " + cluster_id + "\nDB: " + db + "\nSecret ARN: " + secret_arn)
bc_session = s.get_session()

session = boto3.Session(
        botocore_session=bc_session,
        region_name=region,
    )

# Setup the client
client_redshift = session.client("redshift-data")
print("Data API client successfully loaded")

Step 5 – Initiating Waiter

waiter_model = WaiterModel(waiter_config)
custom_waiter = create_waiter_with_client(waiter_name, waiter_model, client_redshift)

Step 6 – Define the SQL Query

query_str = 'select drugprimaryname,drugnamesynonyms from "masterdb"."raw".raw_drugs'

Step 7 – Run the Query

res = client_redshift.execute_statement(Database= db, SecretArn= secret_arn, Sql= query_str, ClusterIdentifier= cluster_id)
id = res["Id"]

try:
    custom_waiter.wait(Id=id)
    print("Done waiting to finish Data API.")
except WaiterError as e:
    print (e)
    
desc=client_redshift.describe_statement(Id=id)
print("Status: " + desc["Status"] + ". Excution time: %d miliseconds" %float(desc["Duration"]/pow(10,6)))

output = client_redshift.get_statement_result(Id=id)
print('Result Extracted Successfully!!!')

Step 8 – Extracting data and create a DataFrame of it

Here is a simple logic that will simply create a dataframe out of your extracted data.

## Getting column names
cols = []
for i in output['ColumnMetadata']:
    cols.append(i['label'])
    

## Extracting records
l = []
for i in range(len(output['Records'])):
    temp = []
    for j in output['Records'][i]:
        temp.append(j['stringValue'])
    l.append(temp)
    
## Creating Dataframe
df = pd.DataFrame(l,columns=cols)
df.head(10)

And here is our dataframe, though I have hidden the content…

Download Notebook

And this is how you can read Redshift data from Sagemaker Notebook.

So this is all for this blog folks, thanks for reading it and I hope you are taking something with you after reading this and till the next time ?…

Read my previous post: Easiest way to schedule a Python Script in AWS Glue as a Job

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net

Exit mobile version