Easiest way to read Redshift data from Sagemaker Notebook – 2024

Hey guys in this blog we will discuss how we can read Redshift data from Sagemaker Notebook using credentials stored in the secrets manager. So without any further due, Let’s do it…

Step 1 – Creating a Secret in Secrets Manager

  • Search for Secrets Manager
read Redshift data from Sagemaker Notebook
  • Click on ‘Store a New Secret’.
  • Choose Secret Type as ‘Credentials for Amazon RDS database’.
  • Then add your database username and password.
read Redshift data from Sagemaker Notebook
  • And then finally select the database and click on ‘Next’.
  • Then in the next step, give your secret a name, I will name it ‘mysecret’.
read Redshift data from Sagemaker Notebook
  • Click on Next and you will reach step 3 (Configure rotation – optional).
  • Again click on ‘Next’ and skip Step 3.
  • In Step 4 it will ask to review, and finally, click ‘Store’.
  • Refresh the page and your secret will show up.
read Redshift data from Sagemaker Notebook
  • Click on your secret.
  • Now scroll a bit and you will find ‘Secret Value’.
read Redshift data from Sagemaker Notebook
  • Click on ‘Retrieve secret value’ and it will open all the details present in this secret instance.
read Redshift data from Sagemaker Notebook
  • Now Click on edit and we will add 2 new entries.
  • Click on ‘Add row’ and add the following 2 entries.
  • Replace yourdbname with your database name and yourredshiftclustername with your redshift cluster name.
read Redshift data from Sagemaker Notebook
  • Click on Save and it’s Done 🙂

Step 2 – Importing required packages

  • Here we are simply importing the required python packages that will help us to read Redshift data from Sagemaker Notebook.
import botocore.session as s
from botocore.exceptions import ClientError
import boto3.session
import json
import boto3

from botocore.exceptions import WaiterError
from botocore.waiter import WaiterModel
from botocore.waiter import create_waiter_with_client

import pandas as pd

Step 3 – Define a waiter

  • Here we are simply defining the configuration of a waiter that we will define ahead.
# Create custom waiter for the Redshift Data API to wait for finish execution of current SQL statement
waiter_name = 'DataAPIExecution'

delay=2
max_attempts=3

#Configure the waiter settings
waiter_config = {
  'version': 2,
  'waiters': {
    'DataAPIExecution': {
      'operation': 'DescribeStatement',
      'delay': delay,
      'maxAttempts': max_attempts,
      'acceptors': [
        {
          "matcher": "path",
          "expected": "FINISHED",
          "argument": "Status",
          "state": "success"
        },
        {
          "matcher": "pathAny",
          "expected": ["PICKED","STARTED","SUBMITTED"],
          "argument": "Status",
          "state": "retry"
        },
        {
          "matcher": "pathAny",
          "expected": ["FAILED","ABORTED"],
          "argument": "Status",
          "state": "failure"
        }
      ],
    },
  },
}

Step 4 – Retrieve DB details from AWS Secrets Manager

In this step, we will retrieve details from our secret. Mainly we need to retrieve following items from AWS Secrets Manager-

  • Cluster Identifier
  • Secrets ARN
  • Database name
secret_name = 'mysecret' ## replace the secret name with yours
session = boto3.session.Session()
region = session.region_name

client = session.client(
        service_name='secretsmanager',
        region_name=region
    )

try:
    get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    secret_arn=get_secret_value_response['ARN']

except ClientError as e:
    print("Error retrieving secret. Error: " + e.response['Error']['Message'])
    
else:
    # Depending on whether the secret is a string or binary, one of these fields will be populated.
    if 'SecretString' in get_secret_value_response:
        secret = get_secret_value_response['SecretString']
    else:
        secret = base64.b64decode(get_secret_value_response['SecretBinary'])
            
secret_json = json.loads(secret)
print(secret_json)

cluster_id=secret_json['dbClusterIdentifier']
db=secret_json['db']
print("Cluster_id: " + cluster_id + "\nDB: " + db + "\nSecret ARN: " + secret_arn)
image 7 1024x113 1
bc_session = s.get_session()

session = boto3.Session(
        botocore_session=bc_session,
        region_name=region,
    )

# Setup the client
client_redshift = session.client("redshift-data")
print("Data API client successfully loaded")
image 8

Step 5 – Initiating Waiter

  • Creating a waiter object that will wait and make retries for data extraction.
waiter_model = WaiterModel(waiter_config)
custom_waiter = create_waiter_with_client(waiter_name, waiter_model, client_redshift)

Step 6 – Define the SQL Query

  • Here we have defined the SQL Query that we need to run on tables present in Redshift.
query_str = 'select drugprimaryname,drugnamesynonyms from "masterdb"."raw".raw_drugs'

Step 7 – Run the Query

  • And finally, we will run the query here in this step.
  • This is the line where we will get our data -> output = client_redshift.get_statement_result(Id=id)
res = client_redshift.execute_statement(Database= db, SecretArn= secret_arn, Sql= query_str, ClusterIdentifier= cluster_id)
id = res["Id"]

try:
    custom_waiter.wait(Id=id)
    print("Done waiting to finish Data API.")
except WaiterError as e:
    print (e)
    
desc=client_redshift.describe_statement(Id=id)
print("Status: " + desc["Status"] + ". Excution time: %d miliseconds" %float(desc["Duration"]/pow(10,6)))

output = client_redshift.get_statement_result(Id=id)
print('Result Extracted Successfully!!!')
image 9

Step 8 – Extracting data and create a DataFrame of it

Here is a simple logic that will simply create a dataframe out of your extracted data.

## Getting column names
cols = []
for i in output['ColumnMetadata']:
    cols.append(i['label'])
    

## Extracting records
l = []
for i in range(len(output['Records'])):
    temp = []
    for j in output['Records'][i]:
        temp.append(j['stringValue'])
    l.append(temp)
    
## Creating Dataframe
df = pd.DataFrame(l,columns=cols)
df.head(10)
image 11 1

And here is our dataframe, though I have hidden the content…

Download Notebook

And this is how you can read Redshift data from Sagemaker Notebook.

So this is all for this blog folks, thanks for reading it and I hope you are taking something with you after reading this and till the next time ?…

Read my previous post: Easiest way to schedule a Python Script in AWS Glue as a Job

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net

Leave a Reply

Your email address will not be published. Required fields are marked *