Easiest Way To Read Redshift Data From Sagemaker Notebook - 2024

Hey guys in this blog we will discuss how we can read Redshift data from Sagemaker Notebook using credentials stored in the secrets manager. So without any further due, Let’s do it…

Table of Contents

Step 1 – Creating a Secret in Secrets Manager

Search for Secrets Manager

read Redshift data from Sagemaker Notebook

Click on ‘Store a New Secret’.
Choose Secret Type as ‘Credentials for Amazon RDS database’.
Then add your database username and password.

And then finally select the database and click on ‘Next’.
Then in the next step, give your secret a name, I will name it ‘mysecret’.

Click on Next and you will reach step 3 (Configure rotation – optional).
Again click on ‘Next’ and skip Step 3.
In Step 4 it will ask to review, and finally, click ‘Store’.
Refresh the page and your secret will show up.

Click on your secret.
Now scroll a bit and you will find ‘Secret Value’.

Click on ‘Retrieve secret value’ and it will open all the details present in this secret instance.

Now Click on edit and we will add 2 new entries.
Click on ‘Add row’ and add the following 2 entries.
Replace yourdbname with your database name and yourredshiftclustername with your redshift cluster name.

Click on Save and it’s Done 🙂

Step 2 – Importing required packages

Here we are simply importing the required python packages that will help us to read Redshift data from Sagemaker Notebook.

import botocore.session as s
from botocore.exceptions import ClientError
import boto3.session
import json
import boto3

from botocore.exceptions import WaiterError
from botocore.waiter import WaiterModel
from botocore.waiter import create_waiter_with_client

import pandas as pd

Step 3 – Define a waiter

Here we are simply defining the configuration of a waiter that we will define ahead.

# Create custom waiter for the Redshift Data API to wait for finish execution of current SQL statement
waiter_name = 'DataAPIExecution'

delay=2
max_attempts=3

#Configure the waiter settings
waiter_config = {
  'version': 2,
  'waiters': {
    'DataAPIExecution': {
      'operation': 'DescribeStatement',
      'delay': delay,
      'maxAttempts': max_attempts,
      'acceptors': [
        {
          "matcher": "path",
          "expected": "FINISHED",
          "argument": "Status",
          "state": "success"
        },
        {
          "matcher": "pathAny",
          "expected": ["PICKED","STARTED","SUBMITTED"],
          "argument": "Status",
          "state": "retry"
        },
        {
          "matcher": "pathAny",
          "expected": ["FAILED","ABORTED"],
          "argument": "Status",
          "state": "failure"
        }
      ],
    },
  },
}

Step 4 – Retrieve DB details from AWS Secrets Manager

In this step, we will retrieve details from our secret. Mainly we need to retrieve following items from AWS Secrets Manager-

Cluster Identifier
Secrets ARN
Database name

secret_name = 'mysecret' ## replace the secret name with yours
session = boto3.session.Session()
region = session.region_name

client = session.client(
        service_name='secretsmanager',
        region_name=region
    )

try:
    get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    secret_arn=get_secret_value_response['ARN']

except ClientError as e:
    print("Error retrieving secret. Error: " + e.response['Error']['Message'])
    
else:
    # Depending on whether the secret is a string or binary, one of these fields will be populated.
    if 'SecretString' in get_secret_value_response:
        secret = get_secret_value_response['SecretString']
    else:
        secret = base64.b64decode(get_secret_value_response['SecretBinary'])
            
secret_json = json.loads(secret)
print(secret_json)

cluster_id=secret_json['dbClusterIdentifier']
db=secret_json['db']
print("Cluster_id: " + cluster_id + "\nDB: " + db + "\nSecret ARN: " + secret_arn)

bc_session = s.get_session()

session = boto3.Session(
        botocore_session=bc_session,
        region_name=region,
    )

# Setup the client
client_redshift = session.client("redshift-data")
print("Data API client successfully loaded")

Step 5 – Initiating Waiter

Creating a waiter object that will wait and make retries for data extraction.

waiter_model = WaiterModel(waiter_config)
custom_waiter = create_waiter_with_client(waiter_name, waiter_model, client_redshift)

Step 6 – Define the SQL Query

Here we have defined the SQL Query that we need to run on tables present in Redshift.

query_str = 'select drugprimaryname,drugnamesynonyms from "masterdb"."raw".raw_drugs'

Step 7 – Run the Query

And finally, we will run the query here in this step.
This is the line where we will get our data -> output = client_redshift.get_statement_result(Id=id)

res = client_redshift.execute_statement(Database= db, SecretArn= secret_arn, Sql= query_str, ClusterIdentifier= cluster_id)
id = res["Id"]

try:
    custom_waiter.wait(Id=id)
    print("Done waiting to finish Data API.")
except WaiterError as e:
    print (e)
    
desc=client_redshift.describe_statement(Id=id)
print("Status: " + desc["Status"] + ". Excution time: %d miliseconds" %float(desc["Duration"]/pow(10,6)))

output = client_redshift.get_statement_result(Id=id)
print('Result Extracted Successfully!!!')

Step 8 – Extracting data and create a DataFrame of it

Here is a simple logic that will simply create a dataframe out of your extracted data.

## Getting column names
cols = []
for i in output['ColumnMetadata']:
    cols.append(i['label'])
    

## Extracting records
l = []
for i in range(len(output['Records'])):
    temp = []
    for j in output['Records'][i]:
        temp.append(j['stringValue'])
    l.append(temp)
    
## Creating Dataframe
df = pd.DataFrame(l,columns=cols)
df.head(10)