SSL For RDS With Glue Python Job and AWS SDK For Pandas

SSL For RDS With Glue Python Job and AWS SDK For Pandas

This blog post is the result of a recent interaction with AWS Support. As always they were very helpful in resolving the issue.

AWS SDK For Pandas

Recently AWS renamed the AWS data wrangler python library to AWS SDK for Pandas. This is an AWS Professional Service open source python initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services.

Built on top of other open-source projects like Pandas, Apache Arrow and Boto3, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases.

I was looking to use the integration with AWS Glue to use a glue connection within some Python ETL code. The connection in my case was to an Amazon RDS PostgreSQL database.

For example:

import sys
import awswrangler as wr
import pandas as pd

con_postgresql = wr.postgresql.connect(connection="My-RDS-PostgreSQL-Connection")

con_postgresql.close()

The theory was that the connection could be defined in Glue once and used by multiple AWS Glue ETL

Amazon RDS Ready - Encryption Requirements

The purpose of the Amazon Relational Database Service (RDS) Ready Program is to recognise AWS Partner products that support the use of Amazon RDS database as a backend for business applications deployed within a customer’s AWS account or provided as SaaS deployed in APN Partner’s AWS Account.

This program requires that products follow AWS security, availability, reliability, performance and other architecture best practices while integrating with Amazon RDS.

At CDL our software has been accredited as Amazon RDS Ready and we apply these standards when developing new solutions. Specifically on Data encryption the Amazon RDS Ready states:

DBCONN-004 - Data Encryption: For business applications where data encryption is a requirement for security compliance, the product must support encryption of data at rest and in transit for Amazon RDS.

At CDL we ensure that data to RDS is encrypted in transit by setting the rds.force_ssl parameter to 1. See Using SSL with a PostgreSQL DB instance - Amazon Relational Database Service

Attempting an SSL Connection From Glue To RDS

A connection in Glue is created to a RDS database that has rds.force_ssl set.

This is done via the legacy glue connection screen in the console as this allows us to test the connection.

Glue Connection

As you can see running the test works.

Glue Connection Test OK

The Problem

Next we try an use that connection in a AWS Glue Python Job utilising the AWS SDK For Pandas.

import sys
import awswrangler as wr
import pandas as pd

con_postgresql = wr.postgresql.connect(connection="My-RDS-PostgreSQL-Connection")

con_postgresql.close()

Running the job will return errors about SSL. I got a couple of different errors when trying to debug different versions of the code.

SSL Error

SSL Error

After a bit of back and forth with AWS Support trying to debug the issue the service team identified the following.

Currently, awswrangler loads and uses default SSL configuration for creating boto3 session clients.

It was clear from the errors we receive that this default did not include the Amazon RDS Root CA.

To overwrite a default configuration, it’s possible to use the connect() function in awswrangler that allows to pass an SSL context.

We need to download the RDS root certificate and point to it.

import sys
import awswrangler as wr
import pandas as pd
import ssl
import os
import urllib.request

def download_rds_root_ca(filename: str):
    print("Downloading RDS CA root cert…")
    urllib.request.urlretrieve('https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem', filename=filename)
    print("Downloaded RDS CA root cert.")

def create_rds_ssl_context():
    cafile = '/tmp/rds-ca-2019-root.pem'
    download_rds_root_ca(cafile)
    ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS);
    ssl_context.verify_mode = ssl.CERT_REQUIRED;
    ssl_context.load_verify_locations(cafile=cafile, capath=None, cadata=None)
    return ssl_context

print("Connecting to RDS database…")
rds_ssl_context = create_rds_ssl_context()
con_postgresql = wr.postgresql.connect(connection="My-RDS-PostgreSQL-Connection", ssl_context=rds_ssl_context)
print("Successfully connected to RDS database.")

Run With SSL

Running the job again with the correct SSL certificate in place we get a successful execution.

Job Run Ok

Jon Run Logs