In February 2024, we announced the release of the Data Solutions Framework (DSF), an opinionated open supply framework for constructing information options on AWS. DSF is constructed utilizing the AWS Cloud Development Kit (AWS CDK) to package deal infrastructure elements into L3 AWS CDK constructs on high of AWS companies. L3 constructs are implementations of frequent technical patterns and create a number of sources which can be configured to work with one another.
On this put up, we reveal tips on how to use the AWS CDK and DSF to create a multi-data warehouse platform primarily based on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of knowledge, and data sharing between totally different information warehouse deployments. Utilizing a programmatic strategy with the AWS CDK and DSF means that you can apply GitOps ideas to your analytics workloads and understand the next advantages:
- You’ll be able to deploy utilizing steady integration and supply (CI/CD) pipelines, together with the definitions of Redshift objects (databases, tables, shares, and so forth)
- You’ll be able to roll out modifications persistently throughout a number of environments
- You’ll be able to bootstrap information warehouses (desk creation, ingestion of knowledge, and so forth) utilizing code and use model management to simplify the setup of testing environments
- You’ll be able to check modifications earlier than deployment utilizing AWS CDK built-in testing capabilities
As well as, DSF’s Redshift Serverless L3 constructs present numerous built-in capabilities that may speed up improvement whereas serving to you observe greatest practices. For instance:
- Working extract, remodel, and cargo (ETL) jobs to and from Amazon Redshift is extra simple as a result of an AWS Glue connection useful resource is mechanically created and configured. This implies information engineers don’t should configure this useful resource and may use it immediately with their AWS Glue ETL jobs.
- Equally, with discovery of knowledge inside Amazon Redshift, DSF gives a handy technique to configure an AWS Glue crawler to populate the AWS Glue Information Catalog for ease of discovery in addition to ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler makes use of an AWS Identity and Access Management (IAM) position that follows least privilege.
- Sharing information between Redshift information warehouses is a typical strategy to enhance collaboration between strains of enterprise with out duplicating information. DSF gives handy strategies for the end-to-end circulate for each information producer and client.
Resolution overview
The answer demonstrates a typical sample the place an information warehouse is used as a serving layer for enterprise intelligence (BI) workloads on high of knowledge lake information. The supply information is saved in Amazon Simple Storage Service (Amazon S3) buckets, then ingested right into a Redshift producer information warehouse to create materialized views and combination information, and at last shared with a Redshift client operating BI queries from the end-users. The next diagram illustrates the high-level structure.
Within the put up, we use Python for the instance code. DSF additionally helps TypeScript.
Stipulations
As a result of we’re utilizing the AWS CDK, full the steps in Getting Started with the AWS CDK earlier than you implement the answer.
Initialize the challenge and provision a Redshift Serverless namespace and workgroup
Let’s begin with initializing the challenge and together with DSF as a dependency. You’ll be able to run this code in your native terminal, or you need to use AWS Cloud9:
Open the challenge folder in your IDE and full the next steps:
- Open the
app.py
file. - On this file, ensure that to uncomment the primary
env
This configures the AWS CDK environment relying on the AWS profile used through the deployment. - Add a configuration flag within the
cdk.context.json
file on the root of the challenge (if it doesn’t exist, create the file):
Setting the @data-solutions-framework-on-aws/removeDataOnDestroy
configuration flag to true makes certain sources which have the removal_policy
parameter set to RemovalPolicy.DESTROY
are destroyed when the AWS CDK stack is deleted. It is a guardrail DSF makes use of to stop by accident deleting information.
Now that the challenge is configured, you can begin including sources to the stack.
- Navigate to the
dsf_redshift_blog
folder and open thedsf_redshift_blog_stack.py
file.
That is the place we configure the sources to be deployed.
- To get began constructing the end-to-end demo, add the next import statements on the high of the file, which lets you begin defining the sources from each the AWS CDK core library in addition to DSF:
We use a number of DSF-specific constructs to construct the demo:
- DataLakeStorage – This creates three S3 buckets, named
Bronze
,Silver
, andGold
, to symbolize the totally different information layers. - S3DataCopy – This manages the copying of knowledge from one bucket to a different bucket.
- RedshiftServerlessNamespace – This creates a Redshift Serverless namespace the place database objects and customers are saved.
- RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that comprises compute- and network-related configurations for the information warehouse. That is additionally the entry level for a number of handy functionalities that DSF gives, reminiscent of cataloging of Redshift tables, operating SQL statements as a part of the AWS CDK (reminiscent of creating tables, information ingestion, merging of tables, and extra), and sharing datasets throughout totally different Redshift clusters with out shifting information.
- Now that you’ve got imported the libraries, create a set of S3 buckets following the medallion architecture best practices with bronze, silver, and gold information layers.
The high-level definitions of every layer are as follows:
- Bronze represents uncooked information; that is the place information from numerous supply programs lands. No schema is required.
- Silver is cleaned and doubtlessly augmented information. The schema is enforced on this layer.
- Gold is information that’s additional refined and aggregated to serve a selected enterprise want.
Utilizing the DataLakeStorage assemble, you possibly can create these three S3 buckets with the next greatest practices:
- Encryption at relaxation via AWS Key Management Service (AWS KMS) is turned on
- SSL is enforced
- Using S3 bucket keys is turned on
- There’s a default S3 lifecycle rule outlined to delete incomplete multipart uploads after 1 day
- After you create the S3 buckets, copy over the information utilizing the S3DataCopy For this demo, we land the information within the
Silver
bucket as a result of it’s already cleaned: - To ensure that Amazon Redshift to ingest the information in Amazon S3, it wants an IAM position with the appropriate permissions. This position might be related to the Redshift Serverless namespace that you simply create subsequent.
- To provision Redshift Serverless, configure two sources: a namespace and a workgroup. DSF gives L3 constructs for each:
Each constructs observe security best practices, together with:
- The default digital personal cloud (VPC) makes use of personal subnets (with public entry disabled).
- Information is encrypted at relaxation via AWS KMS with computerized key rotation.
- Admin credentials are saved in AWS Secrets Manager with computerized rotation managed by Amazon Redshift.
- A default AWS Glue connection is mechanically created utilizing personal connectivity. This can be utilized by AWS Glue crawlers in addition to AWS Glue ETL jobs to hook up with Amazon Redshift.
The
RedshiftServerlessWorkgroup
assemble is the primary entry level for different capabilities, reminiscent of integration with the AWS Glue Information Catalog, Redshift Information API, and Information Sharing API.- Within the following instance, use the defaults offered by the assemble and affiliate the IAM position that you simply created earlier to provide Amazon Redshift entry to the information lake for information ingestion:
Create tables and ingest information
To create a desk, you need to use the runCustomSQL
technique within the RedshiftServerlessWorkgroup
assemble. This technique means that you can run arbitrary SQL statements when the useful resource is being created (reminiscent of create desk
or create materialized view
) and when it’s being deleted (reminiscent of drop desk
or drop materialized view
).
Add the next code after the RedshiftServerlessWorkgroup
instantiation:
Given the asynchronous nature of among the useful resource creation, we additionally implement dependencies between some sources; in any other case, the AWS CDK would attempt to create them in parallel to speed up the deployment. The previous dependency statements set up the next:
- Earlier than you load the information, the S3 information copy is full, so the information exists within the supply bucket of the ingestion
- Earlier than you load the information, the goal desk has been created within the Redshift namespace
Bootstrapping instance (materialized views)
The workgroup.run_custom_sql()
technique gives flexibility in how one can bootstrap your Redshift information warehouse utilizing the AWS CDK. For instance, you possibly can create a materialized view to enhance the queries’ efficiency by pre-aggregating information from the Amazon evaluations:
materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
database_name="defaultdb",
sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS score FROM amazon_reviews WHERE market="US" GROUP BY 1,2;''',
delete_sql="drop materialized view mv_product_analysis")
materialized_view.node.add_dependency(load_amazon_reviews_data)
Catalog tables in Amazon Redshift
The deployment of RedshiftServerlessWorkgroup
mechanically creates an AWS Glue connection useful resource that can be utilized by AWS Glue crawlers and AWS Glue ETL jobs. That is straight uncovered from the workgroup assemble via the glue_connection property. Utilizing this connection, the workgroup assemble exposes a handy technique to catalog the tables contained in the related Redshift Serverless namespace. The next an instance code:
This single line of code creates a database within the Information Catalog named mv_product_analysis
and the related crawler with the IAM position and community configuration already configured. By default, it crawls all of the tables inside the general public schema within the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter within the catalogTables
technique means that you can outline a pattern on what to crawl (see the JDBC information retailer within the embody path).
You’ll be able to run the crawler utilizing the AWS Glue console or invoke it utilizing the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK utilizing AwsCustomResource.
Information sharing
DSF helps Redshift information sharing for each side (producers and shoppers) in addition to identical account and cross-account situations. Let’s create one other Redshift Serverless namespace and workgroup to reveal the interplay:
For producers
For producers, full the next steps:
- Create the brand new share and populate the share with the schema or tables:
data_share = workgroup.create_share('DataSharing', 'defaultdb', 'defaultdbshare', 'public', ['mv_product_analysis']) data_share.new_share_custom_resource.node.add_dependency(materialized_view)
- Create entry grants:
-
- To grant to a cluster in the identical account:
share_grant = workgroup.grant_access_to_share("GrantToSameAccount", data_share, namespace2.namespace_id) share_grant.useful resource.node.add_dependency(data_share.new_share_custom_resource) share_grant.useful resource.node.add_dependency(namespace2)
- To grant to a unique account:
- To grant to a cluster in the identical account:
The last parameter within the grant_access_to_share
technique permits to mechanically authorize the cross-account entry on the information share. Omitting this parameter would default to no authorization; which implies a Redshift administrator must authorize the cross-account share both utilizing the AWS CLI, SDK, or Amazon Redshift console.
For shoppers
For a similar account share, to create the database from the share, use the next code:
For cross-account grants, the syntax is comparable, however you want to point out the producer account ID:
To see the complete working instance, observe the instructions within the accompanying GitHub repository.
Deploy the sources utilizing the AWS CDK
To deploy the sources, run the next code:
You’ll be able to overview the sources created, as proven within the following screenshot.
Affirm the modifications for the deployment to start out. Wait a couple of minutes for the challenge to be deployed; you possibly can hold monitor of the deployment utilizing the AWS CLI or the AWS CloudFormation console.
When the deployment is full, it is best to see two Redshift workgroups (one producer and one client).
Utilizing Amazon Redshift Question Editor v2, you possibly can log in to the producer Redshift workgroup utilizing Secrets and techniques Supervisor, as proven within the following screenshot.
After you log in, you possibly can see the tables and views that you simply created utilizing DSF within the defaultdb
database.
Log in to the patron Redshift workgroup to see the shared dataset from the producer Redshift workgroup beneath the advertising database.
Clear up
You’ll be able to run cdk destroy
in your native terminal to delete the stack. Since you marked the constructs with a RemovalPolicy.DESTROY
and configured DSF to take away information on destroy, operating cdk destroy
or deleting the stack from the AWS CloudFormation console will clear up the provisioned sources.
Conclusion
On this put up, we demonstrated tips on how to use the AWS CDK together with the DSF to handle Redshift Serverless as code. Codifying the deployment of sources helps present consistency throughout a number of environments. Other than infrastructure, DSF additionally gives capabilities to bootstrap (desk creation, ingestion of knowledge, and extra) Amazon Redshift and handle objects, all from the AWS CDK. Which means that modifications will be model managed, reviewed, and even unit examined.
Along with Redshift Serverless, DSF supports other AWS services, reminiscent of Amazon Athena, Amazon EMR, and lots of extra. Our roadmap is publicly accessible, and we look ahead to your function requests, contributions, and suggestions.
You may get began utilizing DSF by following our quick start guide.
Concerning the authors
Jan Michael Go Tan is a Principal Options Architect for Amazon Internet Providers. He helps clients design scalable and modern options with the AWS Cloud.