Unlock scalable analytics with a safe connectivity sample in AWS Glue to learn from or write to Snowflake

Unlock scalable analytics with a safe connectivity sample in AWS Glue to learn from or write to Snowflake
Unlock scalable analytics with a safe connectivity sample in AWS Glue to learn from or write to Snowflake


In right this moment’s data-driven world, the flexibility to seamlessly combine and make the most of various information sources is vital for gaining actionable insights and driving innovation. As organizations more and more depend on information saved throughout numerous platforms, akin to Snowflake, Amazon Simple Storage Service (Amazon S3), and numerous software program as a service (SaaS) functions, the problem of bringing these disparate information sources collectively has by no means been extra urgent.

AWS Glue is a sturdy information integration service that facilitates the consolidation of information from completely different origins, empowering companies to make use of the complete potential of their information belongings. Through the use of AWS Glue to combine information from Snowflake, Amazon S3, and SaaS functions, organizations can unlock new alternatives in generative artificial intelligence (AI), machine learning (ML), business intelligence (BI), and self-service analytics or feed information to underlying functions.

On this publish, we discover how AWS Glue can function the info integration service to convey the info from Snowflake on your information integration technique, enabling you to harness the ability of your information ecosystem and drive significant outcomes throughout numerous use circumstances.

Use case

Contemplate a big ecommerce firm that depends closely on data-driven insights to optimize its operations, advertising methods, and buyer experiences. The corporate shops huge quantities of transactional information, buyer info, and product catalogs in Snowflake. Nevertheless, in addition they generate and accumulate information from numerous different sources, akin to net logs saved in Amazon S3, social media platforms, and third-party information suppliers. To realize a complete understanding of their enterprise and make knowledgeable selections, the corporate must combine and analyze information from all these sources seamlessly.

One essential enterprise requirement for the ecommerce firm is to generate a Pricing Abstract Report that gives an in depth evaluation of pricing and discounting methods. This report is crucial for understanding income streams, figuring out alternatives for optimization, and making data-driven selections relating to pricing and promotions. After the Pricing Abstract Report is generated and saved in Amazon S3, the corporate can use AWS analytics providers to generate interactive BI dashboards and run one-time queries on the report. This permits enterprise analysts and decision-makers to achieve invaluable insights, visualize key metrics, and discover the info in depth, enabling knowledgeable decision-making and strategic planning for pricing and promotional methods.

Answer overview

The next structure diagram illustrates a safe and environment friendly resolution of integrating Snowflake information with Amazon S3, utilizing the native Snowflake connector in AWS Glue. This setup makes use of AWS PrivateLink to supply safe connectivity between AWS providers throughout completely different digital personal clouds (VPCs), eliminating the necessity to expose information to the general public web, which is a vital want for organizations.

BDB-4354-architecture

The next are the important thing parts and steps within the integration course of:

  1. Set up a safe, personal connection between your AWS account and your Snowflake account utilizing PrivateLink. This entails creating VPC endpoints in each the AWS and Snowflake VPCs, ensuring information switch stays inside the AWS community.
  2. Use Amazon Route 53 to create a personal hosted zone that resolves the Snowflake endpoint inside your VPC. This permits AWS Glue jobs to connect with Snowflake utilizing a personal DNS title, sustaining the safety and integrity of the info switch.
  3. Create an AWS Glue job to deal with the extract, rework, and cargo (ETL) course of on information from Snowflake to Amazon S3. The AWS Glue job makes use of the safe connection established by the VPC endpoints to entry Snowflake information. Snowflake credentials are securely saved in AWS Secrets Manager. The AWS Glue job retrieves these credentials at runtime to authenticate and connect with Snowflake, offering safe entry administration. A VPC endpoint lets you securely talk with this service with out traversing the general public web, enhancing safety and efficiency.
  4. Retailer the extracted and reworked information in Amazon S3. Manage the info into acceptable buildings, akin to partitioned folders, to optimize question efficiency and information administration. We use a VPC endpoint enabled to securely talk with this service with out traversing the general public web, enhancing safety and efficiency. We additionally use Amazon S3 to retailer AWS Glue scripts, logs, and short-term information generated through the ETL course of.

This strategy presents the next advantages:

  • Enhanced safety – Through the use of PrivateLink and VPC endpoints, information switch between Snowflake and Amazon S3 is secured inside the AWS community, decreasing publicity to potential safety threats.
  • Environment friendly information integration – AWS Glue simplifies the ETL course of, offering a scalable and versatile resolution for information integration between Snowflake and Amazon S3.
  • Value-effectiveness – Utilizing Amazon S3 for information storage, mixed with the AWS Glue pay-as-you-go pricing mannequin, helps optimize prices related to information administration and integration.
  • Scalability and suppleness – The structure helps scalable information transfers and may be prolonged to combine further information sources and locations as wanted.

By following this structure and benefiting from the capabilities of AWS Glue, PrivateLink, and related AWS providers, organizations can obtain a strong, safe, and environment friendly information integration resolution, enabling them to harness the complete potential of their Snowflake and Amazon S3 information for superior analytics and BI.

Conditions

Full the next stipulations earlier than establishing the answer:

  1. Confirm that you’ve got entry to AWS account with the required permissions to provision assets in providers akin to Route 53, Amazon S3, AWS Glue, Secrets and techniques Supervisor, and Amazon Virtual Private Cloud (Amazon VPC) utilizing AWS CloudFormation, which helps you to mannequin, provision, and handle AWS and third-party assets by treating infrastructure as code.
  2. Verify that you’ve got entry to Snowflake hosted in AWS with required permissions to run the steps to configure PrivateLink. Seek advice from Enabling AWS PrivateLink within the Snowflake documentation to confirm the steps, required entry stage, and repair stage to set the configurations. After you allow PrivateLink, save the worth of the next parameters offered by Snowflake to make use of within the subsequent step on this publish:
    1. privatelink-vpce-id
    2. privatelink-account-url
    3. privatelink_ocsp-url
    4. regionless-snowsight-privatelink-url
  3. Ensure you have a Snowflake consumer snowflakeUser and password snowflakePassword with required permissions to learn from and write to Snowflake. The consumer and password are used within the AWS Glue connection to authenticate inside Snowflake.
  4. In case your Snowflake consumer doesn’t have a default warehouse set, you will have a warehouse title. We use snowflakeWarehouse as a placeholder for the warehouse title; change it along with your precise warehouse title.
  5. In case you’re new to Snowflake, take into account finishing the Snowflake in 20 Minutes By the top of the tutorial, you must know the best way to create required Snowflake objects, together with warehouses, databases, and tables for storing and querying information.

Create assets with AWS CloudFormation

This publish features a CloudFormation template for a fast setup of the bottom assets. You may assessment and customise it to fit your wants if wanted. The CloudFormation template generates the next assets:

To create your assets, full the next steps:

  1. Check in to the AWS CloudFormation console.
  2. Select Launch Stack to launch the CloudFormation stack.
  3. Present the CloudFormation stack parameters:
    1. For PrivateLinkAccountURL, enter the worth of the parameter privatelink-account-url obtained within the stipulations.
    2. For PrivateLinkOcspURL, enter the worth of the parameter privatelink_ocsp-url obtained within the stipulations.
    3. For PrivateLinkVpceId, enter the worth of the parameter privatelink-vpce-id obtained within the stipulations.
    4. For PrivateSubnet1CIDR, enter the IP addresses on your personal subnet 1.
    5. For PrivateSubnet2CIDR, enter the IP addresses on your personal subnet 2.
    6. For PrivateSubnet3CIDR, enter the IP addresses on your personal subnet 3.
    7. For PublicSubnet1CIDR, enter the IP addresses on your public subnet 1.
    8. For RegionlessSnowsightPrivateLinkURL, enter the worth of the parameter regionless-snowsight-privatelink-url obtained within the stipulations.
    9. For VpcCIDR, enter the IP addresses on your VPC.
  4. Select Subsequent.
  5. Choose I acknowledge that AWS CloudFormation may create IAM assets.
  6. Select Submit and look forward to the stack creation step to finish.

After the CloudFormation stack is efficiently created, you may see all of the assets created on the Assets tab.

Navigate to the Outputs tab to see the outputs offered by CloudFormation stack. Save the worth of the outputs GlueSecurityGroupId, VpcId, and PrivateSubnet1Id to make use of within the subsequent step on this publish.

BDB-4354-cfn-output

Replace the Secrets and techniques Supervisor secret with Snowflake credentials for the AWS Glue connection

To replace the Secrets and techniques Supervisor secret with consumer snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse that you’ll use within the AWS Glue connection to determine a connection to Snowflake, full the next steps:

  1. On the Secrets and techniques Supervisor console, select Secrets and techniques within the navigation pane.
  2. Open the key blog-glue-snowflake-credentials.
  3. Beneath Secret worth, select Retrieve secret worth.

BDB-4354-secrets-manager

  1. Select Edit.
  2. Enter the consumer snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse for the keys sfUser, sfPassword, and sfWarehouse, respectively.
  3. Select Save.

Create the AWS Glue connection for Snowflake

An AWS Glue connection is an AWS Glue Information Catalog object that shops login credentials, URI strings, VPC info, and extra for a selected information retailer. AWS Glue crawlers, jobs, and improvement endpoints use connections with a view to entry sure kinds of information shops. To create an AWS Glue connection to Snowflake, full the next steps:

  1. On the AWS Glue console, within the navigation pane, beneath Information catalog, select Connections.
  2. Select Create connection.
  3. For Information sources, seek for and choose Snowflake.
  4. Select Subsequent.

BDB-4354-sf-data-source

  1. For Snowflake URL, enter https://<privatelink-account-url>.

To acquire the Snowflake PrivateLink account URL, seek advice from parameters obtained within the stipulations.

  1. For AWS Secret, select the key blog-glue-snowflake-credentials.
  2. For VPC, select the VpcId worth obtained from the CloudFormation stack output.
  3. For Subnet, select the PrivateSubnet1Id worth obtained from the CloudFormation stack output.
  4. For Safety teams, select the GlueSecurityGroupId worth obtained from the CloudFormation stack output.
  5. Select Subsequent.

BDB-4354-sf-connection-setup

  1. Within the Connection Properties part, for Title, enter glue-snowflake-connection.
  2. Select Subsequent.

BDB-4354-sf-connection-properties

  1. Select Create connection.

Create an AWS Glue job

You’re now able to outline the AWS Glue job utilizing the Snowflake connection. To create an AWS Glue job to learn from Snowflake, full the next steps:

  1. On the AWS Glue console, beneath ETL jobs within the navigation pane, select Visible ETL.

BDB-4354-glue-studio

  1. Select the Job particulars tab.
  2. For Title, enter a reputation, for instance, Pricing Abstract Report Job.
  3. For Description, enter a significant description for the job.
  4. For IAM Position, select the position that has entry to the goal S3 location the place the job is writing to and the supply location from the place it’s loading the Snowflake information and likewise to run the AWS Glue job. You’ll find this position in your CloudFormation stack output, named blog-glue-snowflake-GlueServiceRole-*.
  5. Use the default choices for Kind, Glue model, Language, Employee sort, Variety of employees, Variety of retries, and Job timeout.
  6. For Job bookmark, select Disable.
  7. Select Save to avoid wasting the job.

BDB-4354-glue-job-details

  1. On the Visible tab, select Add nodes.

  1. For Sources, select Snowflake.

  1. Select Information supply – Snowflake within the AWS Glue Studio canvas.
  2. For Title, enter Snowflake_Pricing_Summary.
  3. For Snowflake connection, select glue-snowflake-connection.
  4. For Snowflake supply, choose Enter a customized question.
  5. For Database, enter snowflake_sample_data.
  6. For Snowflake question, add the next Snowflake question:
SELECT l_returnflag
    , l_linestatus
    , Sum(l_quantity) AS sum_qty
    , Sum(l_extendedprice) AS sum_base_price
    , Sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price
    , Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge
    , Avg(l_quantity) AS avg_qty
    , Avg(l_extendedprice) AS avg_price
    , Avg(l_discount) AS avg_disc
    , Depend(*) AS count_order
FROM tpch_sf1.lineitem
WHERE l_shipdate <= Dateadd(day, - 90, To_date('1998-12-01'))
GROUP BY l_returnflag
    , l_linestatus
ORDER BY l_returnflag
    , l_linestatus;

The Pricing Abstract Report gives a abstract pricing report for all line objects shipped as of a given date. The date is inside 60–120 days of the best ship date contained within the database. The question lists totals for prolonged value, discounted prolonged value, discounted prolonged value plus tax, common amount, common prolonged value, and common low cost. These aggregates are grouped by RETURNFLAG and LINESTATUS, and listed in ascending order of RETURNFLAG and LINESTATUS. A depend of the variety of line objects in every group is included.

  1. For Customized Snowflake properties, specify Key as sfSchema and Worth as tpch_sf1.
  2. Select Save.

BDB-4354-glue-source-setup

Subsequent, you add the vacation spot as an S3 bucket.

  1. On the Visible tab, select Add nodes.
  2. For Targets, select Amazon S3.

  1. Select Information goal – S3 bucket within the AWS Glue Studio canvas.
  2. For Title, enter S3_Pricing_Summary.
  3. For Node dad and mom, choose Snowflake_Pricing_Summary.
  4. For Format, choose Parquet.
  5. For S3 Goal Location, enter s3://<YourBucketName>/pricing_summary_report/ (use the title of your bucket).
  6. For Information Catalog replace choices, choose Create a desk within the Information Catalog and on subsequent runs, replace the schema and add new partitions.
  7. For Database, select db_blog_glue_snowflake.
  8. For Desk title, enter tb_pricing_summary.
  9. Select Save.
  10. Select Run to run the job, and monitor its standing on the Runs tab.

You efficiently accomplished the steps to create an AWS Glue job that reads information from Snowflake and hundreds the outcomes into an S3 bucket utilizing a safe connectivity sample. Finally, if you wish to rework the info earlier than loading it into Amazon S3, you should utilize AWS Glue transformations out there in AWS Glue Studio. Utilizing AWS Glue transformations is essential when creating an AWS Glue job as a result of they permit environment friendly information cleaning, enrichment, and restructuring, ensuring the info is within the desired format and high quality for downstream processes. Seek advice from Editing AWS Glue managed data transform nodes for extra info.

Validate the outcomes

After the job is full, you may validate the output of the ETL job run in Athena, a serverless interactive analytics service. To validate the output, full the next steps:

  1. On the Athena console, select Launch Question Editor.
  2. For Workgroup, select blog-workgroup.
  3. If the message “All queries run within the Workgroup, blog-workgroup, will use the next settings:” is displayed, select Acknowledge.
  4. For Database, select db_blog_glue_snowflake.
  5. For Question, enter the next assertion:
SELECT l_returnflag
    , l_linestatus
    , sum_qty
    , sum_base_price
FROM db_blog_glue_snowflake.tb_pricing_summary

  1. Select Run.

You’ve got efficiently validated your information for the AWS Glue job Pricing Abstract Report Job.

Clear up

To scrub up your assets, full the next duties:

  1. Delete the AWS Glue job Pricing Abstract Report Job.
  2. Delete the AWS Glue connection glue-snowflake-connection.
  3. Stop any AWS Glue interactive sessions.
  4. Delete content from the S3 bucket blog-glue-snowflake-*.
  5. Delete the CloudFormation stack blog-glue-snowflake.

Conclusion

Utilizing the native Snowflake connector in AWS Glue gives an environment friendly and safe method to combine information from Snowflake into your information pipelines on AWS. By following the steps outlined on this publish, you may set up a personal connectivity channel between AWS Glue and your Snowflake utilizing PrivateLink, Amazon VPC, safety teams, and Secrets and techniques Supervisor.

This structure means that you can learn information from and write information to Snowflake tables instantly from AWS Glue jobs working on Spark. The safe connectivity sample prevents information transfers over the general public web, enhancing information privateness and safety.

Combining AWS information integration providers like AWS Glue with information platforms like Snowflake means that you can construct scalable, safe information lakes and pipelines to energy analytics, BI, information science, and ML use circumstances.

In abstract, the native Snowflake connector and personal connectivity mannequin outlined right here present a performant, safe method to embody Snowflake information in AWS large information workflows. This unlocks scalable analytics whereas sustaining information governance, compliance, and entry management. For extra info on AWS Glue, go to AWS Glue.


In regards to the Authors

Caio Sgaraboto Montovani is a Sr. Specialist Options Architect, Information Lake and AI/ML inside AWS Skilled Providers, creating scalable options in accordance buyer wants. His huge expertise has helped prospects in several industries akin to life sciences and healthcare, retail, banking, and aviation construct options in information analytics, machine studying, and generative AI. He’s captivated with rock and roll and cooking, and likes to spend time along with his household.

Kartikay Khator is a Options Architect inside World Life Sciences at AWS, the place he dedicates his efforts to creating progressive and scalable options that cater to the evolving wants of shoppers. His experience lies in harnessing the capabilities of AWS analytics providers. Extending past his skilled pursuits, he finds pleasure and achievement on the earth of working and mountaineering. Having already accomplished two marathons, he’s at present making ready for his subsequent marathon problem.

Navnit Shukla, an AWS Specialist Answer Architect specializing in Analytics, is captivated with serving to purchasers uncover invaluable insights from their information. Leveraging his experience, he develops ingenious options that empower companies to make knowledgeable, data-driven selections. Notably, Navnit is the completed creator of the guide “Information Wrangling on AWS,” showcasing his experience within the subject.

BDB-4354-awskamenKamen Sharlandjiev is a Sr. Large Information and ETL Options Architect, Amazon MWAA and AWS Glue ETL knowledgeable. He’s on a mission to make life simpler for patrons who’re dealing with complicated information integration and orchestration challenges. His secret weapon? Totally managed AWS providers that may get the job accomplished with minimal effort. Comply with Kamen on LinkedIn to maintain updated with the newest Amazon MWAA and AWS Glue options and information!

Bosco Albuquerque is a Sr. Associate Options Architect at AWS and has over 20 years of expertise working with database and analytics merchandise from enterprise database distributors and cloud suppliers. He has helped expertise firms design and implement information analytics options and merchandise.

Leave a Reply

Your email address will not be published. Required fields are marked *