The speedy adoption of software program as a service (SaaS) options has led to knowledge silos throughout varied platforms, presenting challenges in consolidating insights from various sources. Efficient knowledge analytics depends on seamlessly integrating knowledge from disparate programs by figuring out, gathering, cleaning, and mixing related knowledge right into a unified format. AWS Glue, a serverless knowledge integration service, has simplified this course of by providing scalable, environment friendly, and cost-effective options for integrating knowledge from varied sources. With AWS Glue, you’ll be able to streamline knowledge integration, cut back knowledge silos and complexities, and achieve agility in managing knowledge pipelines, in the end unlocking the true potential of your knowledge belongings for analytics, data-driven decision-making, and innovation.
This put up explores the brand new Salesforce connector for AWS Glue and demonstrates easy methods to construct a contemporary extract, rework, and cargo (ETL) pipeline with AWS Glue ETL scripts.
Introducing the Salesforce connector for AWS Glue
To satisfy the calls for of various knowledge integration use circumstances, AWS Glue now helps SaaS connectivity for Salesforce. This permits customers to rapidly preview and switch their buyer relationship administration (CRM) knowledge, fetch the schema dynamically on request, and question the info. With the AWS Glue Salesforce connector, you’ll be able to ingest and rework your CRM knowledge to any of the AWS Glue supported locations, together with Amazon Simple Storage Service (Amazon S3), in your most popular format, together with Apache Iceberg, Apache Hudi, and Linux Basis Delta Lake; knowledge warehouses equivalent to Amazon Redshift and Snowflake; and many more. Reverse ETL use circumstances are additionally supported, permitting you to jot down knowledge again to Salesforce.
The next are key advantages of the Salesforce connector for AWS Glue:
- You need to use AWS Glue native capabilities
- It’s nicely examined with AWS Glue capabilities and is manufacturing prepared for any knowledge integration workload
- It really works seamlessly on high of AWS Glue and Apache Spark in a distributed style for environment friendly knowledge processing
Resolution overview
For our use case, we need to retrieve the total load of a Salesforce account object in a knowledge lake on Amazon S3 and seize the incremental modifications. This answer additionally permits you to replace sure fields of the account object within the knowledge lake and push it again to Salesforce. To realize this, you create two ETL jobs utilizing AWS Glue with the Salesforce connector, and create a transactional knowledge lake on Amazon S3 utilizing Apache Iceberg.
Within the first job, you configure AWS Glue to ingest the account object from Salesforce and put it aside right into a transactional knowledge lake on Amazon S3 in Apache Iceberg format. Then you definately replace the account object knowledge that’s extracted from the primary job within the transactional knowledge lake in Amazon S3. Lastly, you run the second job to ship that change again to Salesforce.
Conditions
Full the next prerequisite steps:
- Create an S3 bucket to retailer the outcomes.
- Join a Salesforce account, in the event you don’t have already got one.
- Create an AWS Identity and Access Management (IAM) function for the AWS Glue ETL job to make use of. The function should grant entry to all sources utilized by the job, together with Amazon S3 and AWS Secrets Manager. For this put up, we identify the function
AWSGlueServiceRole-SalesforceConnectorJob
. Use the next insurance policies:- AWS managed insurance policies:
- Inline coverage:
- Create the AWS Glue connection for Salesforce:
- The Salesforce connector helps two OAuth2 grant varieties:
JWT_BEARER
andAUTHORIZATION_CODE
. For this put up, we use theAUTHORIZATION_CODE
grant kind. - On the Secrets and techniques Supervisor console, create a brand new secret. Add two keys,
ACCESS_TOKEN
andREFRESH_TOKEN
, and preserve their values clean. These shall be populated after you enter your Salesforce credentials. - Configure the Salesforce connection in AWS Glue. Use
AWSGlueServiceRole-SalesforceConnectorJob
whereas creating the Salesforce connection. For this put up, we identify the connectionSalesforce_Connection
. - Within the Authorization part, select Authorization Code and the key you created within the earlier step.
- Present your Salesforce credentials when prompted. The
ACCESS_TOKEN
andREFRESH_TOKEN
keys shall be populated after you enter your Salesforce credentials.
- The Salesforce connector helps two OAuth2 grant varieties:
- Create an AWS Glue database. For this put up, we identify it
glue_etl_salesforce_db
.
Create an ETL job to ingest the account object from Salesforce
Full the next steps to create a brand new ETL job in AWS Glue Studio to switch knowledge from Salesforce to Amazon S3:
- On the AWS Glue console, create a brand new job (with the Script editor possibility). For this put up, we identify the job
Salesforce_to_S3_Account_Ingestion
. - On the Script tab, enter the Salesforce_to_S3_Account_Ingestion script.
Be sure that the identify, which you used to create the Salesforce connection, is handed because the connectionName
parameter worth within the script, as proven within the following code instance:
The script fetches data from the Salesforce account object. Then it checks if the account desk exists within the transactional knowledge lake. If the desk doesn’t exist, it creates a brand new desk and inserts the data. If the desk exists, it performs an upsert operation.
- On the Job particulars tab, for IAM function, select
AWSGlueServiceRole-SalesforceConnectorJob
. - Beneath Superior properties, for Extra community connection, select the Salesforce connection.
- Arrange the job parameters:
--conf
:spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
--datalake-formats
:iceberg
--db_name
:glue_etl_salesforce_db
--s3_bucket_name
: your S3 bucket--table_name
: account
- Save the job and run it.
Relying on the scale of the info in your account object in Salesforce, the job will take a couple of minutes to finish. After a profitable job run, a brand new desk known as account is created and populated with Salesforce account data.
- You need to use Amazon Athena to question the info:
Validate transactional capabilities
You’ll be able to validate the transactional capabilities supported by Apache Iceberg. For testing, attempt three operations: insert, replace, and delete:
- Create a brand new account object in Salesforce, rerun the AWS Glue job, then run the question in Athena to validate the brand new account is created.
- Delete an account in Salesforce, rerun the AWS Glue job, and validate the deletion utilizing Athena.
- Replace an account in Salesforce, rerun the AWS Glue job, and validate the replace operation utilizing Athena.
Create an ETL job to ship updates again to Salesforce
AWS Glue additionally permits you to write knowledge again to Salesforce. Full the next steps to create an ETL job in AWS Glue to get updates from the transactional knowledge lake and write them to Salesforce. On this situation, you replace an account document and push it again to Salesforce.
- On the AWS Glue console, create a brand new job (with the Script editor possibility). For this put up, we identify the job
S3_to_Salesforce_Account_Writeback
. - On the Script tab, enter the S3_to_Salesforce_Account_Writeback script.
Be sure that the identify, which you used to create the Salesforce connection, is handed because the connectionName
parameter worth within the script:
- On the Job particulars tab, for IAM function, select
AWSGlueServiceRole-SalesforceConnectorJob
. - Configure the job parameters:
--conf
:spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
--datalake-formats
:iceberg
--db_name
:glue_etl_salesforce_db
--table_name
:account
- Run the replace question in Athena to vary the worth of
UpsellOpportunity__c
for a Salesforce account to “Sure”: - Run the
S3_to_Salesforce_Account_Writeback
AWS Glue job.
Relying on the scale of the info in your account object in Salesforce, the job will take a couple of minutes to finish.
- Validate the article in Salesforce. The worth of
UpsellOpportunity
ought to change.
You’ve got now efficiently validated the Salesforce connector.
Issues
You’ll be able to arrange AWS Glue job triggers to run the ETL jobs on a schedule, in order that the info is frequently synchronized between Salesforce and Amazon S3. You can too combine the ETL jobs with different AWS companies, equivalent to AWS Step Functions, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Lambda, or Amazon EventBridge, to create a extra superior knowledge processing pipeline.
By default, the Salesforce connector doesn’t import deleted data from Salesforce objects. Nonetheless, you’ll be able to set the IMPORT_DELETED_RECORDS
choice to “true” to import all data, together with the deleted ones. Seek advice from Salesforce connection options for various Salesforce connection choices.
Clear up
To keep away from incurring costs, clear up the sources used on this put up out of your AWS account, together with the AWS Glue jobs, Salesforce connection, Secrets and techniques Supervisor secret, IAM function, and S3 bucket.
Conclusion
The AWS Glue connector for Salesforce simplifies the analytics pipeline, reduces time to insights, and facilitates data-driven decision-making. It empowers organizations to streamline knowledge integration and analytics. The serverless nature of AWS Glue means there isn’t any infrastructure administration, and also you pay just for the sources consumed whereas your jobs are working. As organizations more and more depend on knowledge for decision-making, this Salesforce connector offers an environment friendly, cost-effective, and agile answer to swiftly meet knowledge analytics wants.
To study extra in regards to the AWS Glue connector for Salesforce, confer with Connecting to Salesforce in AWS Glue Studio. On this person information, we stroll by your complete course of, from organising the connection to working the info switch movement. For extra data on AWS Glue, go to AWS Glue.
In regards to the authors
Ramakant Joshi is an AWS Options Architect, specializing within the analytics and serverless area. He has a background in software program improvement and hybrid architectures, and is enthusiastic about serving to prospects modernize their cloud structure.
Kamen Sharlandjiev is a Sr. Large Information and ETL Options Architect, Amazon MWAA and AWS Glue ETL professional. He’s on a mission to make life simpler for patrons who’re dealing with complicated knowledge integration and orchestration challenges. His secret weapon? Absolutely managed AWS companies that may get the job achieved with minimal effort. Observe Kamen on LinkedIn to maintain updated with the newest Amazon MWAA and AWS Glue options and information!
Debaprasun Chakraborty is an AWS Options Architect, specializing within the analytics area. He has round 20 years of software program improvement and structure expertise. He’s enthusiastic about serving to prospects in cloud adoption, migration and technique.