Amazon MWAA greatest practices for managing Python dependencies

Amazon MWAA greatest practices for managing Python dependencies
Amazon MWAA greatest practices for managing Python dependencies


Prospects with information engineers and information scientists are utilizing Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a central orchestration platform for working information pipelines and machine studying (ML) workloads. To assist these pipelines, they typically require further Python packages, comparable to Apache Airflow Suppliers. For instance, a pipeline could require the Snowflake supplier package deal for interacting with a Snowflake warehouse, or the Kubernetes supplier package deal for provisioning Kubernetes workloads. Because of this, they should handle these Python dependencies effectively and reliably, offering compatibility with one another and the bottom Apache Airflow set up.

Python consists of the software pip to deal with package deal installations. To put in a package deal, you add the identify to a particular file named necessities.txt. The pip set up command instructs pip to learn the contents of your necessities file, decide dependencies, and set up the packages. Amazon MWAA runs the pip set up command utilizing this necessities.txt file throughout preliminary atmosphere startup and subsequent updates. For extra info, see How it works.

Creating a reproducible and stable requirements file is essential for decreasing pip set up and DAG errors. Moreover, this outlined set of necessities supplies consistency throughout nodes in an Amazon MWAA atmosphere. That is most essential throughout worker auto scaling, the place further employee nodes are provisioned and having completely different dependencies may result in inconsistencies and activity failures. Moreover, this technique promotes consistency throughout completely different Amazon MWAA environments, comparable to dev, qa, and prod.

This publish describes greatest practices for managing your necessities file in your Amazon MWAA atmosphere. It defines the steps wanted to find out your required packages and package deal variations, create and confirm your necessities.txt file with package deal variations, and package deal your dependencies.

Greatest practices

The next sections describe the perfect practices for managing Python dependencies.

Specify package deal variations within the necessities.txt file

When making a Python necessities.txt file, you may specify simply the package deal identify, or the package deal identify and a particular model. Including a package deal with out model info instructs the pip installer to obtain and set up the newest out there model, topic to compatibility with different put in packages and any constraints. The package deal variations chosen throughout atmosphere creation could also be completely different than the model chosen throughout an auto scaling occasion afterward. This model change can create package deal conflicts resulting in pip set up errors. Even when the up to date package deal installs correctly, code adjustments within the package deal can have an effect on activity habits, resulting in inconsistencies in output. To keep away from these dangers, it’s greatest apply so as to add the model quantity to every package deal in your necessities.txt file.

Use the constraints file on your Apache Airflow model

A constraints file comprises the packages, with variations, verified to be suitable together with your Apache Airflow model. This file provides an extra validation layer to forestall package deal conflicts. As a result of the constraints file performs such an essential function in stopping conflicts, starting with Apache Airflow v2.7.2 on Amazon MWAA, your necessities file should embrace a --constraint assertion. If a --constraint assertion is just not equipped, Amazon MWAA will specify a compatible constraints file for you.

Constraint recordsdata can be found for every Airflow model and Python model mixture. The URLs have the next type:

https://uncooked.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt

The official Apache Airflow constraints are tips, and in case your workflows require newer variations of a supplier package deal, it’s possible you’ll have to modify your constraints file and include it in your DAG folder. When doing so, the perfect practices outlined on this publish turn into much more essential to protect towards package deal conflicts.

Create a .zip archive of all dependencies

Making a .zip file containing the packages in your necessities file and specifying this because the package deal repository supply makes certain the very same wheel recordsdata are used throughout your preliminary atmosphere setup and subsequent node configurations. The pip installer will use these native recordsdata for set up moderately than connecting to the exterior PyPI repository.

Check the necessities.txt file and dependency .zip file

Testing your necessities file earlier than launch to manufacturing is essential to avoiding set up and DAG errors. Testing each regionally, with the MWAA local runner, and in a dev or staging Amazon MWAA atmosphere, are greatest practices earlier than deploying to manufacturing. You should use steady integration and supply (CI/CD) deployment methods to carry out the necessities and package deal set up testing, as described in Automating a DAG deployment with Amazon Managed Workflows for Apache Airflow.

Resolution overview

This resolution makes use of the MWAA native runner, an open supply utility that replicates an Amazon MWAA atmosphere regionally. You employ the native runner to construct and validate your necessities file, and package deal the dependencies. On this instance, you put in the snowflake and dbt-cloud supplier packages. You then use the MWAA native runner and a constraints file to find out the precise model of every package deal suitable with Apache Airflow. With this info, you then replace the necessities file, pinning every package deal to a model, and retest the set up. When you could have a profitable set up, you package deal your dependencies and check in a non-production Amazon MWAA atmosphere.

We use MWAA native runner v2.8.1 for this walkthrough and stroll by the next steps:

  1. Obtain and construct the MWAA native runner.
  2. Create and check a necessities file with package deal variations.
  3. Bundle dependencies.
  4. Deploy the necessities file and dependencies to a non-production Amazon MWAA atmosphere.

Stipulations

For this walkthrough, it is best to have the next conditions:

Arrange the MWAA native runner

First, you obtain the MWAA local runner model matching your goal MWAA atmosphere, then you definately construct the picture.

Full the next steps to configure the native runner:

  1. Clone the MWAA native runner repository with the next command:
    git clone git@github.com:aws/aws-mwaa-local-runner.git -b v2.8.1

  2. With Docker working, construct the container with the next command:
    cd aws-mwaa-local-runner
     ./mwaa-local-env build-image

Create and check a necessities file with package deal variations

Constructing a versioned necessities file makes certain all Amazon MWAA parts have the identical package deal variations put in. To find out the suitable variations for every package deal, you begin with a constraints file and an un-versioned necessities file, permitting pip to resolve the dependencies. Then you definately create your versioned necessities file from pip’s set up output.

The next diagram illustrates this workflow.

Requirements file testing process

To construct an preliminary necessities file, full the next steps:

  1. In your MWAA native runner listing, open necessities/necessities.txt in your most popular editor.

The default necessities file will look just like the next:

--constraint "https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-mysql==5.5.1

  1. Exchange the present packages with the next package deal listing:
--constraint "https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake
apache-airflow-providers-dbt-cloud[http]

  1. Save necessities.txt.
  2. In a terminal, run the next command to generate the pip set up output:
./mwaa-local-env test-requirements

test-requirements runs pip set up, which handles resolving the suitable package deal variations. Utilizing a constraints file makes certain the chosen packages are suitable together with your Airflow model. The output will look just like the next:

Efficiently put in apache-airflow-providers-dbt-cloud-3.5.1 apache-airflow-providers-snowflake-5.2.1 pyOpenSSL-23.3.0 snowflake-connector-python-3.6.0 snowflake-sqlalchemy-1.5.1 sortedcontainers-2.4.0

The message starting with Efficiently put in is the output of curiosity. This exhibits which dependencies, and their particular model, pip put in. You employ this listing to create your remaining versioned necessities file.

Your output will even include Requirement already glad messages for packages already out there within the base Amazon MWAA atmosphere. You don’t add these packages to your necessities.txt file.

  1. Replace the necessities file with the listing of versioned packages from the test-requirements command. The up to date file will look just like the next code:
--constraint "https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

Subsequent, you check the up to date necessities file to verify no conflicts exist.

  1. Rerun the requirements-test perform:
./mwaa-local-env test-requirements

A profitable check is not going to produce any errors. If you happen to encounter dependency conflicts, return to the earlier step and replace the necessities file with further packages, or package deal variations, based mostly on pip’s output.

Bundle dependencies

In case your Amazon MWAA atmosphere has a personal webserver, you have to package deal your dependencies right into a .zip file, add the file to your S3 bucket, and specify the package deal location in your Amazon MWAA occasion configuration. As a result of a personal webserver can’t entry the PyPI repository by the web, pip will set up the dependencies from the .zip file.

If you happen to’re utilizing a public webserver configuration, you additionally profit from a static .zip file, which makes certain the package deal info stays unchanged till it’s explicitly rebuilt.

This course of makes use of the versioned necessities file created within the earlier part and the package-requirements function within the MWAA native runner.

To package deal your dependencies, full the next steps:

  1. In a terminal, navigate to the listing the place you put in the native runner.
  2. Obtain the constraints file on your Python model and your model of Apache Airflow and place it within the plugins listing. For this publish, we use Python 3.11 and Apache Airflow v2.8.1:
curl -o plugins/constraints-2.8.1-3.11.txt https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt

  1. In your necessities file, replace the constraints URL to the native downloaded file.

The –-constraint assertion instructs pip to check the package deal variations in your necessities.txt file to the allowed model within the constraints file. Downloading a particular constraints file to your plugins listing lets you management the constraint file location and contents.

The up to date necessities file will appear like the next code:

--constraint "/usr/native/airflow/plugins/constraints-2.8.1-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

  1. Run the next command to create the .zip file:
./mwaa-local-env package-requirements

package-requirements creates an up to date necessities file named packaged_requirements.txt and zippers all dependencies into plugins.zip. The up to date necessities file appears like the next code:

--find-links /usr/native/airflow/plugins
--no-index
--constraint "/usr/native/airflow/plugins/constraints-2.8.1-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

Notice the reference to the native constraints file and the plugins listing. The –-find-links assertion instructs pip to put in packages from /usr/native/airflow/plugins moderately than the general public PyPI repository.

Deploy the necessities file

After you obtain an error-free necessities set up and package deal your dependencies, you’re able to deploy the property to a non-production Amazon MWAA atmosphere. Even when verifying and testing necessities with the MWAA native runner, it’s greatest apply to deploy and check the adjustments in a non-prod Amazon MWAA atmosphere earlier than deploying to manufacturing. For extra details about making a CI/CD pipeline to check adjustments, discuss with Deploying to Amazon Managed Workflows for Apache Airflow.

To deploy your adjustments, full the next steps:

  1. Add your necessities.txt file and plugins.zip file to your Amazon MWAA atmosphere’s S3 bucket.

For directions on specifying a necessities.txt model, discuss with Specifying the requirements.txt version on the Amazon MWAA console. For directions on specifying a plugins.zip file, discuss with Installing custom plugins on your environment.

The Amazon MWAA atmosphere will replace and set up the packages in your plugins.zip file.

After the replace is full, confirm the supplier package deal set up within the Apache Airflow UI.

  1. Access the Apache Airflow UI in Amazon MWAA.
  2. From the Apache Airflow menu bar, select Admin, then Suppliers.

The listing of suppliers, and their variations, is proven in a desk. On this instance, the web page displays the set up of apache-airflow-providers-db-cloud model 3.5.1 and apache-airflow-providers-snowflake model 5.2.1. This listing solely comprises the supplier packages put in, not all supporting Python packages. Supplier packages which are a part of the bottom Apache Airflow set up will even seem within the listing. The next picture is an instance of the package deal listing; observe the apache-airflow-providers-db-cloud and apache-airflow-providers-snowflake packages and their variations.

Airflow UI with installed packages

To confirm all package deal installations, view the ends in Amazon CloudWatch Logs. Amazon MWAA creates a log stream for the necessities set up and the stream comprises the pip set up output. For directions, discuss with Viewing logs for your requirements.txt.

A profitable set up ends in the next message:

Efficiently put in apache-airflow-providers-dbt-cloud-3.5.1 apache-airflow-providers-snowflake-5.2.1 pyOpenSSL-23.3.0 snowflake-connector-python-3.6.0 snowflake-sqlalchemy-1.5.1 sortedcontainers-2.4.0

If you happen to encounter any set up errors, it is best to decide the package deal battle, replace the necessities file, run the native runner check, re-package the plugins, and deploy the up to date recordsdata.

Clear up

If you happen to created an Amazon MWAA atmosphere particularly for this publish, delete the atmosphere and S3 objects to keep away from incurring further prices.

Conclusion

On this publish, we mentioned a number of greatest practices for managing Python dependencies in Amazon MWAA and easy methods to use the MWAA local runner to implement these practices. These greatest practices cut back DAG and pip set up errors in your Amazon MWAA atmosphere. For added particulars and code examples on Amazon MWAA, go to the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.

Apache, Apache Airflow, and Airflow are both registered emblems or emblems of the Apache Software program Basis in america and/or different international locations.


In regards to the Creator


Mike Ellis is a Technical Account Supervisor at AWS and an Amazon MWAA specialist. Along with helping prospects with Amazon MWAA, he contributes to the Apache Airflow open supply venture.

Leave a Reply

Your email address will not be published. Required fields are marked *