Airflow is a platform-independent workflow orchestration tool that offers many possibilities to create and monitor stream and batch pipeline processes - even very complex processes are possible here without any problems. And that on basically all important platforms and tools that exist in the data engineering world, such as AWS or Google Cloud. In addition to planning and organizing your processes, Airflow can also be used for monitoring, so you can keep track of what is happening during processing and do troubleshooting.
In short: Airflow is currently the number one hyped tool for workflow orchestration. It is needed and used in many companies and is therefore in high demand when it comes to the skills of a data engineer. So, especially students should definitely focus on working with it.
Airflow fundamental concepts
At the beginning you get to know the basics of Airflow. Here you will learn how to create a process using DAGs (Directed Acyclic Graphs) and how such a DAG is structured with all its tasks and operators. You will also learn what the Airflow architecture looks like and what parts the tool consists of, such as the database or the web interface. Finally, you get a few examples of pipelines in an event driven process that could be planned and created with Airflow.
The aim of this project is to process weather data from the Internet. As soon as the processing starts, the DAG will read the data from a weather API, transform it and store it in a Postgres database.
Here you will learn how the Docker setup works, how to check whether the user interfaces for Airflow are running correctly and whether the containers for monitoring Docker are working. You also configure the weather API during setup and create your Postgres database with the appropriate tables and columns.
Hands-on part: Learn creating DAGs
First, we'll go over the Airflow web interface again so you understand how it works, what you can see there, and how you monitor DAGs.
Then you create Airflow 2.0 DAGs that read the data from the API, transform it and print it to the log.
You then create DAGs for Airflow’s Taskflow API. This is a nice API that offers a new way to create Airflow DAGs. Here, you first create the same processing as before with Airflow 2.0. Then you use Taskflow to fetch data from the API, transform it and store it in the database. In an example I also show you how to make a fanout out to run multiple processes in parallel.