Learn how to use Airflow to construct a large-scale marketing data warehouse
Having your own marketing data warehouse is now unquestionable. There are typically two approaches to building a marketing warehouse: purchasing a connector tool like Supermetrics or Fivetran that connects, for example, with BigQuery, or choosing to develop your own system.
Purchasing a connector tool offers a ready-made solution that can save you time, but it can be expensive and lacks customization. On the other hand, building your own system provides greater control and flexibility over your data, but it comes with its own set of challenges.
Our focus here is on companies that have chosen the second route – those that have developed their own systems for pulling data from marketing platforms for multiple accounts. A perfect example is agencies.
They typically use Python scripts that loop over a list of accounts, extracting data from different sources like Google Ads, Facebook Ads, Google Analytics, etc. Each data extraction is treated as a task within the script, and these tasks run sequentially for each client. These scripts can be deployed on Cloud Functions, Lambda, or using classic cron jobs.
Consider a scenario where you have individual Python scripts for each step in your ETL process:
This approach might work fine when you're dealing with a smaller scale or fewer platforms. But as your marketing operations grow, managing these individual scripts can become daunting and comes with several problems:
Manually managing these scripts means executing each one in the correct sequence, ensuring each one finishes successfully before starting the next. If a script fails, you must diagnose the problem, fix it, and then restart the sequence from that point. This is where Apache Airflow comes into the picture.
Apache Airflow is an open-source platform that programmatically orchestrates your workflows, creating pipelines of tasks that execute in a specified order, and automatically managing the data flow between tasks. There are 2 core concepts of Airflow:
In the context of our marketing data warehouse, each Python script becomes a task in the Airflow DAG. You can set dependencies that ensure your data is extracted, cleaned, transformed, and loaded into BigQuery in the correct sequence.
Here are the key features of Apache Airflow that address the challenges mentioned above:
In this article, we will cover how to convert your classic Python ETL scripts into Airflow DAGs. The article is split into 3 parts:
The simplest way to install Airflow locally on your machine is to use Docker. If you're not familiar with Docker, I'd recommend checking this out. Just make sure to download and install the Docker desktop app, and I will guide you step by step through the installation process.
Step 1: Create a new project or folder and name it 'airflow'
Step 2: Within your project, execute the following command in the terminal to fetch the docker-compose.yaml file:
Step 3: Create three folders inside your project for the following purposes:
You have two options to create these folders: either manually create them or use the following command in the terminal:
Your project structure should now resemble the following:
Step 4: Let's initialize the project by executing the following command:
This process may take approximately 5 to 10 minutes to install all the containers. Once the initialization is complete, you should see a message similar to the following:
Step 5: Now that we're done, let's start Airflow by running the following command:
After starting Airflow, you can access the web interface by visiting: http://0.0.0.0:8080/. You will be prompted to enter a username and password, both of which should be set as "airflow".
Once you are inside Airflow, you will notice a set of example DAGs displayed.
Since we don’t need any of these examples, let’s clean them up by disabling DAG examples in the docker-compose.yaml file. Set the value of AIRFLOW__CORE__LOAD_EXAMPLES to "false" to prevent the examples from being loaded.
To apply the changes and ensure that the Docker containers are restarted, follow these steps:
Execute the following command to stop the Docker containers:
Once the containers are stopped, start them again using the following command:
After the containers are restarted, you should now observe that there are no DAGs displayed in Airflow, as the examples have been disabled and removed.
To create your first DAG in Airflow, follow these steps.Go to the ./dags folder and create a new file. Name it, for example: ga4_dag.py.
There are two syntaxes to create a DAG in Airflow. The first one involves using standard constructors as shown below:
The second approach, which is the preferred way, is to use decorators. This method will be used throughout the articles. To create a DAG, you can use the @dag decorator and @task decorator for tasks.
Let's create a GA4 DAG with three dummy tasks: extract, transform, and load.
Note the >> between the tasks, which signifies that the tasks should run in order. If you want them to run in parallel, you can enclose them in a list, like [extract(), transform(), load()].
Once you access the Airflow web interface, you should be able to see your DAG displayed.
Now, if you click into it and go to the ‘Graph’ tab, you should see the diagram of your tasks:
To run the DAG, click on the play button. If everything goes well, you should observe the DAG completing successfully.
To inspect the logs of each task, you can click on the individual tasks within the DAG. This will allow you to access the full logs and examine any outputs or errors encountered during task execution.
That concludes the first part of this article series. Stay tuned for the next part, where we will dive deeper into how to structure your DAG, how to run backfill, and how to use dynamic task mapping to execute tasks for hundreds of accounts.