Upstream or downstream the battle of task dependencies

Task dependencies are useful and a popular feature in Airflow. Simply put they define an order of task execution. Basically which tasks to run and in what order. While it’s not required, task dependencies are normally always set. What if a task dependence is not defined? Well then Airflow takes matters into it’s own hands […]
Which is Right for You: A Database, Data Lake, or Data Warehouse?

As a data engineer, it’s your responsibility to handle and process data effectively. The type of data you encounter in your work can vary, but you will no doubt encounter databases, data lakes, and data warehouses at some stage along your journey as a Data Engineer. In this blog post, I briefly highlight the differences […]
Getting Started with Python Exception Handling

The errors that occur during the execution of a Python program are called exceptions. Examples of exceptions include dividing by zero, combining objects of incompatible types, and many others. Some exceptions have specific names, such as ZeroDivisionError and TypeError. If exceptions are not handled properly, they can halt the entire execution of the program. This […]
The 5 Verbs of REST APIs: A Beginner’s Guide

The data pipelines you build as a Data engineer will move data from one location to another, often from various sources such as databases and APIs to places like data warehouses or data lakes. You will no doubt have to deal with REST API’s at some stage in let’s have a look at what REST […]
The Three Vs of Big Data: A Beginner’s Guide

The three Vs of big data refer to the three characteristics that make managing and analysing large datasets particularly challenging. The three Vs are: Volume The sheer amount of data that needs to be processed and analysed. Big data sets are often too large to be stored and processed on a single computer, and may […]
Comparing Amazon S3 Storage Options: s3n, s3a, and s3

When I’m building pipelines, it is common to access S3 at some point in the process. In some articles and tutorials, S3N or S3A may be mentioned in the connection string for S3. What is the difference? I look into the differences here. Basically In a nutshell, S3N and S3A are storage options provided by […]
Exploring the Ins and Outs of Taming Big Data with Apache Spark and Python: A Review of the Hands-On Udemy Course

I recently completed the course “Taming Big Data with Apache Spark and Python – Hands On!” on Udemy, taught by Frank Kane. I wanted to share my thoughts on the course and provide my opinions on what I enjoyed and what I didn’t like as much. My thoughts If you are already familiar with Spark […]
Why Every Data Engineer Needs Jupyter Notebooks in Their Toolkit

Jupyter Notebooks are an important part of my daily routine, whether for work or personal projects. They have been incredibly useful to me and I can’t imagine being a Data Engineer without them. Best thing is, it’s easy to get started with it. You can use it by downloading JupyterLab or accessing it through your […]
The Key Files for a Smooth-Running Python Project

The three key files in a Python project are the Docker file, the Makefile, and the requirements.txt. Makefile The Makefile is particularly important because it allows you to automate various steps in your project, such as installation, deployment, and linting. Essentially, the Makefile acts like a set of recipes that help you streamline your work […]
Amazon Athena for Apache Spark

Christmas has come early for us and we have the good folks at AWS to thank for it. What is it I hear you say? A new feature that I believe is going to change the way we use Athena going forward. Well, going to change the way I use it going forward that is […]
