A Data Engineers switch from AWS to GCP first thoughts

I’ve always been an AWS kind of guy not by choice, but just because every where I worked they had AWS. Sure, there were few places who had Azure – cue the developers in the room grumbling about Microsoft . Azure was doing something or other, but I just didn’t have any exposure to Azure […]
Useful Git commands every Data Engineer should know

There is no getting around it. If you are working as a Data Engineer you will be using some form of source control to manage and track your pipelines code changes, handle your deployments (CI/CD), collaborate with you team, push infrastructure changes, and document your pipelines. If you are not using source control. Stop reading […]
How To Pin public datasets to a Project In Google BigQuery

I recently started spending more time in GCP, so this is a short and sweet post on how to pin the bigquery-public-data dataset in Google BigQuery in three easy steps. I clicked around for ages trying to figure this out, but thanks to a little bit of googling and dumb luck I figured it out. […]
Review of Data Engineer with Python – Datacamp Course

I recently wrapped up the Data Engineer with Python course on Datacamp and thought I would write up my thoughts on the course and if it was worth the massive amount of time I invested in it. The course promises a lot! –“Start your journey to becoming a data engineer and gain the in-demand data […]
Getting MySQL up and running in Docker

Recently, I needed to get an instance of MySQL up and running quickly to play around with different methods of loading a large CSV file (1.6 million rows) into a database. I wasn’t going to install MySQL locally so I decided to go with a MySQL Docker image and use that to trial different loading […]
AWS Made Easy with Boto3

If you’re a starting out as a Data Engineer and using AWS, then life gets a whole lot easier with the use of Boto3, the AWS SDK for Python. Boto3 simplifies integration of your Python applications, libraries, or scripts with AWS services like Amazon S3, EC2, DynamoDB and more. Well, that’s what the documentation says. […]
I asked ChatGPT what Data Engineers do. Here’s what it said.

My problem with ChatGPT is it never says “Perhaps”, “I think”, “I could be wrong” or “Maybe”. It speaks with the same level of absolute confidence whether it’s right or wrong. It just confidently presents it’s answers with a level of surety and authority leaving it up to you to weed out the garbage. AI […]
From ETL to ELT: How the Data Processing Landscape has Changed

The process of ETL (extract, transform and load) has been around for decades. It’s one of the most widely used and well known methods used in data warehousing and business intelligence. However ELT (Extract, Load, Transform) has become increasingly popular in the realm of cloud computing and big data. It is now often seen as […]
Four types of databases out there

Database have been around almost almost as long as computer have been around. There are many flavours of databases out in wild, and as a Data Engineer you will run in to some of them if not all of them throughout your career. In this post I will look at the 4 most common types […]
Query S3 using S3 Select and SQL

S3 Select is a highly valuable and in my option one of the most underappreciated features within AWS S3. As a Data Engineer, it is a must-have in your toolkit. What is S3 Select? A feature within S3 that allows you the Data Engineer to run simple SQL queries on objects in S3 buckets. For […]
