Amazon Athena for Apache Spark

Amazon Athena for Apache Spark

Christmas has come early for us and we have the good folks at AWS to thank for it. What is it I hear you say? A new feature that I believe is going to change the way we use Athena going forward. Well, going to change the way I use it going forward that is for sure.

Athena now enables us “the data people” to easily set up an interactive, fully managed notebook with Apache Spark engine pre-loaded and ready to go plus some good old SQL sprinkled on top for good measure all in Athena. No infrastructure to configure or a messy Spark configuration set up needed.

I have always found it cumbersome and time-consuming to query S3 (data lake) through Athena. I personally find Athena is not very user-friendly (my years of using SQL Management Studio have spoiled me). However, now we can use Python and the built-in Spark functionality within a convenient notebook to make things easier.

Testing

Spinning up a notebook is relatively easy and this blog post by AWS walks you through it very well, but I’ll take you through my checklist of cool features.

First things first you’re going to have to set yourself up a work group in Athena.

To create a new notebook, go to the notebook explorer, select your nae work-group and follow the prompts to create a new notebook. It’s as easy as that!

I made a basic notebook that reads in a text file called “marvel-names.txt” from an S3 bucket and displays the data in a dataframe using a pre-defined schema. Although it’s a simple process, the potential for this method is significant and could be a game-changer for Athena workloads.

One of the benefits of this method is that it saves the notebook, which is not something that can be done with the AWS Glue interactive sessions notebooks. I have used those notebooks before and found them to be cumbersome and if the connection is lost, the notebook is lost and there is no way to retrieve it. This solution provides a way to access the notebook again if that happens.

Another advantage is that the notebook closes its session after 20 minutes, which is great for those of us who forget to shut things down.

One thing to keep in mind is that this is still early days for this feature and is not yet available in all regions. I see it’s available in North Virginia and Ireland.

Tim