Wednesday, January 25, 2017

AWS Athena to read Dynamodb backup data

Our data scientists wanted to read some data from Dynamodb. In order to do this, we had to increase the Read Capacity of the table to a high value and it was cost consuming. So we architected a system using Athena where data scientist could run their query whenever they want.

We used AWS datapipeline template to backup the dynamodb table. So, we had our data archived into S3 in multiple json formatted files. Here Athena came into picture. Athena is a new tool created by AWS which runs presto underneath. So we created a table in Athena using the Dynamodb backup location and started exploring data:

CREATE EXTERNAL TABLE IF NOT EXISTS default.test ( key string, sortkey string, data string)
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://bucket-name/DynamodbDBBackup/2016-12-23-20-28-20/'

It created a table in Athena. It's time to run query.

select * from default.test limit 3;
Sample Data:


Now we had to extract these data in such a way that we would get only values for key and sortkey and from details attribute we had to extract only usage.

As I said, Athena is a service for presto, we used presto function to extract data from json. There is a function is presto described at presto functions:

json_path expression can use the dot-notation and it takes $ as root path. More details are available at JsonPath

Using the above formula, we wrote our query:

SELECT cast(json_extract(key,'$.s') as varchar) as key ,
cast(json_extract(sortkey,'$.s') as varchar) as sortkey
, json_extract(replace(cast(json_extract(details,'$.s') as varchar),'""','"'),'$.usage') as usage
FROM test
limit 10;​

And that gave us the data as:


It was really fun to use Athena.

Continue reading →