Amazon Athena - The Data Analytics Tool

By: Amit Dua 18 October 2019

Analytics plays a key role to have a clear business insight, and if the data you want to analyze is huge, then there are a number of parameters that need to be taken care of viz: cost, the expertise of the domain, maintenance, regular upgrades, problem of concurrent users, etc.

Redshift is a fully managed data warehouse solution by Amazon. It is an efficient solution to manage data, but if more disk capacity is required, we need to increase the size of our cluster by adding more nodes. Being horizontally scalable, we can end up to handle a gigantic redshift cluster. There can be a simpler approach to this as well.

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly from S3 using standard SQL.

Serverless- that means there is no infrastructure, no instances, no administration, no updates required. We talk to Athena using an end-point or directly by AWS console.
Zero Spin-up time — that means as soon as we log in, we are able to submit queries, or analyze results, or create tables.
All the upgrades that happen behind the scenes are transparent; we get the benefit of the upgrades without any downtime.

One can query the data directly from S3; we don’t need loading of data in any external data warehouse. We can query the data directly from where it lives. It supports a variety of data in raw format from S3, which can be a text CSV, JSON, weblogs, AWS service logs. For, eg. If we have a text file that can be parsed using a regular expression, then also we can define that regular expression as a part of creating a table query.

Since the data is streamed directly from S3, so we can also take advantage of S3 durability and availability.

Cost-effective - We pay only for S3, which is quite cheap, and externally we Pay per query.

Per query cost - 5 dollars per Terabyte scan for S3.

Failed queries and DDL queries are free. If we cancel a query, while running, the charge will be on the amount of data scanned. At the same time, there are many optimization techniques to reduce this cost, like partitioning the data while creating tables, so that the query on the table will scan only a part of S3.

Also, we can prefer using columnar storage of data. If data is correctly stored, we need to scan fewer data. Hence cost reduces.

Presto (SQL on anything)

Athena uses Presto as a managed service. Presto is an in-memory distributed SQL engine, which came out of Facebook. It read data from anywhere and actually processes data from where it lives; hence can be connected to a variety of connectors including HDFS, S3, MongoDB, MySQL, Postgres, Redshift, SQL Server. It has the power to handle hundreds of concurrent queries on a single cluster. That means we don’t have to maintain complex clusters.

Presto decouples the data from its processing; No data is stored in Presto, so it reads it from elsewhere. e.g., S3. And since S3 storage is really cheap, it makes a lot of sense to use it as the storage for our Datalake. So, data resides in S3, we create metadata over it, which is stored in AWS Glue Catalog Metastore, all processing is in-memory, and the results are directly streamed to console (or can also be customized to an output bucket), also we can download data in CSV format from the console directly.

Internally, it automatically parallelizes queries, so when compared to Hive, which provides a brilliant use of Java MapAndReduce in Hadoop, executing seven hive commands in parallel took hours, whereas Presto was able to perform exceptionally good. This was one of the reasons Facebook created and switched to Presto, which further powers-up Amazon Athena [ Read more at source link]

The blog is originally published at c-sharpcorner.