Amazon Athena – The Data Analytics Tool

October 18, 2019

Analytics plays a key role to have a clear business insight, and if the data you want to analyze is huge, then there are a number of parameters that need to be taken care of viz: cost, the expertise of the domain, maintenance, regular upgrades, problem of concurrent users, etc.

 Redshift is a fully managed data warehouse solution by Amazon. It is an efficient solution to manage data, but if more disk capacity is required, we need to increase the size of our cluster by adding more nodes. Being horizontally scalable, we can end up to handle a gigantic redshift cluster. There can be a simpler approach to this as well.

 What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly from S3 using standard SQL.

  1. Serverless- that means there is no infrastructure, no instances, no administration, no updates required. We talk to Athena using an end-point or directly by AWS console.
  2. Zero Spin-up time — that means as soon as we log in, we are able to submit queries, or analyze results, or create tables.
  3. All the upgrades that happen behind the scenes are transparent; we get the benefit of the upgrades without any downtime.

One can query the data directly from S3; we don’t need loading of data in any external data warehouse. We can query the data directly from where it lives. It supports a variety of data in raw format from S3, which can be a text CSV, JSON, weblogs, AWS service logs. For, eg. If we have a text file that can be parsed using a regular expression, then also we can define that regular expression as a part of creating a table query.

Since the data is streamed directly from S3, so we can also take advantage of S3 durability and availability.

 Cost-effective – We pay only for S3, which is quite cheap, and externally we Pay per query.

 Per query cost – 5 dollars per Terabyte scan for S3.

 Failed queries and DDL queries are free. If we cancel a query, while running, the charge will be on the amount of data scanned. At the same time, there are many optimization techniques to reduce this cost, like partitioning the data while creating tables, so that the query on the table will scan only a part of S3.

Also, we can prefer using columnar storage of data. If data is correctly stored, we need to scan fewer data. Hence cost reduces.

Presto (SQL on anything)

Athena uses Presto as a managed service. Presto is an in-memory distributed SQL engine, which came out of Facebook. It read data from anywhere and actually processes data from where it lives; hence can be connected to a variety of connectors including HDFS, S3, MongoDB, MySQL, Postgres, Redshift, SQL Server. It has the power to handle hundreds of concurrent queries on a single cluster. That means we don’t have to maintain complex clusters.

Presto decouples the data from its processing; No data is stored in Presto, so it reads it from elsewhere. e.g., S3. And since S3 storage is really cheap, it makes a lot of sense to use it as the storage for our Datalake. So, data resides in S3, we create metadata over it, which is stored in AWS Glue Catalog Metastore, all processing is in-memory, and the results are directly streamed to console (or can also be customized to an output bucket), also we can download data in CSV format from the console directly.

Internally, it automatically parallelizes queries, so when compared to Hive, which provides a brilliant use of Java MapAndReduce in Hadoop, executing seven hive commands in parallel took hours, whereas Presto was able to perform exceptionally good. This was one of the reasons Facebook created and switched to Presto, which further powers-up Amazon Athena [ Read more at source link]

The blog is originally published at c-sharpcorner.

Amit Dua

Co-Founder at Signity Solutions
A tech-evangelist, he has an uncanny ability to synergize and build associations, thriving teams, and reputable clients. His vision to grow his decade-old company as per global standards, and his deep analytical skills to foresee market trends, as well as global challenges, has enabled his company to come up with products and services best suited not only for local, but global markets as well.

Get In Touch

Send us your requirements. We will get back to you with a free quote.

Want to See Your Idea as the Next Big Thing?

  • With Signity Solutions, your business is destined to grow at a new level. Get in touch today to create a game changing experience.

Send Us a Message

SSL-Security-SignitySolutions

Our Global workspaces

Contact-India-SignitySolutions
IndiaHQ

Netsmartz House,
3rd Floor, Plot #10, Rajiv Gandhi IT Park,
Chandigarh 160101
+91-172-465-0448

Contact-Australia-SignitySolutions
New Zealand

14049, 35 George
Street, Kingsland
Auckland 1024.
+64 22 111 0002

Contact-United-States-SignitySolutions
USA

762 Green St, Suite 1-R,
Iselin, NJ 08830
+1-619-309-4653

Contact-Netherlands-SignitySolutions
Netherlands

Kanteel-35 1083 DB
Amsterdam

Drop us an e-mail at sales@signitysolutions.com