Blog

What is Amazon Athena? – The New Serverless Data Analytics Tool

Serverless Data Analytics Tool

Introduction To Amazon Athena

Amazon Athena is a serverless, interactive query service that enables users to analyze data in Amazon S3 using standard SQL. Athena is designed to make it easy to analyze large amounts of data stored in S3 without having to set up and manage infrastructure. With Athena, you can quickly analyze data in a cost-effective and secure way, using only the data you need.

Athena uses Presto, an open-source distributed SQL query engine, to execute SQL queries on data stored in S3. This means that you can use standard SQL to query data in S3, without having to learn a new query language or tool. Athena also supports a wide range of data formats, including CSV, JSON, Avro, ORC, and Parquet.

To use Athena, you create a table that points to data stored in S3, and then you can query that data using standard SQL. Athena automatically scales to handle any amount of data, so you can focus on analyzing your data rather than managing infrastructure. You only pay for the queries that you run, so you can keep costs low by analyzing only the data you need.

Overall, Amazon Athena provides a simple and cost-effective way to analyze data stored in Amazon S3 using standard SQL. It’s an ideal solution for data analysts, data scientists, and anyone who needs to analyze large amounts of data without having to set up and manage infrastructure.

This article focuses on the following pointers:

Difference Between Microsoft SQL Server And Amazon Athena

Use Of Amazon Athena
Data Analysts/Developers: Do you offer Storage
Accessing Amazon Athena
Features Of Athena

Difference Between Microsoft SQL Server And Amazon Athena

Microsoft SQL Server and Amazon Athena are both database management systems, but they have several differences in terms of their architecture, functionality, and usage. Here are some of the main differences between the two:

Architecture: Microsoft SQL Server is a traditional, on-premises database management system that requires you to set up and manage your own infrastructure. Amazon Athena, on the other hand, is a serverless, cloud-based service that runs on Amazon Web Services (AWS) and eliminates the need for you to manage any infrastructure.
Query language: Microsoft SQL Server uses the SQL query language, just like Amazon Athena. However, SQL Server also supports T-SQL, a proprietary extension to SQL that provides additional functionality for working with relational data.
Data storage: SQL Server stores data in a variety of formats, including relational tables, views, and indexes. Athena, on the other hand, is designed to work with data stored in Amazon S3, which supports a wide range of file formats, including CSV, JSON, Avro, ORC, and Parquet.
Scalability: SQL Server is designed to scale vertically, meaning you can add more resources (e.g., CPUs, RAM) to a single server to handle increased workload. Athena, on the other hand, is designed to scale horizontally, meaning it can automatically distribute queries across a cluster of nodes to handle large workloads.
Cost: SQL Server requires you to purchase and manage your own infrastructure, which can be expensive. Athena, on the other hand, is a serverless service that allows you to pay only for the queries you run, making it more cost-effective for smaller workloads.

In summary, Microsoft SQL Server and Amazon Athena have different architectures and functionality, but they both support SQL and can be used to manage and analyze data. SQL Server is a traditional database management system that requires you to set up and manage your own infrastructure, while Athena is a serverless service that runs on AWS and eliminates the need for infrastructure management.

MySQL Vs Amazon Athena

Use Of Amazon Athena

If you are a Data Analyst and have experience analyzing data stored on S3, you will relate to this,

Amazon Athena is a serverless, interactive query service that allows you to analyze data in Amazon S3 using SQL. With Athena, you can analyze data stored in Amazon S3 using SQL and other standard data processing frameworks and BI tools. Athena is particularly well-suited to querying large datasets stored in S3 because it uses a distributed query engine that can process data in parallel, reducing the time it takes to run complex queries.

Some common use cases for Athena include:

Analyzing log data stored in S3 for security and compliance purposes
Running ad-hoc queries on data stored in S3 for data exploration and analysis
Querying data stored in S3 as part of a larger ETL (extract, transform, load) pipeline
Running interactive queries on data stored in S3 for business intelligence and reporting purposes

Athena is a fully managed service, which means that you don’t have to worry about setting up and maintaining any infrastructure. You simply pay for the queries that you run, making it a cost-effective solution for querying data stored in S3.

Data Analysts/Developers: Do you offer Storage?

AWS: Yes.

As data analysts or developers, it is not typically within our scope of responsibilities to offer storage for data. Storage is typically provided by the company or organization that we work for or by a third party storage provider. Our role would be to analyze and develop solutions using the data stored in the provided storage location.

Data Analysts/Developers: Do you have tools for Analytics?

AWS: Not sure.”

Amazon worked on this and came up with Amazon Athena. Now, you have a tool to play with your data. Athena helps you analyze unstructured, semi-structured, and structured data that is stored in Amazon S3. Using Athena you can create dynamic queries for your dataset. Athena also works with AWS Glue to give you a better way to store the metadata in S3.

Using AWS CloudFormation and Athena, you can use named queries. Named query allows you to name your query and then call it using the name.

This interactive service from AWS can be used by Data Scientists, and developers to take a sneak peek into the table instead of running the complete query. It is also used to fetch data from S3, and load it to different data stores using Athena JDBC driver, for log storage/analysis and Data Warehousing events.

Now that you know Athena is an interesting tool, let’s find out in this Amazon Athena tutorial how to get your hands on this amazing service from Amazon.

Accessing Amazon Athena

Accessing Athena is very easy and it can be done by either:

These are a few of the ways to access Amazon Athena. By now, you pretty much know everything important about Amazon Athena. Let me walk you through the different features of Athena.

Features Of Athena

Out of the many services provided by Amazon, Athena is one of the services. It has many features that make it suitable for Data Analysis. Let’s take a look at the different features one by one.

Easy Implementation: Athena doesn’t require installation. It can be accessed directly from the AWS Console and also directly by AWS CLI.
Serverless: It is serverless, so the end-user doesn’t need to worry about infrastructure, configuration, scaling, or failure. Athena takes care of everything on its own.
Pay per query: Athena charges you only for the query you run, i.e. the amount of data that is managed per query. You can save a lot if you can compress them and format your dataset accordingly.
Fast: Athena is a very fast analytics tool. It can perform complex queries in less time by breaking the complex queries into simpler ones and running them parallelly, then combining the results to give the desired output.
Secure: With the help of IAM policies and AWS Identity, Athena gives you complete control over the data set. As the data is stored in S3 buckets, IAM policies can help you manage control over users.
Highly available: With the assurance of AWS, Athena is highly available and the user can execute queries around the clock. As AWS is 99.999% available, so is Athena.
Integration: The best feature of Athena is that it can be integrated with AWS Glue. AWS Glue will help the user to create a better-unified data repository. This helps you create better versioning of data, better tables, views, etc.

Great isn’t it? Athena provides many features at the same time, it is cost-efficient.

By now you must be impressed by AWS Athena. Now that you know quite a lot about Athena. Let’s roll our sleeves and understand the working of Athena by performing a small demo. In this Amazon Athena tutorial, we will work on two Demos, let’s find out what are they.

Demo – I (Creating Tables In Athena)

In Amazon Athena, a table is a logical representation of data stored in one or more files in an Amazon S3 bucket. To create a table in Athena, you need to specify the schema for the table, which includes the names and data types of the columns in the table.

There are two primary ways to create a table in Athena:

Using the Athena console: You can use the Athena console to create a table by specifying the schema and the location of the data in S3.
Using the CREATE TABLE SQL statement: You can run a CREATE TABLE SQL statement using the Athena query editor to create a table.

Here’s an example of a CREATE TABLE a statement that creates a table called my_table with two columns

CREATE EXTERNAL TABLE my_table (
  col1 INT,
  col2 STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://my-bucket/data/';

This statement creates a table called my_table with two columns, col1 (an integer) and col2 (a string). The ROW FORMAT SERDE and WITH SERDEPROPERTIES clauses specify the serialization and deserialization (SerDe) library to use for the data and the LOCATION the clause specifies the location of the data in S3.

Once you have created a table, you can query the data using standard SQL statements.

As you know all about Amazon Athena, let’s take a dive into how to query your data stored as a .json file in Amazon S3 using Athena.

Create multiple JSON files containing entries
Store the files in an S3 bucket
Create an external table for the files stored in S3
Write a Query for accessing the data

Let’s understand how to do the above-said tasks one by one.

Create JSON Files. (Create the data without using newline characters)

JSON File- Amazon Athena Tutorial

We will access the S3 bucket using AWS CLI
1. Configure IAM User
3. Create S3 Bucket
5. Copy files to S3 Bucket
Create an External Table in Athena. There are two ways of doing this:
1. Using AWS Glue Crawler
2. Manually
We will create it manually:
1. Create table.

Create a new database if you don’t have one. Give a table name. Give the location of your file.

Select the type of file you will be working with. Select the architecture of the data in your file.
Amazon Athena Console
As the entered data is not that complex, we don’t need a partition. Click on “Create Table”.
Amazon Athena Console – Amazon Athena Tutorial
Athena will auto-generate the Query for creating an External Table and run it.

You have your external table ready.

We write a query to select all data from the table.

select * from testdb;

Click on Run Query and you have all the information in your table.

Demo – II (Comparison Between Amazon Athena And MySQL)

Amazon Athena and MySQL are both relational database management systems (RDBMS), but they have some key differences that you should consider when deciding which one to use.

One of the main differences between the two is that Athena is a serverless service, while MySQL is a traditional software that you install and run on a server or virtual machine. This means that with Athena, you don’t need to worry about provisioning and maintaining infrastructure, as you do with MySQL. Athena is also fully managed, which means that it automatically handles tasks such as backing up your data and updating the software.

Another key difference is the way that the two systems store and retrieve data. Athena is a query service that is designed to work with data stored in Amazon S3, while MySQL is a traditional RDBMS that stores data in tables on a server. This means that with Athena, you don’t need to worry about importing data into the system or creating tables, as you do with MySQL.

There are also some differences in the query languages that the two systems use. Athena uses SQL (Structured Query Language), while MySQL supports both SQL and MySQL’s own proprietary SQL variant.

Overall, Athena is a good choice if you need a flexible, scalable, and cost-effective way to run ad-hoc queries on data stored in S3, while MySQL is a good choice if you need a full-featured RDBMS for storing and managing data on a server.

In this Amazon Athena tutorial, now we will compare MySQL and Athena and understand how even simple queries take less time to execute in Athena.

Loading a CSV file to MySQL took around 1 hour but in Athena, it took just 3 mins to upload the CSV file to S3 and 0.42 seconds to create a table for the same.
Select query. select * from the table.Select query in Athena.4. Select query in MySQL.
Selecting a specific column from the table. Select a specific column in Athena
Select a specific column in MySQL.
Getting the count of a specific column.Count of a specific column in Athena.Count of a specific column in MySQL.
Counting a number of records in the table. Count all records in Athena.
Count all records in MySQL.
Select a query with a specified range. Select a query within the said range in Athena.
Select a query within the said range in MySQL.