Blog

Blog

Top 30+ AWS Athena Interview Questions & Answers

AWS Athena Interview Questions

AWS Athena Interview Questions

1. What is AWS Athena?

AWS Athena is used for performing database automation, parquet file conversion, table creation, snappy compression, partitioning, and more. It acts as an interactive service for analyzing Amazon S3 data by using standard SQL. The user can point Athena at data stored in AWS S3 and also helps in executing queries for getting results using standard SQL. Amazon Athena scales executing queries in parallel, and scales automatically, providing fast results even with a large dataset and complex questions.

2. What are the Features Of Athena?

Features of AWS Athena are:

  • Easy Implementation – Athena can be accessed directly from AWS Console and also directly by AWS CLI.
  • Serverless – AWS Athena can be for taking care of all the things on its own.
  • Pay per query – We can compress the data set by using AWS Athena.
  • Fast – Athena helps in performing complex queries in less time by breaking the complex queries into simpler ones and also can run them parallelly.
  • Secure – All data can be stored in S3 buckets and IAM policies can also help in managing control of users.
  • Highly available – It is highly available and can execute queries around the clock.
  • Integration – AWS Athena helps us in creating better versioning of data, better tables, views, etc.

3. How can we create an Athena database and table?

Amazon Athena is a serverless query service that lets you analyze data in Amazon S3 using SQL. You can use Athena to process data stored in S3 and generate reports or dashboards.

To create a database and table in Athena, follow these steps:

  1. Open the Athena console.
  2. In the Query Editor, type the following SQL statement to create a database:

CREATE DATABASE database_name;

Replace database_name with the name of the database you want to create.

CREATE TABLE database_name.table_name (
column_name_1 data_type,
column_name_2 data_type,

);

  1. To create a table in the database, type the following SQL statement:

Replace database_name with the name of the database, table_name with the name of the table, and column_name_1 and column_name_2 with the names of the columns in the table. Specify the data type for each column (e.g., INT, VARCHAR, TIMESTAMP).

  1. Run the SQL statement by clicking the “Run” button. This will create the table in the specified database.
  2. To verify that the table was created, you can run a SELECT statement to list the tables in the database:

And This will display a list of all the tables in the database.

SHOW TABLES IN database_name;

4. Is it possible to create user-defined functions in AWS Athena? If yes, then how?

Yes, it is possible to create user-defined functions in AWS Athena. You can do this by using the CREATE FUNCTION command. This command allows you to specify the name of the function, the input and output types, the function body, and any other required parameters.

5. What happens if you need to execute an ETL process on top of your existing data sets before loading them into AWS Athena?

In this case, you would need to first export your data sets from their current location into Amazon S3, and then use AWS Glue to ETL them into the format required by Athena.

6. What are the differences between AWS Athena and Google BigQuery?

Both AWS Athena and Google BigQuery are cloud-based data warehouses that allow users to query large data sets. However, there are some key differences between the two. First, Athena uses a serverless architecture, meaning that users only pay for the queries they run. BigQuery, on the other hand, charges a monthly fee for storage and usage. Second, Athena uses the Presto query engine, while BigQuery uses BigQuery SQL. This can impact performance, as Presto is generally faster. Finally, Athena integrates with other AWS services, while BigQuery integrates with other Google Cloud Platform services.

7. How to tune your Performance of Athena?

  • Data Partitioning – used for dividing the table into simple parts and also can be kept related data together all based on various column values like date, country, region, etc.
  • Bucketing Data – this method is used for partitioning data into a single partition.
  • Compress the Files – this method is used for increasing the query speed and can ensure the files are of optimal size and are splittable.
  • Optimization of File Sizes – used for having splittable file formats is helpful with parallelism irrespective of the size of files.
  • Optimization of the Data Store Generation – this is used for features such as efficient storage of data by using column-wise compression, different encoding, and also compression, which is based on the data type.
  • Optimization of ORDER – used for returning the results of the query in sorted order.

8. How does AWS Athena compare with Amazon Redshift?

Both Amazon Redshift and AWS Athena are data warehousing solutions that can be used to analyze data in the cloud. However, there are some key differences between the two. Amazon Redshift is a fully managed data warehouse service, while AWS Athena is an interactive query service that is used to query data stored in Amazon S3. Amazon Redshift is designed for larger data sets and can be used for OLAP (online analytical processing) workloads, while AWS Athena is designed for smaller data sets and is better suited for OLAP workloads.

9. How does AWS Athena differ from Hadoop Hive?

AWS Athena is a serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Athena is fast. Athena uses Presto with full standard SQL support and works with a variety of data formats, including CSV, JSON, ORC, Avro, and Parquet.

10. Why should I use AWS Athena instead of AWS Glue?

AWS Athena is a serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. With Athena, there is no need to set up or manage any infrastructure, so you can start analyzing your data immediately. Athena is also highly scalable, so you can run queries on large datasets without having to worry about provisioning or managing any resources. Finally, Athena is very cost-effective, as you only pay for the queries that you run.

11. What are the Price details of AWS Athena?

As we discussed earlier, Amazon Athena is an interactive query service to query data in Amazon S3 with standard SQL statements. Athena reads the data without performing operations such as addition or modification.

Now let’s look at Amazon Athena pricing and some tips to reduce Athena costs.

According to Amazon Athena’s pricing page, Athena is priced at $5/TB scanned to run a query. If you cancel any query,  the charge is for the data scanned up to the cancellation point of the query.

Doing that math for smaller queries:

https://mindmajix.com/aws-athena

Therefore, you will be charged a minimum of $0.000004768 (to scan 10 MB minimum). So be careful with those 200KB queries. You will still be charged for a full 10 MB.

Things That Are Free

Database, table, DDL-related executions, and schema are all free. For example, there is no charge for any of the following statements:

  • CREATE EXTERNAL TABLE
  • MSCK REPAIR TABLE
  • ALTER TABLE

Additional Costs

Athena reads the data that is stored in S3. There are standard charges in S3 to store the data based on how it’s stored. It stores query history and results in another bucket known as a secondary S3 bucket. Therefore, there will also be standard S3 data charges for that new data stored in the same bucket.

Cost Reduction Techniques

  1. Till now, we came across the pricing details of Amazon Athena. Now let’s look into some of the cost reduction techniques listed below:
  2. Remove historical results using S3 lifecycle rules
  3. Compress your input data in S3
  4. Use Partitions Effectively
  5. Store Your Data in a Columnar Format

Amazon Athena is an exciting service. It helps you to structure your data and queries to reduce your costs up to an extent and you’ll be added with a potential new candidate to your arsenal for serverless computing.

12. How do Create Dataframe from AWS Athena using Boto3 get query results method?

client = boto3.client('athena')
response = client.get_query_results(
        QueryExecutionId=res['QueryExecutionId']
        )

13. How to create an Athena database via API?

import boto3


client = boto3.client('athena')


config = {'OutputLocation': 's3://TEST_BUCKET/'}


client.start_query_execution(
                             QueryString = 'create database TEST_DATABASE',
                             ResultConfiguration = config

14. How to Create a Table In Athena

In this tutorial, we are using live resources, so you are only charged for the queries you run but not for the datasets you use, and if you want to upload your data files into Amazon S3, charges do apply.

To query S3 file data, you need to have an external table associated with the file structure. We can CREATE EXTERNAL TABLES in two ways:

  • Manually.
  • Using the AWS Glue crawler.

To manually create an EXTERNAL table, write the statement CREATE EXTERNAL TABLE following the correct structure and specify the correct format and accurate location. An example is shown below:

Creating an External table manually

The created ExTERNAL tables are stored in AWS Glue Catalog. The Glue Clawer parses the structure of the input file and generates metadata tables, defined in Glue Data Catalog.

The crawler uses an AWS IAM (Identity and Access Management) role to permit access to the data stored and the Data Catalog. You should have permission to pass the roles to the crawler for accessing Amazon S3 paths that are crawled.

Go to AWS Glue, choose “Add tables” and then select the “Add tables using a crawler” option.

Add tables using the Glue crawler

Give the crawler a name. Let’s say for example cars-crawler

Enter crawler name

Choose the path in Amazon S3 where the file is saved.

If you plan to query only one file, you can choose either an S3 file path or the S3 folder path to query all the files in the folder having the same structure. 

Enter crawler name

Choose the path in Amazon S3 where the file is saved.

If you plan to query only one file, you can choose either an S3 file path or the S3 folder path to query all the files in the folder having the same structure. 

cars.json file is in the S3 location s3://rosyll-niranjana-xavier/data_input/json-files/cars.json. You can also choose s3://rosyll-niranjana-xavier/data_input/json-files/ as the path.

Create an IAM role that is having permission to the S3 object that you aim to query or choose an existing IAM role (which has enough privileges to access the S3 object).

Choose a database that contains the external tables and optionally choose a prefix to be added to the external table name.

Choose database and prefix for external tables

Click Finish to create the Glue Crawler

Run the crawler

The External table was created under the specified database. Now you can query the S3 object using it.

SELECT data from the external table

Since we placed a file, the “SELECT *FROM json_files;” query returns a record that was in the file. Let’s now try to place another file having the same structure in the same S3 folder and try to query the EXTERNAL TABLE again.

petercars.json file uploaded to S3

If you Query the same EXTERNAL table, you will see two rows returned instead of one.

When the same EXTERNAL TABLE is queried, you will get two records. This is because there are two files in the S3 folder with the desired structure. You can perform several operations on the data. For instance, the following query will UNNEST the array in the result set.

UNNEST arrays in Athena.

15. Can you explain the architecture of AWS Athena?

AWS Athena is a serverless query service that allows you to analyze data stored in Amazon S3 using standard SQL. It is designed to be highly scalable and cost-effective, and it integrates with other AWS services such as Amazon S3, Amazon Glue, and Amazon QuickSight.

16. How much does it cost to use AWS Athena?

AWS Athena is a pay-per-use service, so you only pay for the queries that you run. Prices vary depending on the amount of data scanned per query but start at $5 per TB.

17. Can you tell me more about table metadata as used by AWS Athena?

When you create a table in Athena, you have the option of including table metadata. This metadata can be used to provide information about the table, such as the column names and data types, that can be used by Athena when querying the table. This metadata can be stored in an external file or in the table itself and can be updated as needed.

18. What kinds of queries can be run using AWS Athena?

AWS Athena supports a variety of standard SQL queries, including data manipulation language (DML) statements and data definition language (DDL) statements. Additionally, Athena supports complex queries that can include joins, aggregations, and window functions.

19. What’s your understanding of catalogs in the context of AWS Athena?

A catalog is a collection of databases and tables that are used to store data. In AWS Athena, a catalog is used to keep track of the location of the data that you want to query. In order to query data in Athena, you must first create a catalog.

20. What is PrestoDB? How does it work in conjunction with AWS Athena?

PrestoDB is an open-source distributed SQL query engine that is used in conjunction with AWS Athena. Athena uses PrestoDB to query data stored in Amazon S3. PrestoDB is designed to be fast and efficient, and it supports a variety of data formats including CSV, JSON, and Avro.

21. Which programming languages can I use when running SQL queries in AWS Athena?

You can use any programming language that is compatible with JDBC or ODBC drivers. This includes languages like Java, Python, Node.js, and more.

22. Can you give me some examples of query aggregates that can be executed in AWS Athena?

Some examples of query aggregates that can be executed in AWS Athena include COUNT, SUM, MIN, MAX, and AVG.

23. Is there a free tier available for AWS Athena?

Yes, there is a free tier available for AWS Athena. This free tier gives you the ability to run up to 5 GB of data per month through Athena.

24. How are datasets in AWS Athena stored?

Datasets in AWS Athena are stored in Amazon S3 in a proprietary format that is optimized for efficient querying.

25. What are the Features of Athena?

Athena is one of the best services offered by Amazon. It has several features making it suitable to analyze data. Let’s have a look at the various features of Athena.

Easy Implementation: Athena requires no installation and can directly access using the AWS Console.

Serverless: The end-user does not face any problems in configuring, scaling, or failure as Athena is a serverless service. It can take care of everything on its own.

Pay per query: It charges only for queries you run, i.e. the amount of data that is managed per query. 

Fast: Athena is a high-speed analytics tool and can perform even complex queries in relatively less time by splitting into simpler ones and running them parallelly, and merging them to provide the desired output.

Secure: Using AWS Identity and IAM policies, Athena provides you with complete control over the data set. 

High availability: With AWS, Athena is accessible and the user can run queries around the clock. 

Integration: The best feature of Athena is its integration with AWS Glue. 

26. How can we create the first query using Athena?

select distinct costdb.cost.productname
from costdb.

How to get input file name as a column in AWS Athena external tables?

We can do this with the $path pseudo column:

select "$path" from table

27. How can we convert CSV files to Parquet using AWS Athena?

import sys
import boto3
from awsAthena.transforms import *
from awsAthena.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsAthena.context import AthenaContext
from awsAthena.job import Job




## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])




sc = SparkContext()
AthenaContext = AthenaContext(sc)
spark = AthenaContext.spark_session
job = Job(AthenaContext)
job.init(args['JOB_NAME'], args)








client = boto3.client('Athena', region_name='ap-southeast-2')




databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName




Tables = client.get_tables(DatabaseName=databaseName)




tableList = Tables['TableList']




for table in tableList:
    tableName = table['Name']
    print '\n-- tableName: ' + tableName




    datasource0 = AthenaContext.create_dynamic_frame.from_catalog(
        database="tpc-ds-csv",
        table_name=tableName,
        transformation_ctx="datasource0"
    )




    datasink4 = AthenaContext.write_dynamic_frame.from_options(
        frame=datasource0,
        connection_type="s3",
        connection_options={
            "path": "s3://aws-Athena-tpcds-parquet/"+ tableName + "/"
            },
        format="parquet",
        transformation_ctx="da

28. How to Access Amazon Athena?

There are distinct options available for accessing Athena quickly. It can be accessed through any of the following tools:

  • AWS Console
  • Athena with your JDBC
  • AWS CLI

As you have gained knowledge about Amazon Athena, let us walk through the various features of Athena.

29. Can you give me some examples of real-world applications where I can use AWS Athena?

AWS Athena can be used for a variety of tasks, including data analysis, data warehousing, and log analysis. It is a particularly useful tool for analysts who need to quickly query and analyze data stored in Amazon S3.

30. Do you know what partitioning is and how it works in context with AWS Athena?

Partitioning in AWS Athena is a way of dividing data up into smaller pieces so that queries can run faster and more efficiently. Partitioning can be done on any column in a table, and it is especially useful for columns that have a lot of data or that are frequently queried. Partitioning works by creating separate partitions for each value in the partitioning column and then storing the data in those partitions. When a query is run, only the partitions that are relevant to the query are scanned, which can greatly reduce the amount of time it takes to run the query.

31. How can we create the first query using Athena?

select distinct costdb.cost.productname
from costdb.

32. What is the Difference between Microsoft SQL Server and Amazon Athena

Amazon Athena is a serverless and interactive tool to analyze data and processes complex queries in relatively less time. Being a serverless service, you pay only for the queries you execute. Mark your data in S3 and define the required schema using standard SQL and go.

Let’s compare the two data analysis tools, Microsoft SQL Server and Amazon Athena.

FeaturesMicrosoft SQL ServerAmazon Athena
DefinitionMicrosoft SQL Server is a database management and analysis system.Amazon Athena is an interactive query service that makes data analysis easy.
UsageUsed for DCL, DML, DDL, and TCL operations on the Database.Used for DatabaseDML operations.
BenefitsReliable and easy to useHigh performanceEasy to maintainEasy server installationMultiple tools integration possibleEasy to useHigh performanceNo maintenance required No server configuration requiredMultiple tools integration possible
IntegrationSequelizeSQLDepPrestoAmazon S3AWS GluePresto
LimitationsLimited RDS storageLimited instancesCan not handle recursionNo DDL supportedWorks with external table onlyUser-Defined Functions not supported

33. How does AWS Athena work

Athena works directly with S3 data. It uses a distributed SQL engine, Presto for running queries. It uses Apache Hive to create and alter tables and partitions. 

Let’s have a look at the prerequisites to start working with Athena:

  1. Must have an AWS account
  2. Enable your account to export your cost and usage data into an S3 bucket. 
  3. Prepare buckets for Athena to connect. 
  4. AWS creates manifest files using metadata every time it writes to the bucket. Create a folder inside the technology-was-billing-data bucket known as Athena, which contains only the data.
  5. To simplify the setup, we can use one region: the us-west-2 region.
  6. The final step is downloading the credentials for the new IAM user. The credentials will directly map to the database credentials to connect.
ParameterValue
Database UsernameIAM username
Database passwordSecret Access Key
Database nameAccess Key ID
Database port443

Creating an Athena database and table

create database if not exists costdb;
    create external table if not exists cost (
        InvoiceID string,
        PayerAccountId string,
        LinkedAccountId string,
        RecordType string,
        RecordId string,
        ProductName string,
        RateId string,
        SubscriptionId string,
        PricingPlanId string,
        UsageType string,
        Operation string,
        AvailabilityZone string,
        ReservedInstance string,
        ItemDescription string,
        UsageStartDate string,
        UsageEndDate string,
        UsageQuantity string,
        Rate string,
        Cost string,
        ResourceId string
    )
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    with serdeproperties (
        'separatorChar' = ',',
        'quoteChar' = '"',
        'escapeChar' = ''
    )
    stored as textfile
    location 's3://technology-aws-billing-data/athena'

Create a table that matches the CSV formats and files in the S3 billing bucket.

After a bit of trial and error, some unseen errors that we can view are as follows:

  1. Using the OpenCSVSerde plugin, you can parse CSV files.
  2. The plugin supports gzip files but not zip files. You’ll convert the compression format to gzip or one of its supporting formats.
  3. This plugin claims in supporting skip.header.line.count to skip header rows, but seems to be broken. You have to rewrite CSV files manually without the header.
  4. You can run the DDL statements using the AWS Web console or through the product for creating databases.

Creating the first query using Athena

To ensure that you can query your Athena database, you can run the below query for various AWS services you use:

select distinct costdb.cost.productname
from costdb.

Cost by AWS service and operation

In the earlier section, each column is mentioned as a string data type. You have to cast columns as the data types to move further:

select productname, operation,
 sum (cast(cost as double))
from costdb.cost
group by 1, 2
order by 3 desc 

Amazon’s Cost Explorer

Amazon provides a tool called Cost Explorer to drag and drop, which comes with a set of prebuilt reports like “Monthly service costs”, ” reserved instance usage”, etc.

If you are curious, try to recreate the query above for service costs and operation. It doesn’t seem to be possible.

You can slice your raw data to your satisfaction, and can also compute growth rates every month, build histograms, compute using z-scores, etc.

Additional considerations:

Athena’s pricing model: The Pricing of Athena is $5  to scan Terabyte data from S3, surrounded to the closest megabyte having a minimum of 10 MB per query.

Reducing Athena’s cost: The cost trick is reducing the data that is scanned. This is possible in three ways:

Compress your data through gzip or other supported formats: If you get a compression rate of 2:1 ratio, the cost is reduced by 50%.

Use columnar data formats like Apache Parquet: If the query references only two columns, you need not scan the entire row which results in significant savings.

Partition of the data: The partition keys can be defined as either one or more. For instance, if your data consists of a customer_id column and a time-based column, the amount of data scanned is reduced significantly when the query has clauses for the data and customer columns. It allows you to analyze S3 data using standard SQL without managing any infrastructure. You can even access Athena via a BI tool with a JDBC driver. 

Select the fields to be shown. Others will be hidden. Drag and drop to rearrange the order.
  • Image
  • SKU
  • Rating
  • Price
  • Stock
  • Availability
  • Add to cart
  • Description
  • Content
  • Weight
  • Dimensions
  • Additional information
Click outside to hide the comparison bar
Compare

Subscribe to Newsletter

Stay ahead of the rapidly evolving world of technology with our news letters. Subscribe now!