Blog

Top 50+ AWS Glue Interview Questions and Answers

1. What is AWS Glue?

AWS Glue is a managed service ETL (extract, transform, and load) service that enables categorizing, cleaning, enriching, and moving data reliably between various data storage and data streams simple and cost-effective. AWS Glue consists of the AWS Glue Data Catalog, an ETL engine that creates Python or Scala code automatically, and a customizable scheduler that manages dependency resolution, job monitoring, and retries. Because AWS Glue is serverless, there is no infrastructure to install or maintain.

AWS Glue is an event-driven, serverless computing platform offered by Amazon as part of Amazon Web Services. It is a computing service that executes code in response to events and manages the computing resources needed by that code automatically.

AWS Glue is a serverless data integration service that keeps things simple to discover, prepare, and combine data for analytics, machine learning, and application development.

2. Describe AWS Glue Architecture

The architecture of an AWS Glue environment is shown in the figure below.

The fundamentals of using AWS Glue to generate one’s Data Catalog and process ETL dataflows.

In AWS Glue, users create tasks to complete the operation of extracting, transforming, and loading (ETL) data from a data source to a data target. You usually do the following:
You construct a crawler for datastore resources to enrich one’s AWS Glue Data Catalog with metadata table entries. When you direct your crawler to a data store, the crawler populates the Data Catalog with table definitions. Manually define Data Catalog tables and data stream characteristics for streaming sources.
In addition to table descriptions, the AWS Glue Data Model contains additional metadata that is required to build ETL operations. Users use this information when they take on that job to alter their data.
AWS Glue may generate a data transformation script. Users can also provide the script using the AWS Glue console or API.
Users could complete their task immediately or set it to start when another incident occurs. The trigger could be a timer or an event.
When a user’s task starts, a script pulls information from the user’s data source, modifies it, and sends it to the user’s data target. The script is run in an Apache Spark environment through AWS Glue.

3. What are the use cases of AWS Glue?

AWS Glue is a serverless ETL (extract, transform, load) service that can be used for:

Extracting data from various data sources, such as databases, data lakes, and cloud storage
Transforming data into a desired format or structure
Loading data into a data warehouse or other data store
Scheduling and automating data pipelines
Data cataloging and management
Data preparation and cleansing

4. What are the Benefits of AWS Glue?

Some benefits of using AWS Glue include:

Serverless: No need to worry about managing servers or infrastructure
Scalability: Can handle large amounts of data with ease
Integration with other AWS services: Can easily connect to and extract data from a variety of data sources and destinations
Low cost: Pay only for the resources you use, with no upfront costs
Easy to use: Provides a simple, visual interface for creating and managing ETL jobs and pipelines
Automation: Can schedule and automate ETL processes
Data cataloging: Provides a central repository for storing and managing metadata about data sources and transformations

5. How can we Automate Data Onboarding?

6. When to use a Glue Classifier?

A Glue Classifier is used to crawl a data store in the AWS Glue Data Catalog to generate metadata tables. An ordered set of classifiers can be used to configure your crawler. When a crawler calls a classifier, the classifier determines whether or not the data is recognized. If the first classifier fails to acknowledge the data or is unsure, the crawler moves to the next classifier in the list to see if it can.

7. What are the main components of AWS Glue?

AWS Glue’s main components are as follows:

Data Catalog acts as a central metadata repository
ETL engine that can automatically generate Scala or Python code.
The flexible scheduler manages dependency resolution, job monitoring, and retries.
AWS Glue DataBrew allows the user to clean and stabilize data using a visual interface.
AWS Glue Elastic View will enable users to combine and replicate data across multiple data stores.

These solutions will allow you to spend more time analyzing your data by automating most of the non-differentiated labor associated with data search, categorization, cleaning, enrichment, and migration.

8. What are the drawbacks of AWS Glue?

Limited Compatibility – used for working with a variety of commonly used data sources and works with services running on AWS.
No incremental data sync – Glue is not the best option for real-time ETL jobs.
Learning curve – used for supporting queries of a traditional relational database.

9. What is AWS Glue Data Catalog?

Your persistent metadata repository is AWS Glue Data Catalog. It’s a managed service that allows you to store, annotate, and exchange metadata in the AWS Cloud in the same way as an Apache Hive metastore does. AWS Glue Data Catalogs are unique to each AWS account and region. It creates a centralized location where diverse systems may store and get metadata to maintain data in data silos and query and alter the data using that metadata. Access to the data sources handled by the AWS Glue Data Catalog can be controlled with AWS Identity and Access Management (IAM) policies.

10. When an AWS Glue job times out, how do we Retry it?

Retrying a job only works if it has failed, not if it has timed out. For this, we’ll need to create custom logic, such as Event Bridge listening for Glue timeout events and then running a Lambda to restart your task.

11. Which AWS services and open-source projects use AWS Glue Data Catalog?

The AWS Glue Data Catalog is used by the following AWS services and open-source projects:

AWS Lake Formation
Amazon Athena
Amazon Redshift Spectrum
Amazon EMR
AWS Glue Data Catalog Client for Apache Hive Metastore.

12. What are AWS Glue Crawlers?

AWS Glue crawler is used to populate the AWS Glue catalog with tables. It can crawl many data repositories in one operation. One or even more tables in the Data Catalog are created or modified when the crawler is done. In ETL operations defined in AWS Glue, these Data Catalog tables are used as sources and targets. The ETL task reads and writes data to the Data Catalog tables in the source and target.

13. What is the AWS Glue Schema Registry?

The AWS Glue Schema Registry assists us by allowing us to validate and regulate the lifecycle of streaming data using registered Apache Avro schemas at no cost. Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda benefit from Schema Registry.

14. Why should we use AWS Glue Schema Registry?

You can use the AWS Glue Schema Registry to:

Validate schemas: Schemas used for data production are checked against schemas in a central registry when data streaming apps are linked with AWS Glue Schema Registry, allowing you to regulate data quality centrally.
Safeguard schema evolution: One of eight compatibility modes can be used to specify criteria for how schemas can and cannot grow.
Improve data quality: Serializers compare data producers’ schemas to those in the registry, enhancing data quality at the source and avoiding downstream difficulties caused by random schema drift.
Save costs: Serializers transform data into a binary format that can be compressed before transferring, lowering data transfer and storage costs.
Improve processing efficiency: A data stream frequently comprises records with multiple schemas. The Schema Registry allows applications that read data streams to process each document based on the schema rather than parsing its contents, increasing processing performance.

15. How does Schema Registry be integrated?

16. When should I use AWS Glue Vs. AWS Batch?

AWS Batch enables you to conduct any batch computing job on AWS with ease and efficiency, regardless of the work type. AWS Batch maintains and produces computing resources in your AWS account, giving you complete control over and insight into the resources in use. AWS Glue is a fully-managed ETL solution that runs your ETL tasks in a serverless Apache Spark environment. We recommend using AWS Glue for your ETL use cases. AWS Batch might be a better fit for some batch-oriented use cases, such as ETL use cases.

AWS Glue Interview Questions

17. What kinds of evolution rules does AWS Glue Schema Registry support?

Backward, Backward All, Forward, Forward All, Full, Full All, None, and Disabled are the compatibility modes accessible to regulate your schema evolution.

18. How does AWS Glue Schema Registry maintain high availability for applications?

The AWS Glue SLA is underpinned by the Schema Registry storage and control plane, and the serializers and deserializers use best-practice caching strategies to maximize client schema availability.

19. Is AWS Glue Schema Registry open-source?

No, AWS Glue Schema Registry is not an open-source tool. It is a proprietary service offered by Amazon Web Services (AWS).

20. How does AWS Glue relate to AWS Lake Formation?

AWS Lake Formation benefits from AWS Glue’s shared infrastructure, which offers console controls, ETL code development and task monitoring, a shared data catalog, and serverless architecture. Lake Formation features AWS Glue capability and additional capabilities for constructing, securing, and administering data lakes, even though AWS Glue is still focused on such types of procedures.

21. What are Development Endpoints?

The term “development endpoints” is used to describe the AWS Glue API’s testing capabilities when utilizing Custom DevEndpoint. A developer may debug the extract, transform, and load ETL Scripts at the endpoint.

22. What Data Sources are supported by AWS Glue?

AWS Glue supports a wide range of data sources, including:

Relational databases: Amazon Redshift, Amazon Aurora, Oracle, MySQL, Microsoft SQL Server, PostgreSQL, MariaDB
Data lakes: Amazon S3, Azure Data Lake Store, Google Cloud Storage
Cloud storage: Amazon S3, Azure Blob Storage, Google Cloud Storage
Cloud applications: Salesforce, Google Sheets
NoSQL databases: Amazon DynamoDB, Cassandra, MongoDB
File formats: CSV, JSON, Parquet, ORC
Other: Amazon DocumentDB, Apache Hive, Apache Avro, Apache Parquet, Apache ORC

23. What are the Features of AWS Glue?

The key features of AWS Glue are listed below:

Automatic Schema Discovery : Enables crawlers to automatically acquire scheme-related information and store it in a data catalog.
Job Scheduler :Several jobs can be initiated simultaneously, and users can specify job dependencies.
Developer Endpoints:Aid in creating bespoke readers, writers, and transformations.
Automatic Code Generation (ACG) :Aids in building code.
Integrated Data Catalog:The AWS pipeline’s Integrated Data Catalog stores various sources.

24. What are AWS Tags in AWS Glue?

A tag is a label you apply to an Amazon Web Services resource. Each tag has a key and an optional value, both of which are defined by you.

In AWS Glue, you may use tags to organize and identify your resources. Tags can be used to generate cost accounting reports and limit resource access. You can restrict which users in your AWS account have the authority to create, update, or delete tags if you use AWS Identity and Access Management.

The following AWS Glue resources can be tagged:

Crawler
Job
Trigger
Workflow
Development endpoint
Machine learning transform.

25. What are the points to remember when using tags with AWS Glue?

Each entity can have a maximum of 50 tags.
Tags are specified as a list of key-value pairs in the “string”: “string”… in AWS Glue.
The tag key is necessary when creating a tag on an item, but the tag value is not.
Case matters when it comes to the tag key and value.
The prefix AWS cannot be used in the tag key or the tag value. Such tags are not subject to any activities.
In UTF-8, 128 Unicode characters are the maximum tag key length. There can’t be any empty or null tags in the tag key.
In UTF-8, 256 Unicode characters are the highest tag value length. The tag value may be null or empty.

26. What is the AWS Glue database?

The AWS Glue Data Catalog database is a container that houses tables. You utilize databases to categorize your tables. When you run a crawler or manually add a table, you establish a database. All of your databases are listed in the AWS Glue console’s database list.

27. What programming language is used to write ETL code for AWS Glue?

Scala or Python can write ETL code for AWS Glue.

28. What is the AWS Glue Job system?

AWS Glue Jobs is a managed platform for orchestrating your ETL workflow. In AWS Glue, you may construct jobs to automate the scripts you use to extract, transform, and transport data to various places. Jobs can be scheduled and chained, or events like new data arrival can trigger them.

29. Does AWS Glue use EMR?

The AWS Glue Data Catalog integrates with Amazon EMR, Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive megastore, providing a consistent metadata repository across several data sources and data formats.

30. Does AWS Glue have a no-code interface for visual ETL?

Yes. AWS Glue Studio is a graphical tool for creating Glue jobs that process data. AWS Glue studio will produce Apache Spark code on your behalf once you’ve defined the flow of your data sources, transformations, and targets in the visual interface.

AWS Glue Advanced Interview Questions

31. How to customize the ETL code generated by AWS Glue?

Scala or Python code is generated via the AWS Glue ETL script suggestion engine. It makes use of Glue’s ETL framework to manage task execution and facilitate access to data sources. One can use AWS Glue’s library to write ETL code, or you can use inline editing using the AWS Glue Console script editor to write arbitrary code in Scala or Python, which you can then download and modify in your IDE.

32. How to build an end-to-end ETL workflow using multiple jobs in AWS Glue?

AWS Glue includes a sophisticated set of orchestration features that allow you to handle dependencies between numerous tasks to design end-to-end ETL processes; in addition to the ETL library and code generation, AWS Glue ETL jobs can be scheduled or triggered when they finish. Several jobs can be activated simultaneously or sequentially by triggering them on a task completion event.

33. How does AWS Glue monitor dependencies?

AWS Glue uses triggers to handle dependencies among two or more activities or external events. Triggers can both watch and invoke jobs. The three options are a scheduled trigger, which runs jobs regularly, an on-demand trigger, or a job completion trigger.

34. How do I query metadata in Athena?

AWS Glue metadata such as databases, tables, partitions, and columns may be queried using Athena. Individual hive DDL commands can be used to extract metadata information from Athena for specific databases, tables, views, partitions, and columns, but the results are not tabular.

35. What is the general workflow for how a Crawler populates the AWS Glue Data Catalog?

The usual method for populating the AWS Glue Data Catalog via a crawler is as follows:

To deduce the format and schema of your data, a crawler runs any custom classifiers you specify. Custom classifiers are programmed by you and run in the order you specify.
A schema is created using the first custom classifier that correctly recognizes your data structure. Lower-ranking custom classifiers are ignored.
Built-in classifiers attempt to identify your data schema if no custom classifier matches it. One that acknowledges JSON is an example of a built-in classifier.
The crawler accesses the data storage. Connection attributes are required for crawler access to some data repositories.
Your data will be given an inferred schema.
The crawler populates the data catalog. A table description is a piece of metadata that defines your data store’s data. The table is kept in the Data Catalog, a database container for tables. The label generated by the classifier that inferred the table schema is the table’s classification attribute.

36. How does AWS Glue handle ETL errors?

AWS Glue tracks job metrics and faults and sends all alerts to Amazon CloudWatch. You may set up Amazon CloudWatch to do various tasks responding to AWS Glue notifications. You can use AWS Lambda to trigger an AWS Lambda function when you get an error or a success notice from Glue. Glue also has a default retry behavior that retries all errors three times before generating an error message.

37. Can we run existing ETL jobs with AWS Glue?

Yes. On AWS Glue, we can run your Scala or Python code. Simply save the code to Amazon S3 and use it in one or more jobs. We can reuse code across multiple jobs by connecting numerous jobs to the exact code location on Amazon S3.

38. What AWS Glue Schema Registry supports data format, client language, and integrations?

The Schema Registry supports Java client apps and Apache Avro and JSON Schema data formats. We intend to keep adding support for non-Java clients and various data types. The Schema Registry works with Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda applications.

39. How to get metadata into the AWS Glue Data Catalog?

The AWS Glue Data Catalog can be populated in a variety of ways. Crawlers in the Glue Data Catalog search various data stores you own to infer schemas and partition structures and populate the Glue Data Catalog with table definitions and statistics. You can also run crawlers regularly to keep your metadata current and in line with the underlying data. Users can also use the AWS Glue Console or the API to manually add and change table information. Hive DDL statements can also be executed on an Amazon EMR cluster via the Amazon Athena Console or a Hive client.

40. How to import data from the existing Apache Hive Metastore to the AWS Glue Data Catalog?

Simply execute an ETL process that reads data from your Apache Hive Metastore, exports it to Amazon S3, and imports it into the AWS Glue Data Catalog.

41. Do we need to maintain my Apache Hive Metastore if we store metadata in the AWS Glue Data Catalog?

No, the Apache Hive Metastore is incompatible with AWS Glue Data Catalog. You can use Glue Data Catalog to replace Apache Hive Metastore by pointing to its endpoint.

42. What is AWS Glue Streaming ETL?

AWS Glue helps in enabling ETL operations on streaming data by using continuously-running jobs. It can also be built on the Apache Spark Structured Streaming engine and can ingest streams from Kinesis Data Streams and Apache Kafka using Amazon Managed Streaming for Apache Kafka. It can clean and transform streaming data and load it into S3 and JDBC data stores and can process event data like IoT streams, clickstreams, and network logs.

43. When should we use AWS Glue Streaming, and when should we use Amazon Kinesis Data Analytics?

AWS Glue Streaming is a feature of AWS Glue that allows real-time data processing and transformation. It is best suited for situations where data needs to be processed and transformed as it is being ingested, such as in near real-time data analytics or data processing for machine learning applications.

On the other hand, Amazon Kinesis Data Analytics is a fully managed service that enables real-time processing and analysis of streaming data. It is best suited for situations where real-time data analytics and processing is the primary focus, such as real-time data dashboarding or anomaly detection.

In summary, if your primary focus is on data transformation and integration, you should use AWS Glue Streaming. If your primary focus is on real-time data analytics and processing, you should use Amazon Kinesis Data Analytics.

44. How to specify join types in AWS Glue?

cUser0 = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_users", transformation_ctx = "cUser")


cUser0DF = cUser0.toDF()


cKKR = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_karyakartas", redshift_tmp_dir = args["TempDir"], transformation_ctx = "cKKR")


cKKRDF = cKKR.toDF()


dataSource0 = cUser0DF.join(cKKRDF, cUser0DF.id == cKKRDF.user_id,how='left_oute

45. What is AWS Glue DataBrew?

AWS Glue DataBrew is a visual data preparation solution that allows data analysts and scientists to prepare without writing code using an interactive, point-and-click graphical interface. You can simply visualize, clean, and normalize terabytes, even petabytes, of data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS, with Glue DataBrew.

46. Who can use AWS Glue DataBrew?

AWS Glue DataBrew is designed for users that need to clean and standardize data before using it for analytics or machine learning. The most common users are data analysts and data scientists. Business intelligence analysts, operations analysts, market intelligence analysts, legal analysts, financial analysts, economists, quants, and accountants are examples of employment functions for data analysts. Materials scientists, bioanalytical scientists, and scientific researchers are all examples of employment functions for data scientists.

47. How to list all databases and tables in AWS Glue Catalog?

To list all databases and tables in the AWS Glue Catalog, you can use the get_databases() and get_tables() methods of the AWS Glue Catalog class in the awsglue module of the AWS Glue Python library. Here is an example of how to use these methods:

import boto3
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create a Glue context
sc = SparkContext()
glue_context = GlueContext(sc)

# Create a Glue client
client = boto3.client('glue')

# List all databases in the Glue Catalog
databases = client.get_databases()['DatabaseList']
for database in databases:
    print(database['Name'])

    # List all tables in each database
    tables = client.get_tables(DatabaseName=database['Name'])['TableList']
    for table in tables:
        print(f' - {table["Name"]}')

Alternatively, you can use the GlueClient.get_database() and GlueClient.get_table() methods to retrieve specific databases and tables by name.

48. What types of transformations are supported in AWS Glue DataBrew?

You can combine, pivot, and transpose data using over 250 built-in transformations without writing code. AWS Glue DataBrew also suggests transformations such as filtering anomalies, rectifying erroneous, wrongly classified, and duplicate data, normalizing data to standard date and time values, or generating aggregates for analysis automatically. Glue DataBrew enables transformations that leverage powerful machine learning techniques such as Natural Language Processing for complex transformations like translating words to a common base or root word (NLP). Multiple transformations can be grouped, saved as recipes, and applied straight to incoming data.

49. What file formats does AWS Glue DataBrew support?

AWS Glue DataBrew accepts comma-separated values (.csv), JSON and nested JSON, Apache Parquet and nested Apache Parquet, and Excel sheets as input data types. Comma-separated values (.csv), JSON, Apache Parquet, Apache Avro, Apache ORC, and XML are all supported as output data formats in AWS Glue DataBrew.

50. How do execute AWS Glue scripts using python 2.7 from a local machine?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job


glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
             database="records",
             table_name="recordsrecords_converted_json")
print "Count: ", persons.count()
persons.printSchema()

51. Do we need to use AWS Glue Data Catalog or AWS Lake Formation to use AWS Glue DataBrew?

No. Without using the AWS Glue Data Catalog or AWS Lake Formation, you can use AWS Glue DataBrew. DataBrew users can pick data sets from their centralized data catalog using the AWS Glue Data Catalog or AWS Lake Formation.

52. What are AWS Glue Elastic Views?

AWS Glue Elastic Views makes it simple to create materialized views that integrate and replicate data across various data stores without writing proprietary code. AWS Glue Elastic Views can quickly generate a virtual materialized view table from multiple source data stores using familiar Structured Query Language (SQL). AWS Glue Elastic Views moves data from each source data store to a destination data store and generates a duplicate of it. AWS Glue Elastic Views continuously monitors data in your source data stores, and automatically updates materialized views in your target data stores, ensuring that data accessed through the materialized view is always up-to-date.

53. Why should we use AWS Glue Elastic Views?

Use AWS Glue Elastic Views to aggregate and continuously replicate data across several data stores in near-real-time. This is frequently the case when implementing new application functionality requiring data access from one or more existing data stores. For example, a company might use a customer relationship management (CRM) application to keep track of customer information and an e-commerce website to handle online transactions. The data would be stored in these apps or more data stores.

54. What are the components used by AWS Glue?

AWS Glue consists of:

Data Catalog is a Central Metadata Repository.
ETL Engine helps in generating Python and Scala Code.
Flexible Scheduler helps in handling Dependency Resolution, Job Monitoring, and Retiring.
AWS Glue DataBrew helps in Normalizing and Cleaning Data with a visual interface.
AWS Glue Elastic View is used in Replicating and Combining Data through multiple Data Stores.

55. What is AWS Glue Data Catalog?

Your persistent metadata store is the AWS Glue Data Catalog. It is a managed service that allows you to store, annotate, and share metadata in the AWS Cloud in the same manner that an Apache Hive metastore does. There is one AWS Glue Data Catalog per AWS region in each AWS account.

56. Which AWS services and open-source projects make use of the AWS Glue Data Catalog?

The AWS services and open-source projects that make use of the AWS Glue Data Catalog include:

AWS Lake Formation
Amazon Athena
Amazon Redshift Spectrum
Amazon EMR
AWS Glue Data Catalog Client for Apache Hive Metastore

57. How do join/merge all rows of an RDD in PySpark / AWS Glue into one single long line?

Each rdd row can be mapped into one string per row using a map, and the result of the map call can then be aggregated into a single large string:

result = rdd.map(lambda r: " ".join(r) + "\n")\
    .aggregate("", lambda a,b: a+b, lambda a,b: a+b)

58. What is the purpose of an AWS Glue Job?

In AWS Glue, a job is the business logic that does the extract, transform, and load (ETL) operations. AWS Glue executes a script when you start a task that pulls data from sources, transforms it, and inserts it into targets. In the ETL section of the AWS Glue console, you could create jobs.

59. Do the AWS Glue APIs return the partition key fields in the order they were specified when the table was created.?

Yes, the partition keys would be returned in the same order as they were specified when the table was created.

60. How do you trigger a Glue crawler in AWS Glue?

create-trigger the command will create a schedule trigger named TechGeekNextTrigger in the activated state, and runs every day at 12:00 pm UTC with a crawler named TechGeekNextCrawler.

aws glue create-trigger --name TechGeekNextTrigger --type SCHEDULED --schedule  "cron(0 12 * * ?

61. How to create an AWS glue job using CLI commands?

We can create an AWS glue job by using the below command:

aws glue create-job \
--name ${GLUE_JOB_NAME} \
--role ${ROLE_NAME} \
--command "Name=techgekkenxt1,ScriptLocation=s3:///" \
--connections Connections=${GLUE_CONN_NAME} \
--default-arguments file://${DEFAULT_ARGUMENT_FILE}

62. How to get the total number of partitions in AWS Glue for a specific range?

By using the below command, we can get the total number of partitions with specified lengths.

aws glue get-partitions --database-name xx --table-name xx --query 'length(Partitions[])'

63. What are AWS Glue Crawlers?

AWS Glue Crawlers are automated extract, transform, and load (ETL) jobs that discover and catalog data in data stores. They can crawl data stored in Amazon S3, Amazon RDS, Amazon Redshift, and other data stores to create a metadata catalog of data structures and properties.

AWS Glue Crawlers can be used to:

Discover data stored in various data stores
Extract metadata and create a table definition in the Glue Data Catalog
Transform data into a desired format or structure
Load data into a data store or data warehouse

AWS Glue Crawlers can be scheduled to run at specific intervals or triggered to run on demand. They can also be configured to run continuously, monitoring data stores for changes and updating the metadata catalog as needed.

Blog

Blog

Top 50+ AWS Glue Interview Questions and Answers

1. What is AWS Glue?

2. Describe AWS Glue Architecture

3. What are the use cases of AWS Glue?

4. What are the Benefits of AWS Glue?

5. How can we Automate Data Onboarding?

6. When to use a Glue Classifier?

7. What are the main components of AWS Glue?

8. What are the drawbacks of AWS Glue?

9. What is AWS Glue Data Catalog?

10. When an AWS Glue job times out, how do we Retry it?

11. Which AWS services and open-source projects use AWS Glue Data Catalog?

12. What are AWS Glue Crawlers?

13. What is the AWS Glue Schema Registry?

14. Why should we use AWS Glue Schema Registry?

15. How does Schema Registry be integrated?

16. When should I use AWS Glue Vs. AWS Batch?

17. What kinds of evolution rules does AWS Glue Schema Registry support?

18. How does AWS Glue Schema Registry maintain high availability for applications?

19. Is AWS Glue Schema Registry open-source?

20. How does AWS Glue relate to AWS Lake Formation?

21. What are Development Endpoints?

22. What Data Sources are supported by AWS Glue?

23. What are the Features of AWS Glue?

24. What are AWS Tags in AWS Glue?

25. What are the points to remember when using tags with AWS Glue?

26. What is the AWS Glue database?

27. What programming language is used to write ETL code for AWS Glue?

28. What is the AWS Glue Job system?

29. Does AWS Glue use EMR?

30. Does AWS Glue have a no-code interface for visual ETL?

AWS Glue Advanced Interview Questions

31. How to customize the ETL code generated by AWS Glue?

32. How to build an end-to-end ETL workflow using multiple jobs in AWS Glue?

33. How does AWS Glue monitor dependencies?

34. How do I query metadata in Athena?

35. What is the general workflow for how a Crawler populates the AWS Glue Data Catalog?

36. How does AWS Glue handle ETL errors?

37. Can we run existing ETL jobs with AWS Glue?

38. What AWS Glue Schema Registry supports data format, client language, and integrations?

39. How to get metadata into the AWS Glue Data Catalog?

40. How to import data from the existing Apache Hive Metastore to the AWS Glue Data Catalog?

41. Do we need to maintain my Apache Hive Metastore if we store metadata in the AWS Glue Data Catalog?

42. What is AWS Glue Streaming ETL?

43. When should we use AWS Glue Streaming, and when should we use Amazon Kinesis Data Analytics?

44. How to specify join types in AWS Glue?

45. What is AWS Glue DataBrew?

46. Who can use AWS Glue DataBrew?

47. How to list all databases and tables in AWS Glue Catalog?

48. What types of transformations are supported in AWS Glue DataBrew?

49. What file formats does AWS Glue DataBrew support?

50. How do execute AWS Glue scripts using python 2.7 from a local machine?

51. Do we need to use AWS Glue Data Catalog or AWS Lake Formation to use AWS Glue DataBrew?

52. What are AWS Glue Elastic Views?

53. Why should we use AWS Glue Elastic Views?

54. What are the components used by AWS Glue?

55. What is AWS Glue Data Catalog?

56. Which AWS services and open-source projects make use of the AWS Glue Data Catalog?

57. How do join/merge all rows of an RDD in PySpark / AWS Glue into one single long line?

58. What is the purpose of an AWS Glue Job?

59. Do the AWS Glue APIs return the partition key fields in the order they were specified when the table was created.?

60. How do you trigger a Glue crawler in AWS Glue?

61. How to create an AWS glue job using CLI commands?

62. How to get the total number of partitions in AWS Glue for a specific range?

63. What are AWS Glue Crawlers?

Become An Instructor

Subscribe to Newsletter

About US

Links

Work With Us

Courses

Subscribe to Newsletter