What are the benefits of using Apache Hudi?

Apache Hudi provides faster data ingestion, efficient data querying, and supports real-time streaming data processing. It also enables easy updates and deletes on large datasets and supports ACID transactions.

What are the use cases of Apache Hudi?

Apache Hudi can be used for various use cases such as data quality monitoring, data governance, data replication, log processing, and real-time analytics.

What programming languages does Apache Hudi support?

Apache Hudi supports Java, Scala, and Python.

Blog

Top 50+ Apache Hudi Interview Questions

Apache Hudi Interview Questions

1. What is Hudi?

Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.

2. When and Why is Hudi used for?

Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data management framework designed to simplify incremental data processing in Apache Hadoop. Hudi provides a reliable and efficient way to handle data changes, enabling users to perform updates, deletes, and incremental processing on large datasets stored in Hadoop Distributed File System (HDFS) or cloud storage systems like Amazon S3, Azure Data Lake, and Google Cloud Storage.

Hudi is typically used in scenarios where there is a need to perform updates, deletes, or incremental processing on large datasets, especially those that are continuously changing. Hudi is well-suited for use cases where data needs to be frequently updated, but only the changes need to be processed, rather than the entire dataset.

Some of the common use cases of Hudi include:

Data warehousing: Hudi can be used to build a reliable data pipeline for storing and processing large datasets in a data warehouse environment. Hudi’s support for incremental processing and updates makes it ideal for building data warehouses that require frequent updates.
Streaming data processing: Hudi can be used to process data in real-time by ingesting data from streaming sources and writing to Hudi tables. This allows for real-time analysis and processing of data.
Data Lake management: Hudi can be used to manage and process data stored in a data lake environment. Hudi’s support for incremental processing and updates makes it easy to handle changes to data in a data lake, without having to process the entire dataset each time.

Hudi is used to simplify the processing of large datasets in Hadoop environments, and it’s particularly useful for scenarios where data changes frequently, but only the changes need to be processed.

3. What are some non-goals for Hudi?

Although Hudi is a powerful data management framework for processing and managing large datasets in Hadoop environments, it has some non-goals or limitations that users should be aware of. Some of the non-goals for Hudi include:

Hudi is not a replacement for traditional databases: Hudi is designed to work with Hadoop Distributed File System (HDFS) and cloud storage systems to process large datasets, but it is not a database replacement. Hudi does not provide all the features of a traditional database, such as full transaction support and complex query capabilities.
Hudi is not designed for low-latency transaction processing: While Hudi provides support for incremental processing and updates, it is not designed for low-latency transaction processing. Hudi is optimized for batch processing of large datasets, which may take longer to complete.
Hudi is not a machine learning framework: Hudi is not designed for machine learning workloads or providing advanced analytics capabilities. Hudi is focused on data management and simplifying data processing in Hadoop environments.
Hudi does not provide data security features: Although Hudi provides some basic data management features, it does not provide advanced security features such as encryption or access control. Users are responsible for implementing their own security measures to protect their data.

4. What is incremental processing in Hudi?

When processing data incrementally with Hudi, the framework uses the commit metadata to identify the delta files that need to be processed to bring the dataset up to date. Hudi then applies the changes in the delta files to the dataset to update, insert, or delete records as necessary.

Incremental processing in Hudi provides several benefits, including:

Faster processing times: Incremental processing enables Hudi to process only the changes made to a dataset since the last time it was processed, which can significantly reduce processing times for large datasets.
Reduced storage requirements: By only storing the changes made to a dataset since the last time it was processed, Hudi can significantly reduce the storage requirements for large datasets.
More efficient use of resources: Incremental processing enables Hudi to use computing resources more efficiently, by avoiding the need to process the entire dataset each time.

5. What is the difference between copy-on-write (COW) vs merge-on-read (MOR) storage types?

Copy-On-Write (COW) and Merge-On-Read (MOR) are two different storage types supported by Hudi for managing large datasets stored in Hadoop Distributed File System (HDFS) or cloud storage systems like Amazon S3, Azure Data Lake, and Google Cloud Storage. The main differences between these storage types can be summarized in the following table:

	Copy-On-Write (COW)	Merge-On-Read (MOR)
Data Update	Creates a new copy of the entire dataset	Updates existing data files
Write Performance	Slower, as it requires creating new copies of data files for each update	Faster, as updates are written to a log file and merged later during reads
Read Performance	Slower, as data needs to be merged at read time to form the latest view of the dataset	Faster, as data is already merged and stored in optimized data files
Storage Overhead	Higher, as multiple versions of data files need to be stored	Lower, as only the latest version of data files need to be stored
Data Consistency	Provides strong consistency as each version of the dataset is immutable and read-only	Provides eventual consistency as data is merged during reads, which may result in stale data if the merge has not yet occurred
Use Cases	Best suited for use cases where data changes frequently and a complete audit trail of all changes is required	Best suited for use cases where data is read frequently and updates are infrequent or batch-oriented

6. Explain how to choose a storage type for a workload?

Choosing the right storage type for your workload in Hudi depends on several factors, including the size of your data, the type of queries you will be running, and the frequency of updates to your data.

Here are some common storage types used in Hudi and their characteristics:

Copy on Write (COW): This storage type works well for use cases where data is append-only or rarely updated. It creates a new file for each write operation and preserves the previous version of the file, enabling fast reads.
Merge on Read (MOR): This storage type works well for use cases where data is frequently updated or overwritten. It stores all the versions of the data in the same file, making it easier to manage and query.
Read Optimized (RO): This storage type works well for use cases where data is read-heavy and rarely updated. It creates immutable data files that can be quickly read by a query engine.

When choosing a storage type, you should consider the trade-offs between read and write performance, data size, and query complexity. For example, if you have a large dataset with frequent updates, MOR might be the best choice. Conversely, if you have a smaller dataset with infrequent updates and heavy read requirements, RO might be the better option.

Ultimately, the choice of storage type for your Hudi workload will depend on the specific requirements of your use case. It’s important to evaluate your workload carefully and experiment with different storage types to find the optimal solution.

7. Can you explain whether Hudi is an analytical database or not?

Apache Hudi is not an analytical database in the traditional sense. Instead, it is a distributed data management framework designed to handle large-scale, near-real-time data processing workloads.

Hudi is designed to enable incremental data processing, which means it can process only the changes to data rather than processing the entire dataset each time a query is executed. This feature makes Hudi well-suited for handling constantly changing data, such as clickstream data, social media feeds, or IoT sensor data.

While Hudi is not an analytical database, it can be used in conjunction with analytical databases like Apache Hive, Apache Spark, or Presto to provide fast and efficient querying capabilities for large datasets. Hudi’s incremental processing capabilities and ability to store large volumes of data in a distributed manner can help improve the performance of analytical databases when processing large datasets.

8. Explain how to model the data stored in Hudi?

Modeling data in Hudi is similar to modeling data in other distributed data processing systems. However, there are some specific considerations that you should keep in mind when modeling data in Hudi.

Define your schema: Before you start storing data in Hudi, it is important to define a schema for your data. This schema will help you define the structure of your data and ensure that it is stored in a consistent and organized way. Hudi supports multiple data formats, including Parquet, Avro, and ORC, so you can choose the format that works best for your use case.
Choose your storage type: As mentioned earlier, Hudi supports different storage types, such as Copy on Write (COW), Merge on Read (MOR), and Read Optimized (RO). You should choose a storage type that is best suited for your use case and data processing requirements.
Define your partitioning strategy: Hudi supports partitioning data by different columns, such as date, time, or location. Partitioning your data can help you optimize your queries and improve the performance of your data processing workflows.
Define your indexing strategy: Hudi supports indexing data by different columns, such as primary keys, foreign keys, or indexed columns. Indexing your data can help you improve the performance of your queries by enabling fast lookups and filtering.
Define your merge policy: When using the Merge on Read storage type, you need to define a merge policy that specifies how data is merged when new data is added. Hudi provides a number of built-in merge policies, such as deduplication, timestamp-based merge, and delta stream merge, which can be used depending on your use case.

9. Why does Hudi require a key field to be configured?

Hudi requires a key field to be configured because it uses this field to uniquely identify and track changes to data records.

When Hudi writes data to storage, it partitions the data based on the key field. This enables Hudi to store and process large datasets efficiently, by distributing the data across multiple nodes in a cluster.

Additionally, when new data is written to storage, Hudi uses the key field to determine if a record already exists, and if so, whether the new data should overwrite or be merged with the existing record. This is essential for handling incremental data processing workflows, where new data is frequently added to an existing dataset.

By requiring a key field to be configured, Hudi ensures that data is stored and processed efficiently and accurately, even in large-scale, constantly changing data processing workflows.

10. Does Hudi support cloud storage/object stores?

Yes, Hudi supports cloud storage/object stores such as Amazon S3, Azure Blob Storage, Google Cloud Storage, and more.

In fact, cloud storage and object stores are often the preferred storage option for Hudi users because they provide cost-effective, scalable, and durable storage for large datasets.

Hudi supports cloud storage and object stores through its built-in connectors, which enable seamless integration with these services. These connectors are designed to optimize data reads and writes, minimize network latency, and ensure data consistency and durability.

When using Hudi with cloud storage or object stores, it is important to consider the storage and retrieval costs associated with these services, as well as the performance and latency requirements of your data processing workflows. Hudi provides several configuration options and tuning parameters that can help you optimize your data storage and processing workflows for cloud storage and object stores.

11. What versions of Hive/Spark/Hadoop are support by Hudi?

Hudi supports a wide range of versions for Hive, Spark, and Hadoop. However, the specific versions that are supported depend on the Hudi release version that you are using.

Here are the supported versions for some of the major Hudi releases:

Hudi 0.9.x: Supports Apache Hadoop 2.7.x and later, Apache Spark 2.4.x and later, and Apache Hive 2.x and later.
Hudi 0.10.x: Supports Apache Hadoop 2.7.x and later, Apache Spark 3.0.x and later, and Apache Hive 3.x and later.
Hudi 0.11.x: Supports Apache Hadoop 2.7.x and later, Apache Spark 3.1.x and later, and Apache Hive 3.x and later.

Note that the exact version compatibility may vary depending on the specific features and configuration options that you are using.

12. How does Hudi actually store data inside a dataset?

Hudi stores data inside a dataset using a combination of columnar and row-based file formats, along with an indexing mechanism to enable fast data access and processing.

Hudi organizes data into partitions based on a user-specified partition key. Within each partition, Hudi stores data in two types of files:

Parquet files: Hudi uses the Apache Parquet columnar file format to store the bulk of the data. This format provides efficient compression and encoding of data, as well as support for predicate pushdown and column projection, which can significantly improve query performance.
Delta files: Hudi uses delta files to track incremental changes to data. Each delta file contains a set of inserts, updates, and deletes to the corresponding Parquet file. This enables Hudi to provide efficient upserts and incremental processing of data.

In addition to these files, Hudi also creates an indexing mechanism to enable fast data access and processing. Hudi creates two types of indexes:

Bloom filters: Hudi creates a Bloom filter for each Parquet file, which is used to filter out data that does not match a given predicate. This can significantly reduce the amount of data that needs to be scanned for a given query.
Hoodie timeline: Hudi maintains a timeline of all the delta files that have been written to a dataset, along with their corresponding Parquet files. This timeline provides a mechanism for querying and processing data at different points in time, which is essential for many data processing workflows.

By using this combination of columnar and row-based file formats, along with indexing mechanisms, Hudi is able to provide fast, efficient, and scalable storage and processing of large datasets.

13. How Hudi handles partition evolution requirements ?

Partition evolution refers to the need to add, remove, or modify partition keys in an existing dataset. This can happen when the schema of the underlying data changes, or when the data is reorganized to improve query performance or reduce storage costs.

Hudi provides several mechanisms to handle partition evolution requirements:

Dynamic partitioning: Hudi supports dynamic partitioning, which allows new partitions to be added to a dataset on-the-fly, as data is written. This can be useful when dealing with datasets that have a large number of partitions, or when the partition keys are not known in advance.
Partition key schema evolution: Hudi supports schema evolution for partition keys, which allows the partition key schema to be modified over time without affecting the existing data. This can be useful when the partitioning scheme needs to be changed to improve query performance or reduce storage costs.
Partition path renaming: Hudi provides support for renaming partition paths, which allows existing partitions to be moved or renamed without affecting the data. This can be useful when reorganizing data to improve query performance or to conform to a different partitioning scheme.
Upsert-based partition evolution: Hudi uses an upsert-based approach to handle partition evolution requirements. When a new partition is added, or when the partition key schema changes, Hudi creates new delta files that are associated with the new partition or schema. These delta files contain all the new or modified data, and can be merged with the existing data to create a new version of the dataset.

By providing these mechanisms, Hudi enables users to handle partition evolution requirements in a flexible and efficient way, while ensuring that data consistency and durability are maintained.

Using Hudi

14. What are some ways to write a Hudi dataset?

Apache Hudi is a data management framework that enables stream processing and incremental data processing. To write a Hudi dataset, you can follow these steps:

Choose a storage type: Hudi supports various storage types such as HDFS, S3, Azure Blob Storage, and more. Choose the storage type that suits your use case.
Define your schema: Hudi requires a schema definition for the data it will store. You can define your schema using the Avro schema format, which is a popular choice for Hudi.
Create a Hudi table: Use the Hudi API to create a Hudi table. This involves specifying the storage type, schema, and other table properties such as partitioning.
Write data to the table: You can write data to a Hudi table using one of the available APIs such as Spark or Flink. The API will write data to the table and handle the incremental updates automatically.
Perform incremental updates: Hudi supports incremental updates, which means that you can update existing records in the table without having to rewrite the entire table. You can perform these updates using the available APIs.
Compact the table: Over time, a Hudi table may become fragmented due to the incremental updates. You can use the Hudi API to compact the table and remove any unnecessary data.

15. How is a Hudi job deployed?

To deploy a Hudi job, you can follow these steps:

Build the Hudi application: Build your Hudi application using your preferred build tool such as Maven or Gradle.
Package the application: Package your Hudi application into a jar file that can be executed.
Configure your environment: Set up the environment where you want to deploy your Hudi job. This includes setting up the storage system, Hadoop cluster, and any other dependencies required by your application.
Submit the job: Submit the Hudi job to the cluster using one of the available job submission mechanisms such as YARN, Spark, or Flink. The job submission mechanism will launch the Hudi application and execute the job.
Monitor the job: Monitor the Hudi job to ensure it is executing correctly. This involves checking the logs, metrics, and other indicators to ensure the job is making progress and completing its tasks.
Manage the job: Manage the Hudi job as necessary, including scaling it up or down, modifying its configuration, and restarting it if needed.

16. Explain how to query a Hudi dataset that has been written?

To query a Hudi dataset, you can use one of the available query engines or tools that support the data storage and processing format used by the Hudi dataset. The specific tools and techniques you can use depend on the storage type and query engine you are using. Here are some options:

Hive: If you have stored your Hudi dataset on HDFS or a compatible storage system, you can use Hive to query the data using SQL-like syntax. You can define a Hive table that maps to the Hudi dataset, and then run SQL queries against that table.
Presto: Presto is an open-source distributed SQL query engine that can be used to query data stored in various formats, including Hudi. You can define a Presto table that maps to the Hudi dataset, and then run SQL queries against that table.
Spark: If you used Spark to write your Hudi dataset, you can use Spark SQL to query the data using SQL-like syntax. You can load the data into a Spark DataFrame, and then use Spark SQL to run queries against that DataFrame.
Flink: If you used Flink to write your Hudi dataset, you can use Flink SQL to query the data using SQL-like syntax. You can load the data into a Flink Table, and then use Flink SQL to run queries against that table.
Hudi DeltaStreamer: Hudi DeltaStreamer is a tool that can be used to ingest data into a Hudi dataset from various sources, including Kafka, Flume, and more. It can also be used to perform upserts and deletes on existing data. You can use DeltaStreamer to stream data into a Hudi dataset, and then query the dataset using one of the above tools.

17. How does Hudi handle duplicate record keys in an input?

When Hudi encounters duplicate record keys in an input, it uses a merge strategy to resolve the conflict and merge the duplicate records. The merge strategy used depends on the write operation that is being performed, which can be either upsert or insert.

Upsert: When performing an upsert operation, Hudi checks whether the incoming record key already exists in the dataset. If the key exists, Hudi retrieves the existing record and performs a merge with the incoming record. The merge is performed using a configurable merge strategy, which can be set to either overwrite or delta. If the merge strategy is set to overwrite, the incoming record replaces the existing record. If the merge strategy is set to delta, the incoming record is merged with the existing record, with any conflicting fields resolved based on a specified resolution rule.
Insert: When performing an insert operation, Hudi allows duplicate record keys to be inserted into the dataset. This is because the insert operation is typically used to write new records that do not exist in the dataset, and duplicate keys can be used to represent different versions or revisions of the same record.

18. Can you implement custom logic for merging input records with records on storage in Hudi?

Yes, you can implement custom logic for merging input records with records on storage in Hudi. Hudi provides a flexible API that allows you to define and implement your own merge strategy and resolution rule classes, which can be used to customize the merge logic.

To implement your own merge logic, you can create a custom merge strategy and resolution rule class that implements the Hudi MergeStrategy and ResolutionStrategy interfaces respectively. The MergeStrategy class defines the logic for how incoming records are merged with existing records, while the ResolutionStrategy class defines the logic for how conflicts between fields are resolved during the merge process.

Once you have implemented your custom merge strategy and resolution rule classes, you can configure Hudi to use them by setting the appropriate configuration properties in your Hudi job. You can also use the Hudi HoodieWriteConfig class to set the merge strategy and resolution rule properties programmatically in your Hudi job.

For example, here is some sample code that shows how to configure a custom merge strategy and resolution rule in a Hudi job using the HoodieWriteConfig class:

// Create a HoodieWriteConfig instanceHoodieWriteConfig config = HoodieWriteConfig.newBuilder()
    .withPath("/path/to/hudi/dataset")
    .withMergeStrategy(MyCustomMergeStrategy.class.getName())
    .withConflictResolution(MyCustomResolutionRule.class.getName())
    .build();


// Create a Hudi table instance using the configurationHoodieTable table = HoodieTable.create(config, context, schema);


// Write the input data to the Hudi dataset using the table instance
JavaRDD<HoodieRecord> inputRecords = ...;
JavaRDD<WriteStatus> writeStatuses = table.upsert(jsc, inputRecords, instantTime)

In this example, the MyCustomMergeStrategy and MyCustomResolutionRule classes are custom merge strategy and resolution rule classes that you have implemented. By setting the withMergeStrategy and withConflictResolution configuration properties to the fully-qualified class names of your custom classes, you are configuring Hudi to use your custom merge logic when upserting data into the Hudi dataset.

19. What is the process for deleting records from a Hudi dataset?

In Hudi, records can be deleted from a dataset using a delete operation. The delete operation is performed using the HoodieWriteClient API, which provides several methods for deleting records.

The process for deleting records from a Hudi dataset involves the following steps:

Create a HoodieWriteClient instance for the Hudi dataset using the configuration settings.
Prepare the delete records by creating a JavaRDD of HoodieKey objects. The HoodieKey object represents the unique key of the record that needs to be deleted.
Call the delete method on the HoodieWriteClient instance and pass in the JavaRDD of HoodieKey objects to delete the records from the dataset.
Optionally, you can specify the commit time for the delete operation using the withCommitTime method. If you don’t specify a commit time, Hudi will automatically generate one.

Here’s an example code snippet that demonstrates how to delete records from a Hudi dataset:

// Create a HoodieWriteConfig instanceHoodieWriteConfig config = HoodieWriteConfig.newBuilder()
    .withPath("/path/to/hudi/dataset")
    .withSchema(schema)
    .build();


// Create a HoodieWriteClient instance using the configurationHoodieWriteClient client = new HoodieWriteClient(jsc, config);


// Prepare the delete records
List<HoodieKey> keysToDelete = Arrays.asList(
    new HoodieKey("partition1", "key1"),
    new HoodieKey("partition2", "key2")
);
JavaRDD<HoodieKey> deleteKeys = jsc.parallelize(keysToDelete);


// Delete the records from the Hudi datasetString commitTime = client.delete(deleteKeys

In this example, the HoodieWriteClient instance is used to delete records from the Hudi dataset located at /path/to/hudi/dataset. The deleteKeys RDD contains the unique keys of the records that need to be deleted. The delete method is called on the HoodieWriteClient instance to perform the delete operation, and the commit time for the delete operation is returned as a String.

20. Does deleted records appear in Hudi’s incremental query results?

By default, deleted records are not included in the incremental query results of a Hudi dataset. When records are deleted using the HoodieWriteClient API, they are marked for deletion by setting a tombstone flag. During query execution, Hudi’s query engine automatically filters out tombstone records from the incremental query results, ensuring that they are not returned in query results.

However, it is important to note that tombstone records are not physically removed from the dataset until a compaction operation is performed. Compaction is a process in which the data files of a Hudi dataset are merged and rewritten in a more compact form, removing any tombstone records that were previously marked for deletion.

If you need to include tombstone records in your query results, you can use the HoodieTimeline API to retrieve a view of the dataset that includes all versions of the data, including tombstone records. The HoodieTimeline API provides methods for querying the timeline of a Hudi dataset and retrieving specific versions of the data, including deleted records. However, this approach can be slower than using the incremental query feature, especially for large datasets, as it requires scanning the entire dataset history.

21. What is the process for migrating data to Hudi?

The process for migrating data to Hudi involves the following steps:

Prepare the data: The first step in migrating data to Hudi is to prepare the data for ingestion. This involves cleaning, transforming, and validating the data as needed. Hudi supports various data formats such as Parquet, ORC, Avro, and JSON, so the data may need to be converted to one of these formats before ingestion.
Set up the Hudi environment: Next, you need to set up the Hudi environment. This involves installing the necessary dependencies, configuring the Hudi cluster, and creating the Hudi dataset.
Ingest the data: Once the data is prepared and the Hudi environment is set up, you can start ingesting the data into Hudi. Hudi provides several ingestion options, including bulk insert, upsert, and incremental insert. Choose the option that best fits your use case.
Verify the data: After ingesting the data, it is important to verify that the data was ingested correctly. You can use Hudi’s query functionality to query the data and validate that it matches the original data.
Monitor the Hudi dataset: Finally, it is important to monitor the Hudi dataset to ensure that it is performing as expected. This includes monitoring the storage usage, query performance, and any errors or exceptions that may occur.

22. Explain how to pass Hudi configurations to a Spark job?

There are several ways to pass Hudi configurations to a Spark job:

Set configuration properties in code: You can set configuration properties directly in your Spark application code using the SparkConf object. For example, you can use the set method to set the Hudi table type, table name, and other properties.
Set configuration properties using command line arguments: You can also pass Hudi configuration properties to your Spark application as command line arguments. This can be done using the --conf option followed by the configuration key and value. For example, you can pass the Hudi table type and table name as follows: --conf "hoodie.datasource.write.table.type=MERGE_ON_READ" --conf "hoodie.datasource.write.table.name=my_hudi_table"
Set configuration properties using a configuration file: You can create a configuration file containing Hudi properties and pass it to your Spark application using the --properties-file option. This is useful for managing large sets of configuration properties.
Set configuration properties using environment variables: You can set Hudi configuration properties as environment variables and access them in your Spark application using the System.getenv method. This is useful when you want to keep sensitive configuration properties out of code or configuration files.

23. How to create Hive style partition folder structure?

To create a Hive-style partition folder structure in Hudi, you need to partition your data using one or more columns, and then write the data to Hudi using the hoodie.datasource.write.partitionpath.field configuration property. This property specifies the name of the column used for partitioning.

Here’s an example code snippet in Scala:

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.date_format


val data = spark.read.format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("/path/to/mydata.csv")


val partitionedData = data.withColumn("year", date_format(col("date"), "yyyy"))
  .withColumn("month", date_format(col("date"), "MM"))
  .withColumn("day", date_format(col("date"), "dd"))


val hudiOptions = Map(
  "hoodie.table.name" -> "my_hudi_table",
  "hoodie.datasource.write.partitionpath.field" -> "year,month,day",
  "hoodie.datasource.write.recordkey.field" -> "id",
  "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.SimpleKeyGenerator",
  "hoodie.datasource.write.operation" -> "upsert",
  "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE"
)


partitionedData.write
  .format("org.apache.hudi")
  .options(hudiOptions)
  .mode(SaveMode.Append)
  .save("/path/to/hudi_tabl

In this example, the data is first loaded from a CSV file and partitioned by year, month, and day columns. The hoodie.datasource.write.partitionpath.field property is set to “year,month,day” to create the Hive-style partition folder structure. The other Hudi configuration properties are also set as needed.

The data is then written to Hudi using the save method with the org.apache.hudi format and the specified Hudi configuration options.

By partitioning your data and using the hoodie.datasource.write.partitionpath.field property, Hudi will automatically create a folder structure in the HDFS path that reflects the partitioning columns, similar to the structure used by Hive.

24. Explain how to pass Hudi configurations to Beeline Hive queries?

To pass Hudi configurations to Beeline Hive queries, you can use Hive variables and set them before executing the query. Here’s an example command to set a Hudi configuration variable and execute a Beeline Hive query:

beeline -u "jdbc:hive2://<hive-server2-hostname>:<port>/<database-name>" -n <username> -p <password> -hivevar hoodie.datasource.write.recordkey.field=id -f /path/to/query.hql

In this example, we’re setting the hoodie.datasource.write.recordkey.field variable to id, which is the name of the field used as the record key when writing data to Hudi. We’re also specifying the path to the Hive query file using the -f option.

You can set any Hudi configuration property using the -hivevar option. For example, to set the hoodie.datasource.write.keygenerator.class property to org.apache.hudi.keygen.SimpleKeyGenerator, you would use the following command:

beeline -u "jdbc:hive2://<hive-server2-hostname>:<port>/<database-name>" -n <username> -p <password> -hivevar hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator -f /path/to/query.hql

Once you’ve set the Hudi configuration variables, you can use them in your Hive query as normal Hive variables. For example:

INSERT INTO TABLE my_hudi_table
SELECT id, name, address, dateFROM my_hive_table;

In this example, the hoodie.datasource.write.recordkey.field property is used implicitly as the record key for the Hudi table being written to, since we set it earlier using the -hivevar option.

25. Explain whether it is possible to register a Hudi dataset with Apache Hive metastore?

Yes, it is possible to register a Hudi dataset with the Apache Hive metastore. By registering a Hudi dataset with the metastore, you can easily query the dataset using SQL-like syntax in Hive.

What are Stack Data Structures in Python?

To register a Hudi dataset with the metastore, you can use the hudi-cli command-line tool provided by the Hudi library. The hudi-cli tool includes a register command that can be used to register a dataset with the Hive metastore.

26. How does the Hudi indexing work?

Hudi supports two types of indexing: global indexing and local indexing.

Global indexing is built on top of Apache Avro schema definition and creates a global index on the record key for all the Hudi files. The global index is used to look up records quickly based on the key, making it faster to perform point lookups and range scans. When a record is updated, the index is updated as well, ensuring that the index is always up to date with the latest data. This allows Hudi to support real-time querying of data using engines like Presto or Apache Drill.

Local indexing, on the other hand, is built for optimizing queries that are specific to a subset of the data. Local indexing creates an index on a specific column or set of columns within a partition, and this index is used to optimize queries that filter on that column or columns. Local indexing can be useful when you have a large dataset and want to filter it based on specific columns quickly.

27. What are the benefits of Hudi indexing work?

The benefits of using Hudi’s indexing include:

Improved query performance: By creating indexes on the data, Hudi can retrieve data more efficiently, resulting in faster query performance.
Real-time querying: Hudi’s global indexing allows real-time querying of data, making it possible to retrieve data as it’s updated.
Efficient use of storage: Hudi’s local indexing allows for the creation of indexes on specific columns, reducing the amount of storage needed to store indexes and improving query performance.

28. What does the Hudi cleaner do?

The Hudi cleaner is a built-in component in Hudi that helps to clean up old and expired data files that are no longer needed. It is responsible for identifying and deleting old data files that have already been compacted or are no longer useful.

When Hudi writes data, it writes to a new file, and over time, these files can accumulate and consume a significant amount of disk space. The cleaner is designed to free up this disk space by identifying old and unnecessary files and deleting them.

The cleaner uses a retention policy to determine which files are no longer needed. The retention policy is based on a combination of time-based and size-based criteria. For example, it may keep data files that are less than 7 days old and have a total size of less than 1GB.

The cleaner runs as a background job and periodically scans the Hudi dataset for expired files. When it identifies an expired file, it schedules it for deletion. The cleaner can also be configured to run in different modes, such as dry-run or clean, which allows you to preview the impact of the cleanup before actually deleting files.

29. What’s Hudi’s schema evolution story?

Hudi has a flexible schema evolution story that allows you to evolve your data schema over time without disrupting the existing data. Hudi’s schema evolution features are designed to handle changes in the schema, such as adding or removing columns, renaming columns, or changing data types.

Hudi’s schema evolution story is based on Apache Avro, which is a data serialization system that allows you to define data schemas in a JSON format. Avro provides a rich set of data types and allows for flexible schema evolution.

Hudi supports two types of schema evolution: backward-compatible and forward-compatible. Backward-compatible schema changes include adding new fields or making existing fields optional. Forward-compatible schema changes include removing optional fields, renaming fields, or changing the data type of a field.

Hudi allows you to specify a schema for your data when you write it, and the schema can be stored with the data. When reading data, Hudi can automatically detect the schema based on the data files, making it easy to evolve the schema over time.

Hudi also supports schema validation to ensure that the data conforms to the expected schema. This validation can be done during data ingestion or during query time. This ensures that the data is always of the expected schema and avoids any potential issues caused by data schema mismatches.

30. What is the process for running compaction on a MOR dataset in Hudi?

Compaction is an important process in Hudi that helps optimize the storage and performance of Merge on Read (MOR) datasets. The process involves merging the small files in the dataset into larger files, reducing the number of files to be read during query execution.

Here is the process for running compaction on a MOR dataset in Hudi:

Set the configuration properties for compaction in your Hudi job. These properties include the compaction strategy to be used, the target file size for compacted files, and the maximum number of delta commits to be compacted at once.
Start the compaction process by running a Hudi command or scheduling a job. The Hudi command hoodie compaction can be used to initiate compaction on a specific MOR dataset. If you prefer to schedule the compaction job, you can use Apache Oozie or Apache Airflow to set up a workflow.
Monitor the progress of the compaction job. You can use the Hudi dashboard to monitor the compaction progress and track any issues that may arise during the process.
Verify the results of the compaction process. After the compaction job completes, you can verify that the number of files in the dataset has been reduced and that query performance has improved.

31. What are the available options for performing asynchronous or offline compactions on MOR datasets in Hudi?

In Hudi, there are several options available for performing asynchronous or offline compactions on Merge on Read (MOR) datasets. Here are some of the available options:

Schedule: Schedule compaction jobs using a workflow engine such as Apache Airflow or Apache Oozie. You can use these workflow engines to schedule compaction jobs at a specific time or after a certain threshold of data has been accumulated. This approach allows you to perform compaction offline or asynchronously from other jobs.

DeltaStreamer:
- In Continuous mode, asynchronous compaction is achieved by default. Here scheduling is done by the ingestion job inline and compaction execution is achieved asynchronously by a separate parallel thread.
- In non continuous mode, only inline compaction is possible.
- Please note in either mode, by passing –disable-compaction compaction is completely disabled
Spark datasource:
- Async scheduling and async execution can be achieved by periodically running an offline Hudi Compactor Utility or Hudi CLI. However this needs a lock provider to be configured.
- Alternately, from 0.11.0, to avoid dependency on lock providers, scheduling alone can be done inline by regular writer using the config hoodie.compact.schedule.inline . And compaction execution can be done offline by periodically triggering the Hudi Compactor Utility or Hudi CLI.
Spark structured streaming:
- Compactions are scheduled and executed asynchronously inside the streaming job. Async Compactions are enabled by default for structured streaming jobs on Merge-On-Read table.
- Please note it is not possible to disable async compaction for MOR dataset with spark structured streaming.
Flink:
- Async compaction is enabled by default for Merge-On-Read table.
- Offline compaction can be achieved by setting compaction.async.enabled to false and periodically running Flink offline Compactor. When running the offline compactor, one needs to ensure there are no active writes to the table.
- Third option (highly recommended over the second one) is to schedule the compactions from the regular ingestion job and executing the compaction plans from an offline job. To achieve this set compaction.async.enabled to false, compaction.schedule.enabled to true and then run the Flink offline Compactor periodically to execute the plans.

32. How to disable all table services in case of multiple writers?

In Hudi, you can disable all table services when there are multiple writers by setting the configuration property hoodie.write.concurrency.mode to optimistic_concurrency_control. This configuration disables table services, such as indexing and compaction, and allows multiple writers to write to the dataset without coordination.

When optimistic_concurrency_control is enabled, Hudi uses a compare-and-swap mechanism to ensure that only the latest write is committed to the dataset. If two writers attempt to write to the same record simultaneously, only the write with the latest timestamp will be committed, and the other write will be rejected.

To disable all table services and enable optimistic concurrency control, you can set the following configuration properties:

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.cleaner.policy.keep.deleted.records=false
hoodie.cleaner.commits.retained=0

The hoodie.cleaner.policy.keep.deleted.records configuration disables the deletion of records during cleaning, and hoodie.cleaner.commits.retained specifies the number of previous commits to retain in the dataset. Setting these configurations to false and 0, respectively, effectively disables all table services.

Note that disabling all table services can have performance implications, especially if your dataset is large or if there are a high number of writes. You should evaluate the trade-offs between consistency and performance before disabling all table services in your Hudi dataset.

33. Explain the expected performance and ingest latency when writing to a Hudi dataset?

When writing to a Hudi dataset, the performance and ingest latency can depend on a variety of factors, such as the size of the dataset, the number of writers, the storage system used, and the type of workload.

In general, Hudi is designed for high throughput and low latency ingestion of large datasets. Hudi supports multiple write modes, including Copy-On-Write (COW) and Merge-On-Read (MOR), which can affect performance and ingest latency.

For COW mode, Hudi writes each new version of a record to a new file, which can lead to higher latency and slower performance compared to MOR mode. However, COW mode can be more suitable for workloads that require strong consistency or frequent updates.

In MOR mode, Hudi appends new versions of records to existing files, which can provide faster write performance and lower latency compared to COW mode. MOR mode is more suitable for read-heavy workloads or workloads with a higher ingestion rate.

Additionally, Hudi provides various performance optimizations, such as indexing and compaction, that can further improve write performance and reduce ingest latency.

Ultimately, the expected performance and ingest latency when writing to a Hudi dataset will depend on the specific use case and workload. It is recommended to perform performance testing and benchmarking to determine the optimal write mode and configuration settings for your Hudi dataset.

34. Explain the expected performance when reading from a Hudi dataset or querying it?

When reading from a Hudi dataset or querying it, the performance can depend on various factors, such as the size of the dataset, the query complexity, the storage system used, and the type of workload.
In general, Hudi is designed for high performance querying of large datasets. Hudi provides several query mechanisms, such as incremental querying and predicate pushdown, that can improve query performance and reduce data access time.
Hudi also provides indexing and compaction mechanisms that can further enhance query performance by reducing data scans and improving data locality.
The expected query performance for a Hudi dataset will depend on the specific use case and workload. It is recommended to perform performance testing and benchmarking to determine the optimal configuration settings for your Hudi dataset.
Overall, Hudi is optimized for read-heavy workloads and can provide high-performance querying of large datasets.

35. Explain how to avoid creating numerous small files in Hudi?

Creating a large number of small files can negatively impact the performance of a Hudi dataset. To avoid this, there are several strategies that can be employed:

Use a higher value for the hoodie.parquet.max.file.size parameter: This parameter controls the maximum size of each data file in the dataset. Setting a higher value for this parameter can reduce the number of small files created and improve performance.
Use compaction: Hudi provides various compaction modes that can be used to merge small files into larger ones, reducing the total number of files in the dataset. These modes include Lazy, Schedule, and Incremental.
Use the hoodie.merge.on.read feature: This feature allows merging small files into larger ones during the read operation. It enables faster querying and reduces the number of small files in the dataset.
Use partitioning: Partitioning the data into smaller chunks can help to group related data together and reduce the number of small files. It can also help with filtering and querying the data.

By employing these strategies, it is possible to avoid creating numerous small files in Hudi and improve the overall performance of the dataset.

36. Why does Hudi retain at-least one previous commit even after setting hoodie.cleaner.commits.retained’: 1 ?

Hudi retains at least one previous commit even after setting hoodie.cleaner.commits.retained: 1 because it needs a reference commit to perform upserts on the data. This is required to maintain the data correctness, as Hudi relies on the last known state of the data for performing upserts.

When a new commit is made, Hudi first applies all the changes to the last known state of the data, which is the reference commit, and creates a new version of the data. This new version of the data becomes the new reference commit, and the previous reference commit is retained as per the hoodie.cleaner.commits.retained configuration.

Thus, even if hoodie.cleaner.commits.retained is set to 1, Hudi will still retain at least one commit, which acts as the reference commit for all subsequent commits.

37. How do you write to a non-partitioned Hudi dataset using DeltaStreamer or Spark DataSource API?

To write to a non-partitioned Hudi dataset using DeltaStreamer or Spark DataSource API, you need to create a DataFrame with the appropriate schema and write it to the dataset using the Hudi DataSource API. You also need to specify the Hudi table type as “COPY_ON_WRITE” for non-partitioned datasets. Here is an example code snippet for writing to a non-partitioned Hudi dataset using the DeltaStreamer:

# Import necessary librariesfrom delta.tables import DeltaTable
from pyspark.sql.functions import col
from pyspark.sql import SparkSession


# Create SparkSession
spark = SparkSession.builder.appName("HudiExample").getOrCreate()


# Define Hudi dataset path
hudi_dataset_path = "hdfs://path_to_hudi_dataset"# Define input DataFrame
input_df = spark.read.parquet("hdfs://path_to_input_data")


# Write to Hudi dataset
input_df.write \
        .format("hudi") \
        .option("hoodie.table.name", "my_hudi_table") \
        .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
        .option("hoodie.datasource.write.recordkey.field", "id") \
        .option("hoodie.datasource.write.partitionpath.field", "date") \
        .option("hoodie.upsert.shuffle.parallelism", "4") \
        .option("hoodie.insert.shuffle.parallelism", "4") \
        .mode("append") \
        .save(hudi_dataset_path)


# Stop SparkSession
spark.sto

In the above example, input_df is the input DataFrame to be written to the Hudi dataset. The format function specifies that we are writing to a Hudi dataset, and the option functions specify the necessary Hudi configuration options, such as the table name, table type, record key field, and partition path field. The mode function specifies the write mode as “append”, which means that the data will be appended to the existing Hudi dataset. Finally, the save function writes the data to the specified Hudi dataset path.

Similarly, you can use the Spark DataSource API to write to a non-partitioned Hudi dataset. The process is similar to the above example, but you need to specify the Hudi table type as “COPY_ON_WRITE” in the options.

38. Why do we have to set 2 different ways of configuring Spark to work with Hudi?

Configuring Spark to work with Hudi requires setting two different types of configurations:

Hudi specific configurations: These are Hudi-specific properties that control how data is ingested, stored, and queried within Hudi. These configurations are set using the Hudi Configuration API and are used to configure Hudi-specific functionality such as how to handle schema evolution, how to partition data, and how to perform upserts and deletes.
Spark configurations: These are Spark-specific properties that control how Spark processes data. These configurations are set using SparkConf and are used to configure Spark-specific functionality such as the number of Spark executors, how to serialize data, and how to cache data.

Both sets of configurations are necessary because Hudi is built on top of Spark and relies on Spark to perform many of its operations. The Hudi-specific configurations are used to control the behavior of Hudi itself, while the Spark-specific configurations are used to control the behavior of Spark as it processes data for Hudi. By setting both sets of configurations appropriately, you can optimize the performance of both Hudi and Spark for your use case.

39. Explain how to evaluate Hudi using a portion of an existing dataset as a part of a job interview?

To evaluate Hudi using a portion of an existing dataset, you can follow the below steps:

Identify a representative sample of your existing dataset that you want to use to evaluate Hudi. You can use a tool like head or tail to extract a portion of the dataset.
Create a new Hudi dataset by following the steps mentioned earlier. You can specify the size of the dataset based on the size of your representative sample.
Write the representative sample of your existing dataset into the new Hudi dataset using the Hudi DataSource API.
Perform the operations that you want to evaluate in Hudi on this dataset. This could be any Hudi operation like reading, writing, updating, deleting, compaction, etc.
Measure the performance metrics for the operations that you have performed. This could include metrics like latency, throughput, file size, etc.
Compare the performance metrics of Hudi with that of your existing dataset to evaluate the benefits and drawbacks of using Hudi.

By following these steps, you can evaluate Hudi using a portion of your existing dataset without risking any data loss or affecting the performance of your production system.

40. The file versions are kept at 1 in Hudi, will it be possible to roll back to the last commit in case of write failure?

Yes, Commits happen before cleaning. Any failed commits will not cause any side-effects and Hudi will guarantee snapshot isolation.

41. Does AWS GLUE support Hudi ?

Yes, AWS Glue supports Hudi. Glue is a fully managed ETL service provided by AWS, and it supports reading and writing data to/from Hudi datasets through its dynamic frames API. Glue can also be used to run transformations on Hudi datasets, and to create and manage Hudi tables in AWS Glue Data Catalog.

42. How to override Hudi jars in EMR?

To override Hudi jars in Amazon EMR, you can follow these steps:

Create a new S3 bucket and upload the updated Hudi jar file to it.
In the EMR console, create a new cluster or select an existing one.
Under the “Software configuration” tab, click “Edit” to modify the configuration.
Under the “Advanced options” section, click “Add custom Jar”.
Enter the S3 path of the updated Hudi jar file.
Click “Add” to save the changes.
Start the cluster and the updated Hudi jar file will be used instead of the default one.

Alternatively, you can also use the bootstrap action option to override the Hudi jars. You can create a bootstrap script that downloads the updated Hudi jar from an S3 bucket and replaces the default Hudi jar with it. Then, you can specify the bootstrap script while launching the EMR cluster.

43. Why partition fields are also stored in parquet files in addition to the partition path ?

Partition fields in Hudi are also stored in Parquet files in addition to the partition path for efficient filtering and querying. The partition path is the directory structure in which the data is stored, while partition fields are the actual values of the fields that are used to define the partition.

Storing partition fields in the Parquet file allows Hudi to use predicate pushdown optimization to only read the necessary partitions based on the partition fields, which can significantly reduce the amount of data read from storage. Additionally, storing partition fields in the Parquet file also allows for efficient aggregation and sorting of data within a specific partition.

Overall, storing partition fields in the Parquet file allows for efficient data querying and processing, which is crucial for high-performance big data processing.

44. How to control the number of archive commit files generated in Hudi? Can you explain this as an interview question?

Hudi generates archive files to store previous versions of the data in case of data corruption or other issues. The number of archive files generated can be controlled using the hoodie.cleaner.commits.retained configuration parameter. By default, this parameter is set to 2, which means that the last two commits will be retained as archive files. If you want to retain more or fewer commits, you can set this parameter to a different value.

For example, if you want to retain only the last commit as an archive file, you can set hoodie.cleaner.commits.retained to 1. Alternatively, if you want to retain more commits, you can set this parameter to a higher value.

It’s important to note that reducing the number of archive files may increase the risk of data loss in case of issues with the current version of the data. So, you should carefully consider the trade-offs before changing this configuration parameter.

45. How do you configure Bloom filters in Hudi when using Bloom or Global Bloom index?

To configure Bloom filters in Hudi when using Bloom or Global Bloom index, you can set the following configurations:

hoodie.bloom.index.enabled: Set this to true to enable Bloom index.
hoodie.bloom.index.bootstrap.enabled: Set this to true to enable bootstrapping of the Bloom filter index.
hoodie.bloom.index.bucket.count: Set this to the number of buckets used to construct the Bloom filter index. A higher number of buckets will result in a more accurate index, but also a larger index size.
hoodie.bloom.index.num.entries: Set this to the number of entries in the Bloom filter index.
hoodie.bloom.index.false_positive_rate: Set this to the desired false positive rate for the Bloom filter index. The lower the false positive rate, the larger the index size.

For Global Bloom index, you can additionally set:

hoodie.index.global.bloom.enabled: Set this to true to enable Global Bloom index.
hoodie.index.global.bloom.num.records: Set this to the number of records in the dataset.
hoodie.index.global.bloom.max.qps: Set this to the maximum number of queries per second for the Global Bloom index.
hoodie.index.global.bloom.refresh.seconds: Set this to the refresh interval in seconds for the Global Bloom index.

These configurations can be set in the Spark configuration object or in the Hudi configuration file.

46. How to tune shuffle parallelism of Hudi jobs ?

Tuning the shuffle parallelism of Hudi jobs is important for improving the performance of data ingestion and processing. The shuffle parallelism determines the number of tasks that are used for shuffling data between the Spark executors during job execution. Here are some steps to tune the shuffle parallelism of Hudi jobs:

Set the number of partitions: Hudi uses Spark RDDs and data frames to manipulate data. Setting the number of partitions of the RDDs and data frames affects the number of tasks for data processing and shuffling. You can set the number of partitions by using the spark.default.parallelism or spark.sql.shuffle.partitions configuration properties.
Increase the executor memory: Hudi needs enough executor memory to handle the shuffle data. Increasing the executor memory can improve the shuffle performance. You can set the executor memory by using the spark.executor.memory configuration property.
Increase the number of executors: Increasing the number of executors can increase the number of tasks for data processing and shuffling, which can improve the shuffle performance. You can set the number of executors by using the spark.executor.instances configuration property.
Tune the spark.sql.shuffle.partitions property: The spark.sql.shuffle.partitions property determines the number of partitions used for shuffling data. You can set this property to control the number of partitions used for shuffling data.
Use dynamic allocation: Dynamic allocation is a technique that allows Spark to automatically adjust the number of executors and the amount of memory used based on the workload. You can enable dynamic allocation by setting the spark.dynamicAllocation.enabled configuration property to true.

These are some of the ways to tune the shuffle parallelism of Hudi jobs. It is recommended to experiment with different configurations to find the optimal settings for your specific use case.

47. How to convert an existing COW table to MOR?

All you need to do is to edit the table type property in hoodie.properties (located at hudi_table_path/.hoodie/hoodie.properties). But manually changing it will result in checksum errors. So, we have to go via hudi-cli.

Copy existing hoodie.properties to a new location.
Edit table type to MERGE_ON_READ
launch hudi-cli
1. connect –path hudi_table_path
2. repair overwrite-hoodie-props –new-props-file new_hoodie.properties

48. Is it possible to receive notifications when new commits happen in a Hudi table?

Yes, it is possible to receive notifications when new commits happen in a Hudi table. Hudi provides an option to set up notification hooks that can trigger external systems or services when new commits are made to a Hudi table.

To set up notification hooks in Hudi, you need to implement a Java class that extends the org.apache.hudi.callback.HoodieWriteCommitCallback interface. This class should define the logic for sending notifications to external systems or services.

Once the callback class is implemented, you can configure Hudi to use it by setting the hoodie.callbacks.commit property to the fully qualified name of the callback class.

When a new commit is made to the Hudi table, the callback class will be invoked with the commit metadata, and you can use this information to trigger notifications to external systems or services.

49. How do you verify datasource schema reconciliation in Hudi?

Hudi provides an easy way to verify schema reconciliation for a datasource. It can be done using the hoodie.datasource.write.schema.validate parameter, which is set to true by default.

When this parameter is set to true, Hudi will validate that the schema of the incoming data matches the schema of the table that it is being written to. If there are any schema mismatches, Hudi will throw an exception and the write will fail.

To turn off schema validation, set the hoodie.datasource.write.schema.validate parameter to false. However, this is not recommended, as it can result in data corruption if the incoming data does not match the schema of the table.

Therefore, by default, Hudi ensures that schema reconciliation is performed before writing to the table to avoid any issues related to data schema mismatch.

50. Have you ever encountered two different records for the same record key value in Hudi, each record key with a different timestamp format? How is this possible?

Yes, it is possible to encounter two different records for the same record key value in Hudi, each record key with a different timestamp format. This can happen when the same record is updated by two different writers with different timestamp formats.

Hudi supports multiple timestamp formats, and each writer can specify its own timestamp format while writing data to Hudi. When two writers update the same record with different timestamp formats, Hudi treats them as different versions of the same record, and both versions are stored in the dataset. When reading data from Hudi, the latest version of the record is returned by default, but users can also read previous versions of the record if needed.

To handle this situation, users can configure Hudi to reconcile timestamp formats during data ingestion. Hudi provides an option to reconcile timestamps based on a user-defined format, which allows Hudi to recognize the same record updated by different writers with different timestamp formats as the same record.

51. Explain if it’s possible to switch from one index type to another without having to rewrite the entire table?

In Hudi, it is possible to switch from one index type to another without having to rewrite the entire table. Hudi is a storage and processing layer that supports incremental data processing, and it allows you to change the indexing strategy without having to rewrite the entire dataset.

Hudi provides two types of indexing: bloom indexing and simple indexing. Bloom indexing is a memory-efficient way of indexing data that works well for high cardinality columns, while simple indexing is more traditional and works well for low cardinality columns.

To switch from one index type to another in Hudi, you can use the hoodie_index property in the configuration file for your Hudi dataset. For example, if you want to switch from bloom indexing to simple indexing, you can set the hoodie_index property to SIMPLE in the configuration file. Hudi will then automatically rebuild the index using the new indexing strategy.

It’s important to note that changing the indexing strategy can have an impact on the performance of your Hudi dataset. So, it’s important to carefully consider the implications before making any changes. Additionally, it’s recommended to test any changes in a non-production environment before deploying them to production.

52. Explain how to resolve the NoSuchMethodError from HBase when using Hudi with a metadata table on HDFS?

When using Hudi with a metadata table on HDFS, it’s possible to encounter a NoSuchMethodError from HBase. This error can occur if the version of HBase you are using is not compatible with the version of Hudi you are using.

To resolve this error, you will need to ensure that you are using compatible versions of HBase and Hudi. You can check the compatibility matrix for Hudi and HBase versions to ensure that you are using compatible versions.

If you are using compatible versions and still encountering the error, you may need to update the version of HBase in your environment. You can do this by downloading and installing the compatible version of HBase, and updating the HBase dependencies in your project to use the new version.

Another possible solution is to shade the HBase dependencies in your project to avoid conflicts with the HBase dependencies used by Hudi. You can use a build tool like Maven or Gradle to shade the HBase dependencies in your project.

It’s important to note that resolving the NoSuchMethodError from HBase may require some trial and error, and may involve updating other dependencies in your environment. It’s recommended to test any changes in a non-production environment before deploying them to production.

53. Explain how to resolve the RuntimeException saying “hbase-default.xml file seems to be for an older version of HBase?”

If you are encountering a RuntimeException stating that the “hbase-default.xml file seems to be for an older version of HBase,” it means that the version of HBase you are using is not compatible with the configuration file you are using.

To resolve this issue, you can try one of the following solutions:

Update your HBase configuration: You can update the hbase-default.xml file to match the version of HBase you are using. You can download the correct version of the configuration file from the HBase website and replace the existing file with the new one.
Specify the correct HBase version: You can specify the correct version of HBase in your project dependencies. This will ensure that the correct version of the configuration file is used.
Check your classpath: Ensure that the correct version of the HBase jar files is in your classpath. If you are using a tool like Maven or Gradle, ensure that the correct version of the HBase dependencies are specified in your build file.

It’s important to note that the specific solution to this issue will depend on your specific environment and configuration. It’s recommended to test any changes in a non-production environment before deploying them to production.

54. Explain how to find the average record size in a commit?

To find the average record size in a commit in Hudi, you can use the HoodieTableMetaClient API in Java. Here are the steps to follow:

Instantiate a HoodieTableMetaClient object: This object provides access to the metadata of a Hudi table.

HoodieTableMetaClient metaClient = new HoodieTableMetaClient(new Configuration(), "/path/to/table", true);

2. Get the commit metadata: Use the getCommitTimeline() method to get the timeline of all commits in the table, and then use the getInstantDetails() method to get the details of the desired commit.

HoodieTimeline timeline = metaClient.getCommitTimeline();
HoodieInstant commitInstant = timeline.getInstantDetails("<commit_id>").get();

3. Get the number of records and total size: Use the getPartitionMetadata() method to get the partition metadata for the commit, and then use the getTotalRecords() and getTotalFileSize() methods to get the number of records and total size for the commit.

HoodiePartitionMetadata partitionMetadata = metaClient.getPartitionMetadata("<commit_time>", commitInstant.getTimestamp());
long numRecords = partitionMetadata.getStatistics().getTotalRecords();
long totalSize = partitionMetadata.getStatistics().getTotalFileSize();

4. Calculate the average record size: Divide the total size by the number of records to get the average record size.

long averageRecordSize = totalSize / numRecords;

It’s important to note that the above code snippet assumes that the commit being analyzed is for a single partition. If the commit spans multiple partitions, you will need to iterate over the partition metadata for each partition and sum the number of records and total size. Additionally, the method for calculating the average record size may vary depending on the specific use case and dataset being analyzed.

55. Explain how to resolve the IllegalArgumentException saying “Partitions must be in the same table” when attempting to sync to a metastore?

If you are encountering an IllegalArgumentException stating that “Partitions must be in the same table” when attempting to sync to a metastore in Hudi, it means that you are trying to sync data from multiple tables into a single metastore, or you are trying to sync data from different partitions into the same table.

To resolve this issue, you can try one of the following solutions:

Ensure that you are syncing data from the same table: Verify that the data you are attempting to sync is from the same table. If you are syncing data from multiple tables, you will need to create separate metastores for each table.
Check your partitioning scheme: Verify that the partitioning scheme for the data you are attempting to sync is consistent across all partitions. If the partitioning scheme is inconsistent, you may need to modify your partitioning scheme or split the data into separate tables.
Ensure that the table names are unique: Verify that the table names for the data you are attempting to sync are unique. If there are duplicate table names, you may need to modify the table names or split the data into separate tables.

It’s important to note that the specific solution to this issue will depend on your specific use case and configuration. It’s recommended to test any changes in a non-production environment before deploying them to production.

56. Explain how to reduce table versions created by Hudi in AWS Glue Data Catalog/metastore?

Hudi creates new versions of tables in AWS Glue Data Catalog/metastore each time a new commit is made to the table. This can result in a large number of table versions over time, which can impact performance and increase storage costs. To reduce the number of table versions created by Hudi in AWS Glue Data Catalog/metastore, you can follow these steps:

Configure the Hudi hoodie.datasource.hive_sync.max_commits property: This property controls the maximum number of commits that will be synced to AWS Glue Data Catalog/metastore. By default, Hudi will sync all commits to the metastore. Set this property to a value that meets your needs. For example, if you set this property to 5, Hudi will only sync the metadata for the latest 5 commits.

hoodie.datasource.hive_sync.max_commits=5

2. Configure the Hudi hoodie.datasource.hive_sync.use_jdbc property: This property controls whether Hudi will use JDBC or Spark SQL to sync the metadata to AWS Glue Data Catalog/metastore. JDBC is recommended for large tables with many partitions as it is faster and uses less memory. Set this property to true to use JDBC.

hoodie.datasource.hive_sync.use_jdbc=true

3. Use partition pruning: Hudi supports partition pruning to reduce the number of partitions that are synced to AWS Glue Data Catalog/metastore. You can enable partition pruning by setting the hoodie.datasource.hive_sync.partition_extractor_class property to org.apache.hudi.hive.MultiPartKeysValueExtractor.

hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor

4. Compact your Hudi table: Compacting your Hudi table can help reduce the number of versions and improve query performance. You can compact your table using the hoodie:compact command-line tool.

./hoodie/bin/hoodie --path <path/to/table> --spark-master <spark-master-url> --packages com.uber.hoodie:hoodie-utilities-bundle_<hudi-version>:<hadoop-version> --class org.apache.hudi.utilities.compact.HoodieCompactor -- --srcPath <path/to/table> --targetPath <path/to/compacted/table> --compactionInstantTime <compaction-instant-time>

It’s important to note that reducing the number of table versions may impact the ability to roll back to previous versions of the table. Therefore, it’s recommended to test any changes in a non-production environment before deploying them to production.

57. Can you describe whether it’s possible to modify the key generator for a table that already exists?

No. There are small set of properties that cannot change once chosen. KeyGenerator is one among them. Here is a code referecne where we validate the properties.

Blog

Blog

Top 50+ Apache Hudi Interview Questions

Apache Hudi Interview Questions

1. What is Hudi?

2. When and Why is Hudi used for?​

3. What are some non-goals for Hudi?​

4. What is incremental processing in Hudi? ​

5. What is the difference between copy-on-write (COW) vs merge-on-read (MOR) storage types?​

6. Explain how to choose a storage type for a workload?​

7. Can you explain whether Hudi is an analytical database or not?​

8. Explain how to model the data stored in Hudi?​

9. Why does Hudi require a key field to be configured?​

10. Does Hudi support cloud storage/object stores?​

11. What versions of Hive/Spark/Hadoop are support by Hudi?​

12. How does Hudi actually store data inside a dataset?​

13. How Hudi handles partition evolution requirements ?​

Using Hudi​

14. What are some ways to write a Hudi dataset?​

15. How is a Hudi job deployed?​

16. Explain how to query a Hudi dataset that has been written?​

17. How does Hudi handle duplicate record keys in an input?​

18. Can you implement custom logic for merging input records with records on storage in Hudi?​

19. What is the process for deleting records from a Hudi dataset?​

20. Does deleted records appear in Hudi’s incremental query results?​

21. What is the process for migrating data to Hudi?​

22. Explain how to pass Hudi configurations to a Spark job?​

23. How to create Hive style partition folder structure?​

24. Explain how to pass Hudi configurations to Beeline Hive queries?​

25. Explain whether it is possible to register a Hudi dataset with Apache Hive metastore?​

26. How does the Hudi indexing work?

27. What are the benefits of Hudi indexing work?​

28. What does the Hudi cleaner do?​

29. What’s Hudi’s schema evolution story?​

30. What is the process for running compaction on a MOR dataset in Hudi?​

31. What are the available options for performing asynchronous or offline compactions on MOR datasets in Hudi?​

32. How to disable all table services in case of multiple writers?​

33. Explain the expected performance and ingest latency when writing to a Hudi dataset?​

34. Explain the expected performance when reading from a Hudi dataset or querying it?​

35. Explain how to avoid creating numerous small files in Hudi?​

36. Why does Hudi retain at-least one previous commit even after setting hoodie.cleaner.commits.retained’: 1 ?​

37. How do you write to a non-partitioned Hudi dataset using DeltaStreamer or Spark DataSource API?​

38. Why do we have to set 2 different ways of configuring Spark to work with Hudi?​

39. Explain how to evaluate Hudi using a portion of an existing dataset as a part of a job interview?​

40. The file versions are kept at 1 in Hudi, will it be possible to roll back to the last commit in case of write failure?​

41. Does AWS GLUE support Hudi ?​

42. How to override Hudi jars in EMR?​

43. Why partition fields are also stored in parquet files in addition to the partition path ?​

44. How to control the number of archive commit files generated in Hudi? Can you explain this as an interview question?​

45. How do you configure Bloom filters in Hudi when using Bloom or Global Bloom index?​

46. How to tune shuffle parallelism of Hudi jobs ?​

47. How to convert an existing COW table to MOR?​

48. Is it possible to receive notifications when new commits happen in a Hudi table?​

49. How do you verify datasource schema reconciliation in Hudi?​

50. Have you ever encountered two different records for the same record key value in Hudi, each record key with a different timestamp format? How is this possible?​

51. Explain if it’s possible to switch from one index type to another without having to rewrite the entire table?​

52. Explain how to resolve the NoSuchMethodError from HBase when using Hudi with a metadata table on HDFS?​

53. Explain how to resolve the RuntimeException saying “hbase-default.xml file seems to be for an older version of HBase?”​

54. Explain how to find the average record size in a commit?​

55. Explain how to resolve the IllegalArgumentException saying “Partitions must be in the same table” when attempting to sync to a metastore?​

56. Explain how to reduce table versions created by Hudi in AWS Glue Data Catalog/metastore?​

57. Can you describe whether it’s possible to modify the key generator for a table that already exists?​

Become An Instructor

Subscribe to Newsletter

About US

Links

Work With Us

Courses

Subscribe to Newsletter

2. When and Why is Hudi used for?

3. What are some non-goals for Hudi?

4. What is incremental processing in Hudi?

5. What is the difference between copy-on-write (COW) vs merge-on-read (MOR) storage types?

6. Explain how to choose a storage type for a workload?

7. Can you explain whether Hudi is an analytical database or not?

8. Explain how to model the data stored in Hudi?

9. Why does Hudi require a key field to be configured?

10. Does Hudi support cloud storage/object stores?

11. What versions of Hive/Spark/Hadoop are support by Hudi?

12. How does Hudi actually store data inside a dataset?

13. How Hudi handles partition evolution requirements ?

Using Hudi

14. What are some ways to write a Hudi dataset?

15. How is a Hudi job deployed?

16. Explain how to query a Hudi dataset that has been written?

17. How does Hudi handle duplicate record keys in an input?

18. Can you implement custom logic for merging input records with records on storage in Hudi?

19. What is the process for deleting records from a Hudi dataset?

20. Does deleted records appear in Hudi’s incremental query results?

21. What is the process for migrating data to Hudi?

22. Explain how to pass Hudi configurations to a Spark job?

23. How to create Hive style partition folder structure?

24. Explain how to pass Hudi configurations to Beeline Hive queries?

25. Explain whether it is possible to register a Hudi dataset with Apache Hive metastore?

27. What are the benefits of Hudi indexing work?

28. What does the Hudi cleaner do?

29. What’s Hudi’s schema evolution story?

30. What is the process for running compaction on a MOR dataset in Hudi?

31. What are the available options for performing asynchronous or offline compactions on MOR datasets in Hudi?

32. How to disable all table services in case of multiple writers?

33. Explain the expected performance and ingest latency when writing to a Hudi dataset?

34. Explain the expected performance when reading from a Hudi dataset or querying it?

35. Explain how to avoid creating numerous small files in Hudi?

36. Why does Hudi retain at-least one previous commit even after setting hoodie.cleaner.commits.retained’: 1 ?

37. How do you write to a non-partitioned Hudi dataset using DeltaStreamer or Spark DataSource API?

38. Why do we have to set 2 different ways of configuring Spark to work with Hudi?

39. Explain how to evaluate Hudi using a portion of an existing dataset as a part of a job interview?

40. The file versions are kept at 1 in Hudi, will it be possible to roll back to the last commit in case of write failure?

41. Does AWS GLUE support Hudi ?

42. How to override Hudi jars in EMR?

43. Why partition fields are also stored in parquet files in addition to the partition path ?

44. How to control the number of archive commit files generated in Hudi? Can you explain this as an interview question?

45. How do you configure Bloom filters in Hudi when using Bloom or Global Bloom index?

46. How to tune shuffle parallelism of Hudi jobs ?

47. How to convert an existing COW table to MOR?

48. Is it possible to receive notifications when new commits happen in a Hudi table?

49. How do you verify datasource schema reconciliation in Hudi?

50. Have you ever encountered two different records for the same record key value in Hudi, each record key with a different timestamp format? How is this possible?

51. Explain if it’s possible to switch from one index type to another without having to rewrite the entire table?

52. Explain how to resolve the NoSuchMethodError from HBase when using Hudi with a metadata table on HDFS?

53. Explain how to resolve the RuntimeException saying “hbase-default.xml file seems to be for an older version of HBase?”

54. Explain how to find the average record size in a commit?

55. Explain how to resolve the IllegalArgumentException saying “Partitions must be in the same table” when attempting to sync to a metastore?

56. Explain how to reduce table versions created by Hudi in AWS Glue Data Catalog/metastore?

57. Can you describe whether it’s possible to modify the key generator for a table that already exists?