Apache Cassandra Interview Questions
1) Explain Cassandra.
Apache Cassandra is a distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for cluster deployment and data distribution, with a flexible data model that allows for the storage of key-value, wide column, and graph data.
Cassandra is highly scalable and can handle a high volume of read and write operations, making it suitable for use cases such as real-time event processing, Internet of Things (IoT) data management, and high-scale data ingest. It is also known for its robust support for data replication and fault tolerance, allowing it to remain available even in the event of hardware or network failures.
Cassandra is designed to be easy to set up and operate, with a simple and intuitive data model that allows developers to quickly start working with it. It also offers advanced features such as tunable consistency, data compression, and support for a wide range of programming languages and developer tools.
2) In which language Cassandra is written?
Apache Cassandra is written in Java and uses the Java Virtual Machine (JVM) for execution. Cassandra is implemented using the Apache Thrift RPC system, which allows it to communicate with clients written in a variety of programming languages, including but not limited to Java, C++, Python, Ruby, and C#.
Cassandra also provides drivers and libraries for a number of languages, including Java, Python, Ruby, C++, C#, and Node.js, which allow developers to easily interact with Cassandra from their preferred language. This makes it easy for developers to build applications that can use Cassandra as a data store, regardless of the language they are using.
3) Who was the original author of Cassandra?
Apache Cassandra was originally developed at Facebook by a team led by Avinash Lakshman and Prashant Malik. The first version of Cassandra was released as open source software in July 2008.
In 2010, the Cassandra project was donated to the Apache Software Foundation (ASF) and has since become an Apache Incubator project. The project has a large and active developer and user community, with contributions from individuals and organizations around the world. The current version of Cassandra is maintained and developed by the Apache Cassandra Project Team at the ASF.
4) Which query language is used in Cassandra database?
Apache Cassandra uses a variant of Structured Query Language (SQL) called Cassandra Query Language (CQL) as its primary interface for interacting with the database. CQL is a simple, easy-to-use language that is designed to be familiar to developers who have experience with SQL.
CQL is used to create, alter, and drop keyspaces, tables, and other database objects, as well as to insert, update, and delete data from tables. It also includes support for query languages such as SELECT, UPDATE, DELETE, and INSERT, as well as a range of built-in functions for working with data.
In addition to CQL, Cassandra also provides a native API for interacting with the database using Java, as well as drivers and libraries for a number of other programming languages. These interfaces allow developers to interact with Cassandra using their preferred programming language and take advantage of Cassandra’s distributed, scalable, and fault-tolerant design.
5) What are the benefits/advantages of Cassandra?
- Cassandra delivers real-time performance simplifying the work of Developers, Administrators, Data Analysts and Software Engineers.
- It provides extensible scalability and can be easily scaled up and scaled down as per the requirements.
- Data can be replicated to several nodes for fault-tolerance.
- Being a distributed management system, there is no single point of failure.
- Every node in a cluster contains different data and able to serve any request.
6) Where Cassandra stores its data?
In Apache Cassandra, data is stored in a distributed fashion across a cluster of nodes. Each node in the cluster is responsible for storing a portion of the data, known as a “replica.” The data is automatically partitioned and distributed across the nodes in the cluster based on a hash of the primary key of the data being stored.
In Cassandra, data is organized into “keyspaces,” which are logical containers for tables. Each keyspace is stored on one or more nodes in the cluster and is replicated across multiple nodes for fault tolerance. Each keyspace also has a “replication factor” which determines how many copies of the data should be stored across the cluster.
Cassandra stores data on disk in a data structure called an “SSTable” (Sorted String Table). Each SSTable is a self-contained file that contains a sorted list of key-value pairs, along with index and summary information to allow for efficient querying. Cassandra also maintains an in-memory cache of recently accessed data to improve read performance.
Overall, Cassandra’s distributed data storage model allows it to scale horizontally and handle large amounts of data with high availability and no single point of failure.
7) What was the design goal of Cassandra?
The design goals of Apache Cassandra include providing high scalability, high availability, and high performance for applications that need to handle large amounts of data. To achieve these goals, Cassandra was designed as a distributed database management system that can scale horizontally across a large number of commodity servers.
Cassandra was also designed to be fault-tolerant, with built-in support for data replication and automatic failover in the event of node or hardware failures. This allows Cassandra to remain available and consistent even in the face of failures or downtime.
Additionally, Cassandra was designed to be easy to set up and operate, with a simple and intuitive data model that allows developers to quickly start working with it. It also offers advanced features such as tunable consistency, data compression, and support for a wide range of programming languages and developer tools.
8) How many types of NoSQL databases? Give some examples.
There are mainly 4 types of NoSQL databases:
- Document store types ( MongoDB and CouchDB)
- Key-Value store types ( Redis and Volgemort)
- Column store types ( Cassandra)
- Graph store types ( Neo4j and Giraph)
9) Mention some important components of Cassandra data models?
These are some key components of Cassandra data model: –
- Table( collection of columns)
10) What are the other components of Cassandra?
Some other components of Cassandra are:
- Data Center
- Commit log
- Bloom Filter
11) What is keyspace in Cassandra?
In Apache Cassandra, a keyspace is a logical container for tables and other database objects. It is used to group related tables and other objects together, and is similar in concept to a schema in a traditional relational database management system (RDBMS).
Each keyspace in Cassandra is stored on one or more nodes in the cluster, and is replicated across multiple nodes for fault tolerance. The keyspace also has a “replication factor” which determines how many copies of the data should be stored across the cluster. This allows Cassandra to provide high availability and fault tolerance for the data stored in the keyspace.
Keyspaces are defined using the Cassandra Query Language (CQL), and can be created, altered, and dropped using CQL statements. They can also be used to set options such as the consistency level for reads and writes, the replication strategy, and the data placement strategy for the keyspace.
Overall, keyspaces are an important concept in Cassandra and are used to organize and manage data within the database.
12) What are the different composite keys in Cassandra?
In Apache Cassandra, composite keys are used to define the primary key of a table, which is used to uniquely identify each row of data in the table. Composite keys are made up of two or more columns, and are used when the primary key of a table is made up of multiple columns rather than a single column.
There are two types of composite keys in Cassandra:
- Clustering keys: Clustering keys are used to determine the order in which data is stored within a partition. They are usually defined as part of the primary key, and are used to sort rows within a partition based on the values of the clustering columns.
- Partition keys: Partition keys are used to distribute data across the nodes in a Cassandra cluster. They determine which node a particular row of data is stored on, and are used to ensure that rows with the same partition key are stored together on the same node.
Together, the partition key and clustering key form the composite primary key of a table in Cassandra. They are used to uniquely identify each row of data in the table and to provide fast access to data based on the values of these key columns.
13) What is data replication in Cassandra?
Data replication is an electronic copying of data from a database in one computer or server to a database in another so that all users can share the same level of information. Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. The replication strategy decides the nodes where replicas are placed.
14) What is node in Cassandra?
In Apache Cassandra, a node is a single instance of the Cassandra software that runs on a physical or virtual machine. Nodes are used to store data and participate in the distributed database cluster.
Each node in a Cassandra cluster is responsible for storing a portion of the data, known as a “replica.” The data is automatically partitioned and distributed across the nodes in the cluster based on a hash of the primary key of the data being stored. This allows Cassandra to scale horizontally and handle large amounts of data with high availability and no single point of failure.
Nodes in a Cassandra cluster communicate with each other using a peer-to-peer architecture, and can dynamically join or leave the cluster as needed. This allows Cassandra to be highly resilient to node failures and other disruptions, as the remaining nodes can continue to operate and serve data to clients even if one or more nodes are unavailable.
Overall, nodes are an important component of a Cassandra cluster and are responsible for storing and serving data to clients.
15) What do you mean by data center in Cassandra?
In Apache Cassandra, a data center (DC) is a logical grouping of nodes that are located in the same physical location. Data centers are used to organize nodes within a Cassandra cluster, and can be used to control data placement and replication within the cluster.
Each data center in a Cassandra cluster has a name, and nodes within a data center are assigned to the same data center name. Data centers can be used to control the replication of data within the cluster, as replicas of data can be placed in different data centers to ensure high availability and fault tolerance.
Data centers are also used to define the “local” and “remote” nodes within a cluster. Local nodes are nodes that belong to the same data center as the client making the request, while remote nodes are nodes that belong to a different data center. This allows Cassandra to optimize data access by preferring local nodes when possible.
Overall, data centers are a useful concept in Cassandra and are used to organize and manage nodes within a cluster. They can be used to control data placement and replication, as well as to optimize data access for clients.
16) What do you mean by commit log in Cassandra?
In Apache Cassandra, the commit log is a write-ahead log that is used to ensure data durability and consistency. It is a sequential log of all write operations that are performed on the database, and is used to recover data in the event of a failure or outage.
When a write operation is performed on a Cassandra node, it is first written to the commit log on disk. This ensures that the data is durable and can be recovered in the event of a failure. The data is then written to an in-memory data structure called the memtable, which is used to store recently written data in memory for fast access.
Periodically, the data in the memtable is flushed to disk and stored in an SSTable (Sorted String Table), which is a self-contained file that contains a sorted list of key-value pairs, along with index and summary information to allow for efficient querying. The commit log is then truncated, as the data has now been persisted to disk.
Overall, the commit log is an important component of Cassandra’s data storage architecture, and is used to ensure the durability and consistency of data in the database. It is a write-ahead log that is written to disk before the data is persisted to disk, and is used to recover data in the event of a failure or outage.
17) What do you mean by column family in Cassandra?
In Apache Cassandra, a column family is a data structure that is used to store rows of data. It is similar in concept to a table in a traditional relational database management system (RDBMS), and is used to store and organize data in Cassandra.
A column family in Cassandra consists of a set of rows, each of which is identified by a unique primary key. Each row can have one or more columns, which store the actual data values for the row. Columns can be added, modified, or deleted from a row as needed, and rows can be added, modified, or deleted from the column family.
Column families are defined using the Cassandra Query Language (CQL), and can be created, altered, and dropped using CQL statements. They are stored within a keyspace, which is a logical container for tables and other database objects in Cassandra.
Overall, column families are an important concept in Cassandra and are used to store and organize data within the database. They are similar to tables in a traditional RDBMS and are used to store rows of data with unique primary keys and one or more columns.
18) What do you mean by consistency in Cassandra?
In Apache Cassandra, consistency refers to the level of agreement between replicas of data within a cluster. Consistency is an important concept in Cassandra, as the database is designed to scale horizontally across a large number of nodes and data is automatically replicated across the cluster for fault tolerance.
Cassandra offers a range of consistency levels that can be configured to trade off between consistency and performance. The higher the consistency level, the more replicas of data that need to be in agreement before a write operation is considered successful, which can result in higher latencies for write operations.
The available consistency levels in Cassandra include:
- Any: A write operation is considered successful as long as it has been written to the commit log on at least one node. This provides the lowest level of consistency, but also the highest level of performance.
- One: A write operation is considered successful as long as it has been written to the commit log on at least one node and has been acknowledged by one replica.
- Two: A write operation is considered successful as long as it has been written to the commit log on at least one node and has been acknowledged by at least two replicas.
- Three: A write operation is considered successful as long as it has been written to the commit log on at least one node and has been acknowledged by at least three replicas.
- Quorum: A write operation is considered successful as long as it has been written to the commit log on at least one node and has been acknowledged by a quorum of replicas, where a quorum is defined as (N/2)+1, where N is the total number of replicas.
- All: A write operation is considered successful as long as it has been written to the commit log on at least one node and has been acknowledged by all replicas. This provides the highest level of consistency, but also the lowest level of performance.
Overall, consistency is an important concept in Cassandra and is used to control the level of agreement between replicas of data within the cluster. The consistency level can be configured to trade off between consistency and performance, depending on the needs of the application.
19) How many types of tunable consistency are supported in Cassandra?
It supports two consistencies: Eventual Consistency and Strong Consistency.
The eventual consistency is used when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.
Cassandra supports the following conditions for strong consistency:
R + W > N
N: Number of replicas
W: Number of nodes that need to agree for a successful write
R: Number of nodes that need to agree for a successful read
20) What is tunable consistency in Cassandra?
Tunable Consistency is a phenomenal characteristic of Cassandra which makes it a popular choice. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency facilitates users to select the consistency level best suited for their use cases.
21) What is the syntax to create keyspace in Cassandra?
- CREATE KEYSPACE <identifier> WITH <properties>
22) What is a column family in Cassandra?
In Cassandra, a collection of rows is referred as “column family”.
23) How does Cassandra perform write function?
Cassandra performs the write function by applying two commits:
- First commit is applied on disk and then second commit to an in-memory structure known as memtable.
- When the both commits are applied successfully, the write is achieved.
- Writes are written in the table structure as SSTable (sorted string table).
24) What is memtable?
Memtable is in-memory/write-back cache space containing content in key and column format. In memtable, data is sorted by key, and each ColumnFamily has a distinct memtable that retrieves column data via key. It stores the writes until it is full, and then flushed out.
25) What is SSTable?
SSTable is a short form of ‘Sorted String Table’. It refers to an important data file in Cassandra and accepts regular written memtables. They are stored on disk and exist for each Cassandra table.
26) How the SSTable is different from other relational tables?
SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.
27) What are the management tools in Cassandra?
DataStaxOpsCenter: It is an internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional Edition of OpsCenter.
SPM: SPM primarily administers Cassandra metrics and various OS and JVM metrics. It also monitors Hadoop, Spark, Solr, Storm, zookeeper and other Big Data platforms besides Cassandra.
28) Mention some important features of SPM in Cassandra?
The main features of SPM are:
- Correlation of events and metrics
- Distributed transaction tracing
- Creating real-time graphs with zooming
- Detection and heartbeat alerting
29) What is cluster in Cassandra?
In Cassandra, the cluster is an outermost container for keyspaces that arranges the nodes in a ring format and assigns data to them. These nodes have a replica which takes charge in case of data handling failure.
30) What is the role of ALTER KEYSPACE?
ALTER KEYSPACE is used to change the value of DURABLE_WRITES with its related properties.
31) What do you mean by Cassandra-Cqlsh?
Cqlsh is a Cassandra query language shell used to execute the commands of CQL (Cassandra query language).
32) What are the differences between a node, a cluster, and datacenter in Cassandra?
Node: A node is a single machine running Cassandra.
Cluster: A cluster is a collection of nodes that contains similar types of data together.
Datacenter: A datacenter is a useful component when serving customers in different geographical areas. Different nodes of a cluster can be grouped into different data centers.
33) What is the use of Cassandra CQL collection?
Cassandra CQL collection is used to collect the data and store it in a column where each collection represents the same type of data. CQL consist of three types of types:
- SET: It is a collection of unordered list of unique elements.
- List: It is a collection of elements arranged in an order and can contain duplicate values.
- MAP: It is a collection of unique elements in a form of key-value pair.
34) What is the use of Bloom Filter in Cassandra?
On a request of a data, before doing any disk I/O Bloom filter checks whether the requested data exist in the row of SSTable.
35) How does Cassandra delete data?
In Cassandra, to delete a row, it is required to associate the value of column to Tombstone (where Tombstone is a special value).
36) What is SuperColumn in Cassandra?
In Cassandra, SuperColumn is a unique element containing similar collection of data. They are actually key-value pairs with values as columns.
37) What is the difference between Column and SuperColumn?
Difference between Column and SuperColumn:
- The values in columns are string while the values in SuperColumn are Map of Columns with different data types.
- Unlike Columns, Super Columns do not contain the third component of timestamp.
38) What is Hadoop, HBase, Hive and Cassandra? Specify similarities and differences among them.
Hadoop, HBase, Hive and Cassandra all are Apache products.
Apache Hadoop supports file storage, grid compute processing via Map reduce. Apache Hive is a SQL like interface on the top of Haddop. Apache HBase follows column family storage built like Big Table. Apache Cassandra also follows column family storage built like Big Table with Dynamo topology and consistency.
39) What is the usage of “void close()” method?
In Cassandra, the void close() method is used to close the current session instance.
40) Which command is used to start the cqlsh prompt?
The cqlsh command is used to start the cqlsh prompt.
41) What is the usage of “cqlsh-version” command?
The “cqlsh-version” command is used to provide the version of the cqlsh you are using.
42) Does Cassandra work on Windows?
Yes. Cassandra is compatible on Windows and works pretty well. Now its Linux and Window compatible version are available.
43) What is Kundera in Cassandra?
In Cassandra, Kundera is an object-relational mapping (ORM) implementation which is written using Java annotations.
44) What do you mean by Thrift in Cassandra?
Thrift is the name of RPC client which is used to communicate with the Cassandra Server.
45) What is Hector in Cassandra?
Hector was one of the early Cassandra clients. It is an open source project written in Java using the MIT license.
46). What Is Cassandra?
Cassandra is defined as an open-source NoSQL data storage system that leverages a distributed architecture to enable high availability, scalability, and reliability, managed by the Apache non-profit organization.
47). Why is JConsole used? What is it’s different elements?
JConsole is used to Monitor and perform analysis on the Server activities. Once you’ve connected to a server, the default view includes four major categories about your server’s state, which are updated constantly:
48). Explain Nodetool Utility.
The Nodetool Utility is a command-line utility that comes out of the box with Cassandra and is a great tool for administration and monitoring. It communicates with JMX to perform operational and monitoring tasks exposed by MBeans.
49). What are Roles in CQLSH?
Roles enable authorization management on a larger scale than security per user can provide. A role is created and may be granted to other roles. Hierarchical sets of permissions can be created with the help of it.
50). What is Python Stress test in Cassandra?
Cassandra comes with a popular utility called py_stress that can be used to run a stress test on Cassandra cluster. The Cassandra-stress tool is a Java-based stress testing utility for basic benchmarking and load testing a Cassandra cluster. This is an effective tool for populating a cluster and stress testing CQL tables and queries.
So, I hope these Cassandra Interview Questions helped you to brush up your knowledge of Apache Cassandra.