MongoDB Vs HBase Vs Cassandra
Unlike traditional SQL databases, NoSQL databases, or “non-SQL” databases, do not store their data in tabular relations. Originally designed for modern web-scale databases, they have found widespread use in present-day big data and real-time web applications. Some of the most commonly used data structures include key-value, wide column, graph, and document stores
As NoSQL databases do not adhere to a strict schema, they can handle large volumes of structured, semi-structured, and unstructured data. This allows developers to be more agile and push code changes much more quickly than with relational databases. In this article, we compare three popular open-source NoSQL databases and discuss how their specific use cases and features might be a good fit for your application.
Cassandra
Overview and Features
Cassandra is the most popular wide-column store database system on the market. Initially developed at Facebook for Facebook’s Inbox search feature, Cassandra was open-sourced in 2008 and subsequently made a top-level project for Apache on February 17, 2010. It is widely favoured for its enterprise features, like scalability and high availability, that allow it to handle large amounts of data and provide near real-time analysis. Written in Java, Cassandra offers synchronous and asynchronous replication for each update. It’s durability and fault-tolerant capabilities make it ideal for always-on applications.
Cassandra’s Advantages and Use Cases
Unlike MongoDB, Cassandra uses a masterless “ring” architecture which provides several benefits over legacy architectures like master-slave architecture. This, in turn, means that all nodes in a cluster are treated equally, and a majority of nodes can be used to achieve quorum.
Although Cassandra stores data in columns and rows like a traditional RDBMS, it provides agility in the sense that it allows rows to have different columns and even allows a change in the format of the columns. Apart from this, its query language, Cassandra Query Language (CQL), closely resembles the traditional SQL syntax, and thus, can be easier for SQL users to understand. This gives it some leverage in any comparison of Cassandra vs. HBase.
Cassandra offers advanced repair processes for read, write, and entropy (data consistency), which makes its cluster highly available and reliable. Owing to its lack of a single point of failure, it can provide a highly available architecture if a quorum of nodes is maintained and the replication factor is tuned accordingly. This also allows for better fault tolerance compared to document stores like MongoDB, which might take up to 40 seconds to recover.
Some of Cassandra’s most common use cases include messaging systems (for their superior read-and-write performance), real-time sensor data, and e-commerce websites.
Cassandra Disadvantages
Cassandra isn’t without its disadvantages. As the architecture is distributed, replicas can become inconsistent. If a node in a cluster goes down, its coordinator node tries to preserve the data in the form of hints. When the failed node is brought online, the coordinator node hands off the hints to aid in the repair. This, however, can prove to be a burden for the coordinator node and overload it. As a result, you will see a loss of data replicas and a refusal of writes from the coordinator node.
While scanning data, Cassandra handles itself well if the primary key is known, but suffers severely if it is not. This is because it has to scan all its nodes in the cluster, resulting in high read-time penalties.
Another major drawback is its lack of solid official documentation from the Apache Software Foundation; you must rely on third-party companies like DataStax for this.
Ownership, Support, and Key Customers
Cassandra is currently maintained by the Apache Software Foundation. Third-party companies like DataStax, URimagination, and Impetus provide support based on their database implementations. Cassandra boasts several large organizations among its users, including GitHub, Facebook, Instagram, Netflix, and Reddit.
MongoDB
Overview and Features
MongoDB is the most popular document store on the market and is also one of the leading Database Management Systems. It was created in 2007 by the team behind DoubleClick (currently owned by Google) to address the scalability and agility issues in serving Internet ads by DoubleClick.
MongoDB offers both a community and an enterprise version of the software. The enterprise version offers additional enterprise features like LDAP, Kerberos, auditing, and on-disk encryption.
MongoDB Advantages and Use Cases
MongoDB is a schema-less database and stores data as JSON-like documents (binary JSON). This provides flexibility and agility in the type of records that can be stored, and even the fields can vary from one document to the other. Also, the embedded documents provide support for faster queries through indexes and vastly reduce the I/O overload generally associated with database systems. Along with this is support for schema on write and dynamic schema for easy evolving data structures.
MongoDB also provides several enterprise features, like high availability and horizontal scalability. High availability is achieved through replica sets which boast features like data redundancy and automatic failover. This ensures that your application keeps serving, even if a node in the cluster goes down.
MongoDB also provides support for several storage engines, ensuring that you can fine-tune your database based on the workload it is serving. Some of the most common use cases of MongoDB include a real-time view of your data, mobile applications, IoT applications, and content management systems. Finally, it includes a nested object structure, indexable array attributes, and incremental operations
MongoDB Disadvantages
Unless you opt for one of the DBaaS flavours, management operations like patching are manual and time-consuming processes. Although it has improved in the newer versions, MapReduce implementations still remain a slow process, and MongoDB also suffers from memory hog issues as the databases start scaling.
Ownership, Support, and Key Customers
MongoDB is owned by MongoDB, Inc., with its headquarters in New York. The community version is free to use and is supported by the MongoDB community of developers and users on its community forum. On the other hand, the enterprise version is supported by MongoDB, Inc. engineers 24/7.
As one of the most popular databases on the market, MongoDB counts names like eBay, Google, Cisco, SAP, and Facebook among its clients.
HBase
HBase Overview and Features
HBase is an open-source wide-column store distributed database that is based on Google’s Big Table. It was developed in 2008 as part of Apache’s Hadoop project. Built on top of HDFS, it borrows several features from Bigtable, like in-memory operation, compression, and Bloom filters. Built on Java, HBase provides support for external APIs like Thrift, Avro, Scala, Jython, and REST. Hbase offers a stand-alone version of its database, but that is mainly used for development configuration, not in production scenarios.
HBase Advantages and Use Cases
One of the strengths of HBase is its use of HDFS as the distributed file system. This allows the database to store large data sets, even billions of rows, and provide analysis in a short period. This support for sparse data, along with the fact that it can be hosted/distributed across commodity server hardware, ensures that the solution is very cost-effective when the data is scaled to gigabytes or petabytes. That distribution contributes to one of its big pros: failover support includes automatic recovery.
Several concepts from Bigtable, like Bloom filters and block caches, can also be used for query optimization. HBase also has a leg up in any HBase vs. Cassandra comparison when it comes to consistency, as the reads and writes adhere to immediate consistency, compared to the eventual consistency in Cassandra. Its close integration with Hadoop projects and MapReduce makes it an enticing solution for Hadoop distributions.
Some of HBase’s common use cases include online log analytics, Hadoop distributions, write-heavy applications, and applications in need of large volume (like Tweets, Facebook posts, etc.).
HBase Disadvantages
Although HBase shares several similarities with Cassandra, one major difference in its architecture is the use of a master-slave architecture. This also proves to be a single point of failure, as failing from one HMaster to another can take time, which can also be a performance bottleneck. If you are looking for an always-available system, then Cassandra might be a better choice.
Unlike Cassandra, HBase does not have a query language. This means that to achieve SQL-like capabilities, one must use the JRuby-based HBase shell and technologies like Apache Hive (which, in turn, is based on MapReduce). The major problem with this approach is the high latency and steep learning curve in employing these technologies.
Although HBase scales well by adding DataNodes to its cluster, it has some high hardware requirements, mainly because of its dependency on HDFS, which would require five DataNodes and one NameNode as a minimum. This, in turn, translates to high running and maintenance costs.
Another important factor to consider when choosing HBase is its interdependency on other systems, like HDFS, for storage, and Apache ZooKeeper for status management and metadata. So, when designing solutions, the architecture might become complex, and one must know these technologies well.
Ownership, Support, and Key Customers
HBase is an open-source database developed and maintained by the Apache Software Foundation, and commercial technical support is provided by several Hadoop vendors.
Some of HBase’s prominent customers include Adobe, Netflix, Pinterest, Salesforce, Spotify, and Yahoo.
Conclusion
There is no doubt that MongoDB is one of the most popular open-source NoSQL databases, but wide-column databases like Cassandra may provide better query performance and always-on capabilities.
Another important factor when choosing a NoSQL database is the availability of managed DBaaS services, where you can offload the management and maintenance of the database to the provider, and the developer can simply focus on their application. HBase, in this context, is lacking, while MongoDB has very mature DBaaS offerings, like MongoDB Atlas. On the other hand, HBase can be a very good solution for write-heavy applications and enormous amounts of records.