Debunking BigData Myths write durability, data integrity and consistency
Several times in the past I have heard a very strange thing about BigData storage systems - specifically Cassandra and Hadoop. People were praising their low cost (relative), scalability and open-source nature. Yet, the same people did say something like “for that price we are ok if some data loss is possible from time to time”. Shocking? Or, more importantly, is this really something that BigData adopters have to tolerate in exchange for other benefits?
Funny, one of the recent dialogs involved Oracle a “more reliable” alternative.
I mean no disrespect - in many cases it was a clear misunderstanding of the difference between the data consistency and durability or the writes. And, generally, the ability of the storage system to preserve the data integrity over time.
First, about the software quality. To be fair, it is quite possible that Oracle has spent more time and money on testing Oracle database server and weeding out the bugs. Oracle software is used by millions of customers. Open-source software like Apache Cassandra is also used by many thousands of customers and many of them are also not very tolerant to software bugs. Not to mention that many open-source products are supported by commercial vendors who perform additional quality control. DataStax does it for Apache Cassandra, Cloudera, Hortonworks and others - for Hadoop and so on. Also it is important to mention that the source code for open-source products is publicly available and thousands of people contribute to it. Bottom line - I am not really buying the argument that the software origin (on average) makes a huge difference for data integrity and durability when comparing commercial and open-source products. Assuming that the latter are mature enough, of course.
This covers the data integrity in general. So, once successfully written to the disks the data will most likely stay there for as long as needed and BigData storage systems like Cassandra and Hadoop do not have any disadvantages comparing to the products like Oracle database. Not to mention that both storage solutions support automatic data replication which adds another layer of protection to the data you keep - something that you traditionally achieve with RAIDs and network storage systems.
At the same time, we all know that the software does crash from time to time. Same applies to OS and filesystems. Same applies to the hardware. So, regardless of the software origin there is always a relatively small possibility of data loss because of such a failure.
Now, about the fundamental problem of writing the data in a client-server environment. It is always possible, regardless of how much money do you pay for the server software, that the write request sent by the client will timeout. You will not know for sure if it has reached the server or not. Possibly your data has been written but you have not received the response. Probably most of your data has been lost in transmission and the protocol stack was unable to recover from this condition. There may be a number of reasons for this. In fact, as client application developer you do not care too much about the failure itself. What you really care about is the predictable outcome of the request, i.e. you want to know for sure what to expect when you read this data later. From this perspective, the timeout problem is the worst case scenario. It is that cheque that the sender claims to have sent you you but your mailbox remains empty after 2 weeks :)
Fortunately, each storage solution offers a way to deal with this situation. With SQL databases a successful commit guarantees that the data is persisted, i.e. will stay there until you delete it. So, all you need to do is to carefully check the result of the commit operation. The writes are durable and the write result is 100% clear.
What is about Cassandra? Cassandra actually provides a veyr high level of write durability. Its commit log is the key feature that ensures that if your client receives the OK from the number of replicas you requested as target consistency level, the data is successfully written to the replica’s commit log. And once in commit log, only a severe filesystem corruption affecting the commit log can destroy this data. Regardless of how many times your node crashes after that, the data from the commit log will eventually end up in an SSTable. And even if the coordinator had some troubles talking to some of the replicas (again, all depending on your requested consistency level) the hinted hand-off mechanism will cover you in case of brief replica failures. Write durability is taken very seriously by Cassandra.
HBase? It also has a mechanism to ensure the write durability and it is simple and powerful. It is called Write-Ahead Log (WAL). And since this WAL is stored in HDFS, you get not only the crash protection but also the automatic data redundancy at the same time. WAL is replayed by the region server upon restart so, again, your successful writes will make it to HFiles and, thus, are durable.
Finally, the consistency. Yes, this is an important topic and, yes, many BigData storage systems do intentionally give up on consistency to provide better availability and data partitioning. However, this is purely application and data model problem, not a technical flaw. Usually it is much easier to deal with data consistency when working with the relational databased like Oracle - complete ACID transaction control allows you to always have strong consistency.
Cassandra and HBase do have their own definition of data consistency that is different. Due to Cassandra’s distributed nature its tunable consistency is the key feature that allows you to reach very high data throughput for writes, reads or even both. But, again, simple or different consistency model has nothing to do with the “data loss” - assuming that the client application(s) are correctly designed and their data models are created by the engineers who understand the architecture of respective storage products. At the end, they are storage platforms that are different from relational databases and they do need to be approached accordingly.
Conclusion: it is absolutely unfair and technically wrong to claim that the modern BigData storage products (such as Cassandra or HBase) carry higher risk of data loss comparing to commercial relational databases like Oracle. Of course, for this to be true, the software has to comply to the requirements imposed by their architecture.
blog comments powered by Disqus