In my last article about Cassandra, I had discussed about architecture of Cassandra distributed database at a higher level. We had discussed, how data is distributed and replicated across nodes in a Cassandra cluster to guarantee partition tolerance and availability of data.
In this article, we will take a deeper look within the Cassandra database to understand how it writes and reads data to and from the database. Each node in a Cassandra distributed database is treated same as any other node in the Cassandra distributed cluster and has the same internal architecture. We are going to look into the Cassandra architecture from an individual node prospective.
As mentioned in the introductory article, Cassandra is a peer to peer database where each node in the cluster constantly communicates with each other to share and receive information (node status, data ranges and so on). There is no concept of master or slave in a Cassandra cluster. Any node in the cluster can serve a write or read request at any given point of time even though the data being written to or read from doesn't belong to the node serving the request.
Following diagram represents the read/write path within a Cassandra database.
Well, this diagram seems too complicated. Let's simplify it and look in to the write path (i.e. how Cassandra writes the incoming data) first followed by the read path (i.e how data is read from a Cassandra database)
When a client performs a write operation against a Cassandra database, it writes the data into different parts of the database. We must be familiar with following components of the Cassandra database to understand its write path.
Let's briefly understand the above mentioned components of a Cassandra database before discussing more about the write path.
Following diagram represents the write path of a Cassandra database.
When a client issues a write operation, Cassandra writes the data into the Commit log as a first step (for durability) and then writes the same data to a in memory representation into the memtable. Once, data is written to the Commit log and the memtable, Cassandra sends the acknowledgement back to the client confirming the write operation (even though the data is not yet placed on the disk permanently).
As more writes come in, Cassandra will keep appending (in sequential order) the records into the Commit Log and then to the memtable. However, when Cassandra writes the data into memtable, it writes the data in the sorted order of table's partitioning key.
Now since, the memtable is in memory and can't keep on growing infinitely, Cassandra flushes (periodically) the records from the memtable (memory) to disk, which results into the creation of a SSTable. Each flush of the memtable results into a new SSTable in the disk. Moreover, the records stored in the SSTable are sorted by the partitioning key as they were in the memtable; which facilitates in sequential access of data from SSTable. Cassandra is specifically designed to perform sequential writes and reads as random reads and writes are prone to performance.
Once the data is flushed from memtable to SSTable, we no longer need the Commit log for data durability as well as the memtable, this is when the memtable is cleared and commit log (file) related to the records is removed and new commit log is created for incoming records.
In the event of any system failure, if the data was not flushed to disk before system failure; Cassandra will reload the unflushed data from commit log to memtable during system startup and then persist it in the disk during the flush operation.
Cassandra flushes the memtable to disk (SSTable) in the occurrence of following events.
As we have discussed, each flush of the memtable results in to a new SSTable. This might not be an issue if the number of SSTables are less. However, over the period of time we will end up with huge number of SSTables and this could slow down the read performance as we might have to walk through multiple SSTables to read a record. Cassandra periodically performs a background operation known as compaction to merge the SSTables in order to optimize the read operation. Compaction is a critical and sophisticated process, I will discuss more about compaction in a upcoming article.
As we can see from the architecture, Cassandra has the simplest and fastest write path available which makes Cassandra most suitable for write-heavy workloads.
The read path in Cassandra database is little more complicated (actually way more complicated) than the write path, as there are variety of components involved in reading data from a Cassandra database. Following diagram represents the read path in a Cassandra database.
In addition to the components that we have discussed in the 'Write Path' section, we must also be aware of the following components to understand the read path of a Cassandra database.
Let's briefly discuss about the above mentioned components before talking more about the read path in Cassandra.
While reading the records from Cassandra, if the record is present in the Row Cache, the read would be served from the row cache without needing to look into any other location, as the row cache stores the recently read records. This is the simplest and fastest read path available in Cassandra.
If we miss the Row Cache, Cassandra will consult the Bloom filters that can tell in which SSTables the requested record doesn't exist. In this away Cassandra can avoid reading unwanted SSTables. Once Cassandra consults the Bloom filter to identify the SSTables to be scanned for the requested record, it reaches out to the (Partition) Key Cache, which has the mapping of recently read partition keys to their corresponding SSTables. If we find the partition key (of the requested record) in the key cache, we can directly fetch (into memtable) the record from the SSTable as the key cache will directly point to the specific record offset in the SSTable.
However, if we miss the key cache i.e if the partition key is not there in the key cache, Cassandra will look into the Partition Summaries which is nothing but a sampling of the Partition Indexes. Partition Summary helps to jump into a specific offset in the Partition Index. Once we are in Partition Index, we now have the offset of the partition key in SSTable and we can directly fetch (into memtable) the record from that offset of SSTable.
One point to mention here is that no matter how Cassandra fetches the record from SSTables, it always need to consult the Compression offsets to be able to read the data from compressed blocks.
Now, as we discussed earlier, in Cassandra there could be multiple version of the same record as Cassandra doesn't perform in-place modification. Cassandra attaches a timestamp to each version of the record (specifically to each column/field) and uses this timestamp to merge records from different SSTables and memtable to present the current version of the complete record.
As we mentioned earlier, the Row Cache is a optional component and if it is configured, it will be updated along the path while returning the record to the client.
In this article, we have discussed in detail the internal mechanisms involved in writing/reading data to/from a Cassandra database. Cassandra has the most simplest write path which makes it primary contender for write heavy workloads. We have also discussed, how different components works internally to serve a read request from a Cassandra database. Cassandra performs a lot of internal optimizations to provide efficient read performance.