Scanning the entire Cassandra column family with CQL

Until Cassandra 2.0.2 it was nearly impossible to execute a simple command such as select * from my_cql_table. Even worse, when running such a command the server heap usage could skyrocket. In version 2.0.1 of Cassandra such a command might not return at all if launched from CQLSH.

When using ordered partitioner this is simpler as it would be possible to request the rows in the order of their primary key(s) and using the LIMIT clause implement the paging. However, this is not possible with a random partitioner.

While Cassandra is not really designed for such a use case, especially when the number of rows to be loaded is large, everything is possible. And in some cases it would be probably rare but a real use case.

Fortunately there is an easy way to load this data, thanks to the TOKEN() function available in CQL. But before using it you need to clearly understand one aspect of Cassandra - how the rows are stored. If you are dealing only with CQL this may not be clear. Any number of CQL columns defined as primary key get combined together in one composite key. It does not matter how - as long as the combination produces a unique value. And the hash function is used on this composite value to generate it hash. This hash is, essentially, the row key - the unique token that identifies this row.

In Cassandra all the rows are partitioned across the entire number space of the partitioner you have configured. Typically today MurMur3 partitioner is used more on partitioners, which uses the token range from -2⁶³ to 2⁶³-1. This corresponds to the range of java.lang.Long. Or 64-bit number. In fact, you do not care too much about the actual value of the token, all you need to know that it is so unique that there is only one token for any of the combinations of your primary key columns. And that these tokens are more or less evenly distributed in this 64-bit number space.

Considering all this you can implement a very simple paging to walk through your data. Assuming the sample table like this:

create table test_table(a text, b int, c int, d text, primary key ((a,b), c));

We can issue the following CQL query:

cqlsh:> select token(a,b), c, d from test_table where token(a,b) >= -9223372036854775808 limit 50;


 token(a, b)          | c | d
----------------------+---+----
 -9087493578437596993 | 2 | d1
 ...
 -8987733732583272758 | 9 | x1

(50 rows)

You have got 50 rows that have the token value from Long.MIN_VALUE to whatever 50th value you have. Again, the value of the token itself does not really matter. To continue fetching you can repeat the same request but now you will change it to:

cqlsh:> select token(a,b), c, d from test_table where token(a,b) >= -8987733732583272758 limit 50;

And so on until you get less than 50 rows back, this will be your last batch.

If you understand how this example works, you should also understand that you can, in fact, run these queries in parallel. Which would be probably practical since when you read a range of rows you are reading from the same node or maybe two nodes at most, when your table is relatively large. Splitting then entire token range in chunks and querying them in parallel will probably make your bulk data loading faster.

← Previous Archive Next →

blog comments powered by Disqus

Published

04 November 2013

Scanning the entire Cassandra column family with CQL

Published

Tags