Effectively Use "cbt" command to get Google Cloud BigTable Data
Google Cloud BigTable is a NO-SQL time series storage. Unlike SQL or my SQL, it doesn't have a standard user interface to see data (as of 2020 Nov). Instead, it provides a command-line tool.
Thus this blog will describe how to effectively query Google Cloud BigTable instance, to view and filter rows using the cbt command-line tool. This blog expects you to have a fair understanding of:
- Google Cloud BigTable
- Row keys, columns, column qualifier in BigTable
- Proper access rights to query BigTable
Cloud BigTable Command line tool (cbt) provided by Google helps you perform various actions on your BigTable instance. The most common purpose I have been using it is for reading rows from my BigTable instance. 'cbt' command-line tool provides you the following reading options
1. Read: Read rows between the range of keys.
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> start=<ROW_KEY_1> end=<ROW_KEY_2>
2. Read from: Start reading rows from the key specified
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> <ROW_KEY>
- Each table has only one index, the row key
- Rows are sorted lexicographically (https://en.wikipedia.org/wiki/Lexicographic_order) by row key
- Columns are grouped by column family and sorted in lexicographic order within the column family
- Cloud Bigtable tables are sparse
To start, open a command line. Make sure your machine has proper read access to the Google Cloud BigTable instance. To know more read here
List the tables in Google Cloud BigTable
First identify the table from which you wish to read your data. To see a list of tables in Google Cloud BigTable, run the following command in your terminal.
>> cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> ls
testTable1
testTable2
The output will be the list of tables in your BigTable instance.
How to read Google Cloud BigTable Rows between two Row Keys using cbt?
Pick any two row keys between which you wish to see your data. Let's call it ROW_KEY_1 and ROW_KEY_2. ROW_KEY_1 must be smaller than ROW_KEY_2 when sorted lexicographically. Then in your console, enter the following command:
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> start=<ROW_KEY_1> end=<ROW_KEY_2>
Replace your appropriate BigTable instance Id, GCP project id, table name, and row keys from the command above, and then hit the ENTER key.
It will list all the rows between these two rows keys and the output will be like:
656345623452352345.1605170270100
data:sensor1 @ 2020/03/03-07:15:13.994000
"54\n"
data:sensor2 @ 2020/03/03-07:15:13.994000
"234\n"
data:sensor3 @ 2020/03/03-07:15:13.994000
"12\n"
data:sensor4 @ 2020/03/03-07:15:13.994000
"545\n"
data:sensor5 @ 2020/03/03-07:15:13.994000
"321\n"
data:sensor6 @ 2020/03/03-07:15:13.993000
"687\n"
data:sensor7 @ 2020/03/03-07:15:13.994000
"56\n"
data:sensor8 @ 2020/03/03-07:15:13.994000
"787\n"
so on...
Usually, if you are recording data by seconds or milliseconds in BigTable, you would get a big stream of data in your console, in the format shown above. Reading such a long stream of data on the console is difficult. We will talk about how to deal with this problem later in the blog.
Understand the cbt command output format
656345623452352345.1605170270100
The first line is simply your row key
data:sensor1 @ 2020/03/03-07:15:13.994000
In the line below, 'data:sensor1' tells you about your column group and column. 'data' being the column group and 'sensor1' being the column or column qualifier. The date mentioned in the same line after @ is the date at which the data is recorded for that cell value.
"54\n"
The line below is the cell value i.e., "54\n". Similarly, the rest of the lines describe other columns and their value for the row key mentioned above. You would get a list of such rows, its columns, and its values printed on your console.
Reading Google Cloud BigTable Rows starting from a Row Key
Decide on a Row key first (say ROW_KEY_1) from where you wish to start reading data. From this point of row key, cbt will return list of rows with row key greater than ROW_KEY_1.. Open the command line console and enter the following command:
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> <ROW_KEY>
Replace your appropriate BigTable instance Id, GCP project id, table name, and row key in the command string, then hit the ENTER key.
It will list all the rows between these two rows keys. The output will be like:
656345623452352345.1605170270100
data:sensor1 @ 2020/03/03-07:15:13.994000
"54\n"
data:sensor2 @ 2020/03/03-07:15:13.994000
"234\n"
data:sensor3 @ 2020/03/03-07:15:13.994000
"12\n"
data:sensor4 @ 2020/03/03-07:15:13.994000
"545\n"
data:sensor5 @ 2020/03/03-07:15:13.994000
"321\n"
data:sensor6 @ 2020/03/03-07:15:13.993000
"687\n"
data:sensor7 @ 2020/03/03-07:15:13.994000
"56\n"
data:sensor8 @ 2020/03/03-07:15:13.994000
"787\n"
so on...
Again, the output will be in the format explained above. The problem with this command is that it will keep on showing you a list of BigTable rows until it reaches the end of the table.
How to Effectively read BigTable rows using cbt?
Effective methods:
- Making use of filters and pipes (this works on Linux)
- Using column filters option given by cbt
Making use of filters and pipes (this works on Linux)
Making use of filters and pipes, on the Linux machine you get them handy. Say you just want to see only the row keys, then use the following in either of the read row commands:
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> <ROW_KEY> | grep -- '<ROW_KEY_PREFIX>'
ROW_KEY_PREFIX is a string with which my row key starts with. For example, my row key '656345623452352345.1605170270100' is a combination of some device ID and timestamp with a period '.' in between. So if I want to see rows of data stored against this device then I will create my ROW_KEY_PREFIX as '656345623452352345.'. It will let me see the list of row keys, like shown below:
656345623452352345.1605170270100
656345623452352345.1605170270112
656345623452352345.1605170270144
656345623452352345.1605170270166
656345623452352345.1605170270178
656345623452352345.1605170270182
656345623452352345.1605170270189
656345623452352345.1605170270191
656345623452352345.1605170270195
656345623452352345.1605170270199
In the case of reading a stream of rows starting from a row key, the output will be a stream of data on your console until it reaches the end of the table. So in such case I would use less with my cbt command. So that I can see the data page wise.
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> <ROW_KEY> | less
Using column filters option given by cbt
Using column filters in my cbt command will let me see only those columns which I mention in the filter parameters:
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> <ROW_KEY> columns=<COLUMN_GROUP>:<COLUMN1>,<COLUMN_GROUP>:<COLUMN2>
In my case it would be:
cbt -instance <BIGTABLE_INSTANCE_ID> -project <GCP_PROJECT_ID> read <TABLE_NAME> <ROW_KEY> columns=data:sensor1,data:sensor2
Additionally, you can do a mix-match of both. Using column filters and pipe/filters together to read the BigTable rows effectively.
Please feel free to ask any questions in the comment below. Hope this blog post has helped you.
Thank you.
Comments
Post a Comment
Feel free to ask, comment and/or suggest.