BigData World: Cassandra Data Model

My previous posts related to cassandra gives an overview of what is cassandra and how to install cassandra

This posts will describe how to insert and fetch data from cassandra database:

Cassandra Keyspace and Column Family : Cassandra keyspace is sort of like a relational database. It defines one or more column families, which are very roughly analogous to tables in the relational world.it’s enough to think of a column family as a multidimensional ordered map that you don’t have to define further ahead of time. Column families hold columns, and columns are the atomic unit of data storage.

Keyspaces :A cluster is a container for keyspaces—typically a single keyspace. A keyspace is the outermost container for data in Cassandra, corresponding closely to a relational database. Like a relational database, a keyspace has a name and a set of attributes that define keyspace-wide behavior.
To my knowledge, there are currently no naming conventions in Cassandra for such items.

Column families :
In the same way that a relational database is a container for tables, a keyspace is a container for a list of one or more column families. A column family is roughly analagous to a table in the relational model, and is a container for a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.

Cassandra is considered schema-free because although the column families are defined, the columns are not. You can freely add any column to any column family at any time, depending on your needs.

Cassandra provides two interface to interact with it.

Cassandra-cli
cassandra cql

Cassandra cql provides sql like interface to cassandra tables

Enter in cassandra cli:Run the following command to connect to your local Cassandra instance:
bin/cassandra-cli

You should see the following message, if successful:
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.0.7
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default@unknown]
You can access to the online help with 'help;' command.

Note:Commands are terminated with a semicolon (';') in the cli.

Some basic commands to be run via cassandra-cli:

To see the name of the current cluster you’re working in, type:

[default@unknown] show cluster name
Test Cluster

To see which keyspaces are available in the cluster, issue this command:
[default@unknown] show keyspaces
system

If you have created any of your own keyspaces, they will be shown as well
The system keyspace is used internally by Cassandra, and isn’t for us to put data into. In this way, it’s similar to the master and temp databases in Microsoft SQL Server. This keyspace contains the schema definitions and is aware of any modifications to the schema made at runtime. It can propagate any changes made in one node to the rest of the cluster based on timestamps.

Create keyspace and column family via cli
create keyspace demo with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};

CREATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: full_name, validation_class: UTF8Type}
{column_name: email, validation_class: UTF8Type}
{column_name: state, validation_class: UTF8Type}
{column_name: gender, validation_class: UTF8Type}
{column_name: birth_year, validation_class: LongType}
];

Inserting Data in column family:
[default@demo] SET users['testuser']['full_name']='Sachin';
[default@demo] SET users['testuser']['email']='sachtechie@gmail.com';
[default@demo] SET users['testuser']['state']='TX';
[default@demo] SET users['testuser']['gender']='M';
[default@demo] SET users['testuser']['birth_year']='1995';

Secondary index on column:
The CLI can be used to create secondary indexes (indexes on column values). You can add a secondary index when you create a column family or add it later using the UPDATE COLUMN FAMILY command.
e.g: to add a secondary index to the birth_year column of the users column family:

[default@demo] UPDATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND column_metadata = [{column_name: birth_year, validation_class: LongType, index_type: KEYS}];

Get the record from table:
Because of the secondary index created for the column birth_year, its values can be queried directly for users born in a given year as follows:

[default@demo] GET users WHERE birth_year = 1969;

Delete a row or column:
For example, to delete the state column for the testuser row key in the users column family:
[default@demo] DEL users ['testuser']['state'];
[default@demo] GET users ['testuser'];
Or to delete an entire row:
[default@demo] DEL users ['testuser'];

cassandra cql:

In CQL 3, identifiers, such as keyspace and table names, are case-insensitive unless enclosed in double quotation marks. You can force the case by using double quotation marks.

Enter in cql
./cqlsh --cql3

Create keyspace and column family
CREATE KEYSPACE demo WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor='1';

create table children ( childId varchar, firstName varchar, lastName varchar, country varchar, state varchar, zip varchar, primary key (childId ) ) ;

insert into children (childId, firstName, lastName, country, state, zip) values ('sachin.arora', 'sachin', 'arora', 'India', 'Delhi', 'EI33');
insert into children (childId, firstName, lastName, country, state, zip) values ('owen.oneill', 'Owen', 'O''Neill', 'IRL', 'D', 'EI33');
insert into children (childId, firstName, lastName, country, state, zip) values ('collin.oneill', 'Collin', 'O''Neill', 'IRL', 'D', 'EI33');
insert into children (childId, firstName, lastName, country, state, zip) values ('richie.rich', 'Richie', 'Rich', 'USA', 'CA', '94333');
insert into children (childId, firstName, lastName, country, state, zip) values ('johny.b.good', 'Johny', 'Good', 'USA', 'CA', '94333');
insert into children (childId, firstName, lastName, country, state, zip) values ('bart.simpson', 'Bart', 'Simpson', 'USA', 'CA', '94111');
insert into children (childId, firstName, lastName, country, state, zip) values ('dennis.menace', 'Dennis', 'Menace', 'USA', 'CA', '94222');
insert into children (childId, firstName, lastName, country, state, zip) values ('michael.myers', 'Michael', 'Myers', 'USA', 'PA', '18964');

Misc Queries:
cqlsh:demo> SELECT * FROM children ;
cqlsh:demo> select * FROM children WHERE childid='sachin.arora';
cqlsh:demo> create index country_index on children (country) ;
cqlsh:demo> select * FROM children WHERE childid='sachin.arora' and country='India';
cqlsh:demo> SELECT count(*) from children ;
cqlsh:demo> select * FROM children WHERE childid='sachin.arora' and country='India' and state='Delhi' Allow Filtering;
cqlsh:demo> SELECT * FROM children WHERE childid in('sachin.arora','owen.oneill') Allow filtering;

With this basic set of queries we are good to explore nosql cassandra database.

Separate table directories
Internally cassandra creates separate directories for keyspaces and column families. Casandra stores table to disk using separate table directories within each keyspace directory.
Data files are stored using this directory and file naming format:

/var/lib/cassandra/data/ks1/cf1/ks1-cf1-hc-1-Data.db

The new file name format includes the keyspace name to distinguish which keyspace and table the file contains when streaming or bulk loading data. Cassandra creates a subdirectory for each table, which allows you to symlink a table to a chosen physical drive or data volume.

Cassandra also provides thrift,hector,astyananx and many more APIs to interact with it

BigData World

Thursday, 29 August 2013

Cassandra Data Model

No comments:

Post a Comment