BigData World: 2013

Tuesday, 10 December 2013

DataStax Hybrid Cluster SetUp

This post will set up a hybrid cluster of datastax. Hybrid means one node is for cassandra and another is for solr.

Here i will show the configuration for one cassandra and one solr node .... But in the same way you can add the configuration for 'N' number of nodes

DSE-3.1.0 Multi node Hybrid cluster setup:

Here, we will set up a two node cluster in which first node will be cassandra node and second will be of solr node.

Lets us say, node which will be of cassandra node has ip - ip1 and node of solr node has ip - ip2

Cassandra node : ip1
Solr node : ip2

Prerequisites:

DSE-3.1.0 tar
Download the Tar:
1. Dse tar can be downloaded from : http://downloads.datastax.com/enterprise/dse-3.1.0-bin.tar.gz
Or you can use wget command to download this:
wget http://<user_name>:<password>@downloads.datastax.com/enterprise/dse-3.1.0-bin.tar.gz
For this you must be registered with this site.

Configuration Steps:
1.      Place the tar in same locations on all nodes in the cluster.
            Location in this cluster : /home/softwares/ dse-3.1.0
2.      Extract the dse-3.1.0-bin.tar.gz on all nodes.

Configuration matrix:
For cassandra node ip1

File : cassandra.yaml
Location of File: DSE{installation directory/resources/cassandra/conf}
Property/Value:

num_tokens   1
initial_token          -9223372036854775808

data_file_directories   Path/to/dseinstallation/resources/                                                                       cassandra/tmp/var/lib/cassandra/data

commitlog_directory            Path/to/dseinstallation/resources/
                                            cassandra/tmp/var/lib/cassandra/commitlog

saved_caches_directory       Path/to/dseinstallation/resources/
                                            cassandra/tmp/var/lib/cassandra/saved_caches

seed_provider                           ip1
listen_address                         ip1
rpc_address                              ip1
read_request_timeout_in_ms    50000
range_request_timeout_in_ms   50000
write_request_timeout_in_ms    50000
request_timeout_in_ms              50000

File: log4j-server.properties
Location Of File:   DSE{installation directory/resources/cassandra/conf}
Property/Value:

log4j.appender.R.File:            Path/to/dseinstallation/resources/
                                               cassandra/tmp/var/log/cassandra/system.log

log4j.appender.V.File             Path/to/dseinstallation/resources/
                                               cassandra/tmp/var/log/cassandra/solrvalidation.log

For Solr node ip2

File : cassandra.yaml
Location of File: DSE{installation directory/resources/cassandra/conf}
Property/Value:

num_tokens   1
initial_token          -6148914691236517206

data_file_directories   Path/to/dseinstallation/resources/                                                   cassandra/tmp/var/lib/cassandra/data

commitlog_directory             Path/to/dseinstallation/resources/
                                            cassandra/tmp/var/lib/cassandra/commitlog

saved_caches_directory        Path/to/dseinstallation/resources/
                                            cassandra/tmp/var/lib/cassandra/saved_caches

seed_provider                           ip1
listen_address                         ip2
rpc_address                              ip2
read_request_timeout_in_ms    50000
range_request_timeout_in_ms   50000
write_request_timeout_in_ms    50000
request_timeout_in_ms              50000

File: log4j-server.properties
Location Of File:   DSE{installation directory/resources/cassandra/conf}
Property/Value:

log4j.appender.R.File:           Path/to/dseinstallation/resources/
                                               cassandra/tmp/var/log/cassandra/system.log

log4j.appender.V.File             Path/to/dseinstallation/resources/
                                               cassandra/tmp/var/log/cassandra/solrvalidation.log

Note:
·        Path pointing to the following properties should pre-exist
                                    data_file_directories
                                    commitlog_directory
                                    saved_caches_directory

·        It is good to mention the log directory to check where all the logs will be created. As mentioned in the above log4j-server.properties file. Path should pre-exist also

·       Token Generation Utility To calculate tokens use below command:

python -c 'print [str(((2**64 / number_of_tokens) * i) - 2**63) for i in range(number_of_tokens)]'

For example, to generate tokens for 6 nodes:

python -c 'print [str(((2**64 / 6) * i) - 2**63) for i in range(6)]'

['-9223372036854775808', '-6148914691236517206', '-3074457345618258604', '-2',

'3074457345618258600', '6148914691236517202']

It displays the token for each node

Now update the generated token value in initial_token property in cassandra.yaml

Start the cluster:
Start the cassandra node on ip1       Path/to/dseinstallation/bin/dse cassandra

Start Solr on ip2                              Path/to/dseinstallation/bin/dse cassandra -s

Check that your cluster is up and running:

            Packaged installs: Path/to/dseinstallation/bin/nodetool status

Now you can access solr server at http://ip2:8983/solr/#/

Thursday, 29 August 2013

Cassandra Data Model

My previous posts related to cassandra gives an overview of what is cassandra and how to install cassandra

This posts will describe how to insert and fetch data from cassandra database:

Cassandra Keyspace and Column Family : Cassandra keyspace is sort of like a relational database. It defines one or more column families, which are very roughly analogous to tables in the relational world.it’s enough to think of a column family as a multidimensional ordered map that you don’t have to define further ahead of time. Column families hold columns, and columns are the atomic unit of data storage.

Keyspaces :A cluster is a container for keyspaces—typically a single keyspace. A keyspace is the outermost container for data in Cassandra, corresponding closely to a relational database. Like a relational database, a keyspace has a name and a set of attributes that define keyspace-wide behavior.
To my knowledge, there are currently no naming conventions in Cassandra for such items.

Column families :
In the same way that a relational database is a container for tables, a keyspace is a container for a list of one or more column families. A column family is roughly analagous to a table in the relational model, and is a container for a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.

Cassandra is considered schema-free because although the column families are defined, the columns are not. You can freely add any column to any column family at any time, depending on your needs.

Cassandra provides two interface to interact with it.

Cassandra-cli
cassandra cql

Cassandra cql provides sql like interface to cassandra tables

Enter in cassandra cli:Run the following command to connect to your local Cassandra instance:
bin/cassandra-cli

You should see the following message, if successful:
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.0.7
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default@unknown]
You can access to the online help with 'help;' command.

Note:Commands are terminated with a semicolon (';') in the cli.

Some basic commands to be run via cassandra-cli:

To see the name of the current cluster you’re working in, type:

[default@unknown] show cluster name
Test Cluster

To see which keyspaces are available in the cluster, issue this command:
[default@unknown] show keyspaces
system

If you have created any of your own keyspaces, they will be shown as well
The system keyspace is used internally by Cassandra, and isn’t for us to put data into. In this way, it’s similar to the master and temp databases in Microsoft SQL Server. This keyspace contains the schema definitions and is aware of any modifications to the schema made at runtime. It can propagate any changes made in one node to the rest of the cluster based on timestamps.

Create keyspace and column family via cli
create keyspace demo with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};

CREATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: full_name, validation_class: UTF8Type}
{column_name: email, validation_class: UTF8Type}
{column_name: state, validation_class: UTF8Type}
{column_name: gender, validation_class: UTF8Type}
{column_name: birth_year, validation_class: LongType}
];

Inserting Data in column family:
[default@demo] SET users['testuser']['full_name']='Sachin';
[default@demo] SET users['testuser']['email']='sachtechie@gmail.com';
[default@demo] SET users['testuser']['state']='TX';
[default@demo] SET users['testuser']['gender']='M';
[default@demo] SET users['testuser']['birth_year']='1995';

Secondary index on column:
The CLI can be used to create secondary indexes (indexes on column values). You can add a secondary index when you create a column family or add it later using the UPDATE COLUMN FAMILY command.
e.g: to add a secondary index to the birth_year column of the users column family:

[default@demo] UPDATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND column_metadata = [{column_name: birth_year, validation_class: LongType, index_type: KEYS}];

Get the record from table:
Because of the secondary index created for the column birth_year, its values can be queried directly for users born in a given year as follows:

[default@demo] GET users WHERE birth_year = 1969;

Delete a row or column:
For example, to delete the state column for the testuser row key in the users column family:
[default@demo] DEL users ['testuser']['state'];
[default@demo] GET users ['testuser'];
Or to delete an entire row:
[default@demo] DEL users ['testuser'];

cassandra cql:

In CQL 3, identifiers, such as keyspace and table names, are case-insensitive unless enclosed in double quotation marks. You can force the case by using double quotation marks.

Enter in cql
./cqlsh --cql3

Create keyspace and column family
CREATE KEYSPACE demo WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor='1';

create table children ( childId varchar, firstName varchar, lastName varchar, country varchar, state varchar, zip varchar, primary key (childId ) ) ;

insert into children (childId, firstName, lastName, country, state, zip) values ('sachin.arora', 'sachin', 'arora', 'India', 'Delhi', 'EI33');
insert into children (childId, firstName, lastName, country, state, zip) values ('owen.oneill', 'Owen', 'O''Neill', 'IRL', 'D', 'EI33');
insert into children (childId, firstName, lastName, country, state, zip) values ('collin.oneill', 'Collin', 'O''Neill', 'IRL', 'D', 'EI33');
insert into children (childId, firstName, lastName, country, state, zip) values ('richie.rich', 'Richie', 'Rich', 'USA', 'CA', '94333');
insert into children (childId, firstName, lastName, country, state, zip) values ('johny.b.good', 'Johny', 'Good', 'USA', 'CA', '94333');
insert into children (childId, firstName, lastName, country, state, zip) values ('bart.simpson', 'Bart', 'Simpson', 'USA', 'CA', '94111');
insert into children (childId, firstName, lastName, country, state, zip) values ('dennis.menace', 'Dennis', 'Menace', 'USA', 'CA', '94222');
insert into children (childId, firstName, lastName, country, state, zip) values ('michael.myers', 'Michael', 'Myers', 'USA', 'PA', '18964');

Misc Queries:
cqlsh:demo> SELECT * FROM children ;
cqlsh:demo> select * FROM children WHERE childid='sachin.arora';
cqlsh:demo> create index country_index on children (country) ;
cqlsh:demo> select * FROM children WHERE childid='sachin.arora' and country='India';
cqlsh:demo> SELECT count(*) from children ;
cqlsh:demo> select * FROM children WHERE childid='sachin.arora' and country='India' and state='Delhi' Allow Filtering;
cqlsh:demo> SELECT * FROM children WHERE childid in('sachin.arora','owen.oneill') Allow filtering;

With this basic set of queries we are good to explore nosql cassandra database.

Separate table directories
Internally cassandra creates separate directories for keyspaces and column families. Casandra stores table to disk using separate table directories within each keyspace directory.
Data files are stored using this directory and file naming format:

/var/lib/cassandra/data/ks1/cf1/ks1-cf1-hc-1-Data.db

The new file name format includes the keyspace name to distinguish which keyspace and table the file contains when streaming or bulk loading data. Cassandra creates a subdirectory for each table, which allows you to symlink a table to a chosen physical drive or data volume.

Cassandra also provides thrift,hector,astyananx and many more APIs to interact with it

Wednesday, 28 August 2013

Sonar Set Up

Sonar is an open source web-based application to manage code quality which covers seven axes of code quality as:

Architecture and design
comments
duplications
unit tests
complexity
potential bugs and coding rules.

Sonar is Developed in Java. It can cover projects in Java, Flex, PHP, PL/SQL, Cobol and Visual Basic 6.

This post will cover sonar installation and how to use it:

Sonar Installation:

1. Download sonar server from http://dist.sonar.codehaus.org/. Let us install sonar of 2.11 version

sonar-2.11

Running sonar server with default derby database:

Start the server :<path/to sonar/installtion/directory/>/bin/linux-x86-32 : ./sonar.sh start

eg.: /home/impadmin/sws/sonar-2.11/bin/linux-x86-32 : ./sonar.sh start

By default this server starts on 9000 port.Verify the same on UI : http://localhost:9000

Running sonar server using mysql database:

In mysql prompt, fire below queries:

CREATE DATABASE sonar CHARACTER SET utf8 COLLATE utf8_general_ci;

CREATE USER 'sonar' IDENTIFIED BY 'sonar';

GRANT ALL ON sonar.* TO 'sonar'@'%' IDENTIFIED BY 'sonar';

GRANT ALL ON sonar.* TO 'sonar'@'localhost' IDENTIFIED BY 'sonar';

FLUSH PRIVILEGES;

Once it is done edit sonar.properties file:

Path of file: <path/to sonar/installtion/directory>/conf/sonar.properties

eg: /home/impadmin/sws/sonar-2.11/conf/sonar.properties

Comment the properties related to derby;

#sonar.jdbc.url: jdbc:derby://localhost:1527/sonar;create=true

#sonar.jdbc.driverClassName: org.apache.derby.jdbc.ClientDriver

#sonar.jdbc.validationQuery: values(1)

Uncomment the properties related to Mysql

#----- MySQL 5.x/6.x

# Comment the embedded database and uncomment the following properties to use MySQL. The validation query is optional.

sonar.jdbc.url: jdbc:mysql://localhost:3306/sonar?useUnicode=true&characterEncoding=utf8

sonar.jdbc.driverClassName: com.mysql.jdbc.Driver

sonar.jdbc.validationQuery: select 1

Now stop and start the sonar server:

Stop the server : <sonar/installation/directory>/bin/linux-x86-32 : ./sonar.sh stop

Start the server : <sonar/installation/directory>/bin/linux-x86-32 : ./sonar.sh start

By default this server starts on 9000 port.Verify the same on UI : http://localhost:9000

Verify Apporx 43 tables will be created in sonar database in mysql

Sample code to show how to use this: Now sonar is up and running perfectly. Let us create sample application that will show the usage of sonar

1. Create a mavenized project say TestSonar in eclipse

2. Create a package with name test.sonar in project

3. Copy the One.java and OneTest.java in test.sonar package

4. Replace pom.xml content with below mentioned pom.xml content

5. From terminal, Go to project path . Run below command:

mvn clean install -Psonar sonar:sonar

One.java:

public class One {
    String message = "foo";
    String message2 = "toto";

    public String foo() {
        return message;
    }

    public String toto() {
        return message2;
    }

    public void uncoveredMethod() {
        System.out.println(foo());
    }
    }
OneTest.java: Create a junit with name OneTest

import static org.junit.Assert.*;

import org.junit.Test;

public class OneTest {

    @Test
    public void testFoo() throws Exception {
        One one = new One();
        assertEquals("foo", one.foo());
    }

    @Test
    public void testBoth() throws Exception {
        One one = new One();
        assertEquals("toto", one.toto());
        assertEquals("foo", one.foo());
    }

}

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>org.codehaus.sonar</groupId>
<artifactId>example-ut-maven-jacoco-runTests</artifactId>
<version>1.0-SNAPSHOT</version>


<name>Code coverage with Maven and Sonar running tests</name>

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <sonar.language>java</sonar.language>

    
    <sonar.dynamicAnalysis>true</sonar.dynamicAnalysis>
    
    <sonar.java.coveragePlugin>jacoco</sonar.java.coveragePlugin>
</properties>

<dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
</dependencies>

<build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.5</source>
          <target>1.5</target>
        </configuration>
      </plugin>
    </plugins>
</build>


<profiles>
   <profile>
            <id>sonar</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <properties>
                
               <sonar.jdbc.url>jdbc:mysql://localhost:3306/sonar?useUnicode=true&characterEncoding=utf8</sonar.jdbc.url>
                <sonar.jdbc.driverClassName>com.mysql.jdbc.Driver</sonar.jdbc.driverClassName>
                <sonar.jdbc.username>sonar</sonar.jdbc.username>
                <sonar.jdbc.password>sonar</sonar.jdbc.password>
                <sonar.host.url>http://localhost:9000</sonar.host.url>
            </properties>
</profile>

    <profile>
      <id>coverage-per-test</id>
      <build>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-surefire-plugin</artifactId>
            
            <version>2.13</version>
            <configuration>
              <properties>
                <property>
                  <name>listener</name>
                  <value>org.sonar.java.jacoco.JUnitListener</value>
                </property>
              </properties>
            </configuration>
          </plugin>
        </plugins>
      </build>

      <dependencies>
        <dependency>
          <groupId>org.codehaus.sonar-plugins.java</groupId>
          <artifactId>sonar-jacoco-listeners</artifactId>
          <version>1.2</version>
          <scope>test</scope>
        </dependency>
      </dependencies>
    </profile>
</profiles>


</project>

Once the build will be created successfully, Verify one project with name "Code coverage with Maven and Sonar running tests" will be displayed in sonar UI. Opening the project will show something like below:

OneTest.java has two test cases. You can check the time taken by each test cases

Uncomment or add another testcases to see furthur behaviour

Wednesday, 21 August 2013

Know About Cassandra

Cassandra:

This post will give a brief introduction about one of the NoSQl Database Cassandra

Cassandra in 50 Words or Less
“Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable.

Distributed
Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified whole.

Decentralized
Cassandra, however, is decentralized, meaning that every node is identical;No node act as master or slave;no Cassandra node performs certain organizing operations distinct from any other node. Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead.

The fact that Cassandra is decentralized means that there is no single point of failure. All of the nodes in a Cassandra cluster function exactly the same. This is sometimes referred to as “server symmetry.”

Elastic Scalability ;
Scalability is an architectural feature of a system that can continue serving a greater number of requests with little degradation in performance.

There are 2 types of scaling...
Vertical scaling—simply adding more hardware capacity and memory to your existing machine—is the easiest way to achieve this.

Horizontal scaling means adding more machines that have all or some of the data on them so that no one machine has to bear the entire burden of serving requests.

But then the software itself must have an internal mechanism for keeping its data in sync with the other nodes in the cluster.

Elastic scalability refers to a special property of horizontal scalability. It means that your cluster can seamlessly scale up and scale back down. To do this, the cluster must be able to accept new nodes that can begin participating by getting a copy of some or all of the data and start serving new user requests without major disruption or reconfiguration of the entire cluster. You don’t have to restart your process. You don’t have to change your application queries. You don’t have to manually rebalance the data yourself.
Just add another machine—Cassandra will find it and start sending it work.

Scaling down, of course, means removing some of the processing capacity from your cluster.

High Availability

In general architecture terms, the availability of a system is measured according to its ability to fulfill requests.

Cassandra is highly available. You can replace failed nodes in the cluster with no downtime, and you can replicate data to multiple data centers to offer improved local performance and prevent downtime if one data center experiences a catastrophe such as fire or flood.

The replication factor lets you decide how much you want to pay in performance to gain more consistency. You set the replication factor to the number of nodes in the cluster you want the updates to propagate to (remember that an update means any add, update, or delete operation).

Tuneable Consistency :
Consistency essentially means that a read always returns the most recently written value.
But Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.

Setup Cassandra Node

Cassandra Installation:

This post will help to setup a cassandra node

1. Download Cassandra from the http://cassandra.apache.org. I am installing apache-cassandra-1.2.3-bin.tar.gz
2. Unzip this file using gunzip apache-cassandra-1.2.3-bin.tar.gz
3. Untar it using tar -xvf apache-cassandra-1.2.3-bin.tar
4. Modify the cassandra.yaml file. Path of this file will be </path/to/cassandra/installation/conf>
5. In cassandra.yaml you will find the following configuration options:
    initial_token:
    <Generate the token value using ./token-generator tool,Explained in last>
    data_file_directories (/var/lib/cassandra/data),
    commitlog_directory (/var/lib/cassandra/commitlog), and
    saved_caches_directory (/var/lib/cassandra/saved_caches).
    seed_provider:
    # Addresses of hosts that are deemed contact points.
    # Cassandra nodes use this list of hosts to find each other and learn
    # the topology of the ring. You must change this if you are running
    # multiple nodes!
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "192.168.256.78"
   listen_address: localhost
   rpc_address: localhost

Make sure all data directories pre-exist and have writable permission on them

e.g: Updated cassandra.yaml file:
initial_token: 0
saved_caches_directory: /home/apache-cassandra-1.1.0/tmp/var/lib/cassandra/saved_caches
data_file_directories:
    - /home/apache-cassandra-1.1.0/tmp/var/lib/cassandra/data
commitlog_directory: /home/apache-cassandra-1.1.0/tmp/var/lib/cassandra/commitlog
seeds: "192.168.256.78"
Note: seeds takes comma separated list of nodes ip
listen_address: 192.168.256.78
rpc_address: 192.168.256.78

6. By default, Cassandra will write its logs in /var/log/cassandra/.
Make sure this directory exists and has writable permission,and update log4j-server.properies file:
log4j.appender.R.File=/var/log/cassandra/system.log
Path to log4j-server.properies file will be </path/to/cassandra/installtion/conf> eg:
log4j.appender.R.File=/home/apache-cassandra-.1.0/tmp/var/log/cassandra/system.log

Token Generation: With ./token-generator tool you can generate the tokens for n nodes in cluster and then update the generated value in initial_token property of cassandra.yaml in all nodes respectively.
Path of /token-generator is </path/to/cassandra/installtion/tools/bin>
eg: Generating token for 2 nodes:
./token-generator 2
DC #1:
Node #1:                                        0
Node #2:   85070591730234615865843651857942052864

Now update initial_token with generated values in cluster

Start Cassandra:
Start the cassandra daemon using 'bin/cassandra -f'
The service should start in the foreground and log gratuitously to the console
If you do not want to see log on screen use it without -f option
Without using "-f " option, it will run in the background.
It will start a CassandraDaemon which can be checked using jps

Check the cluster state:
bin/nodetool -h <node_ip> ring

Stop Cassandra:
You can stop the process by killing it, using 'pkill -f CassandraDaemon'