Cassandra In Details


Nov 16, 2015    Janaki Mahapatra, Cassandra

What is Cassandra?   [caption id="attachment_843" align="alignnone" width="560"]cassandradb opensource, distributed decentralized, elastically, salable, highly available,column-oriented[/caption]
  • Massively linearly scalable NoSQL database
  • Fully distributed, with no single point of failure
  • Free and open source, with deep developer support
  • Highly performant, with near-linear horizontal scaling in proper use cases
  • No single point of failure, due to horizontal scaling
What is horizontal and vertical scaling:
  • horizontal scaling:  add commodity hardware to a clus ter
  • vertical scaling:  add RAM and CPUs to a specialize d high performance box
  Features:
  • Always on architecture: Cassandra's masterless “ring” architecture provides your application’s end with always-on access to their data, even in the event of rack, machine, or entire data center failure.
  • Native Multi-Data Center Replication: Cross data center (in multiple geographies) and multi-cloud availability zone support for writes/reads.
  • Fast linear-Scale Performance: Enables millisecond response times with linear scalability(double your throughput with two nodes, quadruple it with four, and so on)� to deliver response times speeds your customers have come to expect.
  • Flexible Data Model The Apache Cassandra data model allows for new entities or attributes to be added over time and you’re not restricted to a rigid data model that can’ volve with the needs of the business application — such as the addition of a ew complicated data structure that may be unique to your environment, or adding a new column to a column family.
  • Transparent Fault Detection and Recovery: Nodes that fail can easily be restored or replaced.
  • Tunable Data Consistency: Support for strong  ev entual data consistency across a widely distributed cluster.
  • OpsCenter Monitoring/Management Tool: A graphical management and monitoring tool for Cassandra that provides a view of the system from a centralized dashboard.
  • Runs on Commodity Hardware: Apache Cassandra i built-to-run on commodity hardware and is unparalleled in value. Don't waste another dime on disaster recovery, high-end hardware, or revenue loss due to downtime. Focus your resources on building a great application, not on maintaining an expensive backend.
  • Mitigate Risks of Downtime : Apache Cassandra’s architecture is built with no single point of failure. If a node (rack, machine, or entire data center) goes down, another is available to take its place and serve read/write requests without interruption.
  • Improved Customer Experience: Apache Cassandra’s high availability and superior performance   gives businesses, and the mission-critical applications,  the ability to provide customers with a superior user experience.
  • Faster Time to Market: DataStax goes beyond standard open-source deployments  by providing  resources that make it easier to deliver Apache Cassandra in a single data center, or across multiple data centers, and clouds.
Cassandra supports  the CAP theorem: Consistency: do you get identical results, regardless which node is queried Availability: can the cluster respond to very high write and read volumes Partition Tolerance: is the cluster still available when part of it goes dark In Cassandra:
  • Every node is identical
  • Peer to peer protocol and uses Gossip Protocol to maintain and keep the list of nodes in sync
  • No special host to coordinate activities
  • No single point failure
  • Easier to operate and maintain because all nodes are same.
  • It was designed specifically from the ground up to take full advantage of multi processor/multi core machines and to runs across dozens of these machines housed in multiple data centers
  • It scales consistently and seamlessly to hundreds of terabytes
  • Shows exceptional performance under heavy load
  • Consistently shows very fast throughput for writes per second on a basic commodity workstation
Where to use Cassandra: Use if your application has
  1. Big data (billions of records)
  2. Very high velocity random reads and writes
  3. Flexible Sparse / wide column requirements
  4. No multiple secondary Index needs
  5. Low latency
Use Cases
  1. ecommerce inventory Cache Use Cases
  2. Time series / Events Use Cases
  3. Feed based Activities / Use Cases
Where not to use
  • Traditional RDBMS excels when you need like ACI -compliant transactions, with rollback (eg: bank transfer)
  • Secondary Indexes
  • Relational Data
  • Transactional (Rollback, Commit)
  • Primary and Financial Records
  • Stringent Security and Authorization Needs on Data
  • Dynamic Queries on Columns
  • Searching Column Data
  • Low Latency
What are common Cassandra use cases: Cassandra is particularly useful for
  • Playlists and collections eg: Spotify
  • Personalization and recommendation engines like eBay
  • Messaging eg: instagram
  • Fraud Detection eg: Barracuda
  • Sensor Data eg: Zonar
Who is using Cassandra now.
  • Netflix
  • Intuit
  • Twitter
  • Ebay
  • Expedia
  • Nassa
  • cisco
What is CCM: Cassandra Cluster management? CCM (Cassandra Cluster Manager) is a tool written by Sylvain Lebresne that creates multi-node cassandra clusters on the local machine. It is great for quickly setting up clusters for development and testing, and is the foundation that the Cassandra distributed tests (dtests) are built on. Installing CCM: Download the following packages in your Linux VM.
  • sudo yum install -y python-pip;
  • sudo pip install cql PyYAML;
  • sudo yum install ant -y;
  • sudo yum install git -y;
Go to your home directory in your Linux VM.
  • cd /home
  • sudo  git clone https://github.com/pcmanus/ccm.git
  • cd ccm
  • sudo ./setup.py install;
  • cd ..
Now we have installed the CCM. In order to create a multi node cluster is a single system we have to create 2 or 3 ip alias. To do that
  • sudo ifconfig lo:2  127.0.0.2
  • sudo ifconfig lo:3  127.0.0.3
  • sudo ifconfig lo:4  127.0.0.4
Remember that these ips are temporary and don’t close your terminal window. In order to make those permanenet, we have to add those into network interface. To Do that, go to
  • cd /etc/sysconfig/network-scripts/
  • cp ifcfg-lo ifcfg-lo:0
  • cp ifcfg-lo ifcfg-lo:1
  • cp ifcfg-lo ifcfg-lo:2
  • cp ifcfg-lo ifcfg-lo:3
  Now open one by one file with your internal editor(vi /nano)
    • nano ifcfg-lo:0 ipimage
    • Change the IP to 127.0.0.2Like this make the changes to other files then you are good to go.Now we are ready to create multi node cluster  cassan dra in one VM. To create a multi cluster , typeccm create --version 2.2.3 --nodes 3 --start  MyCluster.
After few moments the cluster "MyCluster will be ready to use.  Here is some command to check
  • ccm status
  • ccm node1 start
  • ccm node1 stop
Cassandra commands:
  • ccm node1 nodetool status
  • ccm node2 nodetool status
Cassandra Query Language (CQL Access)
  • ccm node1 cqlsh
To create a keyspace(database) in node1
  • CREATE KEYSPACE MySampleDB WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };
  • Use MySampleDB
  • Now create some table and play with that