Java Programmer: Casandra

https://www.tutorialspoint.com/cassandra/cassandra_installation.htm

Cassandra

Brewer CAP theorem: consistency,availability,partition... and if we want Partition and availability always then cassandra .... use mangodb for consistency and partition...

Thus, rows are distributed randomly to the nodes in the cluster based on the hash values of their respective partition key
When Cassandra is installed, its default properties are stored in cassandra.yaml at the path $CASSANDRA_HOME/conf.
That's still true even though is not a big difference like in past. A write in general perform better because it doesn't involve too much the I/O -- a write operation is completed when the data has been both written in the commit log (file) and in memory (memtable). When the memtable reach the max size then all table is flushed in a disk sstable. When a read request for a row comes into a node in Cassandra, the row must be combined from all SSTables on that node that contain columns from the row in question as well as from any unflushed memtables, to produce the requested data. To optimize this piercing-together process, Cassandra uses an in-memory structure called a Bloom filter. Each SSTable has a Bloom filter associated with it that checks if any data for the requested row exists in the SSTable before any disk input/output

Free & Open source distributed wide-column store NOSQL database system
Support for clusters and asynchronous masterless replication…

A wide column store is a type of NoSQL database. It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide column store can be interpreted as a two-dimensional key-value store.[1]

Cassandra is designed to handle big data. Cassandra’s main feature is to store data on multiple nodes with no single point of failure.

All the nodes exchange information with each other using Gossip protocol. Gossip is a protocol in Cassandra by which nodes can communicate with each other.

Avinash and Prashant Malik in FB created it for usage in FB inbox search..

Cassandra is Greek mythology character…

Availability and partition tolerance are generally considered to be more important than consistency in Cassandra

Cassandra introduced the Cassandra Query Language (CQL).

CREATE COLUMNFAMILY MyColumns (id text, Last text, First text, PRIMARY KEY(id));

INSERT INTO MyColumns (id, Last, First) VALUES ('1', 'Doe', 'John');

SELECT * FROM MyColumns;

Cassandra cannot do joins or subqueries. Rather, Cassandra emphasizes denormalization through features like collections.

Group by is supported only on columns of primary keys that too in the same order of the clustering columns declared

Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[27]

Each key in Cassandra corresponds to a value which is an object. Each key has values as columns, and columns are grouped together into sets called column families. Thus, each key identifies a row of a variable number of elements. These column families could be considered then as tables. A table in Cassandra is a distributed multi dimensional map indexed by a key. Furthermore, applications can specify the sort order of columns within a Super Column or Simple Column family.

Cassandra is wide column store, and, as such, essentially a hybrid between a key-value and a tabular database management system.

Cassandra is a Java-based system that can be managed and monitored via Java Management Extensions (JMX).

Node
Node is the place where data is stored. It is the basic component of Cassandra.
Data Center
A collection of nodes are called data center. Many nodes are categorized as a data center.
Cluster
The cluster is the collection of many data centers.
Commit Log
Every write operation is written to Commit Log. Commit log is used for crash recovery.
Mem-table
After data written in Commit log, data is written in Mem-table. Data is written in Mem-table temporarily.
SSTable
When Mem-table reaches a certain threshold, data is flushed to an SSTable disk file.

In Cassandra, writes are very cheap
In Cassandra, writes are very cheap

In Cassandra, Keyspace is similar to RDBMS Database. Keyspace holds column families, indexes, user defined types, data center awareness, strategy used in keyspace, replication factor, etc.

There are following limitations in Cassandra query language (CQL).
1. CQL does not support aggregation queries like max, min, avg
2. CQL does not support group by, having queries.
3. CQL does not support joins.
4. CQL does not support OR queries.
5. CQL does not support wildcard queries.
6. CQL does not support Union, Intersection queries.
7. Table columns cannot be filtered without creating the index.
8. Greater than (>) and less than (<) query is only supported on clustering column.Cassandra query language is not suitable for analytics purposes because it has so many limitations.
* Indexes on collections are not supported in Cassandra.
* Without indexing on the column, Cassandra can't filter that column unless it is a primary key.
That's why, for filtering columns in Cassandra, indexes needs to be created.

* Cassandra collection cannot store data more than 64KB.
* Keep a collection small to prevent the overhead of querying collection because entire collection needs to be traversed.
* If you store more than 64 KB data in the collection, only 64 KB will be able to query, it will result in loss of data.

Email set,

A partitioner determines how the data should be distributed on the cluster. Partitioner uses a hash function to distribute data on the cluster. It takes partition key to calculate the hash. That hash is called token. Data is distributed on the basis of this token.

DevCenter is the front end query tool where you can write your query and execute it. DevCenter is provided by the Datastax.

In large organizations, they have thousands of servers. It is very difficult to monitor these servers that which server is running and which server is down. OpsCenter is the tool where you can monitor all the Cassandra nodes by sitting and watching on a single machine.

Besides Cassandra, we have the following NoSQL databases that are quite popular:

Apache HBase - HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as a part of Apache Hadoop project and runs on top of HDFS, providing BigTable-like capabilities for Hadoop.

MongoDB - MongoDB is a cross-platform document-oriented database system that avoids using the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster.

DBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
It has a fixed schema. Cassandra has a flexible schema.
In RDBMS, a table is an array of arrays. (ROW x COLUMN) In Cassandra, a table is a list of “nested key-value pairs”. (ROW x COLUMN key x COLUMN value)
Database is the outermost container that contains data corresponding to an application. Keyspace is the outermost container that contains data corresponding to an application.
Tables are the entities of a database. Tables or column families are the entity of a keyspace.
Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.
Column represents the attributes of a relation. Column is a unit of storage in Cassandra.
RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.

Don't use Cassandra
Relational data
transactional(rollback, commit)
Primary financial records
Security and authorisation needs on data
Dynamic queries on columns

By default, Cassandra provides a prompt Cassandra query language shell (cqlsh) that allows users to communicate with it. Using this shell, you can execute Cassandra Query Language (CQL).

TTL (Time To Live)
Sometimes data inserted into a table becomes obsolete after a given duration. Expiring such data is performed using TTL in Cassandra. It is specified in seconds.

TTL Demo
Assume, an OTP (One Time Password) is generated for a customer transaction during a particular session. This OTP is valid only for 5 mins after which the row is automatically deleted from the table as illustrated below.

Creating the table transaction_otp to store the required details

CREATE TABLE transaction_otp_log
(session_id uuid PRIMARY KEY,
cust_id int,
otp int
);

Setting the OTP (say, 81726) to 5 mins (300 seconds) for the given customer transaction using INSERT query

INSERT INTO transaction_otp_log (session_id, cust_id, otp)
VALUES (uuid(), 1214, 81726) USING TTL 300;

Create materialized views from the base table for each query field as its partition key. For example

customer_id as the partition key

CREATE MATERIALIZED VIEW customer_wise_orders
AS
SELECT year, month, day, order_id, customer_id, amount, pay_method,status
FROM customer_orders
WHERE order_id IS NOT NULL AND customer_id IS NOT NULL
PRIMARY KEY (customer_id, order_id);

year as the partition key

CREATE MATERIALIZED VIEW year_wise_orders
AS
SELECT year, month, day, order_id, customer_id, amount, pay_method,status
FROM customer_orders
WHERE order_id IS NOT NULL AND year IS NOT NULL
PRIMARY KEY (year, order_id);

status as the partition key

CREATE MATERIALIZED VIEW status_wise_orders
AS
SELECT year, month, day, order_id, customer_id, amount, pay_method,status
FROM customer_orders
WHERE order_id IS NOT NULL AND status IS NOT NULL
PRIMARY KEY (status, order_id);

Points to remember
Need to include all primary key columns of the base table

Primary key of the materialized view must contain exactly one key column that is not part of the base table’s primary key

All primary key columns need to be filtered against IS NOT NULL

Filtering against other values in WHERE clause is not possible

Materialized views cannot be created on top of counter tables, or on other materialized views or on tables in other keyspaces

Filter rows based on non-primary key (Use ALLOW FILTERING). For e.g., retrieve log data for customer id 1214

SELECT * FROM customer_log
WHERE cust_id=1214 ALLOW FILTERING;

Note: ALLOW FILTERING has a performance hit as it needs to scan the entire table to fetch the required rows

Java Programmer

Wednesday, 1 August 2018

Casandra

No comments:

Blog Archive

Followers