9. NoSQL modeling¶

The functionality to add an additional layer of abstraction to Cassandra is part of the ValueA framework.

Initial setup¶

Before we can start using our NoSQL database, we need to configure the client connector. For this we create a configuration file in the config directory of our project, which should in our case contain the following:

config/cassandra.conf¶

[servers]
server1=192.168.56.101

# No auth configured, ignore password credientials
#[authentication]
#username=cassandra
#password=cassandra

Note

Please note that by default there is no authentication method configured in cassandra (/etc/cassandra/cassandra.yaml) In case you wish to add additional layers of security, you can use a PasswordAuthenticator, for more information visit the datastax page

Testing our setup¶

Using the lines below we validate our configuration:

check_cassandra_connection.py¶

#!/usr/bin/env python
import valuea_framework.connectors.nosql

session = valuea_framework.connectors.nosql.Cassandra().session()

for record in session.execute('select cluster_name, cql_version from system.local'):
    print (record.cluster_name, record.cql_version)

session.shutdown()

This should return a response like (depending on the database version installed):

(u'ValueA', u'3.4.2')

Note

If the script outputs some errors (“Exception TypeError”) on close, but the expected output is there as well, it’s safe to ignore them. Some versions of the cassandra library do have some close issues.

Creating your first model¶

NoSQL modeling using Cassandra is quite different then relational modeling, due to the distributed nature of the database.

In every design you have to take into account how data can be distributed over the computing nodes that define the cluster.

Cassandra uses a hashing algorithm which spreads data over the nodes using the primary key of the table, which is called a partition in Cassandra.

Within the partition you can further define your table key using clustering keys, which helps to define the uniqueness of your data but doesn’t steer the physical location of it.

The actual data in the table can use different formats, which is quite similar to sql databases.

Below an example, for a table to collect measurements at different intervals.

Where connection and period make up the partition (and thus the physical location of the data), the timestamp defines the uniqueness within the block , value is the measured value. We could also create the same object without period, but when the amount of data grows per connection you can’t distribute the data over the nodes and every query will have more data to process.

Note

When creating models (tables), choose your partitioning level wisely, there is quite some documentation available on the internet on how to size your cluster, but generally speaking you should try to limit the number of records in a partition to somewhere below 100.000 and 100 MB in size.

Note

Because the cluster spreads data using a hash of the partition (primary key), you should always use a key to access your data. You shouldn’t try to scan all nodes in a cluster to find a specific record, rather create a model to store the path to your data. (Storage is cheap, so there’s no real reason to avoid de-duplication)

model/valuea/samplenosql/__init__.py¶

from valuea_framework.connectors.nosql import Model
from valuea_framework.connectors.nosql import Column
from valuea_framework.connectors.nosql.types import String
from valuea_framework.connectors.nosql.types import TimeStamp
from valuea_framework.connectors.nosql.types import Float
from valuea_framework.connectors.nosql.types import Integer


class Meterdata(Model):
    _keyspace = "valuea"
    connection = Column(String(), primary_key=True)
    period = Column(Integer(), primary_key=True)
    timestamp = Column(TimeStamp(), clustering_key=True)
    value = Column(Float(), default_value=0)

Deployment of the model functions similar to the relational database models, just execute the deploymodel command as shown below to deploy our model:

cd <your project directory>
deploymodel --rdbms --nosql .

deploy @ .
please execute the following as superuser in postgresql:
CREATE EXTENSION IF NOT EXISTS "uuid-ossp"

Note

On some versions of the Cassandra client library error messages are emitted when a script is being terminated, if this happens you can safely ignore them. (Example below) Exception TypeError: "'NoneType' object is not callable" in <bound method Cassandra.__del__ of ...

Generate some test data¶

Next step is to input some data into our model and look at the results, to do this we will create a simple service accepting a date and a number of connections to create in our new table.

First the service:

services/valuea/samples/generate_nosql_testdata.py¶

import random
import valuea_framework.broker.Service
from valuea_framework.connectors.nosql import Cassandra
from valuea_framework.connectors.nosql.partitioner import TimeSeriePartitioner
from model.valuea.samplenosql import Meterdata


class Service(valuea_framework.broker.BaseService):
    def __init__(self, *args, **kwargs):
        super(Service, self).__init__(*args, **kwargs)

    def execute(self):
        msg = self.get_message(True)
        session = Cassandra()
        partitioner = TimeSeriePartitioner('month')
        total_value = 0.0
        for i in range(msg.connections):
            this_value = random.random()
            session.add(Meterdata(connection='%018d' % i,
                                  period=partitioner.get_partition(msg.date),
                                  timestamp=partitioner.get_value(msg.date),
                                  value=this_value)
                        )
            total_value += this_value

        session.commit()

        return {'status': 'done', 'total': total_value}

services/valuea/samples/generate_nosql_testdata.json¶

{
	"$schema": "http://json-schema.org/draft-04/schema#",
	"description": "generate_nosql_testdata",
	"properties": {
		"date": {
			"type": "string",
			"format": "date-time"
		},
		"connections": {
			"type": "integer"
		}
	},
	"required": [
		"date",
		"connections"
	]
}

The basic wireframe is the same as always, define a service (in this case of type BaseService), make sure to ship a schema (the json file) and fetch the message that was send by the client. From line 14 of the python file the actual implementation starts.

Our example expects two parameters, one date and an integer number to define the number of test cases to generate.

From line 14 the magic starts, which we will describe here step by step:

First we need a session, which we can use to query the nosql database (Cassandra). This is done in line 14
A partitioner, which is used to generate periods for our data partitions. This is just a convenient helper to calculate time periods
For our response message we use a variable to store the total value added (a random number), which we will define next
Start a loop ranging from 0 to the number of connections defined in the message
- The value for this record we’re going to calculate here, using random
- Add new meterdata records, using the value calculated, the date provided and then partition using both the connection and period which we calculate in the same lines
  - Note the get_value converts the input equally as the partitioner uses.
- Add the record value to the total value, which is only used to report it back to the client
Commit our data, which in the nosql case means it will actually flush the data to Cassandra. The normal transaction context like there is in relational systems doesn’t exist here so all done sofar only lives in memory.
Return some data about the status and the total value added

With our service installed, we can now execute it using the code below:

run_generate_nosql_testdata_local.py¶

#!/usr/bin/env python
import services.valuea.samples.generate_nosql_testdata

# construct a new object from the generate_nosql_testdata service
srv = services.valuea.samples.generate_nosql_testdata.Service()
srv.set_message({'connections': 100, 'date': '2017-01-01T00:00:00+01:00'})

# execute validations, normally processed within the listener
validation_output = srv.validate()
if validation_output:
    # Oo, our validation returned issues, print all
    for message in validation_output:
        print (message)
else:
    # all well, print result
    result = srv.execute()
    print (result)

Now after we executed the procedure we can look at some data which has landed in our Cassandra node.

  $ export CQLSH_NO_BUNDLED=TRUE ; cqlsh 192.168.56.101
  Connected to ValueA at 192.168.56.101:9042.
  [cqlsh 5.0.1 | Cassandra 3.6 | CQL spec 3.4.2 | Native protocol v4]
  Use HELP for help.
  cqlsh> select * from valuea.meterdata limit 5;

   connection         | period | timestamp                       | value
  --------------------+--------+---------------------------------+----------
   000000000000000094 | 201612 | 2016-12-31 23:00:00.000000+0000 | 0.546819
   000000000000000081 | 201612 | 2016-12-31 23:00:00.000000+0000 | 0.946152
   000000000000000089 | 201612 | 2016-12-31 23:00:00.000000+0000 |  0.97025
   000000000000000029 | 201612 | 2016-12-31 23:00:00.000000+0000 | 0.529876
   000000000000000098 | 201612 | 2016-12-31 23:00:00.000000+0000 | 0.459534

  (5 rows)

The actual commands used are highlighted in the text above.

One thing you notice is that our dates are converted to utc timezone before storing, this default is defined in the TimeSeriePartitioner object, you can easily change this default on construction of the object.

Note

The above query should only be executed on small tables, due to the distributed nature of the database it could try to fetch objects from everywhere. Normally on nosql databases you would only query records with a known partition key.

Reading results¶

With some data in our (distributed) table, we can try to read something back. So lets create a simple service to fetch all data within a period for a connection.

services/valuea/samples/read_nosql_testdata.py¶

import pytz
import valuea_framework.broker.Service
from valuea_framework.connectors.nosql import Cassandra
from valuea_framework.connectors.nosql.partitioner import TimeSeriePartitioner
from model.valuea.samplenosql import Meterdata


class Service(valuea_framework.broker.BaseService):
    def __init__(self, *args, **kwargs):
        super(Service, self).__init__(*args, **kwargs)

    def execute(self):
        result = list()
        msg = self.get_message(True)
        session = Cassandra()
        partitioner = TimeSeriePartitioner('month')
        for record in session.Query(Meterdata)\
                .filter_by(connection=msg.connection)\
                .filter_by(period=partitioner.get_partition(msg.date)):
            result.append({'connection': record.connection,
                           'period': record.connection,
                           'timestamp': pytz.timezone('UTC').localize(record.timestamp)
                                            .astimezone(pytz.timezone('Europe/Amsterdam')),
                           'value': record.value})

        return result

services/valuea/samples/read_nosql_testdata.json¶

{
	"$schema": "http://json-schema.org/draft-04/schema#",
	"description": "read_nosql_testdata",
	"properties": {
		"date": {
			"type": "string",
			"format": "date-time"
		},
		"connection": {
			"type": "string"
		}
	},
	"required": [
		"date",
		"connection"
	]
}

We won’t describe the json validator here, because it’s very similar to the others created before. The service itself starts like the previous one with creating a session and a partitioner, next we’re going to execute a query to our object using a connection (number) and a period (determined by the date in the input parameters).

For every record returned, we will add a row to our result list. The complete list of results (in our example only one, provided we have executed the other chapters code without modifications) is then returned back to the client.

Note

The timestamp stored in our record came without timezone information, but because it was stored in utc, we needed to convert it to Amsterdam upon retrieval.

Tip

If you want to retrieve data only stored at the requested date, we would add another filter_by() for the timestamp field using the same timestamp as the one used to store the record. Never discard the period filter, because it’s part of the partition (which tells the cluster where to look for your data)

We expect output in the form of:

[{'timestamp': datetime.datetime(2017, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' CET+1:00:00 STD>), 'connection': u'000000000000000081', 'period': u'000000000000000081', 'value': 0.9461519122123718}]

Our mapper is quite similar to the relational one used from sqlalchemy for the basic constructions, because you can’t join data the more complex logic doesn’t exist here.

Next steps¶

You should be able to store data in your own cluster now and read it back using our models, now is a good time to add some more data and fiddle around with it some more. You can also combine relational data and data stored in the nosql cluster using the RelationalService used in the previous chapter.