Cassandra with Python: Simple to Complex

Pt. 6 of a series on the KillrVideo Python project

Jeff Carpenter
7 min readMar 1, 2019

This is the latest installment of a series about building a Python application with Apache Cassandra — specifically a Python implementation of the KillrVideo microservice tier. In previous posts I shared what motivated this project, how I started with infrastructure including GRPC and Etcd, the testing approach, and most recently, how I began implementing data access using Cassandra.

In this post we’ll look at some additional examples of data access using the DataStax Python Driver, ranging from the simple to the complex. (I’ll be making reference some driver / mapper concepts from the previous post so you may want to review that if you haven’t already.)

Keeping it simple

In the last post I shared how easy it was to implement the data access for the User Management Service using Apache Cassandra and the DataStax Python Driver. Using the cqlengine mapper was a big factor in my productivity. Because of that productivity boost, when implementing the other services I started with the mapper and only used other approaches when the complexity of the data access required it.

In fact, a couple additional services were implemented entirely using the mapper: the Statistics Service and the Ratings Service. These demonstrate Cassandra features including counters and batches.

Simple example 1: Counters in the Statistics Service

The Statistics Service (statistics_service.py) stores counts of how many time each video has been viewed. Counting statistics is one of the relatively few use cases for which the Cassandra counter type is a good fit, because and therefore a good example of how to manipulate counters with the mapper. The model class we used to track statistics is shown here:

class VideoPlaybackStatsModel(Model):
"""Model class that maps to the video_playback_stats table"""
__table_name__ = 'video_playback_stats'
video_id = columns.UUID(primary_key=True, db_field='videoid')
views = columns.Counter()

Note the use of the columns.Counter() type for the views column.

Incrementing the count of views in the record_playback_started() operation is very simple:

VideoPlaybackStatsModel(video_id=video_id).update(views=1)

Note that the value views=1 represents a single view of the video.

Simple example 2: Batch Writes in the Ratings Service

The Ratings Service (ratings_service.py) stores user ratings of videos and allows retrieval of the ratings either by user or by video. One interesting aspect of this service is the use of the mapper as part of a Cassandra batch when storing ratings in order to support writes to two denormalized tables supporting the “by user” and “by video” queries mentioned above. The code block below shows the data access code from the rate_video() operation:

now = datetime.utcnow()

# create and execute batch statement to insert into multiple tables
batch_query = BatchQuery(timestamp=now)
VideoRatingsByUserModel.batch(batch_query)\
.create(video_id=video_id, user_id=user_id, rating=rating)
# updating counter columns rating_counter and rating_total
# values are interpreted as amount to increment
VideoRatingsModel(video_id=video_id).\
update(rating_counter=1, rating_total=rating).batch(batch_query)

batch_query.execute()

In this example, we use two different mapper classes to write to two different tables. Note the use of each mapper’s batch() operation to add a statement to the batch_query, which we can then execute. The first statement created is a CQLINSERT into the killrvideo.video_ratings_by_user via the VideoRatingsByUserModel.

If you’re looking closely may you have also noticed something about that second write — we’re using the VideoRatingsModel, which goes to the killrvideo.video_ratings table to do a CQLUPDATE. This table is another interesting use of counters. In this case, two counter columns are used. The rating_counter tracks the number of times a video has been rated, while the rating_total tracks the sum of all ratings for the video. The average rating can then be calculated by the client via simple division (rating_total/rating_counter).

When simple doesn’t cut it

There were a few other services that involved more complex data access where the mapper couldn’t fully address my needs. These included the Video Catalog Service, the Comments Service, and the Search Service.

Complex example 1: Paging in the Comments Service

Although it’s not “YouTube scale”, the KillrVideo application is designed to support a very large number of videos, users, ratings, and other data types, in order to demonstrate best practices for data modeling and driver usage found in real-world applications.

With this in mind, imagine a user being presented with screens of videos and comments in a web browser on a client device. It would be infeasible and unnecessary to return the entire video catalog or the entire comment history for a popular video to the client at once, so of course some form of paging is required. However, paging is a feature that isn’t supported by the cqlengine mapper provided with the DataStax Python Driver. So, we need to fall back to other methods.

Paging in the KillrVideo Comments Service is used by the web application

A good example of the approach is found in the Comments Service (comments_service.py). Since paging only comes into effect on reads, we can use the mapper for writes, and then use regular CQL statements on the reads to get control over paging.

As an example, to access comments by user, we first create prepared statements in the constructor for the class:

# Prepared statements for get_user_comments()
self.userComments_startingPointPrepared = \
session.prepare('SELECT * FROM comments_by_user WHERE userid = ? AND (commentid) <= (?)')
self.userComments_noStartingPointPrepared = \
session.prepare('SELECT * FROM comments_by_user WHERE userid = ?')

Then, in the get_user_comments() operation, we use one of the prepared statements to create a bound statement which we can then execute:

bound_statement = None

if starting_comment_id:
bound_statement = self.userComments_startingPointPrepared
.bind([user_id, starting_comment_id])
else:
bound_statement = self.userComments_noStartingPointPrepared
.bind([user_id])

bound_statement.fetch_size = page_size
result_set = None

if paging_state:
# see below where we encode paging state to hex before returning
result_set = self.session.execute(bound_statement,
paging_state=paging_state.decode('hex'))
else:
result_set = self.session.execute(bound_statement)

Note where the fetch_size of the batch statement is set to the page_size requested by the client. The code after this iterates over the rows in the result_set to build up a list of comments to return to the client. If page_size rows were returned, then the Cassandra paging state is extracted from the result set and returned to the client:

if len(results) == page_size:
# Use hex encoding since paging state is raw bytes that won't encode to UTF-8
next_page_state = result_set.paging_state.encode('hex')

This allows the client to pass back the paging state on a subsequent call to get_user_comments() to retrieve the next page. Note that we encode/decode the paging state to a hex string, which works well over the Protobuf message format we’re using on our service interfaces.

The Video Catalog Service follows a similar paging approach, with an interesting twist. To read about this and some other nuances of paging in KillrVideo, check out my previous post on this topic. Although that post uses examples from the Java implementation of the microservice tier, the basic algorithms described in that post are also used in the Python implementation.

Complex example 2: Full-text Search in the Search Service

The Search Service presents a different sort of problem. If you’re familiar with Cassandra data modeling practices, you’ll be aware that Cassandra doesn’t support arbitrary searches, and the secondary index implementation that comes with Cassandra is known to perform poorly over large data sets.

Instead, the best practice is to design a table per query, with primary keys based on the attributes you will specify in each query. However, your options when you don’t know the exact key you are looking for, with limited support for range queries and no support for text search features like prefix/postfix or fuzzy matching.

However, we do have a couple of cases where we need text search features in the Search Service, in both the get_query_suggestions() operation, which performs a typeahead search for common search terms, and the search_videos() operation, which is used to search for videos containing a search term in the title or description.

Following the pattern set in the other language implementations of the KillrVideo services, the Python implementation of the Search Service uses DataStax Enterprise Search. First, we need a search index on the videos table (extracted from the CQL for the search schema):

CREATE SEARCH INDEX IF NOT EXISTS on videos;

Then, we create a prepared statement in the constructor of the Search Service:

self.search_videos_prepared = \
session.prepare('SELECT * FROM videos WHERE solr_query = ?')

Finally, in the search_videos() operation, we use the solr_query syntax supported by DSE Search to create a bound statement containing our desired CQL query:

solr_query = '{"q":"name:(' + query + ')^4 OR tags:(' + query + \
')^2 OR description:(' + query + ')", "paging":"driver"}'
bound_statement = self.search_videos_prepared.bind([solr_query])

We can then execute the bound statement and iterate over the results (not shown).

The solr_query defined above searches for the provided search term (query) in the name, tags, or description columns. The solr_query places additional weight on the name and description columns so that movies with the query appearing in the name column will appear the highest in search results, followed by those that have the search term in the description column.

And now for something completely different

In the past two posts I’ve taken you on a guided tour of the data access code for most of the KillrVideo Python services. The remaining service we haven’t discussed yet is the Suggested Videos Service. As you may have guessed, this involves building a recommender, which we’re doing using DataStax Enterprise Graph since that is built on top of Cassandra.

In upcoming posts I’ll share about this recommender, starting with how we shared data between services using Kafka in order to populate data into a graph used for recommendations.

Next post in this series: Kafka + Cassandra — like peanut butter and chocolate?

--

--

Jeff Carpenter
Jeff Carpenter

Written by Jeff Carpenter

I’m a software engineer who is passionate about distributed systems, architecture, and helping developers succeed. Opinions are my own.

Responses (1)