The long sprint — building a Graph Recommender

Part 8 of a series on the KillrVideo Python implementation

11 min readMay 29, 2019

Over the past couple of months I’ve been blogging about building a Python application with Apache Cassandra — specifically a Python implementation of the KillrVideo microservice tier — and we’re getting close to the end!

In the first few posts I shared about why I started this project, how I incorporated GRPC and Etcd, and the testing approach. Then we looked at implementing the KillrVideo Python services, including simple and more complex data access using various features of the DataStax Python Driver. In the most recent post I shared how I introduced Kafka into the architecture in order to pass events between services. Now we’re entering the final stage — building a graph recommendation engine using DataStax Graph in order to implement the SuggestedVideosService.

Live Coding on Twitch

For this stage of the project, I decided it would be helpful to get some graph expertise from my colleague David Gilardi (@SonicDMG on Twitter), so we started live streaming the development work on our DataStax Developers channel on Twitch. You can see some of our past streams on our YouTube channel.

Jeff and David live coding on KillrVideo Python

The great thing about live coding approach was how much I was able to learn from David based on his past experience with DataStax Graph. With the contributions of others on the Twitch stream, it was kind of like pair programming ++.

However, since I started doing a significant amount of the development work on Twitch streams, and minimal work outside of those streams, my pace of progress slowed considerably. I was now only coding a couple of hours a week on the KillrVideo Python project. That made this final part of the project the longest sprint by far — about 3 months instead of two weeks.

The graph recommender “Sprint 4” took a bit longer than the other sprints due to live coding and travel

The Approach — Why Graph?

There are a number of possible approaches we could have taken to implementing a recommender for our KillrVideo Python services. There are libraries available in Python for machine learning tasks including recommendations. For the purpose of this implementation I decided to emulate the approach we took in other language implementations of KillrVideo, which was to implement a recommender using DataStax Enterprise Graph.

The basic approach of this implementation is as follows:

Graph population: the SuggestedVideosService consumes events published to Kafka topics by other services and uses that information to populate a graph stored in DSE Graph. The information consumed includes new users, video uploads, and user ratings of videos to populate a graph.
Recommendation: when the web application requests suggested videos for a particular user, generate recommendations for that user by performing a query on the graph that leverages the relationships between the users and videos such as uploading and rating videos. We’ll discuss the specifics of the algorithm below.

Creating a Graph Schema

First, we’ll need a schema to describe the structure of our graph. The information we need to capture includes entities such as users and videos, and relationships, such as which users uploaded or rated which videos. The entities become vertices in our graph, and the relationships become edges. This schema is shown graphically in the figure below, which was captured from the schema viewer built into DataStax Studio.

Graph schema used for our KillrVideo recommendation algorithm

Users and videos are modeled as vertices in the graph, with relationships between users and videos to capture user ratings of videos and to identify which user uploaded each video. Note that in addition to users and videos, we’ve also elected to model tags as vertices in the graph. This was done in order to support the ability to navigate relationships in the graph between movies with the same tag. We experimented with various recommendation algorithms, including some that leveraged movies with the same tag. As a starting point, I elected to use the same algorithm used by the Java implementation.

For the purposes of our application, the schema is actually written and communicated to DSE using a graph traversal language known as Gremlin, which is part of the Apache TinkerPop project. Gremlin is widely used in graph databases. Here’s the key portions of the schema that define the vertex and edge types and properties used in the graph:

Property, Vertex, and Edge labels used in the KillrVideo recommendation graph

Check out the KillrVideo GitHub for the full schema and the code that loads this schema into DataStax Enterprise.

Adding Data to the Graph

Before we can generate recommendations, we need to write the code that populates the graph with data. In the previous post in this series I introduced the code that processes events from Kafka , which you can find in suggested_videos_events_kafka.py. That event processing code makes calls into the implementation of the Suggested Videos Service (suggested_videos_service.py), specifically the functions handle_user_created(), handle_youtube_video_added(), and handle_user_rated_video().

For example, the code to add a new user to the graph is quite simple:

def handle_user_created(self, user_id, first_name, last_name, email, timestamp):

    self.graph.addV('user') \
        .property('userId', user_id).property('email', email) \
        .property('added_date', timestamp).next()

The graph object is what is known as a “traversal source” to which we add operations in the traversal such as addV() to add a vertex and property() to specify the properties of the vertex. The next() operation is what is known as a terminal operator which indicates to the graph engine our desire to execute the traversal. (By the way, the best reference I’ve found to learn these Gremlin operations is the Apache TinkerPop documentation.)

The constructor for the class is where we instantiated the graph traversal source, providing a DseSession object and the name of the graph:

def __init__(self, session):
    self.session = session
    self.graph = DseGraph.traversal_source(session=self.session, graph_name='killrvideo_video_recommendations')
    self.suggested_videos_consumer = SuggestedVideosConsumer(self)

Accessing DataStax Graph in Python

Despite the apparent simplicity of the sample above, I initially struggled with writing this code to insert the data into the graph. To explain why, let me take a step back. It’s helpful to understand that the Python driver supports different API styles for interacting with graphs, as described in the DataStax documentation:

The DseSession.execute_graph() operation allows you to provide a Gremlin traversal as a string in the standard Groovy syntax that you might use in the gremlin-console or DataStax Studio. For example, the code to select a user might look like session.execute_graph("g.V().has('user', ‘userId', “ + user_id + “)”)This can be handy if you have simple traversals, but as you can see, building up the query strings can be a bit more unwieldy if you need to create larger traversals referencing multiple variables.
The Graph fluent API allows you to create a traversal source that is accessible directly in Python with a syntax that is quite similar to Groovy, but with some occasional differences, for example the code to select a user might look like: g.V().has('user', 'userId', user_id)
If you use the fluent API, you also have the option of developing a Domain-Specific Language (DSL) to create more natural-sounding graph traversals that might be more easily understood by those familiar with the application domain. For example, we might have something like g.user('Fred').rateVideo('video_id', 5), where operations like user() and rateVideo() operate on a traversal to add steps.

The Java implementation of the KillrVideo graph recommender uses a DSL, so I considered using the same approach for the Python implementation. However, I felt the approach that would help me to learn DataStax Graph most effectively would be to use the fluent API first, and then consider factoring some of the traversal details into a DSL as a follow-up step.

Leveraging the Python Console

With that context, I can better explain the challenges I experienced in using the fluent API. First, I started out with a pretty naive development process:

Study the Python fluent API syntax to locate the desired operations to build my traversal
Implement calls using the fluent API in suggested_videos_service.py
Restart the Python services
Run generator to create test data
Debug
Repeat

This approach felt pretty slow. As a fairly new Python developer, I had failed to realize the value of testing out my syntax using the console. Once I realized the value of this, I quickly created for myself a quick list of Python commands that I could use to initialize the DataStax Python Driver and start executing traversals:

from dse.cluster import Cluster, EXEC_PROFILE_GRAPH_DEFAULT
from dse_graph import DseGraph
ep = DseGraph.create_execution_profile('graph_name')
cluster = Cluster(execution_profiles={EXEC_PROFILE_GRAPH_DEFAULT: ep}, contact_points=['10.0.75.1'])
session = cluster.connect("killrvideo")
graph = DseGraph.traversal_source(session=session, graph_name='killrvideo_video_recommendations')

These few lines of code would give me a working graph traversal source that I could use at the console. This allowed me to get a tighter feedback loop on syntax errors in my usage of the fluent API.

Leveraging DataStax Studio

Using the Python console was definitely helpful, but I soon realized that another reason for my slow progress was that I was trying to tackle two learning curves at once. It wasn’t just that my Python syntax was wrong, it was that my Gremlin traversals had errors as well. So I took another step back and used DataStax Studio to write my traversals in Groovy to make sure I got them correct. Again, this shortened my development cycle.

I soon arrived at a a process I’d recommend, at least for newer graph users:

Develop Gremlin traversals in Groovy using DataStax Studio or gremlin-console
Translate traversals into the fluent API in Python (or other chosen language), testing them out at the console (where available)
Move working traversals into application code

Gremlin Python API Caveats

The step of translating traversals into the Python fluent API has a couple of tricks to it:

It’s not immediately clear how to locate the correct package for each operation you might want to perform in a traversal.
There are some reserved words in Python such as not, as, or, and others. In these cases, the operator names have an underscore appended (not_, as_, or_, etc.)

Fortunately I was able to find the answers to these questions in the gremlin-python documentation from the Apache TinkerPop project.

More complex data insertion

My improved development process served me well for adding new videos to the graph. As you may recall from the graph schema shown above, videos are represented as vertexes in the graph, with edges to users and tags. Adding a video to the graph actually involves creation of multiple edges and vertexes:

Creating the vertex for the video
Locating the vertex representing the user and creating an “uploaded” edge to the video vertex
Locating or creating the vertexes representing each tag for the video, and creating “taggedWith” edges from the video to each tag

My initial implementation approach for adding a video was to create a separate graph traversal for each stage of the process. However, as I looked more at the equivalent code in the Java implementation, I realized that the “Gremlin way” to approach this was as a single traversal. It actually turned out to be easier to develop this way (although I did take advantage of the fluent API to build things up a bit at a time, it is a single traversal). Part of this was due to the ability to locate particular vertexes in the graph and then use the as_() operator to give them names that could be referenced later:

def handle_youtube_video_added(self, video_id, user_id, name, 
                               description, location, 
                               preview_image_location,
                               tags, added_date, timestamp):    # make sure tags are unique (no duplicates)
    unique_tags = set(tags)

    # locate user vertex
    traversal = self.graph.V() \
                    .has('user', 'userId', user_id).as_('^user')

    # add video vertex
    traversal = traversal.addV('video') \
        .property('videoId', video_id)\
        .property('added_date', added_date) \
        .property('description', description) \
        .property('name', name) \
        .property('preview_image_location', preview_image_location)\
        .as_('^video')

    # add edge from user to video vertex
    traversal = traversal.addE('uploaded') \
        .from_('^user').to('^video') \
        .property('added_date', added_date)

    # find vertices for tags and add edges from video vertex
    for tag in unique_tags:
        traversal = traversal.addE('taggedWith') \
            .from_('^video').to(__.coalesce(
                __.V().has('tag', 'name', tag),
                __.addV('tag').property('name', tag) \    
                              .property('tagged_date', added_date)))

    # execute the traversal
    traversal.iterate()

Another very interesting section of this traversal is the use of the coalesce() operation, which is an operation that takes multiple traversals as input and returns the first traversal that generates a result. In this case, we first attempt to locate a given tag by name (__.V().has(‘tag’, ‘name’, tag)), and if the tag does not exist, we create it (__.addV(‘tag’)...).

On this occasion, we use iterate() as our terminal operator to cause the traversal to execute.

Finally, you may have noticed the unusual looking operator __. This is known as an anonymous traverser and is what we use as the starting point for new traversals we might need to create within a larger traversal.

Generating recommendations

Once the code was implemented to insert users, videos, and tags into the graph, it was finally time to generate recommendations from the contents of the graph. The first operation to implement is get_suggested_for_user, which provides video recommendations for a user given their user_id.

Fortunately, I did not have to invent a new recommendation algorithm. In fact, I was able to reuse a DataStax Studio notebook that David gave me containing a couple of different traversals.

DataStax Studio notebook containing various graph-based recommendation algorithms

The basic idea of the algorithm I chose is as follows:

locate the user vertex for the provided ID
select videos that user gave high ratings (in this case, 4 or 5)
select “similar users” that also rated the same videos highly
sample other videos that were highly rated by those similar users, sorting by those with the highest average rating

As it turns out, the task of translating the algorithm into Python was fairly simple:

def get_suggested_for_user(self, user_id, page_size, paging_state):
    
    traversal = self.graph.V().has('user', 'userId', user_id).as_('^user') \
        .map(__.out('rated').dedup().fold()).as_('^watchedVideos') \
        .select('^user') \
        .outE('rated').has('rating', gte(MIN_RATING)).inV() \
        .inE('rated').has('rating', gte(MIN_RATING)) \
        .sample(NUM_RATINGS_TO_SAMPLE).by('rating').outV() \
        .where(neq('^user')) \
        .local(__.outE('rated').has('rating', gte(MIN_RATING)).limit(LOCAL_USER_RATINGS_TO_SAMPLE)) \
        .sack(Operator.assign).by('rating').inV() \
        .filter(__.in_('uploaded').hasLabel('user')) \
        .group().by().by(__.sack().sum()) \
        .order(Scope.local).by(Column.values, Order.decr) \
        .limit(Scope.local, NUM_RECOMMENDATIONS).select(Column.keys).unfold() \
        .project('video_id', 'added_date', 'name', 'preview_image_location', 'user_id') \
        .by('videoId').by('added_date').by('name').by('preview_image_location').by(__.in_('uploaded').values('userId'))

    results = traversal.toList()

Rather than explaining the entire traversal here, I’ll refer you to the detailed comments in the source code.

There’s a second recommendation operation in the Suggested Videos Service, get_related_videos(). This operation is similar, but the starting point is a specific video. I diverged from emulating the Java implementation at this point, which used DSE Search, and chose to use a graph-based algorithm similar to the one for recommending videos for a user.

Evaluating results and next steps

Finally, I had a working graph recommender in KillrVideo Python!

Output from the graph recommender shows up under “Recommended for You” in the KillrVideo home screen

Results are great, but this algorithm isn’t perfect. One of the observations that we had is that there needs to be a reasonable amount of data (users, videos, ratings) in the graph for the algorithm to generate meaningful results. Thankfully we have a sample data generator that we frequently run as part of our testing, but it does take a few minutes to generate a good amount of data and doesn’t really help us pass the integration tests for the Suggested Videos Service.

Another issue: the algorithm also won’t generate any results for a user that hasn’t rated any videos, which is a problem for new users. An improvement I’m considering is to have a very simple fallback algorithm that always generates results, for example just recommending popular videos that are rated highly by users.

Lastly, some of the graph experts we’ve shared this with have offered suggestions on how we can improve the performance of the traversal. Given all this, let’s consider this implementation a minimum viable product (MVP)!

Next Time

With the completion of the graph recommender, I’ve now got a full implementation of the basic capability of the KillrVideo Python services. There are a couple of small things left like adding authentication from the services to DataStax Enterprise. In the final post in this series I’ll take a step back and summarize lessons learned from this effort.

Next post in this series: Lessons Learned on my first Python app