Introduction to graph databases using Neo4j – Part 1

Over the past years, graph databases have become increasingly popular. Many large companies like Facebook, LinkedIn and Twitter use them heavily for their social media networks. However, when selecting a database solution for a project, this category of NoSQL databases is often overlooked in favor of the traditional relational database or RDBMS for short.

While most of the time an RDBMS will be a good fit for our projects, they do come with downsides that might show over time.

In this article, we’ll set our first foot in the world of graph databases using Neo4j. Neo4j is the most popular graph database solution to date. It’s open source and written in Java. The source code is hosted on Github. All examples in this article are based on the movie recommendation dataset you can find in Neo4j’s online sandbox environment.

Before we go into Neo4j, let’s talk about relational databases for a bit and see where they fall short. This will give us a better understanding when graph databases might be a better fit.

Downsides of the relational model

The relational database model is based on a ledger like table structure. Tables represent our real world entities, each in its own table row. Rows have a primary key, a unique identifier within their table.

We define a relationship between two tables by creating a column in one table in which we store the primary key values of the rows in the other table. These key values are foreign to the table on the owning side of the relationship. We put a foreign key constraint on the column holding these values, to protect the relationships from becoming invalidated.

When writing SQL to query data, we have to join our tables together using these foreign key relations. Depending on the number of relations, queries can grow complex fairly quickly. This is certainly the case in highly normalised databases.

Furthermore, joining a lot of tables can become a performance issue. We can mitigate this by use of indices, but only to some extent. Queries with multiple (nested) joins are not only prone to bad performance, they’re more than often quite difficult to read.

The relational model is also a very rigid structure. Table columns, relationships and constraints have to be defined up front. This may not be a problem today, but our datamodel might prove difficult to change when having to deal with new requirements in the future.

The graph model

In Neo4j, data is stored as a directed graph. A graph consists of nodes and the relationships between these nodes. Nodes are labelled to denounce their type. When compared to an RDBMS, you can think of nodes and labels as records and tables.

Relationships connect nodes. In Neo4j, they are first class citizens and can be queried directly. There is no need to compute them at runtime as an RDBMS will do when joining tables. Each relationship between two nodes has an orientation, hence the term directed graph.

Finally, nodes and relationships both support properties in the form of key value pairs.

Graphs can be easily visualised as shown in the figure below. It represents sample data from the aforementioned dataset. It consists of nodes labelled Actor, Director, Movie, Genre and User. Relationships are drawn as arrows pointing in the direction of the relationship.

 

Note how this is a very natural way of modelling data. It’s just connecting things to other things, like we would do in a whiteboard session when we’re developing our model. Now let’s see how this compares to the RDBMS equivalent:

 

 

Each entity is stored in its own table. As the relationships are many-to-many, we need additional linking tables. Although it is clear and highly structured, it is one step further away from how we think of things in the real world.

In the next section, we’ll see how both models compare when querying data.

Querying a graph with Cypher

Cypher is Neo4j’s graph query language. It is inspired by SQL and should therefore be easy to read if you’re familiar with the latter, even without prior knowledge of Cypher.

Let’s start with an example. Suppose that we want to retrieve a list of all movies that Keanu Reeves acted in. We can do this with the following query:

MATCH (:Actor {name: ‘Keanu Reeves’})-[:ACTED_IN]->(m:Movie)
RETURN m.title 
ORDER BY m.title ASC;

 

With MATCH, we specify the pattern of nodes and relationships we are looking for, in this case the movies Keanu Reeves has acted in. Notice how we use an ASCII-art style of expressing the relationship and its direction. Results are bound to the alias m and we this alias to return the title property for each movie.

Going back to the relational model of the previous paragraph, the equivalent SQL query is already more difficult to read:

SELECT m.title
FROM movies m 
INNER JOIN movie_actors ma ON m.id = ma.movie_id
INNER JOIN actors a ON ma.actor_id = a.id
WHERE a.name = ‘Keanu Reeves’
ORDER BY m.title ASC;

 

The major difference between a Cypher and an SQL query is that in Cypher we describe what to find, rather than how to find it. This will become even more clear in our next example.

Let’s try and answer a more interesting question. We really liked the first Matrix movie and now we want to search for recommendations based on other users that also really liked The Matrix. We quantify ‘really liking a movie’ as rating it at least 4 stars. Furthermore, we only consider those movies that are really liked by at least 100 users that also really liked The Matrix:

MATCH (:Movie {title: 'Matrix, The'})<-[m:RATED]-(u:User)-[r:RATED]->(rec:Movie)
WHERE m.rating >= 4
WITH rec, count(u) as num_users, avg(r.rating) as avg_rating
WHERE avg_rating >= 4 and num_users > 100
RETURN rec.title AS recommendation, num_users, avg_rating
ORDER BY avg_rating DESC;

 

This query is concise and easy to read. We are looking for movies given an average rating of over four stars by at least 100 users that also gave The Matrix at least a four star rating.

To get the same results with SQL, we have to do two separate JOINs between the movies and movie_ratings tables. This results gives a query that is pretty difficult to read:

SELECT rec.title, COUNT(mr2.user_id), AVG(mr2.rating) 
FROM movies m
INNER JOIN movie_ratings mr1 ON m.id = mr1.movie_id
INNER JOIN movie_ratings mr2 ON mr2.user_id = mr1.user_id
INNER JOIN movies rec ON rec.id = mr2.movie_id
WHERE mr1.rating >= 4 AND mr2.movie_id <> mr1.movie_id AND m.title = 'Matrix, The'
GROUP BY rec.id, rec.title
HAVING COUNT(mr2.user_id) > 100 AND AVG(mr2.rating) >= 4
ORDER BY AVG(mr2.rating) DESC;

 

Taking it further

As the examples above have shown, Cypher queries are often more concise and easier to read than their equivalent SQL queries. Let’s take it up a notch and look at some examples of what we can do in Neo4j that would be impossible to do in an RDBMS (or at least give us an ever lasting migraine).

Favorite directors

There is a lot of information hiding in our apparently simple movies database. Among other things, we know who directed the movies our users like, may be we can postulate an answer to the question of who their favorite director is?

First, we have come up with a definition of what a favorite director is:

“A user’s favorite director is the director of whom the user liked most movies more than average.”

Now given this definition we’d like a list of all users and their favorite director. This would be very hard or even impossible to do with SQL. In Cypher I wouldn’t consider this an easy task, but we can certainly produce something that is very readable:

// first we calculate a user’s average movie rating
MATCH (u:User)-[r:RATED]->(m:Movie)
WITH u, avg(r.rating) AS mean

// now we search for directors and score them by the number of movies the user rated above average
MATCH (u)-[r:RATED]->(m:Movie)<-[:DIRECTED]-(d:Director)
WHERE r.rating > mean
WITH u, d, COUNT(*) AS numAboveAverage

// we collect directors and their scores in lists per user
WITH u, collect(d) AS directors, collect(numAboveAverage) AS scores

// by reducing the scores list, we determine the index of the favorite director
WITH u, directors, reduce(x=0, idx IN range(0, size(scores) -1) | CASE WHEN scores[idx] > x THEN idx ELSE x END) AS index

// and finally we return the user along with their favorite director
RETURN u.name, directors[index].name;

 

Creating relations

In the sandbox environment, it took the favorite directors query over 600 ms to come up with a result. Not bad for what we have accomplished, but this is not ideal if we’d like to explore further based on the results.

Based on the results, let’s create a new relationship between users and directors called HAS_FAVORITE. The only thing we have to do is take our original query and add a CREATE clause (we left out the comments and new lines for brevity):

MATCH (u:User)-[r:RATED]->(m:Movie)
WITH u, avg(r.rating) AS mean

MATCH (u)-[r:RATED]->(m:Movie)<-[:DIRECTED]-(d:Director)
WHERE r.rating > mean

WITH u, d, COUNT(*) AS numAboveAverage
WITH u, collect(d) AS directors, collect(numAboveAverage) AS scores
WITH u, directors, reduce(x=0, idx IN range(0, size(scores) -1) | CASE WHEN scores[idx] > x THEN idx ELSE x END) AS index
WITH u, directors[index] AS d

CREATE (u)-[:HAS_FAVORITE]->(d)

RETURN u.name, d.name;

 

This query not only returns each user and their favorite director, but also creates a new relationship between them.

New possibilities

By creating new relationships we create new possibilities. With new relationships we added value and create other questions. We might wonder who the most popular directors are, i.e. the directors that are favorited the most. This query will give us the top 10:

MATCH (d:Director)<-[:HAS_FAVORITE]-()
RETURN d.name, COUNT(*)
ORDER BY COUNT(*) DESC
LIMIT 10;

 

Interestingly enough, executing this query in the sandbox tells us that Steven Spielberg is by far the most popular director, way more popular than for example Christopher Nolan or Quentin Tarantino.

Could our dataset be biased? Does our definition of favorite director do justice to reality? Take the sandbox for a spin and find out!

Further reading

If you want to learn more about Neo4j and its use cases, there is an extensive resource section on the Neo4j website. Also, should you want to learn more about Cypher, checkout out its excellent documentation.

Rens Verhage

Rens Verhage is sinds maart 2020 in dienst als Senior Software Engineer bij Profit4Cloud. Rens heeft ruim 14 jaar ervaring met Java en is OCA / OCP gecertificeerd.