Imagine a bacon-wrapped Ferrari. Still not better than our free technical reports.

Examples where Graph databases shine: Neo4j edition

How to query information with Cypher

The following is the possibly the nerdiest graph data structure I could think of (took me two whole days to research and materialize), which we will use to understand how information extraction is achieved in Cypher. The query.cql file is the cypher script used to create the node and relationship structure.

And the following are the questions we will try to find the answers for, via Cypher.

  1. What are the names of the actors, the names of the characters and the television show in which they portrayed the said role? Arrange the list by both the franchise and the actor names.

This example illustrates the most basic of the Neo4j clauses: MATCH, RETURN, WHERE and ORDER BY and the role they play in extracting the information that is desired.

To solve it, we need to ask ourselves how the actors are related to the characters they played, and the franchises to which those characters belong. The simplest way is to consider the following pretty straightforward scenario:


This makes it easier to understand the relationship chain that we are required to follow: we need to find each actor, the character he played, and the franchise it was a part of, in that order. Remember, there may be better and more efficient ways to approach this problem: this is just one of many. This premise holds for all the queries that we shall discuss. Remember that there is always a better way to solve a problem.

Now, let’s craft the query, shall we? Have a look at it below:

MATCH (people:Actor)-[:PLAYED_THE_ROLE_OF]->(character:Character), 
WHERE character-[:WAS_A_CHARACTER_IN]->(franchise)

Here’s a short explanation of what’s happening in the query:

MATCH (people:Actor)-[:PLAYED_THE_ROLE_OF]->(character:Character), 

The clause MATCH is used to specify the nodes which you would require on your journey from the start to the finish, i.e. you want to start from an ‘actor(type: Actor) node and reach the ‘franchise’ node (type: TelevisionShow), via the character node (type: Character). The notation (n)-[r]->(m) signifies that there is an outgoing relationship r, in this case: PLAYED_THE_ROLE_OF) from a node ‘n’ to a node ‘m’.

The colon before the name of the relationship is used whenever you explicitly state the name of the relationship, instead of using a general notation like ‘r’ to signify it, when you are either not sure of the relationship, or simply do not care which relationship there is between any two nodes.

WHERE character-[:WAS_A_CHARACTER_IN]->(franchise)

The clause WHERE works in much the same as its SQL counterpart, which is used to specify constraints to the process of information retrieval. In this particular case, the where clause signifies a constraint apart from the one specified in the MATCH clause preceding it: the ‘character’ node is related to the ‘franchise’ node by way of an outgoing WAS_A_CHARACTER_IN relationship.


For all the SQL aficionados out there, RETURN is the CQL version of the SQL SELECT clause. Here, it is being mentioned that after matching the nodes according to the relationship chains specified in the MATCH and the WHERE clauses, return the node properties as ACTOR_NAME, as CHARACTER_NAME and as FRANCHISE_NAME.


And the similarities just do not seem to end, do they? The ORDER BY clause makes sure to arrange the results first in the alphabetical order of the franchise names, and then, by the actor names.

Here’s how the output of this query looks in the Neo4j Webadmin Console View on my machine:

Sweet, right?

  1. Find all the actors from the UK, who acted in the Sherlock franchise and were born after 1970.

This example shows the role of builtin functions in Neo4j and the retrieval of information by using a multi-step relationship chain.

The author would like to suggest the reader at this point that developing a CQL query is quite straight-forward, provided you have a clear picture of what information you need to extract, in your mind. Here’s another piece of the aforementioned author’s mind: take the good ol’ pencil and notepad, and draw a diagram, and don’t be afraid if it looks stupid: what matters is what you understand from it and that it clearly defines the situation that you intend to simulate via your query, much like we had a look in the previous question. Now, with that image on your notepad and your mind, let’s proceed to view what the query should look like:

MATCH (actor:Actor)-[*2]->
  (franchise:TelevisionShow {name: 'SHERLOCK'})
WHERE actor.nationality='United Kingdom' AND 
toInt(substring(actor.born, 7)) > 1970

Much like what we saw in the previous example, this query makes use of the MATCH clause to decide which nodes should be encountered as part of the journey and the relationship.

A little difference, though, can be seen here. We have ascertained the node ‘franchise’ by specifying its property: name, which we can, when we are sure of what we are looking for. Any number of properties can be specified, which is good, because the more the information, the faster the lookup.

If I told you that I lived on the outskirts of the Mandora crater on Mars, would you be able to look me up? No. I would need to tell you the address of my bio-dome beforehand. It’s as straightforward as that.

We have made use of the logical operator AND to club the two conditions which are required for us to devise the query: that the actor under consideration should be a national of the United Kingdom, and that they should be born later than 1970. The task of extracting the year of birth has been performed by using the string manipulation function inbuilt into Neo4j, called substring. And to parse that into an integer, so that it can work with mathematical operators, another inbuilt functionality called toInt comes immediately to our rescue.

You would notice that we have, unlike what was discussed earlier, specified the relationship between the nodes differently, by mentioning a 2 after the asterisk. What it implies is this: we don’t care which relationships the two nodes are bound by, as long as it is exactly 2 hops away, outbound, from the ‘actor’ node, signified by an asterisk followed by 2, to the ‘franchise’ node. An asterisk alone would have meant that one node could be any number of jumps away from the other. But if we think about it, it would definitely have taken much more time to process than our version.

The output for this query looks like the on the next image.

  1. Which novels/ novel series were released in the 19th Century?

Ah, the printed word!

This example, as you can see from the description, needs us to find out the novels/ novel series which were released in the 19th century, and we will make use of more inbuilt functions, in addition to the introduction to two new clauses: SET used for variable modification among other things, as described below and WITH for piping the clauses.

Now, there is one very important thing that needs to be mentioned. I have said this once or twice already, and will probably do so again, that just because NoSQL, and hence Neo4j, is new and hip, it does not imply that it is suited to ALL the situations that you encounter. Sometimes, SQL gives us faster results as compared to graphs, and this example happens to be one of many such scenarios.

We need to make sure that we commence from the nodes of the type either Novel, or NovelSeries, and start defining the relationships. But wait! Which node should be the on the other side of the relationship? None, because we only need to look for the properties of standalone nodes in this case. The following are the tasks which should be performed in order to design the query:

Step 1: As mentioned, we require and choose only the Novel and NovelSeries type of nodes, which is achieved by the use of the has function, which returns true in case a particular node has a specific property, i.e. if the node ‘planet’ has a property called ‘orbital_period_around_the_Sun’, then has(planet.orbital_period_around_the_Sun) will return true, otherwise false.

Step 2: Now, we need a temporary variable, which we will call temp, in which we can store the integer referring to the year of publication of the Novel/Novel Series, exactly like we saw in the previous example, which, in Cypher, is achieved by the use of the SET clause.

Step 3: Now since we have defined quite a lot and specified quite a lot, we need to start afresh with the next part of our two-part query. And for that to happen, we need to pipe the node we created, using the WITH clause, which is now bound by the constraints we defined in the WHERE clause and is attached with a new temporary variable.

Step 4: We start with the MATCH again, and now in the WHERE clause, mention the most critical condition of the query, that the Novel/Novel Series should be published in the 19th century.

Step 5: Though it is not necessary, but it is a good practice to remove the temporary variables from the memory after they have served their purpose, because that variable was never actually a property of the node to begin with. Thus, we use the REMOVE clause to do so, but not before we store the value of the doomed variable in some alias for display purposes and pass this value forward to the final RETURN clause, which does the displaying for us.

So, after this much exercise, let’s finally have a look at the query and how the output looks:

MATCH (nOrNS) WHERE has(nOrNS.released) AND has(nOrNS.title)
SET nOrNS.temp = toInt(substring(nOrNS.released, 7))
MATCH (nOrNS) WHERE nOrNS.temp > 1800 AND nOrNS.temp < 1899
WITH nOrNS, nOrNS.temp AS YearOfPublication
RETURN nOrNS.title AS Title, YearOfPublication;

Output that you might see then looks like this.