Imagine a bacon-wrapped Ferrari. Still not better than our free technical reports.
See all our reports

A Glance into the Developer’s World of Data: Graphs, RDBMS and NoSQL

graph-header

How do you feel about graphs? Graphs are all around us. The road network that connects your home to everywhere you want to go is an example. The electric company’s grid network is another example. The neurons working in unison to keep the actions performed by our bodies coordinated in the form of the neural network are another. The web is also an example of an extremely huge directed graph. The social networking scenario (A is a friend of B, B follows C, A and C follow Tom Hanks) is another.

So, a very natural question comes to mind. What exactly are graphs, anyway? Why are they even needed? How do they make data visualization a piece of cake?  These are the questions which will be touched on briefly in this article, along with the introduction to databases, their types, and a brief introduction to Neo4j, which I hope to write about soon.

WHAT ARE GRAPHS?

Before delving into any details, let’s take a moment from our busy lives to ponder over a scenario: the “Friends of friends” situation, to help illustrate the concept of a graph.

But let’s try tweaking it a little bit to include the option of “Following” as well. For those who are alien to the concept of the topic we are about to discuss, let me elaborate a little. If you are a fan of Tom Hanks, then you can follow him on Twitter.

If you wish to remain in touch with your best buddy who lives seven oceans away and you haven’t had time to catch up in the recent past, you can follow him/ her as well. This is where things are a bit different from Facebook, where you befriend your friends, and follow your celebrity idols (or rather, like their “Pages”).

Let’s try to understand this situation using graphs. ‘A’ is a Person, and so are ‘B’, ‘C’, ‘D’ and ‘TH’ is Tom Hanks. A is a friend of B, C is the friend of B. D follows TH, along with B.

What you are observing in this situation would be handily described in a graph! Let us have a look at the mathematical definition of a graph, courtesy of Wikipedia:

“A graph is a mathematical structure consisting of a set of vertices (also called nodes) and a set of edges. An edge is a pair of vertices and these two vertices are called the edge endpoints.”

Explanation: The Persons A, B, C, D and TH are represented by what are called Nodes. The links joining them (or any two nodes in general), the relationships between nodes, are called Edges, or simply Relationships. The edges may be directed (outgoing or incoming), or not, though it is prudent to mention here that Directed Graphs can help understand a situation in a clear manner. The arrow drawn from ‘B’ to ‘TH’ implies ‘B follows TH’, not the other way round! Nor anywhere is it unclear whether B follows TH or TH follows B. The edges got your back!

Now that we are a little loaded with the information as to what Graphs basically mean, let’s move onwards and discuss what a database is, what SQL and NoSQL are, when it makes sense to store information using NoSQL rather than the more famous, prevalent, and tried-and-tested Big Brother of data storage: Relational Database Management Systems using SQL.

A BRIEF LOOK AT DATABASES AND SQL

To store data so that it can be useful in the future, persistence is the key. Imagine a situation in which you would like to see the fixtures of all the football matches which have been played in the Champions League between 1990 and 2005, their results and individual player performances. From where would a person expect to get such information, if it is not stored? Not just stored, but stored in such a format that it is easier to extract and read (Flat File management is also a way to store information, but that requires a lot of parsing effort).

The data is, ultimately, stored in the form of files, but a Relational Database Management System (RDBMS) organizes your data storage in a tabular form.  Developed by Dr. E.F. Codd in 1970, RDBMS is an organized data collection (using classic rows and columns) that enables tabular storage, which is both easy to understand and develop. Almost exclusively, RDBMS uses Structured Query Language (SQL) to manage things.

The following situation might help explain it.

Consider a company, SuperHero Corporation, where there are employees and there are managers as well. Now, the managers are also employees (duh!), and if we are in need of describing this situation to get a better grasp of the scenario, the following is one of the many approaches which might come to our aid. For the sake of simplicity, the number of attributes (columns) has been kept to a minimum.

TABLE_EMPLOYEE_INFO

TABLE_MANAGER_INFO

The table TABLE_EMPLOYEE_INFORMATION contains information about the employees, like the unique IDs designated to them (called Primary keys), their names and the departments in which they work. The other table, TABLE_MANAGER_INFORMATION, contains information as to who is whose manager. E007 (Stan Lee) manages both E001 (Tony Stark) and E003 (Peter Parker), whereas E005 and E002 are respectively managed by E003 and E006.

As the previous situation shows, the collection of information where the data is related and can be elegantly placed into rows and columns, is one of the very distinguishing features of the RDBMS.

SO HOW DID NoSQL GET HERE?

NoSQL is another type of DB management system designed primarily to target our capacity for handling ever-expanding data and its management.

It has also been known at various times to mean “NO SQL!”, but with SQL in use by 94% of respondents in a recent survey, is nowadays understood to indicate “Not only SQL”.

I’ll not get into the reasons why RDBMS proved not to be effective enough to manage this scenario, but this limitation paved the way for NoSQL, which, ever since its inception, boasted its the ability to do unlimited horizontal scalability and superior high performance data computing. But it’s not as if NoSQL is all rainbows and unicorns. The major drawback of using NoSQL lies in the fact that it supports way less functionality as compared to RDBMS. It needs to be cut some slack in this department though, because after all, RDBMS has had almost 40 years to develop and NoSQL has just started rising over the horizon as a serious contender over the past decade or so.

The four broad categorizations of NoSQL based on the employed data models, along with their examples, are:

  • Document oriented (e.g. MongoDB).
  • Key – value store (e.g. Redis).
  • Tabular store (aka Column-family: e.g. HBase).
  • Graph store (e.g. Neo4j).

Out of these four divisions, the one which is the most effective in handling the ever enlarging data-sets, managing the relationships between nodes and traversing them effectively and efficiently is the Graph store, or the Graph Database Management System. You create nodes, define their properties, create relationships, define their properties as well, join the nodes using these relationships and voila! You get a graph structure database model. And this is precisely what Neo4j is designed to achieve.

Consider a classic social networking scenario. If I ask you to find the listing of all friends-of-a-friend-of-a-friend of mine (who is obviously not a friend of mine), this situation might lead to many sleepless nights in order to achieve using RDBMS. You might need a table (say, TABLE_PERSON) which has the PersonID as the primary key and another table (say, TABLE_FRIENDS), which stores the Person_ID from TABLE_PERSON as both the person under consideration and the friends she/ he might have, looking somewhat like the following:

TABLE_HOLDERS

SQL: SELECT FRIEND_OF_HOLDER FROM TABLE_FRIENDS WHERE FRIEND_OF_HOLDER IN (SELECT FRIEND_OF_HOLDER FROM TABLE_FRIENDS WHERE ACCOUNT_HOLDER IN (SELECT FRIEND_OF_HOLDER FROM TABLE_FRIENDS WHERE ACCOUNT_HOLDER IN (SELECT FRIEND_OF_HOLDER FROM TABLE_FRIENDS WHERE ACCOUNT_HOLDER = 'Person001'))) AND FRIEND_OF_HOLDER NOT IN (SELECT FRIEND_OF_HOLDER FROM TABLE_FRIENDS WHERE ACCOUNT_HOLDER = 'Person001');
FRIEND_OF_HOLDER
--------------------
Person008
Person005

Similarly, the SQL query can be modified to extract the third level friends of any ACCOUNT_HOLDER.

So, as you can see, in this situation for small data-sets, it’s not that difficult to visualize the extraction of all the friends-of-a-friend-of-a-friend. But try to increase the data to a large level, or try to extract the nth level friend of a person where n > 10, only then will the real problem come into focus with the time necessary to extract the data.

In contrast, have a look at the Cypher Query (the query language for Neo4j), required to achieve the above effect, assuming that the nodes Person001 to Person008 and the relationship “IS_A_FRIEND_OF” between all the nodes as defined in the table mentioned above have already been created (More on the syntax of Cypher, later).

MATCH [a:Person {ID: “Person001”}]-[:IS_A_FRIEND_OF*3]->[b:Person]
WHERE NOT (a)-[:IS_A_FRIEND_OF]-(b)
RETURN DISTINCT b.id;

Now, to extract the names of all the third level friends of ALL the nodes, we just need to change the query by not mentioning the PERSON_ID of node a. And to find the 10th level friend-list, just change the query as follows:

MATCH [a:Person]-[:IS_A_FRIEND_OF*10]->[b:Person]
WHERE NOT (a)-[: IS_A_FRIEND_OF]-(b)
RETURN DISTINCT b.id;

It’s as straightforward as it looks!

WHEN DO I USE WHICH?

  • WHEN TO PREFER NoSQL?
  1. When the data to be stored is of immense quantities, with the storage requirements increasing every time, like Twitter posts, Facebook followings, server logs etc.
  2. Elaborative functionalities such as constraint management and join development are not required.
  3. The data being dealt with is of an unstructured format. For example, right now, I am managing the Twitter posts of an account holder, and the very next moment, she/ he suddenly decides to upload a high definition video of her recent exploration to the Serengeti. Q: How do I manage that? A: Simple. Use NoSQL.
  • WHEN RDBMS?
  1. The Joins (external, internal) between tables are required for data visualization and extraction.
  2. Whenever there is a need to add constraints to specific columns. For example, in case there is a need to make the Registration Number of a student as the primary key (non-null, non-repeatable), you might want to stick to RDBMS.

But, in all fairness, the author is not trying to imply that one option is better than the other. It just is a matter of choice, really. If you think that your data is better handled using the classic row-column format, then go ahead, there’s nothing forcing you to switch to NoSQL. If you think that the data that you are going to handle is going to be unstructured, ever increasing, then it really is your call, because after all, before the advent of NoSQL, the world and its data was being managed just fine, wasn’t it?

CONCLUSION

In this article, we explored some of the basics of Graphs (both the mathematical and the analytical aspects), looked into what Databases are actually all about, the need to store information in tabular forms (RDBMS), and a new kid in the block: NoSQL. We saw which alternative is a better choice in which scenario. We also had a sneak peek into the workings of Neo4j (the NoSQL Graph store type) and the Cypher Query language, the topics which will be discussed at length in the next article. As a heads-up, the article will feature a very unique graph database scenario. It’s going to be, legen.. wait for it.. dary!

Till then, continue filling me in regarding any criticisms, suggestions and thumbs-ups (if there are any!), so that I may improve in the future. Leave a note for me in the comment section below or find me on Twitter: @SabhyaKaushal

  • SN RAO

    where can u fit cassandra ?