graph_basics2

In this blog we will try to touch upon the basics of graph that one needs to know. Starting from what are edges or vertices, types of graphs there are – to – what are the data structures available to represent graphs in computer memory. So, let’s get started !


So, what is a graph, it’s nodes and edges ?

Graph is a data structure that allows representation of data as relations between nodes/vertices using edges. A,B,C,D,E refer to nodes ;  AB, AC, AD,DE,EC are the edges. Of course, understanding why would one use a graph is useful. So, some examples what nodes and edges can refer to pertaining to different use-cases are listed below –

a. Nodes can be routers and edges be connections between them with weight representing the delay incurred for sending a packet over that edge. This representation can then be used to find best routing path for a packet – one that takes shortest time

b. Nodes can be people on social network and edges can represent connection between people with weight say representing the number of common interests- this can be used to check match between profiles on social networks , suggesting friends and so on.

c. Nodes can be destinations points (places) and edges can be roads connecting them – used for finding navigation path and so on.

Various types of graphs are also shown in Fig [1] below.

grap_basic
Fig[1]. Graph and its types

[B] shows cyclic and acyclic graphs. Cycle refers to loop in the graph – if you can start and end at the same node (any node)- graph is “cyclic” else “acyclic”.

[C] shows undirected and directed graphs. Edges are bi-directional in undirected (Facebook) but uni-directional in directed graphs (web crawling!), the direction indicated by the arrow tip.

[D] shows disconnected and connected graphs. If there is route or set of edges  to connect every node to every other node- then the graph is connected else it is disconnected.

[E] is a weighted graph – each of the edges have a “weight” or “cost” associated with them as represented by numbers on the edges.

We also have another way of classifying graphs –

Sparse and Dense graphs

If the number of edges in the graph E is of the order of the number of vertices (v)  E=~v, then graph is said to be sparse. If E=~(v^2) , then the graph is said to be dense – ie., a highly connected graph. (Keep this in mind) – We need this further down in this blog when we want to pick one data structure over the other to represent graph.

*Yes! have not included the multi-graphs , a graph that can have multiple edges /relations connecting two nodes.(Let’s keep it at basics ! :)).

Okay, now that we have seen various types of graphs, let’s check how we could represent them in computer memory.


Data structures

We will be looking at the two most popular representation schemes- the adjacency matrix and adjacency lists. These are the data structures that help us work on a graph in our programs.

[1] Adjacency Matrix

Now, who is familiar with matrix – put your hands up ! Easiest way to represent graphs is perhaps the adjacency matrix. So,Yes! all you need is a vertex array of length “v” and a 2-D matrix that is of size v x v , where v refers to the number of vertices in the graph. Shown below in Fig[2] is a graph [A] and its adjacency matrix representation [B]. Row i corresponding to node i (i derived from vertex array) has the information about presence of edges between this node and other nodes (say node j) (where 0<=i<=v-1 ; 0<=j<=v-1). Example : Row zero (i=0) and column j in Fig [2]shows the connections of node 0 (mapped to vertex A in this case) with other nodes (0<=j<=v-1) in the graph.

adj_mat

  Fig [2]. (Undirected, unweighted)Graph and it’s adjacency matrix

Some observations to be noted are :

For undirected graphs (Fig [2]) , the adjacency matrix is symmetric. The value corresponding (row,column) = (i,j) is equal to value at (row,column)=(j,i) – which means that presence of edge E(i,j) indicates presence of edge E(j,i) (example – presence of AB indicated presence of BA). However, for directed graphs, this may not be true.

Marked in yellow blocks as “1”, indicates presence of an edge between the two nodes in unweighted graph (Fig [2]) and “0” indicates no edge is present. In weighted graphs, yellow blocks will contain the weight corresponding to the edge between the nodes instead of “1”.

Good , Bad and the Worst

If you want to check if an edge is present between nodes i and j , we just look up Matrix(i,j). If we store the vertices  and their corresponding index number in a hash table/look-up table, then this operation takes : [O(1) to look up first vertex index i + O(1) to look up second vertex index j+ O(1) to look up Edge-Matrix(i,j) ] time ~= O(1) (great!) . However, if you need to check how many nodes a given node (say node i) is connected to, then we need to go to row (i) in the matrix and traverse all “v” entries – takes O(v) time (not so good for large graphs).

The memory taken is of the order O(v^2) (sucks for larger graphs!) . Also, for sparse (low connected graphs), a  lot of vertices wont be connected and will still need an entry in the matrix (look at the entries marked “zero” in Fig[2]). (sucks again!).

Let’s now see if we can do something about this space and wastage issue. *Tada* !! Time to explore adjacency lists.

[2] Adjacency Lists

As the name indicates we will be using lists (Matrix!), to hold graph connectivity information. All we need to store is a linked list of directly connected neighbors pertaining to each node/vertex (Linked lists are programming 101 – please check here). So, we will have up to (max of v-1) list elements corresponding to each node. Refer Fig [3] below.

adj_lists

  Fig [3]. (Undirected, unweighted)Graph and it’s adjacency list

As seen, since we are storing information related to connected neighbors , we are eliminating the need to store anything for unconnected neighbors (score-1 over Adjacency matrix : think of a sparsely connected graph with large number of nodes to understand this). The number of elements in the list “L” associated with a node can be at most O|v-1|. To check how many nodes a given node is directly connected to will only take O(L), where L is the number of directly connected neighbors (0<=L<=|v|) – not O(v) as in case of matrix! However, to check if an edge exists between two nodes, would take worst case O(L) – (Oops ! that was O(1) in case of adjacency matrix).

Also, comparing Fig [2] and Fig [3], we can identify the memory savings. These savings are comparable when we are working with sparsely connected large graphs. However, for densely connected graphs, lists would get bigger (~O(v) length per node) which implies space requirements will be similar to adjacency matrix. In that case, we would choose adjacency matrix- just because of O(1) edge look up! Come on- O(1) beats everything !


Before we wrap up- here’s some food for thought !

If there are millions of users on Facebook – are we gonna matrix index it – how do we manage millions of connections?

Can we do something better?  Read here

 – Info on the meme : here

*** That’s all folks ***