Digital Garden
Computer Science
Algorithms & Data Structures
Graphs & Networks
Link Prediction

Link Prediction

The idea of link prediction is as the name suggests to predict which link, i.e. edge will most likely be formed in the future for a given graph GG. This is a very important problem in social network analysis and has many applications in the real world such as in recommender systems to recommend friends, products, etc. to users.

Todo

Add visual examples

Neighbourhood-based methods

There are many methods to predict links in a graph. One of the most basic methods is to use the neighbourhood of a node in some way to predict the links.

Common Neighbours

This method is based of the common neighbours metric of two nodes. The idea behind it is that if two nodes have many common neighbours, then they are more likely to be connected in the future. You can imagine this as a friend of a friend is more likely to be your friend than a complete stranger.

The algorithm is very simple. For each pair of nodes (u,v)(u, v), we calculate the number of common neighbours cc and our prediction is the link/edge that corresponds to the pair with the highest cc.

To calculate the number of common neighbours, we can use the following formula:

commonNeighbours(u,v)=∣N[u]∩N[v]∣\text{commonNeighbours}(u, v) = |N[u] \cap N[v]|

where N[u]N[u] is the set of neighbours of node uu.

Jaccard Index

The jaccard index or also commonly known as the jaccard similarity coefficient is a measure of similarity between two sets and is used in many different applications. In computer vision for example, it is used to compare the similarity of two images but is there more commonly known as the intersection over union (IoU) metric, because that is what it is.

jaccardIndex(u,v)=∣N[u]∩N[v]∣∣N[u]βˆͺN[v]∣\text{jaccardIndex}(u, v) = \frac{|N[u] \cap N[v]|}{|N[u] \cup N[v]|}

When using the jaccard index for link prediction the idea is the same as with common neighbours. However, it takes into account the size of the neighbourhoods of the two nodes. For example if two nodes have 100 neighbours and 5 of them are common, then they are less likely to be connected than two nodes with 10 neighbours and 5 of them are common.

5100<510\frac{5}{100} < \frac{5}{10}

Soundarajan-Hopcroft

If the community structure of a graph is known, i.e. which nodes belong to which community, then we can use this to improve the common neighbours method for our link prediction. We do this by just adding the number of common neighbours that are also in the same community to the common neighbours metric. So if two nodes have many common neighbours that are also in the same community, and another pair of nodes have many common neighbours but with less in the same community, then the first pair is more likely to be connected in the future.

soundarajanHopcroft(u,v)=∣N[u]∩N[v]∣+βˆ‘w∈N[u]∩N[v]f(w)\text{soundarajanHopcroft}(u, v) = |N[u] \cap N[v]| + \sum_{w \in N[u] \cap N[v]} f(w)

where f(w)f(w) is a function that returns 1 if uu and vv are in the same community and 0 otherwise.

Resource Allocation Index

Imagine we have a resource for example a cake and there are three nodes, x,yx,y and zz. xx and yy have the common neighbour zz. If we want the cake to be shared between xx and yy, then we can give it to zz and zz will then equally share it with its neighbours. So if zz has 10 neighbours, then xx and yy will each get 110\frac{1}{10} of the cake. If zz has 100 neighbours, then xx and yy will each get 1100\frac{1}{100} of the cake.

The resource allocation index is based on this idea and is defined as follows:

resourceAllocationIndex(u,v)=βˆ‘w∈N[u]∩N[v]1deg(w)=βˆ‘w∈N[u]∩N[v]1∣N[w]∣\text{resourceAllocationIndex}(u, v) = \sum_{w \in N[u] \cap N[v]} \frac{1}{deg(w)} = \sum_{w \in N[u] \cap N[v]} \frac{1}{|N[w]|}

The resource allocation index can then be interpreted as a form of closeness between two nodes. We would then expect that two nodes that are close to each other are more likely to be connected in the future.

Adamic-Adar Index

The Adamic-Adar index is very similar to the resource allocation index. The only difference is that Adamic-Adar index weakens the denominator by taking the natural logarithm of the number of neighbours of a common neighbour.

Info

Why is this done? I have no idea. I have not found any explanation for this. If you know why, please let me know. It just seems to make the metric more complicated for no reason and the results larger.

adamicAdarIndex(u,v)=βˆ‘w∈N[u]∩N[v]1ln⁑(deg(w))=βˆ‘w∈N[u]∩N[v]1ln⁑(∣N[w]∣)\text{adamicAdarIndex}(u, v) = \sum_{w \in N[u] \cap N[v]} \frac{1}{\ln(deg(w))} = \sum_{w \in N[u] \cap N[v]} \frac{1}{\ln(|N[w]|)}

Preferential Attachment

The preferential attachment method is based on the idea that nodes with a high degree are more likely to be effected by the addition of a new link than nodes with a low degree. The preferential attachment is defined as follows:

preferentialAttachment(u,v)=deg(u)β‹…deg(v)=∣N[u]βˆ£β‹…βˆ£N[v]∣\text{preferentialAttachment}(u, v) = deg(u) \cdot deg(v) = |N[u]| \cdot |N[v]|

So if we have two nodes with a high degree, then the preferential attachment will be high and if we have two nodes with a low degree, then the preferential attachment and therefore the likelihood of them being connected in the future will be low.

The Link Prediction Problem

In the research paper The Link Prediction Problem for Social Networks (opens in a new tab) from 2004 an experiment was conducted to test the performance of the different link prediction methods. The experiment used a network containing publications and authors from different research fields.

They had training networks from 1994 - 1996 and test networks from 1997 - 1999.

The goal was to predict which authors would publish together in the future. For this they extract the core, which contained the authors that published at least 3 papers in the timeframe of the training networks and 3 papers in the timeframe of the test networks.

They then used the different link prediction methods to predict which authors would publish together in the future but only kept the highest predictions that connected 2 authors within the core. They then compared the predictions to the actual publications in the test networks core where the common neighbours method was the baseline.