Graph Neural Networks

Message Passing

Intermediate topic before moving to GNNs. Is this GraphSAGE?

Given a network where some nodes have labels, how do we propagate the labels to the other nodes?

Network with scammers and trusted nodes. How do we find other scammers or trusted nodes? (red and green nodes)

could just use node embeddings and then use a classifier on top of that.

But we think of it here as a semi-supervised learning problem. We have some labels and we want to propagate them to non-labeled nodes, partially labeled.

key concept is "collective classification", i.e nodes that are connected to each other are more likely to have the same label. node edges/network structure correlates with node labels.

"homophily" - nodes that are connected to each other are more likely to have the same label, footbal fans are more likely to be friends with other football fans. "birds of a feather flock together" "influence" - connections between nodes can influence the labels of the nodes, e.g. a football fan can influence his friends to become football fans.

"guilt by association" - if a node is connected to a scammer, it is more likely to be a scammer.

label of v depends on the labels of its neighbors, its features and the features of its neighbors.

The three approaches below are classic approaches to this problem.

Use a probabilistic model to model the probability of a node having a label given the labels of its neighbors. And we make a first order markov assumption, i.e. the label of a node depends only on the labels of its neighbors. $P(y_v | y_u, N(v))$

Iterative process till it converges. That has three parts:

local classifier: Will assign initial labels to the nodes based on their features. $P(y_v | x_v)$ . Only applied once at the beginning.
relational classifier: Will assign labels to the nodes based on the labels and/or features of their neighbors. $P(y_v | y_u, N(v))$ .
collective inference: Applying the relational classifier iteratively till it converges.

Relational classification

Basic idea is to use a probabilistic model to model the probability of a node having a label given the labels of its neighbors.

So the weighted average of class probabilities of the neighbors. Can for example intiialize the lables with ground-truth and if unlabled set the label to 0.5 for two classes.

$P(y_v=c) = \frac{1}{\sum_{u \in N(v)} w_{vu}} \sum_{u \in N(v)} w_{vu} P(y_v = c)$

where $w_{vu}$ is the weight of the edge between $v$ and $u$ and $P(y_v = c)$ is the probability of node $v$ having label $c$ .

Problem is not necessarily converging and don't use the features of the nodes.

The order of updating the labels is important and can be done in different ways. For example in order of id. Or randomly. Or in order of degree. Not better to do them all at the same time???

If a label doesnt change, then it is not updated anymore, i.e converged. So we can stop the process when no label changes anymore.

This process is just based on the network strucutre and their labels.

Iterative classification

Will also use the features of the nodes and not just the labels.

EAch node has a feature vector $x_v$ and a label $y_v$ . We train two classifiers, one that predicts the label of a node based on its features and one that predicts the label of a node based its features and the labels of its neighbors, $z_v$ . This is a summary vector of the labels of the neighbors. Could be for example a histogram of the labels of the neighbors, lots of possiblities.

The classifiers are trained using a training set in phase 1. In phase 2, we use the first classifier to predict the labels of the nodes and then use the second classifier to predict the labels of the nodes based on the features and the labels of the neighbors. We then update the labels of the nodes with the predictions of the second classifier. We then repeat this process until convergence.

Belief propagation

This is a message passing algorithm. We send messages from the nodes to their neighbors and then update the labels of the nodes based on the messages.

"loopy belief propagation" because graph can have cycles. Passing messages, updating beliefs, passing messages, updating beliefs, etc.

First example is a path graph and we want to count the number of nodes in a graph. Pass messages i.e count of neighbor + 1 etc. at the sink node we have the count.

Can also do this for a tree, when we start in the leaves and then move up to the root.

a Node listens to its incoming messages from its neighbors, updates its belief and then sends messages to its neighbors.

We have a label-lable potential matrix where the value $\phi_{uv}(y_u, y_v)$ the probability of a the node u having label $y_u$ and node v having label $y_v$ .

The function "prior belief" is the probability of a node being in class y_i, $/\psi(Y_i)$ .

The above is very weird. I don't understand it. But its the formal idea of the above.

If the graph has cycles we have problems with convergence but it can still be run. messages just keep being passed around in the cycles.

Deep Learning on Graphs

problems with simple node embeddings:

parameter number grows with the number of nodes. no sharing of parameters between nodes.
transductive, i.e. only works on the graph it was trained on. cannot generalize to new nodes.
do not incorporate node features.

use deep graph encoders to solve these problems, i.e deeper models then just skipgram.

if a node does not have features, we can use a one-hot encoding of the node id or just a vector of ones or constants.

Initial idea could be to just take a row or column of the adjacency matrix as the node embedding and concatonate it with the node features. this results in a lot input features for the first layer which is not good. or we cant use the same network for different graphs becuase of sizes. adjacency matrix also depends on ordering.

Graph Convolutional Networks

GCN - Graph Convolutional Networks, instead of with images using patches, we use patches of the graph, i.e. the neighbors of a node. We want to generalize this idea.

Sliding window is a bit tricky to define because neighborhoods can be different sizes.

the center "aggregates" information from the neighbors to create a new type of message/information.

Need to learn the transfromations of the messages and then the aggregation function.

Each node defines a different computation graph becuase they have different neighbors. So we need to share parameters between nodes.

model depth = number of layers. number of layers = number of hops, K. The input features of the first layer are just the node features. In the end each node has a embedding/feature vector.

The above seems to be very simple but dumb idea.

because the ndoe order is arbitrary, the aggregation function needs to be symmetric, i.e. the order of the neighbors does not matter. permutation/order invariance.

Simple idea is to just sum or average the messages. and then pass it through a FNN.

$h_v^{(k+1)} = \sigma \left( W_k \sum_{u \in N(v)}{\frac{h_u^{(k)}}{|N(v)|} + B_k h_v^{(k)}} \right)$

where $h_v^{(k)}$ is the embedding of node $v$ at layer $k$ , $W_k$ is the weight matrix of layer $k$ , $B_k$ is the bias matrix of layer $k$ and $\sigma$ is the activation function.

Why do we add the node embedding to the sum? Residual connection, self-transformation? This can also be written in matrix form.

We can then use the above to train it to predict labels in a supervised way. We can also use it in a unsupervised way to learn the embeddings of the nodes.

Because the parameters are shared between nodes, we can use the same network for different graphs. or on unseen nodes.

How is this related to GraphSage????q

PageRank The Attention Mechanism