Node Embeddings
DeepWalk
think of random walks being like sentences, and nodes being like words. We can then use the SkipGram model to learn embeddings for nodes. In other words given a random walk missing a node, we want to predict the missing node.
The goal is to have nodes that are close in the graph be close in the embedding space.
Is it really skipgram not cbow?
the probabilities formulas dont make a lot of sense to me.
use random walks to get the context of a node without having to look at the entire graph. The random walk is unbiased, i.e. it is not biased towards any particular node, it chooes the next node uniformly at random.
similiar context => similiar nodes similiar sentences => similiar words
Why use dot product instead of cosine similarity? https://developers.google.com/machine-learning/clustering/similarity/measuring-similarity (opens in a new tab)
encoder node to embeding decoder embedding to similarity ??? whyyy wtf, i.e just simple dot product?
Encoder can be just a lookup table, i.e. the embedding matrix. For every node we have a row or columnnn? in the embedding matrix. We then learn the embedding matrix. not very scalable tho if we have a lot of nodes because matrix is V by D where V is the number of nodes and D is the embedding dimension.
how to defined the node similiarites? Are linked, shared neighbors, have similiar surroundings i.e context => random walks
=> only learning the graph structure, not the node features. But unsupervised so we dont have labels which is good. Hence, the embeddings are also task independent.
P(v|z_u) = probability of visiting node v given we start random walk at node u. This is our model precition.???
=> softmax turns a vector of k real numbers into a vector of k real numbers that sum to 1, i.e a probability distribution. Soft version of a max function.
dont forget at the random walk you can also go back to where you came from.
Using the random walks is much more effiecient?? than using the entire graph. But isnt the lookup table still V by D? So how is it effiecient?
NR_u is the neighborhood of node u using the strategy R. So if we use a random walk of length 10, then NR_u is the set of nodes that are within 10 hops of node u. i.e multiset
maximize sum of log-likelihoods of the random walks?
Node2Vec
same idea but instead of using a random walk, we use a biased random walk.
develop biased 2nd order random walk can swap between local and global search, i.e breadth first search and depth first search.
have two parameters p and q. p is the return parameter, i.e. how likely are you to go back to where you came from. q is the in-out parameter, i.e. how likely are you to go to a node that is close to you or far away from you, i.e ratio between breadth first search and depth first search. Depending on q we decide to go further away, DFS from where we came from or stay at the same level, BFS.
2nd order because it remembers where it came from. 1st order is just a random walk.
LINE
Embedding entire graph
First idea just sum up or average the embeddings of the nodes. was used in 2016 for molecule classification.
other idea: introduce virtual node to represnet graph or subgraph. virtual node is connected to all nodes that i want to embed.
3rd idea is using anonymous walks. Instead of using the node labels, we use the node ids. So we dont know what the node is, we enumerate them. this way we can use the same random walk for different walks. A -> B -> A and A -> C -> A are the same walk, 1 -> 2 -> 1? agnostic to the node labels.
can calcualte the number of anonymous walks for a given length/number of nodes.
Could have a bag of walks, i.e. count the number of times a walk appears in the graph. with length 3 there are 5 possible anonymous walks so get a dimension of 5. Can be seen as a probability distribution over the walks.
To know how many walks we can use a formula with epsilon and delta. epsilon. Can also learn an embedding of the anonymous walks.