Network Embedding Paper Overview

Source: Internet
Author: User
Tags svm

Network Embedding Paper Overview

Turn from: http://blog.csdn.net/Dark_Scope/article/details/74279582, thanks for sharing!

Since Word2vec turned out, it seems that everything is being embedding, today we have to pay attention to this area is the network embedding, that is, based on a graph, the node or edge projection into the low-dimensional vector space, For subsequent machine learning or data mining tasks, this is a relatively new attempt for complex networks and has achieved some results.
This article probably combs some of the methods and papers popular in recent years, paper mainly from the list of thunlp/nrlpapers, and doped with some other papers. Probably looked over, a simple summary, I hope to have some help, if there is not rigorous place, but also look correct.
Aside from some traditional manifold learning methods, the following is probably the outline organization (the distinction is not strict):

Deepwalk(Online learning of social representations.)

Deepwalk is an article in KDD 2014, when Word2vec in the successful application of the text set off a wave of quantitative wave, Word2vec is based on the co-occurrence of the word, the word map to the low-dimensional vector, and retained the rich information of the corpus. Deepwalk algorithm idea is actually very simple, the diagram from a node start using the random walk to generate similar text sequence data, and then the node ID as a "word" using skip gram training to get "word vector."
Though the idea is simple, there is some truth behind it, and some of the work behind it proves that doing so is actually equivalent to the special matrix decomposition (The matrix factorization). And Deepwalk itself has inspired a series of subsequent work.

Node2vec(Scalable Feature Learning for Networks)

Node2vec, on the basis of DW, defines a bias random walk strategy generation sequence, still using skip gram to train.
This paper analyzes the BFS and DFS two walk way, the network structure information is different.
Deepwalk Random walk based on the weight of the edge, while Node2vec adds a weight adjustment parameter α:t is the previous node, V is the newest node, and X is the candidate next node. D (t,x) is the minimum hop count of T to the candidate node.
Different p and Q parameters are set to achieve the purpose of preserving different information. When P and Q are both 1.0, it is equivalent to Deepwalk.

mmdw(Max-margin deepwalk discriminative Learning of Network representation)

The DW itself is unsupervised, and if you can introduce the label data, the resulting vectors will have a better effect on the classification task.
Previously mentioned there is proof that DW is actually a decomposition of a particular matrix M,
This article combines Deepwalk and Max-margin (SVM) from the loss function to see the two components:
1. Training is divided into optimized, fixedX,YOptimizationWAndξ, in fact, is multi class SVM.
2. FixedWAnd ξ optimization x,Y when a little bit special, forget a biased Gradient, because the loss function has a combination of X and W.
This optimizes the discrimination and the representation two parts in the training simultaneously, achieves the good result.

TADW(Network representation Learning with Rich Text information.)

In the article there is a simple proof that Deepwark is equivalent to M matrix decomposition, and in practice, there will be text information on some nodes, so in the framework of matrix decomposition, the text is added directly to a sub-matrix, which will make the learned vectors contain richer information.
The text matrix is the SVD dimensionality reduction result of the TFIDF matrix.

Grarep(Learning Graph Representations with Global Structural information.)

Following the idea of matrix decomposition, the information depicted by different k-step (the number of steps in Random walk) is not the same:

So it is possible to decompose each step matrix, and finally the resulting vector from each of these steps is the final result of stitching together. There is a complete derivation process in this paper, and we will not repeat it here.

Line(Large scale Information network embedding)

Line analyzes the 1st order proximity and the 2nd order proximity, where the similarity is two points directly connected, and the greater the edge weight, the more similar the two points, such as 6 and 7 in, and two degree similarity between the two points shared a lot of neighbors, Their similarity is very high, such as 5 and 6.
In this paper, a very simple way to construct a target function can keep both information. In the case of a once-similarity, the empirical probability that the node I and J are connected is the weight after normalization, i.e.p^1 (i,< Span id= "mathjax-span-37" class= "Mi" >j) = Wij/W , and the probability value is calculated by the vectorP1(I,j) =1 1+ Exp (− ut i Uj) , the objective function is to minimize the distance between the two distributions, and the last loss function is obtained by choosing KL divergence as the distance measurement function. O1.
There is an optimized trick,edge-sampling algorithm: Because the side of the weight difference is very large, the direct use of SGD effect is not good, so there is a sampling edge, according to the edge of the weight sampling, and then each side as a binary calculation.

NEU(Fast Network embedding enhancement via high Order Proximity approximation)

This is a recent article published in Ijcai, to tell the truth is a very trickery way, the article analyzes some of the embedding methods that can be considered as matrix decomposition:

Get a conclusion that if matrix decomposition f(A)=RC can more accurately include higher order information, The effect is better, but the result is a higher computational complexity of the algorithm.
So the article uses a very clever way, in (Low-order Low-order) matrix decomposition results updated to obtain higher order (higher order) decomposition results, so that the final vector effect is better, this method can be applied in several algorithms.
The paper proves a bound to support such an update.

Extra Info


The vast majority of what we see above only takes into account the network structure, but the real-world nodes and edges tend to contain rich information. For example, in the Quora scene, each user himself will have some label and text, in some scenes and even the edge will bring some labels, this information for the construction of the network is actually very important, we also saw the TADW of the node text information into the training, Here is a list of some of the papers related to this direction.

CANE(Context-aware Network embedding for Relation Modeling)

First consider the context on the node, mainly the text, learning to output VT (text vector) and VS (structure vector) for each node
Context-free's words VT is fixed, using a CNN process output, such as the left part: for a text, each word vector form a matrix, and then the L as a window on the D kernel on the CNN convolution operation, The resulting result takes max by row to get the last text vector.
Context-aware words introduced the attention mechanism, will consider the side e= (u,v) TU and TV, through the bottom right of the process output attention weight, and then do similar pooling operations (the last step), This makes the node different when it is connected to various points.
A is the parameter to be trained, the physical meaning may be the spatial transformation of the target dimension.

cene(A General Framework for content-enhanced Network representation learning)

This article translates text into special nodes so that there are two kinds of edges, (node-document) and (node-node), which model the two edges together, and the loss function includesLnn AndLnC, where the text is split into finer sentences, and there are three ways to embedding the sentence, listed below.
As with many methods, the minus parts of the loss are sampled negatively.

trans-net(translation-based Network representation learning for social Relation Extraction)

This paper is also the latest 2017 published on the IJCAI, introduced the idea of machine translation, the translation mechanism applied to the middle, through a autoencode on the edge of the labels (constitute a vector) to encode, The node and edge are then mapped to the same space for addition and subtraction. Think in this space u+l=v ' (each node has two vector representations, respectively, indicating the "start" and "end" of the edge, using ' to differentiate ")
When this is predicted, the simple use of V '-u can get L, and then use the decoder part of AE to revert to the Element-binary label set, you get the predicted results.

Deep Learning

In recent years, deep learning in full swing, strictly speaking, similar word2vec is actually a shallow model, but you can also use some complex depth model to obtain embedding results, the idea is to serialize the network, borrow NLP method to train.

We know that doing CNN on the image is a convolution of the pixels in the pro-connection, so what if we just make CNN on the graph? This is the idea of GCN, but here is not a detailed introduction to GCN, see should be able to understand how it is done.
There are also some work to apply deep learning to network embedding above, previously listed such as Cane and trans-net have such a structure, especially trans-net use of Autoencoder is a neural network.
Deep learning has a lot of work to do based on a whole graph, such as judging some of the properties of a graph, this article is mainly to enumerate the methods of embedding nodes.

SSC-GCN(semi-supervised classification with Graph convolutional Networks)

Https://github.com/tkipf/gcn
http://tkipf.github.io/graph-convolutional-networks/

and DW completely different ideas, introduced a spectral convolutions operation, but now it seems that convolution is done on the entire graph, has not supported the Mini-batch, the final goal is a single node classification and presentation learning.
In some of the previous work, the NN for graph was done on the graph level, classifying, and so on, for the entire sub-graph, but this is essentially a single node.

The operation here is this: for example, x in each row is a graph node input feature, then through a and W can change the number of columns of this matrix, is actually doing " Full Connect "operation, just a may be sparse (transfer matrix), so it can be considered as a convolution operation, each step will be connected with the node's weight information into this line of output.

Finally, a semi-supervised is defined, and a portion of the node's label can also be part of the loss, so the overall loss function is:

Where L0 is the supervised part, behind theLrEg actually contains the information of the edge, where a is the adjacency matrix (or some) that describes all the edge information function thereof).
Interestingly, the network even randomly initialized, not trained, the results of the distribution are relatively clear (different community points will eventually map closer), the paper explains the logic of the calculation itself is a bit like the Weisfeiler-lehman algorithm.

Sdne(Structural deep Network embedding)


Some of the logic in the middle is somewhat similar to the Transnet, which is a description of the node's eigenvectors (such as the Dot " Adjacency vector ") uses Autoencoder encoding, but also for non-0 items heavier penalty (no connection is not necessarily not, it may just not happen, so here is coordinated): Take the Autoencoder middle layer as a vector representation, so as to get 2nd Proximity (similar neighbors have a higher point similarity because the "adjacency vectors" of the two nodes are similar, indicating that they share a lot of neighbors, and the last vector y that is mapped will be closer).
For 1st proximity, it is considered by evaluating the vector distance of points with a connecting edge.
Both of these parts incorporate the last loss function, where the LrEg is regular.
Otherwise, it would be a burden for a particular number of nodes to pass the "adjacency vector" in this calculation.

Heterogeneous

The real-world network is undoubtedly heterogeneous (heterogenous), such as trading, the node involved in the people, goods, shops and so on; more generally, for example, there are different types of nodes and edges in the knowledge map, and most of the work described above is in homogeneous networks (homogenous) , so understanding the embedding of heterogeneous networks can be helpful in real-world applications.

PTE(Predictive text embedding through large-scale heterogeneous text Networks.)


The main intent of this article is to bring the predictive information to the final embedding, but not to embed a complex predictive model directly like the CNN/RNN model. So he defined three kinds of network,word-word,word-document,word-label respectively. Are like two-part graphs, and then aggregate the respective loss functions together (the form is similar, defining the experience probability and the target probability of KL distance), is so simple and rude.

HINES(Heterogeneous information Network embedding for Meta Path based Proximity)


This paper embedding the multi-heterogeneous network (knowledge map) with different types of points and different types of edges. The concept of Meta path is introduced, that is, the connection between the different points is based on a certain meta-information, such as A1 (Author)-p1 (Paper)-a2 (Author) Such a meta path represents the information may be A1 and A2 cooperation between Paper , this concept can be well extended to many scenarios.
In general, when calculating proximity, the idea is to follow the 1st order, but when the meta-path concept is introduced, if A and B are at both ends of a meta path, their proximity should be larger, This, of course, depends on the amount of information in the meta-path itself.
All meta paths that are less than L are selected in the article because the longer the path, the less information is in general.
The last loss function is also the distance that characterizes the distribution.

Summary

Network embedding in recent years there is still a lot of work, here are only a few. Although NE borrows a lot of ideas from NLP, there are some differences, a. If you use the node ID as " Word ", then for some real network, may be very sparse, and the number of nodes will be very large, hundreds of millions is very normal, this has some limitations on the application of some of the above methods, you can imagine how much memory such a large table needs. B. How heterogeneous networks can be better trained is a challenging task, and it is also interesting to learn how this information can be learned in the result vectors, while nodes and edges often have a wealth of information.

This article does not go into every paper in very detail, but only records the main ideas, interested in reading the original paper.

Reference

"0" above each sub-title in parentheses of the contents of the corresponding paper title.
"1" thunlp_nrlpapers
"2" "representation Learning with Networks" by Jure leskovec.slide_01

Network Embedding Paper Overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.