Gspan frequent sub-graph mining algorithm

Last Update:2015-02-24 Source: Internet

Author: User

Tags addall

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reference: Http://www.cs.ucsb.edu/~xyan/papers/gSpan.pdf
Http://www.cs.ucsb.edu/~xyan/papers/gSpan-short.pdf
Http://www.jos.org.cn/1000-9825/18/2469.pdf

http://blog.csdn.net/coolypf/article/details/8263176

more mining algorithms:https://github.com/linyiqun/DataMiningAlgorithm Introduction

Gspan algorithm is an algorithm of graph mining neighborhood, and as a sub-graph mining algorithm, it is also the basis of other graph mining algorithms, so the Gspan algorithm is very important in graph mining algorithm. Gspan algorithm in mining frequent sub-graph, using the same principle as the Fp-grown, is the pattern-grown mode of growth, also used the minimum support count as a filter condition. Graph algorithm is more abstract than other algorithms in program, and it needs more space imagination ability when it is realized. The core of the Gspan algorithm is given n graphs, which are then excavated from the frequently occurring sub-graphs.

algorithm principle

To tell the truth, the Gspan algorithm is very difficult in my recent learning algorithm, because to achieve him, it is necessary to understand his principles, and this will take a lot of time to understand the definition of the algorithm, such as DFS coding, the concept of the right path. Therefore, we should first know a structure of the overall algorithm.

1. Traverse all the graphs to calculate the frequency of all edges and points.

2. Compare the frequency with the minimum support degree, remove the infrequent edges and points.

3. Reorder the remaining points and edges according to frequency, and re-label their rank numbers to edges and points.

4, calculate the frequency of each edge again, after the calculation, then initialize each edge, and make this side of the submining () mining process.

the process of submining

1. Re-restore the current sub-map according to Graphcode

2, determine whether the current encoding is the minimum DFS encoding, if it is added to the result set, continue to try to add a possible edge on this basis, to continue mining

3. If it is not the minimum encoding, the mining process for this sub-graph ends.

DFS encoding

Gspan algorithm to encode the edges of the graph, using E (V0,v1,a,b,a) way, V0,V1 represents the logo, you can be seen as the point of the id,a,b can be used as the point of the label, a for the edge between the label, and a figure is composed of such a side, G{e1, E2, E3, ...}, and DFS encoding is more than the five elements inside the element, I use the rule here is, from left to right in order to compare the size, if who first less than the other side, who even small, the graph of the comparison algorithm is the same, the specific rules can be seen in the code behind me comments. But this rule is not exactly the same, at least in the relevant papers I read a different description exists.

Generate Subgraph

The next mining process of generating sub-graphs is also a difficulty in the Gspan algorithm, first you have to encode the original image, find the same encoding as the mining sub-graph, find the right path on the graph to find the edge, the extension on the most right path is divided into 2 kinds, 1 to the right node to expand on, 1 are extended at the point on the right-most path. A certain amount of judgment is required in 2 situations.

Techniques of Algorithm

Algorithm in the implementation, with a lot of skills, some also very difficult to understand, for example, in the DFS coding or looking for the edge of the process, the use of the figure ID for the edge of the five-tuple ID mapping, this will be at first did not think, and how to describe a graph through a certain data structure.

implementation of the algorithm

This algorithm is a reference to other versions of the implementation of the Internet, I am in the understanding of the People's Code on the basis of their own some of the parts have been modified. Because the code is more, here is the core code, all the code here .

Gspantool.java:

Package Datamining_gspan;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import Java.text.messageformat;import Java.util.arraylist;import Java.util.HashMap;import java.util.map;/** * Gspan Frequent sub-graph mining Algorithm Tool class * * @author Lyq * */public class Gspantool {//File data type Public final String Input_new_ GRAPH = "T";p ublic final String input_vertice = "V";p ublic final String input_edge = "E";//maximum number of label labels, including Dot label and Edge label public F inal int label_max = 100;//test data file address private String filepath;//minimum support rate private double minsupportrate;//minimum support degree, through graph total and minimum support rate The product of the calculated private int minsupportcount;//The initial data for all graphs private arraylist<graphdata> totalgraphdatas;//All the graph structure data private Arraylist<graph> totalgraphs;//The frequency statistics of the frequent sub-graphs of private arraylist<graph> resultgraphs;//side excavated by private Edgefrequency ef;//node frequency private int[] freqnodelabel;//edge frequency private int[] freqedgelabel;//label number of points after re-labeling private int Newnodelabelnum = 0;//Number of labels for the re-labeled edge private int newedgelabelnum = 0;public gspantOol (String FilePath, double minsupportrate) {This.filepath = Filepath;this.minsupportrate = Minsupportrate; Readdatafile ();} /** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; String[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {e.getstacktrace ();} Calfrequentandremove (DataArray);} /** * counts the frequency of edges and points, and removes infrequent point edges, labeled as statistical variables * * @param dataarray * raw data */private void Calfrequentandremove (arraylist& Lt String[]> dataarray) {int tempcount = 0;freqnodelabel = new Int[label_max];freqedgelabel = new int[label_max];//do initialization for (int i = 0; i < Label_max; i++) {//The current number of nodes representing the label i is 0freqnodelabel[i] = 0;freqedgelabel[i] = 0;} Graphdata gd = Null;totalgraphdatas = new arraylist<> (); for (string[] array:dataarray) {if (array[0].equals (Input_new_graph)) {if (GD! = null) {Totalgraphdatas.add (GD);} New map GD = new Graphdata ();} else if (array[0].equals (Input_vertice)) {//each diagram in each diagram is counted only once if (!gd.getnodelabels (). Contains (Integer.parseint (array[2 ])) {Tempcount = Freqnodelabel[integer.parseint (array[2])];tempcount++;freqnodelabel[integer.parseint (array[2]) = Tempcount;} Gd.getnodelabels (). Add (Integer.parseint (array[2])); Gd.getnodevisibles (). Add (True); else if (array[0].equals (Input_edge)) {//each diagram in each diagram is counted only once if (!gd.getedgelabels (). Contains (Integer.parseint (array[3)) {Tempcount = Freqedgelabel[integer.parseint (array[3])];tempcount++;freqedgelabel[integer.parseint (array[3])] = Tempcount;} int i = Integer.parseint (array[1]), int j = integer.parseint (Array[2]), Gd.getedgelabels (). Add (Integer.parseint (array[ 3])); Gd.getedgex (). Add (i); Gd.getedgey (). Add (j); Gd.getedgevisibles (). Add (True);}} Add the last piece of GD data to Totalgraphdatas.add (GD); minsupportcount = (int) (Minsupportrate * totalgraphdatas.size ()); for ( Graphdata G:totalgraphdatas{G.removeinfreqnodeandedge (Freqnodelabel, Freqedgelabel,minsupportcount);}} /** * Sort by label frequency and re-label */private Void Sortandrelabel () {int Label1 = 0;int Label2 = 0;int temp = 0;//point sort rank int[] Ranknode Labels = new int[label_max];//edge sort rank int[] rankedgelabels = new int[label_max];//label corresponding rank int[] Nodelabel2rank = new Int[labe l_max];int[] Edgelabel2rank = new Int[label_max];for (int i = 0; i < Label_max; i++) {//indicates the rank of the I-bit is i,[i] in the rank Ranknod Elabels[i] = i;rankedgelabels[i] = i;}  for (int i = 0; i < freqnodelabel.length-1; i++) {int k = 0;label1 = Ranknodelabels[i];temp = label1;for (int j = i + 1; J < Freqnodelabel.length; J + +) {Label2 = Ranknodelabels[j];if (Freqnodelabel[temp] < Freqnodelabel[label2]) {//Label Interchange temp = Label2;k = j;}} if (temp! = Label1) {//I,k rank under the label swap temp = ranknodelabels[k];ranknodelabels[k] = ranknodelabels[i];ranknodelabels[i] = t EMP;}} Sort the same for the edges for (int i = 0; i < freqedgelabel.length-1; i++) {int k = 0;label1 = Rankedgelabels[i];temp = Label1;foR (Int j = i + 1; j < Freqedgelabel.length; J + +) {Label2 = Rankedgelabels[j];if (Freqedgelabel[temp] < Freqedgelabel [Label2]) {//Interchange for marking temp = Label2;k = j;}} if (temp! = Label1) {//I,k rank under the label swap temp = rankedgelabels[k];rankedgelabels[k] = rankedgelabels[i];rankedgelabels[i] = t EMP;}} Convert rank to label to rank for (int i = 0; i < ranknodelabels.length; i++) {nodelabel2rank[ranknodelabels[i]] = i;} for (int i = 0; i < rankedgelabels.length; i++) {edgelabel2rank[rankedgelabels[i]] = i;} for (Graphdata Gd:totalgraphdatas) {Gd.relabelbyrank (Nodelabel2rank, Edgelabel2rank);} Find the maximum rank value less than the support value for (int i = 0; i < ranknodelabels.length; i++) {if (Freqnodelabel[ranknodelabels[i]] > Minsupp Ortcount) {newnodelabelnum = i;}} for (int i = 0; i < rankedgelabels.length; i++) {if (Freqedgelabel[rankedgelabels[i]] > Minsupportcount) {Newedgelab Elnum = i;}} The rank number is 1 less than the number, so add it back to newnodelabelnum++;newedgelabelnum++;} /** * Mining of frequent sub-graphs */public void freqgraphmining () {Long startTime = System.curreNttimemillis (); long endTime = 0; Graph G;sortandrelabel (); resultgraphs = new arraylist<> (); totalgraphs = new arraylist<> ();//construct diagram structure for by graph data (Graphdata Gd:totalgraphdatas) {g = new Graph (); g = G.constructgraph (GD); Totalgraphs.add (g);} Initializes the edge frequency object based on the number of labels for the new point edge EF = new Edgefrequency (Newnodelabelnum, Newedgelabelnum); for (int i = 0; i < newnodelabelnum; i++ {for (int j = 0, J < Newedgelabelnum; J + +) {for (int k = 0, K < newnodelabelnum; k++) {for Graph Tempg:totalgra PHS) {if (Tempg.hasedge (i, J, K)) {ef.edgefreqcount[i][j][k]++;}}}} Edge Edge;  Graphcode gc;for (int i = 0; i < Newnodelabelnum; i++) {for (int j = 0; J < Newedgelabelnum; J + +) {for (int k = 0; k < Newnodelabelnum; k++) {if (Ef.edgefreqcount[i][j][k] >= minsupportcount) {gc = new Graphcode (); edge = new Edge (0, 1, I, J, K); Gc.getedge Seq (). Add (edge);//Add the graph ID containing this edge to the GC for (int y = 0; y < totalgraphs.size (); y++) {if (Totalgraphs.get (y). Hasedge (I, J, K) {GC.GETGS (). Add (y);}} Mining an edge that satisfies a threshold value SUBMIning (GC, 2);}}} EndTime = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Algorithm Execution Time" + (Endtime-starttime) + "MS");p Rintresultgraphinfo ();} /** * Mining Frequent sub-graphs * * @param GC * Graph encoding * @param the number of points included in the next chart */public void Submining (Graphcode GC, int next) {Edge E; Graph graph = new graph (), int id1;int id2;for (int i=0; i<next; i++) {graph.nodeLabels.add ( -1); Graph.edgeLabels.add ( New arraylist<integer> ()); Graph.edgeNexts.add (new arraylist<integer> ());} First, construct the diagram for (int i = 0; i < gc.getedgeseq (). Size (); i++) {e = Gc.getedgeseq (). get (i); id1 = E.ix;id2 = E.iy;g, based on the edge five-tuple Raph.nodeLabels.set (ID1, e.x); Graph.nodeLabels.set (Id2, e.y); Graph.edgeLabels.get (ID1). Add (E.A); Graph.edgeLabels.get (ID2). Add (E.a), Graph.edgeNexts.get (ID1). Add (Id2); Graph.edgeNexts.get (ID2). Add (ID1);} Dfscodetraveler Dtraveler = new Dfscodetraveler (Gc.getedgeseq (), graph);d traveler.traveler (); if (!dtraveler.ismin) { return;} If the current is the minimum encoding then add this graph to the result set resultgraphs.add (graph); Edge E1; Arraylist<integer&gT Gids; Subchildtraveler SCT; Arraylist<edge> edgearray;//Add potential child side, each child side belongs to the figure Idhashmap<edge, arraylist<integer>> edge2GId = new Hashmap<> (); for (int i = 0; i < gc.gs.size (); i++) {int id = gc.gs.get (i);//Under the condition of this structure, add an additional edge to the sub-graph to continue digging the SCT = new Sub Childtraveler (GC.EDGESEQ, Totalgraphs.get (ID)); Sct.traveler (); edgearray = Sct.getresultchildedge ();//do Edge ID update for ( Edge E2:edgearray) {if (!edge2gid.containskey (E2)) {Gids = new arraylist<> ();} else {gids = Edge2gid.get (e2);} Gids.add (ID); Edge2gid.put (e2, Gids);}} For (Map.entry Entry:edge2GId.entrySet ()) {e1 = (Edge) entry.getkey (); gids = (arraylist<integer>) entry.getvalue ( )//If the frequency of this edge is greater than the minimum support value, continue digging if (Gids.size () < Minsupportcount) {continue;} Graphcode nGc = new Graphcode (); NGc.edgeSeq.addAll (GC.EDGESEQ);//A new edge is added to the current diagram, forming a new sub-graph for mining nGc.edgeSeq.add (E1); NGc.gs.addAll (Gids); if (E1.iy = = next) {//If the point ID setting for the edge is the current maximum value, start looking for the next point submining (NGc, next + 1);} else {//If this point already exists, the NEX T value unchanged submining (NGc, Next);}}} /** * Output frequencySub-graph result information */public void Printresultgraphinfo () {System.out.println (Messageformat.format ("Number of frequently mined sub-graphs: {0}", Resultgraphs.size ()));}}

this algorithm in the subsequent implementation, gradually found that the difficulty of this algorithm far beyond my pre-conceived, not only the abstraction, but also the complexity of testing, for the test data fabricated, if the real data measured, the data is too large, their own data is not very accurate. I finally just faked a graph of data, digging up one of the edges of the situation. Roughly walked a process. The code is not complete and is for learning only.

Disadvantages of the algorithm

After the implementation of the algorithm, I have analyzed the small process, found that the algorithm in 2 depth-first traversal of the process there is still a problem, that is, DFS judge whether the minimum encoding and the original image to find the corresponding code, the edge is limited to the edge is a continuous situation, if not continuous, There will be an error in judgment, because adding an edge to the right-most path will result in a multiple extension of an edge in the preceding point, and it will not be contiguous. In the above code is unable to deal with such situations, the personal solution is to use the stack, the node is pushed into the stack to achieve the best.

the experience of the algorithm

This algorithm spent a lot of time, off the understanding of the algorithm is not easy, often need me in the mind to portray such graphics and traversal of some of the situation, bring me the challenge is very big bar.

features of the algorithm

This algorithm is similar to the Fp-tree algorithm, and in the process of mining, there is no candidate set, and the depth-first mining method is used to excavate it in one step. The Gspan algorithm can be used for structural excavation of chemical molecules.

Gspan frequent sub-graph mining algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More