1. Summary
Because of the complex pattern and different information descriptions of graph databases, it is extremely difficult for non-professional users to query complex graph databases. A Good Graph query engine should support conversion of synonyms, acronyms, abbreviations, and ontology, and should be able to sort search results well.
Based on this problem, this paper proposes a new query framework to facilitate user queries, freeing the user group scratching its ears in order to construct a query graph.
2. Background 2.1 Application
Graph database is also a popular data storage method. application data such as knowledge graph, information network, and social network are stored in graph database. Because the pattern-less or pattern of graph data is too complex and multiple Descriptive methods of information make it very difficult to query graph data. It is even more frustrating for general users.
Figure 2-1 a is part of the graph database. If the query is about 30 years old and related to "Universityof California Berkley" and "Mission: Impssible, in Figure 2-1, the green and yellow sections are relatively good results. Figure 2-1 B is a query that can express the query semantics, but the existing graph database query can only find the green part or one. The reason is that the node information does not match, and the original query does not support semantic conversion or only supports one conversion.
Figure 2-1 Figure database G
The problem solved in this article can be described as: Given a query Q and database G, find out all the diagrams in the graph database that can be converted by the Q conversion function.
2.2 abstract definition
Given a query Q, a graph database G, and a series of conversion functions L, find the best k subgraphs matching Q. Here, the conversion function L includes all conversions in Table 2-1.
Table 2-1 conversion functions supported in this document
Note: The methods in this article can be easily added to other conversion functions to meet different needs.
3. Existing methods
You only need to enter keywords for Spark query, and you do not need to enter complex graph node relationships to obtain the query results. However, it can only extract string similarity matching. By modifying it, it can support other conversions.
It supports graph structure matching and string similarity matching (Jaccard ).
4. Method 4.1 offline operations
4.1.1 measurement functions
In the following formula, "v" indicates a node, "e" indicates an edge, and "?" indicates a matching node. For example, "Phi (v)" indicates a node in graph database G that matches the query graph v. If v can go through the I-th Conversion Function and change to Phi (v), fi (v, Phi (v) = 1; and vice versa (v) = 0.
Node matching cost:
Edge Matching cost:
Graph Matching functions:
The smaller the P value of the easy-to-obtain graph matching function, the higher the quality of the Q-matched graph. That is, the query result should be k subgraphs with the smallest P value.
4.1.2 parameter confirmation
Set W = {α 1, α 2 ,... ; β1, β2...}, Then
T indicates the training set.
4.1.3 cold start
The purpose of enabling this function is to generate a good query training set to obtain the parameters of a good matching function. Cold start steps:
(1) randomly select some subgraphs from the graph database as the query template Q ';
(2) convert some nodes and edges in the query template using the conversion function to obtain the query Q;
(3) extract the subgraph Qe exactly matched with Q;
(4) (Q, Q') and (Q, Qe) form a training set.
4.2 online query
Generally, graph queries belong to the NP-hard problem, which can be reduced to the subgraph homogeneous problem, thus proving that the problem is NP-hard. Therefore, two heuristic methods are designed to solve this problem.
4.2.1 heuristic 1
When the cost of graph matching is accumulated to a node, the matching score of each node can represent the graph matching cost including the node.
Calculation Formula for each node:
Among them, mji (t) (ui) indicates the contribution of the uj node in the t iteration to the node ui matching. For more information about the formula, see Figure 4-1. The left side of a and B indicates the database, and the right side indicates the query graph.
Figure 4-1 intuitive meaning of heuristic 1
4.2.2 heuristic 2
When using heuristic 1 for computing, a large number of nodes need to be calculated. The formula for calculating the cost of node matching is available. For any query node v, the cost of matching through the same conversion function is the same. Based on this, nodes converted from the same node are concentrated into one node, which can effectively reduce the number of node scores. A summary chart consists of a concentrated node. If an edge exists between two nodes in the query graph, the corresponding nodes in the summary graph are connected, regardless of the graph database. The Edge Matching cost is the upper bound of the matching cost of all such edges in the graph database.
The steps for solving this problem are as follows:
(1) construct a summary chart;
(2) Use heuristic 1 for calculation on the summary graph;
(3) Calculate the score of the corresponding subgraph in the source image using the result calculated in the summary graph.
Run cyclically until k results are found.
The above is my personal understanding of the thesis Schemaless and Structureless Graph Querying-vldb2014, of course, it only introduces the main content of the paper, for detailed explanations, please see the paper to explain the ppt, address http://download.csdn.net/detail/woniu317/7.