Abstract:
Matching of elements in two data schemas or data instances is important in data warehouse, e-commerce, and other fields. In this paper, we propose a matching algorithm based on fixed point calculation, which is suitable for different scenarios. The algorithm uses two graphs as input and maps the corresponding nodes in the output graph. Select a ing subset using a filter based on the Matching target. After the algorithm is run, we expect the user to check whether the result needs to be corrected. In fact, we can evaluate the accuracy of the algorithm based on the number of corrections required. We introduce an example to use accuracy metric to evaluate how much time a user can save by using our algorithms to get an initial match. Finally, we discussed how to deploy algorithms as high-level operators in a testbed that has been implemented for managing information models and ing.
Keywords: matching, model management, heterogeneous databases, semistructured data
1. Introduction
2. Method Overview
3. SF Algorithm
Similarity propagation graph (similarity propagation graph)
Fixed Point Calculation
4. Filter
Restrictions
Selection metrics
5. Examples of algorithm features
Semi-structured data
XML Mode
Two different graph-based representations: OEM/lore and XML/DOM standard. In the OEM representation, the element tags is used as an edge annotation, and DOM Representation expresses the relationship between elements as a specific edge annotation "child ".
First, the algorithm produces similar results for different representations. Secondly, the example shows that the edge annotation using wider spectrum has a faster iterative computation. Although the two representations are similar in size, the similarity propagation pen Dom of the OEM is half smaller, and the iteration of fixed point calculation is faster.
Match the XML schema with instance data
Search for related items
6. Matching Quality Evaluation
Matching accuracy
Intended match result
7. Algorithm and filter Evaluation
8. Architecture and implementation
9. Limitations of algorithms: open issues and limitations
1. The algorithm is only valid for directed labeled graphs. It degrades when the edge name is unique or undirected, or when the difference between nodes is blurred.
2. Only models of the same type can be matched.
3. An important assumption is the contribution of joining to similarity propagation. Therefore, if you cannot save the adjacent information, the algorithm will not work properly.
4. The algorithm will give superstructures. higher similarity
5. The algorithm does not consider sequence and aggregation. It is helpful for matching XML.
6. The independent version of the algorithm is not as effective as the matchers developed for a specific field.
10. Related work
11. Conclusion
References
Appendix A: Internal Data Model
Set u to Unicode alphabet, and u * to the string set defined on u. The entity set E and statement set V are defined recursively as follows:
1. U ** × U * belongs to E (any Binary Group composed of two strings is an entity. The first string is the type or namespace of entity, and the second string is the name of entity)
2. E × E belongs to V (every tuple of three entities constitutes a statement)
3. v belongs to E (every statement is an entity)
4. V and E are the smallest sets with the above attributes.
A subset of V is called model. The above definition and terminator are based on the RDF standard. According to V and E in the recursive definition, statements can be nested (one statement can be used as an element in another statement ). In our internal data structure, nested statement is used to represent order relations and aggregation. Currently, the matching algorithm we mentioned in this Article does not use these aspects. Therefore, we will not further discuss the nested statement. We can make a simple assumption: E = u ** u *, V = E3. Therefore, a model is a subset of E3.
In the figure, entity is the node, and statement is the edge. Any statement (S; P; O) (the intermediate element P is called predicate) is described by the Annotation on the edge. Declarations with common predicates define the binary relationship between objects)
In the OIM diagram, the rectangular node is called [literals] and belongs to the entity L = {"literal"} × U *. There is no essential difference between literals and other entities. We distinguish between literals and other entities on graphics, mainly for better readability.
The ing between model M1 and M2 can be regarded as a set of tuples (N1; N2; O) in terms of concept. Therefore, belonsto (N1; M1); belonsto (N2; M2) and O are both actual numbers, representing the similarity. When M1 m2 does not share elements, ing can be defined as an undirected bidirectional map of permission generation. To regard ing as a model, the model is represented as a declared set. For each tuples t = (N1; N2; O), we create four declarations:
1. (node (t); type; mapentry)
2. (node (t); SRC; N1)
3. (node (t); DEST; N2)
4. (node (t); similarity; O)
Appendix B: general version of the algorithm generalized version
Appendix C: propagation coefficient propagation Coefficients
Appendix D: convergence and complexity of the algorithm
The Fixed Point Calculation of SF can be expressed as the following feature vector calculation. T is a square matrix that corresponds to the similarity propagation graph G obtained from model A and model B. If one edge is connected to J = (X; y) And I = (x'; y'), the propagation coefficient is C, so that the matrix entry tij = C. The other entries are set to 0. Note that the propagation coefficient in G is in line with the possibility of transition, if T is a transition matrix.
When T is an aperiodic, irreducible matrix (ergodic theorem), the fixed point calculation is converged. The Matrix T is irreducible, and only when the associated graph G is strongly connected (each node can be reached from any other node ). to ensure these features, we can introduce self-loops in G, by including the number of input O0 in the fixed point equation. For example, let oi + 1 = normalize (O0 + P (OI )). This method is called dampening in literature ?). If O0 is assigned a non-zero value to each map pair in a × B, adding O0 is equivalent to changing g to G', where all nodes are connected through a specific propagation coefficient. Make t' a matrix associated with G.
It can be used as follows to represent feature vector calculation. Set S to a map pair vector. Each position contains a similarity value from o to form a fixed order of map pairs. The iterative calculation of our fixed point corresponds to the matrix multiplication t×s. Repeated multiplication produces the dominant feature vector S * Of Matrix T, for example, t '× S * = ls *, where L is the dominant feature value of t. In the fixed point equation, t×s * is divided by L for standardization.
The fixed point calculation conforms to the Markov chain of T. This fact provides an interesting in-depth perspective on algorithms. Because t complies with the transition matrix on G, the obtained similarity measurement standards can be regarded as a fixed probability distribution of map pairs caused by random movement from pair to pair. This random movement matches a person's designer's manual matching process for A and B. Starting from a given MAP pair, the designer deduced the similarity with another map pair based on the structural characteristics of A and B. Assume that A and B are relational models. If the designer concludes that table T1 in Table A matches table T2 in Table B, there is a definite possibility that the next step is to match columns in T1 and t2.
The conversion rate of fixed point calculation depends on the ratio of dominant T to the Second Eigenvalue, which is determined by the structural characteristics of G. The higher dampening values represents the faster conversion rate of the matrix.
Complexity: the number of operations in each iteration is proportional to the number of edges in the propagation graph G, and proportional to the product of the number of edges A and B in the model.