Welcome to my blog http://pelhans.com/, all articles will be published in the first time there Oh ~
This section mainly introduces the knowledge fusion related technology, first introduces what is the knowledge fusion, then makes a introduction to the knowledge fusion technology flow and makes a brief introduction to the Knowledge fusion common tool. Introduction to Knowledge fusion
Knowledge Fusion, that is, merging two knowledge maps (ontologies), the basic question is how to combine the descriptive information from multiple sources on the same entity or concept. To be sure: equivalence class/Subclass equivalence attribute/Sub-attribute
As shown in the previous illustration, the different colored circles in the graph represent different sources of knowledge maps, where Roma and geoname.org in dbpedia.org are the same entity, via two sameas links. The solid alignment between different knowledge maps is the main work of kg fusion.
In addition to entity alignment, there are conceptual layers of knowledge fusion, cross-language knowledge fusion and other work.
It is worth mentioning that in different literature, knowledge fusion has different names, such as ontology alignment, ontology matching, Record Linkage, entity Resolution, solid alignment and so on, but their essential work is the same.
The main technical challenges of knowledge fusion are two: Data quality challenges such as ambiguous naming, data entry errors, data loss, inconsistent data formats, abbreviations, etc. Data size challenges: Large data volumes (parallel computing), diversity of data types, no longer just name matching, multiple relationships, more links, etc. the basic technological process of knowledge fusion
Knowledge fusion is generally divided into two steps, the basic process of ontology alignment and entity matching is similar to the following:
Data preprocessing
Data preprocessing phase, the quality of raw data will directly affect the results of the final link, different data sets on the same entity are often described differently, the normalization of these data is an important step to improve the accuracy of subsequent links.
Common data preprocessing is: syntax normalization:
Syntax matching: such as the presentation method of a contact phone synthetic properties: Data normalization for the presentation of home addresses:
Remove spaces, "", "", "--symbols input error class topological errors replace the nickname and abbreviation with the official name, etc. record connection
Assuming that the two entity records x and Y, x and Y values on the I attribute are xi,yi x i, y i x_i, y_i, then the connection is recorded in two steps: Attribute similarity: The attribute similarity vector is obtained by synthesizing a single attribute similarity:
[Sim (X1,y1), SIM (x2,y2),..., sim (Xn,yn)] [s i m (x 1, y 1), S i m (x 2, y 2), ..., S i m (x N, y N)] [Sim (X_1, Y_1), Sim (X_2, y_2), \ldots, Sim (X_n, y_n)] entity similarity: The similarity of an entity is obtained based on the attribute similarity vector. calculation of attribute similarity degree
There are many methods for calculating the similarity of attributes, such as editing distance, similarity calculation of sets, and similarity calculation based on vectors. Editing distance: Levenstein, Wagner and Fisher, edit Distance with afine gaps set similarity calculation: Jaccard coefficient, Dice vector-based similarity calculation: cosine similarity, TFIDF similarity ...... Edit Distance calculation attribute similarity Levenshtein Distance
Levenshtein distance, or minimum editing distance, is used to convert one string to another with minimal editing. For example, calculate the editing distance between Lvensshtain and Levenshtein:
Lvensshtain→insert "E" →levensshtain L v e n s s h t a i n→i n s e r T "E" →l E v e n s s h t a i n lvensshtain \rig Htarrow Insert "E" \rightarrow levensshtain
Levenshtain→delete "s" →levenshtain l e v e n S h t a i n→d e L e t e "s "→l e v e n S h t a i n levenshtain \rightarrow delete" s "\rightarrow Levenshtain