First, data mining
Data mining is an advanced process of using computer and information technology to obtain useful knowledge implied from a large and incomplete set of data. Web Data mining is the development of data mining and the application of data mining technology in Web technology. Web Data Mining is a comprehensive technology to improve the efficiency of web technology utilization by extracting information from resources on the Internet, that is, discovering hidden patterns from the Web document structure and the collection of trials.
Data mining involves a lot of fields and methods, there are many kinds of classification method.
(1) According to the mining objects: relational database, object-oriented database, spatial database, time series database, DNA database, multimedia database, heterogeneous database, Heritage database and Web database, etc.
(2) According to the mining methods: machine learning methods, statistical methods, neural network methods and database methods;
A. Machine learning methods can be subdivided into: inductive learning methods (decision tree, rule induction, etc.), case-based learning, genetic algorithms, etc.
B. Statistical methods can be subdivided into: regression analysis (multiple regression, autoregressive, etc.), discriminant analysis (Bayesian discriminant, Fischer Discriminant, nonparametric discrimination, etc.), cluster analysis (System clustering, dynamic clustering, etc.), exploratory analysis (principal element analysis, correlation method, etc.).
C. Neural network methods can be subdivided into: feedforward Neural Networks (BP algorithm, etc.), self-organizing neural Networks (self-organizing feature mapping, competitive learning, etc.).
(3) According to the mining task: can be divided into association rules, classification, clustering, time series Prediction Model discovery and Sequential pattern discovery.
A. Association rules: A typical association rule discovery algorithm is a apriori algorithm, which is also called the breadth-first algorithm, which was proposed by A.agrawal and R.srikandt in 1994, which is the current algorithm of removing AIS and facing SQL Setm The basic idea of the algorithm is that, if a set of itemsets is not a frequent set, its parent set is not a frequent set, which greatly reduces the number of itemsets that need to be validated, and it is obviously superior to the AIS algorithm in actual operation.
Apriori algorithm is one of the most influential algorithms in association rule Mining. Association rules are the discovery of interesting, frequently occurring patterns, associations, and dependencies between transaction databases, relational databases, and the itemsets of large numbers of data in other data stores. Association rules can be divided into two steps:
1 Find all the frequent itemsets. This part is mainly solved by the Apriori algorithm introduced later.
2 The associated rules are generated by frequent itemsets: These rules must meet the minimum support and minimum confidence level.
B. Classification rules: one of the important tasks of data mining is to classify massive data. Data classification is based on the values of certain attributes of a set of data. There are many methods of data classification, including decision tree method, statistical method, neural network method, nearest neighbor method and so on. Among them, the classification method based on decision tree is compared with other classification methods, which has the advantages of fast, easy to convert into simple and easy to understand, easy to convert into database query language, friendly and higher accuracy.
C. Data clustering: The basic idea is: in the process of data analysis, considering the "distance" between the data, more emphasis is given to the common connotation of some data. Data clustering is a grouping of data that is based on the following principles: Maximum group similarity and minimal group similarity.
D. Timing pattern: The following examples can be used to describe the timing pattern: a customer rents the film "Star Wars" and rents "Empire strikes Back" and rents "return of the Judi", noting that these leased events are not necessarily connected. The occurrence of an event like this one can lead to a sequential pattern of events, called timing patterns.
E. Similar patterns: A large amount of data in the temporal or spatial tense is found in computers, including the financial database of the stock price index, the medical database, the multimedia database and so on. The purpose of searching for similar patterns in temporal or spatial-temporal databases is to identify and predict risks, causal relationships and trends associated with specific patterns.
Second, web mining
The data on the Web site has its own characteristics, the main can be summed up in the following points:
1, the data volume is huge, dynamic is extremely strong; 2, heterogeneous database environment, 3, semi-structured data structure.
Web data mining can be divided into Web content mining, web structure mining, web use mining three categories. Web content mining is the process of extracting useful information from the content of a document or its description, and Web content mining has two strategies: directly mining the contents of the document and improving it on the basis of other tools searching. Using the first strategy, there are query Language weblog for the web, and heuristic rules are used to find the Ahoy of Personal homepage information. The method of adopting the second strategy is mainly to deal with the query result of search engine further, and get more accurate and useful information. Belong to this class has the websql, and to search engine's return result carries on the clustering technique and so on. According to the data of mining processing, web content mining can be divided into two parts: text mining and multimedia mining. Web structure Mining is the derivation of knowledge from web organization structure and link relationship. Mining the structure of the page and web structure, can be used to guide the classification and clustering of pages, find the authoritative page, the center page, so as to improve the performance of the search. At the same time can also be used to guide the page collection work, improve collection efficiency. Web structure mining can be divided into web document internal structure mining and the hyperlink structure mining between documents. This aspect is represented by page rank and clever, and the link structure of the page is also leveraged in a multi-level web Data Warehouse (MLDB). Web usage mining is the use of user access logs from a server-side record or a pattern of interest from the user's browsing information, which helps to understand the behavior patterns that users hide in the data, makes predictive analysis, and improves the structure of the site or provides personalized service to users.
WEB Mining Related technologies:
Data mining methods can be divided into two types: one is based on the statistical model, the use of the technology is decision tree, classification, clustering, association rules, etc. The other is to establish an artificial intelligence model based on machine learning, which is based on neural network and Natural law calculation method.
Web Content Mining:
1. Web Text Mining
Web text mining can summarize, classify, cluster, correlate and analyze the contents of a large number of documents on the Web, and make use of Web documents to forecast the trend. Text data on the Internet is typically a set of HTML-formatted documents that convert these documents into a structured and reflective representation of the document's content characteristics in a relational database, but the document feature vector is generally used, but in the present document representation method, There is a disadvantage is that the document feature vector has a very large dimension, making the selection of feature subsets become an essential part of the process of text data mining on the Internet. After the completion of the document Eigenvector dimension reduction, data mining can be used in various methods, such as classification, clustering, association analysis, etc. to extract the specific application-oriented knowledge model, the final evaluation of the mining results, if the evaluation results meet a certain requirements then output, or return to a previous link, After the analysis of the improvement of the new round of excavation work. Association rule Pattern Data Description pattern, discovering association rule algorithm belongs to unsupervised learning method. Discovering Association rules usually takes the following 3 steps: ① connection data, do data preparation; ② the given minimum support degree and the minimum reliability, the association rules are found by using the algorithm provided by the Data Mining tool. ③ visual display, Understanding and evaluation of association rules.
At present, the research of WEB content mining focuses on the retrieval based on text content, refining of information filtering, data de-duplication, data pattern extraction, intermediate form representation, heterogeneous integration, text classification and clustering, document summarization and structure extraction, data warehouse and OLAP, and so on, especially the above topic research based on XML.
For classification mining, the thing to do in the preprocessing phase is to convert this Web page collection text information into a two-dimensional database table, each column is a feature, each behavior a Web page feature collection. The common method in text learning is TF DF vector notation, which is a document word set (bag-of-words) notation, all words extracted from the document, regardless of the order between words and the structure of text. The way to construct this two-dimensional table is that each column is a word, and the set of columns (the feature set) is a word in the dictionary, so that the entire set of columns may have hundreds of thousands of columns. Each row stores information about a word within a page, at which point all the words in the page correspond to the Lie (feature set). Each column (word) in the column set, if it does not appear in the page, the value is 0, and if K is present. Then the value is K. This can represent the frequency of the pages in the page. The two-dimensional table constructed in this way represents the statistical information of the words in the Web page collection, and can be naive Bayesian method or K-nearest neighbor method.
Websql is a query Language for Web page refactoring that takes advantage of the Tu Shu representation of Web documents to get information from an online document site or guide guide. Ahoy uses Internet services like a search engine to access personal services, and uses heuristics to identify a document as a syntactic feature of its personal homepage.
Word segmentation
At present, there are many word segmentation algorithms, such as: Forward maximum matching method (MM), reverse maximum matching method (RMM), word-by-step traversal matching method, establishment of segmentation mark method, forward best matching method and reverse best matching method. In recent years, many new methods are proposed to improve the accuracy of participle and the speed of word segmentation. Such as: The generation of test method through the interaction between lexical atn and semantic atn, to improve the accuracy of segmentation; The improved mm segmentation algorithm adopts the maximum matching method of positive increment word and the jump matching method, Combining the final semantic check and the right principle to eliminate the type ambiguity; the method of word segmentation based on neural network try to use neural network to deal with the problem of disambiguation, but at the same time introduce a problem: the selection of training samples, because of the complexity of natural language, how to select training samples need to do in-depth research; combined with direct matching algorithm, suffix algorithm and thesaurus structure support the method of the first word hash, local improve the speed, but can not do the standard binary search; the nearest neighbor matching algorithm with the first word hash is to use the maximum character matching algorithm, and to support the first character hash and the standard binary search to improve the word segmentation speed.
The basic algorithms of Word Segmentation are: (1) based on dictionary and rule matching method. Based on the dictionary and Rules, the paper uses dictionary matching, Chinese lexical or other Chinese language knowledge to participle, this kind of method is simple, the word segmentation efficiency is high, but the completeness of the dictionary, the consistency of the rules and so on are relatively high. Matching strategies include: maximum matching method, minimum matching method, reverse matching method, word-adding or subtraction matching method, bidirectional scanning method. (2) Symbol law. such as the segmentation mark method, statistical indexing method. (3) Frequency statistics method. Based on statistical segmentation method, Chinese based on the statistical information of words and words, the completeness of the poor. (4) Semantic language usage. such as suffix word method. At present, the most use is based on Word library participle method. As Chinese in participle may produce two semantic, such as "computer" can be divided into "computing" "/Machine" and "Computer" "/", this must be combined with other methods of segmentation, such as based on grammatical rules of the word segmentation method, based on naive Bayesian method, and so on. In the specific word segmentation process, we can also merge the word variant, such as synonyms, synonyms can be merged, such as "Internet" and "World Wide Web" can be treated as an entry.
The semantic web is the next generation of web technology that gives the Web the semantic information that computers can understand.
Ontology plays an important role in the semantic Web technology. Ontology is a common understanding of domain knowledge and a formal and structured description of the domain. Aiming at the existing problems of the semantic web, this project should be based on some key technologies, such as Web technology, information integration and information management, from several aspects.
(1) Semantic information integration. This paper studies the semantic Annotation and ontology integration method of ontology, extracts useful information from heterogeneous resources using ontology based semantic annotation and ontology mapping technology, and integrates information of various information sources through mapping method.
(2) Semantic query. Various query methods of semantic information include: Visual navigation query of ontology, query for concept/instance/attribute, query based on Full-text search technique, semantic relation query.
(3) Semantic information mining. The mining of semantic information has been in a very shallow stage, and most of the current research has been in the traditional text information mining. The research of this project is mainly based on ontology instance clustering, ontology classification, Ontology Association Rules Mining and keyword extraction in ontology. These technologies are the basis of the application of Semantic web, they can be used to analyze the trend of semantic information, automatic processing of semantic data and so on.
(4) Semantic Web Service. The web Service is described by the system-defined software ontology, so as to realize the function of WebService evaluation and assembly.
(5) Semantic information management based on Peer to Peer. The core idea of this problem is to realize the application of semantic mining platform in peer-to-peer environment by integrating the existing peer to peer framework.
(6) algorithm explanation. Using the defined basic data structure to log the implementation of the above algorithm, so that the user-algorithm and development-algorithm interaction can be easily realized. Provides a more user-friendly interface for the algorithm itself.
2. Web Multimedia Mining
The difference between Web multimedia mining and Web text mining is that the features that need to be extracted are different. The features of WEB multimedia mining need to be extracted include file name URL, type, key value table, color vector of image or video. These features can then be excavated. If the association analysis finds a similar "if the image is ' big ' and has something to do with the key word ' grassland ', then the probability that it is green is 0." 8 "Association rules. Of course, multimedia can also be classified, clustering and other operations. The main methods of multimedia data mining are: The similarity search in multimedia data includes two kinds of multimedia indexing and retrieval techniques: Based on descriptive retrieval system and content-based retrieval system; multi-dimensional analysis of multimedia data can design and construct multimedia data cube according to traditional method of constructing data cube from relational data; Classification and predictive analysis, mainly applied to astronomy, the research of Seismology and geography science, decision tree classification is the most commonly used method; association rules mining of multimedia data, mining of association rules include the following three kinds of rules: correlation between image content and non-image content, correlation of image content unrelated to spatial relation, An association of image content related to spatial relationships.
3, Feature extraction
The classical text representation model is a vector space model (vsm-vector), which was proposed by Salton and others in the late 60 and successfully applied to the famous smart text retrieval system. The vector space model makes a simplified representation of the text, considers that the features are independent of each other and ignores their dependencies, and the document content is represented by the characteristic words it contains: d= (T1,t2,...,tn), where TK is the K-feature word of document D, 1≤k≤n. The similarity between the contents of the two documents D1 and D2 Sim (D1,D2) is measured by the similarity between the computed vectors. The most common method of similarity measurement is the cosine distance.
In addition to the vector space model, the probabilistic models proposed by Stephen Robertson and Spark Jones have been widely accepted by people. The model synthetically considers the factors such as word frequency, document frequencies and document length, and merges the document and the user's interest (query) according to certain probability relation, and forms the famous Okapi formula. The model has been successful in the field of information retrieval.
dimensionality reduction is the process of extracting some features from the original feature space automatically, usually through two ways: one is to delete the feature without any information according to the statistic information of the sample set; the other is to synthesize a number of low-level features into a new feature. At present, there are many methods about feature extraction, such as document frequency method (DF), Information gain (IG), Mutual correlation information (MI), X2 statistics (CHI), feature Enhancement (TS), etc. DF refers to the number of documents that contain a feature. The TS method estimates the importance of features by the frequency of statistical features in a similar set of documents, however, it is found in practical applications that the characteristics of some DF or TS values are information-related and cannot be deleted from the feature space. Therefore, these two methods in some cases unreliable, MI weakness is subject to the edge of the characteristics of the impact of the probability of a large, chi and IG use of good results. Commonly used evaluation functions have probability ratio (odds ratio), information gain (information Gain), expectation cross entropy (expected crossentropy), mutual information (Mutual information), Word frequency (wordfreque ncy) and so on.
(1) IG (Information Gain): That information wins. The IG value represents the distribution of the feature on the training set, which is computed by the number of occurrences of the statistical feature in each category, as follows:
where t represents the characteristics, CI represents the category I, M for the category number, only PR (CI) represents the probability of the category CI, PR (ci|i) represents the probability of the category CI in the condition of the inclusion of the feature T, PR (CI|-T) represents the probability of the category CI under the condition that does not contain the characteristic T, PR (t) Represents the probability that the feature t appears, and Pr (-t) represents the probability that the feature t does not appear. The higher the IG value indicates that the feature is more centrally distributed on the category of the training set. The IG method extracts the high value of IG, whose basic idea is more important than the characteristic of more concentrated distribution.
(2) MI (Mutual information): The mutual information value, which is extracted by calculating the correlation between feature T and Class C. The formula is: To facilitate the calculation, the simplification is: where n is the total number of text contained in the training set, A is the number of times for T and C, B is the number of T, and C is not, C is C and T does not appear. Through this formula, we can get the mutual information value between the feature and each category. In order to obtain the overall evaluation of the feature on the dataset, there are two methods of calculation:
The former represents the characteristic and various kinds of average mutual information value, while the latter takes the maximum value of the feature and the mutual information value of each category. The Mi method extracts the characteristic with higher mutual information value, and its basic idea is more important than the characteristic with higher class correlation.
(3) CHI has ideas that are basically similar to the Mi method, and is also accomplished by calculating the degree of dependency between feature T and Class C. But the computational details of the two are different, Chi is considered more, and there is a view that Chi is a "formalized" mi. The formula of Chi is as follows: where n is the total number of text contained in the training set, the number of times A is T and C, B is t and C is not, C is C, and T is not, D is the number of times that neither occurs. As with Mi, Chi also has an average and a maximum value of two ways to obtain the overall evaluation of the feature:
The basic idea of CHI method is also that the more closely related to the category, the higher the importance.
(4) DF (document Frequency): the documentation frequency, the total number of text in the training set that contains the feature. A text inclusion feature refers to this feature appearing in the text, ignoring its occurrence in the text. df method extracts the characteristics of high DF value, its purpose is to remove the number of times in the training set of the characteristics of too little, the retention appears to reach a certain number of times, has a certain impact characteristics. In each feature extraction method, the calculation of DF method is the simplest.
(5) WEE (Weight Evidence): namely, the textual evidence right, its formula is as follows: of which, T is a characteristic, M is the number of categories, CI represents the category I category, the probability of representing the category CI, PR (ci|t) represents the probability of the category CI in the condition containing the feature T, PR (t) Represents the probability of a feature t appearing.
4, classification
There are many methods of text classification, such as multivariate regression model, K-neighbor method, neural network method, Bayesian method, decision tree method, support vector machine, etc., these methods can be divided into two types: Statistical classification method and machine learning based classification method. Support Vector Machine (SVM) is a new knowledge in the field of statistical learning theory, which is still in the development stage, but the application of SVM in many fields is very ideal in the current applications.
Web page Automatic classification is one of the main research contents of Web content mining, the main technology used is classifying technology, because text is the main body of web content, and the processing of text is easier than that of audio and video. The first feature extraction is to classify text. The so-called characteristic refers to a word or phrase. At present, most English classification algorithms are characterized by words, using spaces and other punctuation marks as separators in Word segmentation, so as to extract all the features that appear in the document, all extracted features are called full feature sets. After the feature extraction is finished, feature extraction is generally needed. Feature extraction refers to the process of extracting a subset from a full feature set. The extracted subset is called a feature subset. According to John Pierre's theory, the characters used to represent the text should theoretically have the following characteristics: (1) as few as possible, (2) have moderate frequency, (3) Less redundancy, (4) Less noise, (5) are related to its category, (6) The meaning is as clear as possible; When extracting feature subsets from a full feature set, the weights are usually determined according to the weights of the features, such as information winning (information Gain), Mutual information (Mutual information) and so on. Feature subset can be used to represent text, then can be constructed using different classification methods for classification. The common classification models are: (1) k-Nearest neighbor model, (2) Rocchio model, (3) Bayesian model, (4) Neural network model, (5) Decision tree model. At present, researchers have put forward many text categorization methods, such as vector space Method (VSM), regression model, K nearest neighbor method, Bayesian probabilistic method, decision tree, neural network, online learning, support vector machine, etc.
After the feature extraction is complete, we can use these features to represent a text. The specific presentation method varies depending on the classification method. Each classification model uses its own method to represent a text and incorporates this representation into its own system. All classification models can be divided into two steps of training and classification. In general, the more training cases the more the accuracy of the classification is guaranteed, but not the more the better.
(1) Rocchio algorithm based on TFIDF
Rocchio algorithm is derived from vector space model theory, the basic idea of vector space model is to use vector to represent a text, then the processing process can be transformed into the operation of vector in space. The Rocchio based on TFIDF is an implementation of this idea, in which the text is represented by an n-dimensional vector, the vector dimension n is the feature number, and the vector component is the weight of the feature, and the weight value is called the TFIDF method, and the steps are as follows:
The TFIDF method first represents the text in the training set as a vector and then generates a class eigenvector (that is, a vector that can be used to represent a category). The class feature vector takes the average value of all the text vectors in the class. The process of Rocchio algorithm training is actually the process of establishing the class characteristic vector. When classifying, given an unknown text, the vector of the text, and then compute the similarity between the vector and the class eigenvector, and finally the text is divided into its most similar category. There are two ways to measure the similarity of vectors: (x,y represent vectors, xi,yi represent vector components):
Overall, the Rocchio algorithm is simple and easy to run, especially the classification speed is faster.
(2) Naive Bayesian model
Bayesian classification is a statistical classification method, which is based on Bayesian theorem, which can be used to predict the possibility of class membership and give the probability of text belonging to a certain category. According to the results of the classification, the sample wood is divided into the highest probability category. Suppose to have M-class c1,c2,c3 ... Cm, given an unknown text x, the Bayesian classification gives the category with the highest posterior probability under condition x, that is, the maximum P (ci| X) is available according to the Bayes theorem:
Obviously, P (X) is a constant for all classes, just maximize p (x| CI) P (CI) can be. P (CI) can be calculated according to the category distribution of the training set, i.e. Number of text contained in ci| for category CI, | D| is the total number of text in the training set. In a case with many attributes, the calculation of P (x| Ci), to reduce this overhead, leads to a simple assumption called class-conditional independence: Assuming that one attribute of a document has an effect on the classification that is independent of other attributes, that is, the properties of the document are irrelevant. This is the origin of naive Bayesian (naïve Bayes). This allows for a simple calculation of the probability that each attribute will appear on the category CI (x| Ci). The Laplace estimator (Laplacean prior) is usually used to calculate. And due to the different implementation of the details of the two simple Bayesian model, multivariate model (multi-variate Bernoulli models) Only consider whether the feature in the text appears (1, otherwise recorded as.) , the polynomial model (multinomial) takes into account the number of occurrences of the feature in the text:
The process of training the naive Bayesian classification model is to statistic the regularity of each feature in various kinds. Theoretically, Bayesian classification of the error rate is minimal, in terms of the test results, naive Bayesian in large data sets show a rare speed and accuracy.
(3) Decision tree
A decision tree is a tree structure similar to a flowchart, where each node represents a test on a property, each branch represents a test output, and the last leaf node represents a category. Decision Decision tree is convenient to rewrite as If-then classification rules, easy to understand. The core algorithm of the decision tree is a greedy algorithm, which constructs the decision tree on the basis of the training set in the Top-down way, then tests the attribute of the unknown text in the decision tree, and the path consists of the root node to the leaf node, thus the text belongs to the category. Decision tree algorithms have C4.5 (developed in ID3), Cart,chaid, and so on, their difference is that the structural decision tree and tree pruning algorithm details are different. Decision trees can be very good at resisting noise. The biggest drawback is that it is not adaptable to large datasets, in which case the construction of decision trees becomes inefficient.
(4) Neural network
The learning result of the neural Network (neural network) is the objective function, according to the output of the objective function as the basis of classification. The input is the value of each component of the text on each feature. A neural network is actually a set of connected input/output units, each of which has a certain weight value. The process of training through training sets is the process of adjusting these weights so that the neural network can predict the category correctly. The training of neural network is aimed at the training example, so the training set of the neural network can be added at any time, and the network adjustment can be completed without the need of retraining. At the same time, experimental results show that the classification accuracy of neural networks is low under the condition of too few training cases. Because it can be trained to take some appropriate weights for the feature, the neural network can withstand the noise interference well.
(5) k nearest Neighbor
The idea of K nearest neighbor classification (k-nearest neighbor) is also derived from the vector space model, which also uses the idea of translating text into vectors. KNN is a kind of classification method based on analogy. In the course of training, KNN generates the eigenvectors of all the training examples and saves them. Given an unknown text that first generates its eigenvector, KNN searches through all the training examples, finds K's closest training examples through vector similarity comparisons, and then points the unknown text to the most common category in the K nearest neighbor. The similarity can be measured by the Euclidean distance or the angle between the vectors. According to experience X generally takes 45. KNN is a lazy way, that is, it has no learning process, but only to store all the training cases, until the unknown text to establish the classification. On the training process is faster, and can be added or updated at any time to adjust the training examples. But it can be expensive to classify because it requires a lot of space to keep the training examples and the classification efficiency is poor. There is a view that the performance of KNN in small data sets is excellent.
(6) SVM method
The SVM method is based on the VC dimension Theory of statistical learning theory and the minimum structure risk principle, based on the limited sample information, the optimal compromise between the complexity of the model (i.e. the learning accuracy of specific training samples) and learning ability (i.e. the ability to identify arbitrary samples without error) is sought, with a view to achieving better comprehensive ability. SVM is specially designed for finite samples, whose goal is to get the optimal solution under the existing information rather than the optimal value (KNN and Naive Bayes method based on the sample number tends to infinity), and in theory, SVM will be the best in the whole. Thus, the local extremum problem which cannot be avoided in the neural network method is solved. In addition, SVM transforms the real problem into the high dimensional feature space through nonlinear transformation, the linear discriminant function is constructed in the high dimensional space to realize the nonlinear discriminant function in the original space, the special quality can guarantee the machine to have good generalization ability, and it solves the dimension problem skillfully, and its algorithm complexity is independent of the sample dimension.
5. Web Page Classification method
In general, the part of a Web page that works on a taxonomy is primarily the core text, which is the text part of the page about the content of the page. The second is the structure information and the hyperlink information, and then the multimedia information. The multimedia information recognition involves the technology of image retrieval, speech recognition and so on, and has no good result at present, so it is seldom considered. The basic idea of our web page classification is:
(1) using the Self-developed Web page parser to isolate the core text of the target Web page.
(2) using the SELF-DEVELOPED classification system TCS to the core text part of the target Web page segmentation, feature extraction and other operations, and to produce the original target page feature vector.
(3) classify the target pages according to the feature vectors.
The following five criteria are commonly used to evaluate a classifier in different ways: (1) accuracy (precision) (2) The full rate (recall) (3) F standard combines precision and recall, giving both the same importance to consider, i.e., where R represents recall, p represents precision. These three criteria are only used to evaluate the accuracy of the classifier in a single category. (4) The macroscopic mean (macro-averaged score) (5) microscopic mean (micro-averaged score).
WEB Structure Mining:
Throughout the web space, useful knowledge is included not only in Web page content, but also in the hyperlink structure and Web page structure between Web pages. The purpose of mining the web structure is to discover the structure of the page and the structure of the web, on which the pages are categorized and clustered to find the authoritative pages, which can be used to improve the search engine.
With hundreds of millions of pages stored in search engines, it's easy to get their link structure. What needs to be done is to find a good way to use the link structure to evaluate the importance of the page. The basic idea of Page Rank is: A page is often referenced, and the page is likely to be important; a page may be important, although it has not been referenced many times, but is referenced by an important page; the importance of a page is evenly divided and passed on to the page it references. In the page Rank method, page rank is defined as: Set U to a Web page. Fu for all you point to the set of pages, BU for all the set of pages pointing to U. Set Nu={fu} to the number of links sent from U, C (C1) is a normalized factor (so the total page rank for all pages is a constant), then page rank of the U page is defined as (simplified version): a page The PageRank of the face is assigned to all the pages it points to: each page sums up all the links that point to it and brings the PageRank to get its new PageRank. The formula is a recursive formula that can be calculated from any page, and then repeated until it converges. To search engine's key value search result, PageRank is a good evaluation result method, the result of the query can arrange according to PageRank from big to small.
From the present situation of We-B structure mining, the research of pure network structure mining is very few, and most of them are combined with other web mining forms. The main research focuses on the network virtual view generation and network navigation, information classification and indexing structure reorganization, text classification, text importance determination and so on several aspects.
Critical page/authoritative page (hub/authority) method
The hyperlink relationship of a page is very complex, for example: some links are for navigation, so it is not easy to think that hyperlinks are reference relationships; In addition, because of the business needs, few pages will be their competitor's page as a link. Because of the above defects in the hyperlink structure, the critical page/authoritative page method appears. The idea of a critical page/authoritative page approach is that there is an important page on the Web. The so-called key page is not necessarily linked to multiple pages, but on its page there is the most important site links for a particular area of expertise. For this critical page, it plays a role in implicitly explaining the importance of other Web document pages. An authoritative page should be linked by multiple key pages, and a key page should contain links to many authoritative pages. This link between the key page and the authoritative page is calculated as an algorithm, which is the main idea of the key page/authoritative page method.
Hits and Page Rank, as well as the hits improved algorithm for adding Web content information to the link structure, are mainly used to simulate the topology of web sites, to compute the degree of Web pages and the correlation between Web pages, typically the clever system and Google.
Web Usage Mining:
Web usage mining is also called Web Usage record mining, which is the mode of discovering users ' access to Web pages by mining Web log records. By analyzing and studying the rules of Web logging, we can identify the potential customers of e-commerce, and can use the extended-to-tree model to identify the user browsing mode, which can be used for Web log mining. The user's interest association rules may be mined according to the records of the user accessing the Web, As a basis for predicting user behavior, the user can prefetch some Web pages and speed up the user's access to the page. Web Log mining process is generally divided into 3 stages: preprocessing phase, mining algorithm implementation phase, mode analysis stage. The WEB server log records information about the user's access to this site, including IP address, request time, method, URL of the requested file, return code, number of bytes transmitted, URL of the reference page, and proxy. Some of this information has no effect on web mining, so
Row data preprocessing. Preprocessing includes data purification, user identification, transaction recognition and other processes. After preprocessing the Web log, we can choose the access Pattern Discovery technology according to the specific analysis requirements, such as Path Analysis, association analysis, time series pattern recognition, classification and clustering technology. After digging out the pattern, we should analyze it so that it can be used well.
There are two common ways to find out that users use record information. One way to do this is by analyzing the log file, there are two ways, first, before the access to preprocessing, the log data map to the relational table and the corresponding data mining technology, such as association rules or clustering technology to access the log data, the second is to direct access to log data to obtain user navigation information; The other is to discover user navigation behavior by collecting and analyzing user click events. From the perspective of research objectives, the existing research based on Web server log data can be divided into 3 categories: ① to analyze the performance of the system as the goal; ② to improve the system design as the goal; ③ aims to understand the user's intent. Due to the different functions of each target, the main techniques adopted are different. The user's use of logging mining typically takes the following 3 steps: ① data preprocessing phase. This is the most critical phase of using records information mining, including: Preprocessing of user records, preprocessing of content and structure; ② Pattern Recognition phase. The methods used in this phase include: Statistical method, machine learning and pattern recognition. The implementation algorithm can be: statistical analysis, clustering, classification, association rules, sequence pattern recognition, etc. ③ Mode analysis phase. The task at this stage is to filter out the data and patterns that are not interesting and unrelated from the data set collected at the previous stage. The concrete implementation method depends on the specific use of web mining technology, there are usually two kinds of methods: the use of SQL query statements for analysis; Another kind of data is to guide human multidimensional data cube, and then use OLA P tool to analyze and provide visual structure output. The study of mining user records is a statistical method in the early years, when users visit the Web site through the browser, the statistical model is established to carry out a variety of simple statistics on user access patterns, such as frequent access pages, number of unit event visits, and distribution of access data over time. The method used in the early stage is a statistical model based on breadth first algorithm, and a heuristic HPG (Hypertext probabilistic grammar) model for the discovery of user navigation behavior, which is also a statistical method, because HPG model and K-order Markov model are quite , so some people have proposed to use Markov model to mining user records recently.
Web log mining methods can be divided into (1) Jiawei Han as the representative of the data cube based on the method: the Web log is saved as a data cube, and then on its data mining and OLAP operations; (2) to
Ming-syan Chen as the representative of the web-based approach: they first proposed the concept of the maximum forward reference sequence (MFR), using MFR to divide the user session into a series of transactions, and then mining the frequent browsing paths in a similar way to the association rules.
Web behavior Mining has been widely used in e-commerce, and after dividing the transaction, we can choose the technology (Path Analysis, association, rule mining, timing pattern, clustering and classification technology) according to the specific analysis requirement.
The pattern analysis in WEB usage mining is mainly to find interesting patterns in the collection of patterns found by the pattern discovery algorithm. The development of web analytics techniques and tools can assist analysts to deepen understanding and make full use of the patterns obtained by various mining methods. such as Webwiz (Pitkow) system can visualize the access mode of www; Webminer uses the knowledge query mechanism of SQL-like language, and can also use the Data warehouse of storing web using data, and use OLAP method to discover the specific pattern in the data.
6, the WEB data mining four steps:
1. Find resources: The task is to get the data from the target W E b document. 2, information selection and preprocessing: The task is to remove the useless information from the obtained W e b resources and make the necessary collation of the information. 3, Pattern Discovery: Automatic mode discovery. Can be done within the same site or across multiple sites. 4. Mode analysis: Verify and explain the mode of the previous step.
7, Web Mining on the Internet has a very wide range of applications, more common are:
(1) to help find news or other information of interest to the user to provide personalized service in the Web site, to attract more users.
(2) The search engine on the automatic classification of documents so as to reduce the search engine for the organization of Internet documents required to spend the human resources, you can also sort the web pages, improve the search engine.
(3) Web log mining in the field of E-commerce has a broad application prospects, such as the discovery of customer purchase habits and browsing interests, targeted adjustment of sales model, improve business volume.
8, usually web mining can be divided into 3 subtasks: resource discovery, information extraction, generalization.
• Resource discovery: means to search for available information from the web;
• Information Extraction: Useful information is extracted from the resources that have been discovered. For the text information, we should consider not only the text content but also the structure of the text.
• Generalization: It is the process of learning from web information and extracting certain rules through learning.
In general, web mining has two sources of data: Search engine result sets and online information on the web. These two methods have their own merits, depending on the specific application. Currently, several resource discovery models have been widely used on the Internet: catalog/Browse models (WAIS and Gopher), retrieval models (Archie and AltaVista), Hypercube (Yahoo and Excite). Many resource discovery tools employ a robot retrieval model that scans all documents on the Web and builds indexes, but it also includes irrelevant information and outdated information.
9, the development direction of Web mining:
At present, the research of web mining at home and abroad is at the beginning stage, and it is the Frontier research field. Some of the most useful research directions in the future are:
(1) Research on the intrinsic mechanism of WEB data mining;
(2) The dynamic maintenance and updating of Web Knowledge Base (Schema Library), the integration and promotion of various knowledge and patterns, and the comprehensive method of knowledge evaluation;
(3) Semi-structured, unstructured text data, graphic image data, multimedia data efficient mining algorithm;
(4) The adaptability and timeliness of web data mining algorithm in massive data mining;
(5) The research of Intelligent search engine based on web mining;
(6) Intelligent Site Service Personalization and performance optimization research;
(7) The Research of association rules and sequential patterns in constructing self-organizing sites;
(8) The research of classification in the intelligent extraction of e-commerce market.
10, research significance and direction:
Path pattern Mining
In the Web, documents are easy to navigate through hyperlinks, and users often jump from one page to another to find information over a chain. Capturing the user browsing path is called path analysis. Understanding user browsing paths can help improve system design and help make better market decisions, such as increasing advertising on the right page.
Smart queries in the Web
The library of the digital age is not an organized information warehouse, but more like a messy information warehouse, the intelligent query in the Web includes the following three aspects: 1 resource Discovery: The emphasis is on automatically generating the index that can be found. 2) Information extraction: After the discovery of resources, the next task is to carry out automatic information extraction. 3) Information Induction: The use of classification technology can automatically organize and manage data, you can also find patterns of interest to users.
Web Intelligence Tools
We need to use the software system to extract, locate and manage Web documents to keep up with the speed of information change. This software system is called a Web tool. Existing Web tools lack the ability to identify and use deep semantics, and query language descriptions are limited. A new generation of intelligent web tools that use intelligent agents to help users discover information. It can automatically obtain the user's interest topic, discovers the user's browsing mode and the information resource modification pattern. More efficient use of network resources, multiple users of the query requirements to cluster, reduce the number of queries. Save the extracted document and its Full-text index in the database and discover a variety of useful patterns.
Improve network response speed
Traditional solutions to slow network response, generally based on the client: such as optimization of transmission, reduce congestion, according to the forecast, the advance transmission of some pages. Mining the Association rules on the server side can not only improve the response speed of the network, but also efficiently dispatch the caching of the network proxy. When a user browses to a page, the network agent can download the page associated with the page according to the association rule, that is, the page that the user is likely to visit, thereby increasing the response speed of the network, because the association rules are based on statistics and reflect the interest of most users.
11, the development of personalized technology based on web mining
(1) combined with artificial intelligence technology
Many problems in the field of personalized system can be attributed to machine learning, knowledge discovery and so on. User modeling processes are typically applied to agents and multi-agent technologies. Therefore, the combination of artificial intelligence technology and web mining technology will promote the rapid development of web personalization system.
(2) combined with interactive multimedia web technology
With the rapid development and application of the next generation Internet technology, the future web will be the multimedia world. The interactive personalized multimedia web system appears in combination with Web personalization technology and Web Multimedia system. Content mining, which supports massive multimedia data streams, will become one of the basic functions of web mining technology. Because this kind of content-based interactive personalized multimedia web system can meet the needs of users, it will become one of the developing directions of Web Personalization system.
(3) combined with database technology
12, data Mining and knowledge discovery Development direction:
1, mining algorithm efficiency and scalability. At present, the data of database is large and the number of dimensions is high, which makes the search space of data mining increase and the blindness of knowledge is improved. How to make full use of the knowledge of the domain, eliminate the data unrelated to the discovery task, effectively reduce the dimension of the problem, and design an efficient knowledge discovery algorithm is the key point of the next development.
2, the timing of the data. In the application domain database, the data is constantly updated, over time, the original discovery of knowledge will no longer be useful, we need to gradually revise the discovery pattern over time to guide the new discovery process.
3, and other systems integration. Knowledge discovery system should be an integrated system of database, knowledge Base, expert system, decision support system, visualization tool, network and other multiphase technology.
4, Interactive. Bayesian can be used to determine the probability of data and its distribution to take advantage of previous knowledge, and then use deductive database itself to discover knowledge, and to guide the process of Knowledge discovery.
5, the discovery mode of refinement. Domain knowledge can be used to further refine discovery patterns and extract useful knowledge from them.
6, the discovery of knowledge on the Internet. WWW is becoming more and more popular, from which a lot of new knowledge can be found, there are some resources discovery tools to find the text containing the keyword, but there is not much research on the Knowledge Discovery on www. Han and others in Canada put forward a multi-level database based on the generalization of the original data by using the multi-level structured method. For example, you can store image descriptions on www rather than images themselves in a high-level database. The question now is how to extract useful information from complex data (such as multimedia data), how to maintain multi-tier databases, how to handle the heterogeneity and autonomy of data, and so on.
13, text mining is facing many new research topics:
(1) Scalability problems with text mining algorithms the development of the Internet, the rise and wide application of e-commerce and digital libraries, the declining price of permanent storage devices, all of which make the text information stored in each unit unprecedented in scale. To handle such a large collection of text, you must have a fast and efficient text mining algorithm.
(2) Text representation text mining processing is the text of natural language representation, is unstructured or semi-structured data, lack of computer understandable meaning, before the text mining, the text must be preprocessed and feature extraction, it is expressed as a computer readable intermediate form. At present, although the study of natural language processing has made great progress, there is not a kind of intermediate form which can fully represent the semantics of text. For different mining purposes, it is necessary to use intermediate representations of different degrees of complexity. For fine-grained, domain-specific knowledge discovery tasks, semantic analysis is needed to obtain sufficient representation to capture the relationship between objects or concepts in the text. But semantic analysis is a big problem, and how to make semantic analysis more quickly and scalability for large text sets is a challenge.
(3) Cross-language problems due to the diversity of natural languages, various languages have their own characteristics, the effective text mining in one language may not be applicable to other languages, especially between Indo-European languages and Chinese. And with the globalization of economy, there may be many languages written in the collection of text to be processed, so the text mining function should take into account the semantic transformation between multiple languages.
(4) The choice of algorithm faces a variety of text mining algorithms, different algorithms have their own characteristics, how to select a suitable algorithm is a problem to be studied. Because as a general user, it is difficult for them to understand the principles and requirements of each algorithm.
(5) The algorithm runs the parameter setting many algorithms need the user to set the parameter, some parameter's meaning is more difficult to understand, therefore also is difficult to set correctly. How to make the algorithm choose the relative good parameter value automatically, and adjust the parameter value in the process of the algorithm is a key problem that many algorithms can be used widely.
(6) The mode of understanding and visual display text mining algorithm found in various forms of knowledge. Improving the understanding of these patterns is also a problem that researchers have to face. Solutions to improve understanding often include graphically displaying results, providing relatively small amounts of rules, or generating natural languages and leveraging visualization techniques. and the current text mining system, its face of users are mostly experienced experts, the general user is difficult to use.
(7) Knowledge integration in the field most of the current text mining systems do not use domain knowledge. Domain knowledge is useful for improving the efficiency of text analysis, facilitating more compact representations, and so on, so you can consider integrating domain knowledge into a text mining system.
(8) Chinese text segmentation technology in Indo-European languages, words and words have spaces as a fixed separator, so it is easy to do participle. In Chinese, there is no separator between words and words, a sentence is composed of a series of continuous Chinese characters, in addition to Chinese words have different lengths, the same words can appear in many different words, there are many words composed of a single word, which makes the Chinese text of the correct participle face more challenges.
Although there are still many problems to be solved in the field of text mining, many computer manufacturers have launched text mining software, which includes the application of text mining in Web site management, information streaming and filtering. Apply to market management, quality management and customer relationship management and the use of text mining technology found knowledge to guide the direction of investment, forecasting stock quotes. These successful cases have brought considerable economic profits to many people.
14, search results processing
Mining the results returned by search engines can provide users with more accurate query results. such as Websql system access to search engines to obtain documents, and from within the Document collection URL title, content type, content length, modification date and links and other information. The class SQL declarative language proposes the ability to obtain related documents from search results.
Web search results Mining based on weighted statistics realizes the results of intelligent meta search engine and sorting.
According to the information recommendation technology, the personalized service system can be divided into two kinds: rule-based system and information filtering system. The information filtering system can be divided into a system based on content filtering and a collaborative filtering system. rule-based systems allow system administrators to make rules based on user static and dynamic properties, and a rule is essentially a if-then statement that determines how different services are provided in different situations. The advantage of rule-based system is simple, direct, the disadvantage is that the quality of the rules is difficult to guarantee, and can not be dynamically updated, in addition, as the number of rules increases, the system will become more and more difficult to manage. Content-based filtering system uses the similarity of resource and user's interest to filter information. The advantage of content-based filtering system is simple and effective, the disadvantage is that it is difficult to distinguish the quality and style of resource content, and can not find new and interesting resources for users, but only find resources with similar interest to users. Collaborative filtering systems use the similarity between users to filter information, the advantage of the collaborative filtering system is that it can find new information of interest to the user, the disadvantage is that there are two difficult problems, one is sparsity, that is, in the early stage of system use, because the system resources have not been appraised enough, It is difficult for the system to use these evaluations to find similar users. The other is scalability, which means that as the system's users and resources increase, the performance of the system becomes lower. There are also some personalized services in the same time based on content filtering and collaborative filtering of the two technologies combined with these two filtering technology can overcome their shortcomings, in order to overcome the sparse problem of collaborative filtering, users can be expected to evaluate other resources by using the content of the resources they have browsed, which can increase the density of resource evaluation. With these evaluations, collaborative filtering is used to improve the performance of collaborative filtering.
Web page recommendation algorithm
Assuming the page set is i={}, the current sliding window w={pl,p2,... pm, | W|=m. The association rule set that is mined from the Web log is r={x=>y| X,y belongs to I and | Y|=1}, assuming that the page order that the customer is visiting is listed as
Third, related application papers
Web Mining and its application in competitive intelligence system
This paper introduces the classification, characteristics and implementation technology of Web mining, and expounds the application of Web mining in competitive intelligence system.
Research on application of WEB mining technology in e-commerce
Based on the latest research results, the Web mining technology applied in e-commerce is studied. It is difficult to find the user behavior characteristic problem in the personalized e-commerce website, and the customer Colony clustering algorithm based on Web log and the Web page clustering algorithm are given. The use of these web mining technology can effectively excavate the user's personality characteristics, so as to guide the organization and distribution of E-commerce website resources. Using Web log Clustering algorithm in e-commerce: Customer group Fuzzy Clustering algorithm, K-paths clustering method, Customer Colony clustering Hamming distance algorithm, neural network method, Web page clustering algorithm based on fuzzy theory, Web page clustering Hamming distance algorithm,
Application of Web mining technology in search engine
For the search engine, through the use of web mining technology, can improve the precision and recall, improve the organization of search results, enhance the search user model research, so as to improve the retrieval efficiency.
Design and implementation of Web mining system
This paper introduces the Web mining theory, including Web mining definition, Web Mining task, Web Mining Classification 3 aspects, and briefly introduces several key technologies of Web text mining system wtminer (website text Miner): Participle, feature extraction, classifier design. In the participle, we use the support of the first character hash and the binary search to improve the speed of word segmentation, in the design of classifier, considering the disadvantage of SVM training algorithm, the nearest neighbor method is used to reduce the number of samples in training samples, thus the algorithm speed is greatly improved.
Research on application of web mining in Network Marketing
This paper expounds the characteristics of network marketing and the concept of web mining, and probes into how to apply web mining technology to Network marketing, and introduces a fuzzy clustering algorithm for customer groups and Web pages.
The key technology of WEB text data mining and its application in network retrieval
On the basis of analyzing the characteristics of Web text information, this paper reveals some key technologies such as feature extraction, word segmentation and Web text classification of the target samples of web text data mining, and discusses the application of the technology in the network information retrieval with Google as an example.
Research on Web Mining system in e-commerce public service platform
Aiming at the development of electronic commerce in China, this paper applies the data mining technology to improve the service quality of the public service platform of electronic commerce, designs the Web mining system under the public service platform of e-commerce, and puts forward the system Evaluation Index system, It provides a new thinking and method for the development of electronic commerce public service platform and China's electronic commerce. This paper studies the preprocessing of click-Stream in Web mining system under e-commerce public service platform and the problem of using XML to solve the integration of heterogeneous data sources in E-commerce.
A survey of multi-relational data mining
Multi-relational data mining is one of the most important data mining fields in recent years. The traditional data mining method can only complete the pattern discovery in the single relation, and the multi relational data mining can find the complex pattern involving multiple relationships from the complex structured data. In this paper, the research status of multi relational data mining is reviewed. Firstly, this paper analyzes the causes and backgrounds of the data mining, then summarizes the general methods of data mining, and then introduces and analyzes the most representative data mining algorithms of multiple relationships. Finally, the paper summarizes the problems and challenges that need to be solved in the future development of the data mining of multiple relationships.
Research on word segmentation technology and its application in Web text mining
This paper expounds the application of Chinese automatic word segmentation technology in Chinese web text mining, discusses the theory and discusses the structure and technology of web text mining system. The work of this article focuses on the following points:
(1) The focus of the study is the extraction of Chinese key information, which is the difficulty of Chinese automatic segmentation. This paper focuses on the algorithm based on the automatic establishment of the best matching method of Word library for Chinese word segmentation, while using the improved Markov N-ary language model of statistical processing to deal with the ambiguity in the word segmentation, so as to improve accuracy.
(2) based on the specific word segmentation system, the corresponding Word segmentation dictionary is designed, which supports the fast searching algorithm of the first word, and is applied to the Web mining system, and the analysis results show that this method has a great improvement on the processing speed and the ambiguity processing.
(3) in the recognition of the unidentified words, the decision tree method is introduced, which makes the recognition ability of the unidentified words improve.
(4) In segmentation, we adopt a strategy based on N-One shortest path. In the early phase of participle recall n best result as a candidate set, the aim is to cover as many ambiguous fields as possible, the final result will be the completion of the identification from the n most potential candidate results to be optimized.
(5) In view of the problem that the other algorithms are occupied by the system resources, the data structure used in the improved Word segmentation algorithm is adopted, and the dictionary files are simplified. The most obvious way to achieve this is to create an index file for the various data files required for the program to run the schedule, greatly saving the memory space required for the program to run, and greatly improving the processing speed of the word segmentation.
Personalized service system based on Web usage mining
Personalized service system is a kind of web personalization system which is composed of many kinds of web-based mining technology and is based on user. This system uses data mining technology such as transaction clustering, clustering and association rules technology to analyze user access patterns, and provides real-time personalized service in combination with user's current access situation. The experimental results show that the personalized service system has better performance.
Research of Intelligent portal search engine based on web mining
Search engine is one of the important tools for people to get information quickly on the Internet, but because of the characteristics of Chinese, the accuracy and relevance of the search results is not very high, and the application of Web mining technology to search engine field, so as to generate intelligent search engine, will provide users with an efficient, An accurate web retrieval tool. This paper first introduces the working principle and related concepts of search engine, then introduces the definition, classification and application of Web mining. Finally, the important application of web mining technology in intelligent search engine is discussed in detail.
Design and implementation of information retrieval system based on web mining technology
The design and implementation of a information retrieval system based on Web text mining technology is introduced in detail. The information retrieval technology based on Web text mining integrates the idea of text mining, which combines the traditional information retrieval method of single resource discovery or single information extraction, thus achieves the goal of discovering resources in WWW and extracting the information from them for processing.
XML-based Web data mining technology
Under the situation of economic globalization, making full use of web resources, and digging out the information with decision-making significance, has inestimable significance to the independent development of enterprises. After analyzing the difficulties of web data mining technology, according to the development trend of Internet technology, this paper introduces the Web data mining technology based on XML and puts forward an implementation framework of the evaluation information data mining system based on XML.
Research on personalized Web content mining based on XML
The Web intranet mining based on XML has become an important research topic in web data mining gradually. This paper defines the user model and establishes the user model through three ways. Applying XML and Personalization technology to Web content mining, a personalized Web content mining System (PWCMS) based on XML is designed. The key technology and implementation of PWCMS are discussed. It is proved that the application of XML and personalization technology to Web content mining is effective.
Web personalized Information recommendation system based on data mining
Web personalized information recommendation based on data mining is increasingly becoming an important research topic. In this paper, a web-based personalized information recommendation System (WBIRS) based on data mining is designed, and the recommendation strategy is proposed in Wbirs to consider different recommendation algorithms for different types of users. Two recommended algorithms are adopted based on whether the user has a new information requirement wbirs.
Discovery of knowledge based on search engine
Data mining is typically used in highly structured, large databases to discover the knowledge contained therein. With the increase of online text, the knowledge is more and more rich, but they are difficult to be analyzed and used. Thus. It is very important to study a set of effective scheme to find the knowledge contained in the text, and it is also an important research topic at present. This paper uses search engine Google to obtain relevant web pages, filtering and cleaning to get relevant text, then, text clustering, using episode for event identification and information extraction, data integration and data mining, so as to realize knowledge discovery. Finally, the prototype system is given, and the knowledge discovery is tested in practice, and the result is very good.
Application of data extraction and semantic analysis in Web data mining
The complex network site is used as multiple business data sources, data warehouse and data mining technology is adopted to extract and purify data to mining database, so that data extraction and semantic analysis can be applied to web data mining. On this basis, the idea of using data extraction to transform data structure and applying semantic analysis technology to data extraction is put forward, which makes the data extraction more accurate.
Analysis of human ergonomics in China using self-organizing feature mapping algorithm in text mining
Text mining is the process of extracting valuable knowledge that is effective, novel, useful, understandable, distributed in a text file, and using this knowledge to organize information better. Using the Self-organizing feature mapping (SOM) algorithm in text mining, this paper makes a cluster analysis of a large number of documents of the journal Database of Human Ergonomics in China, obtains the main research categories and trends in the current domestic ergonomics research field, and then the cluster results and the International Ergonomics Association (IEA) The published research areas are compared and analyzed.
Research on personalized web mining in modern distance education
Discovering useful knowledge or patterns from heterogeneous and unstructured data on the web is an important part of data mining research at present. Web mining is the extraction of interesting, potentially useful patterns and hidden information from Web documents and web activities. This paper introduces the basic situation of web mining, analyzes and studies the text mining based on web, and gives a structure model diagram of text mining based on Web. The paper focuses on the Web page clustering algorithm, which realizes the requirement of learning according to the requirements of distance teaching and individualized. This paper presents an intelligent and individualized structure model of modern distance education system based on Web mining.
A web mining model based on natural language understanding
How to find useful knowledge from the huge amount of information on the Internet and meet the needs of users is an urgent subject to study. However, the existing method is difficult to extract a large number of unstructured information from the W EB to the database, and the general search engine is simply to match the keyword as the basis for the query, lower hit rate. This paper puts forward a combination of natural language understanding technology and web data mining to tailor a personalized web data mining model according to the needs of users. The preliminary test results show that the scheme is feasible and can meet the needs of the users, and the model is more versatile and adaptable.