Terms related to Web Data Mining

Source: Internet
Author: User

 

Web Data MiningBased on the analysis of a large amount of network data, the data mining algorithm is adopted, data Extraction, data filtering, data conversion, data mining, and pattern analysis are performed on specific application models. Finally, disruptive reasoning is made to predict customers' personalized behaviors and user habits, this helps with decision-making and management to reduce the risk of decision-making.

Web Data MiningIt involves multiple fields, in addition to data mining, computer networks, databases and data warehousing, AI, information retrieval, visualization, natural language understanding, and other technologies.

1) Web Data Mining Classification

Web data mining can be divided into four types: Web Content Mining, Web Structure Mining, web usage record mining, and Web user nature mining. Among them, Web Content Mining, Web Structure Mining and web usage record mining are already available in the web era, while Web user mining is accompanied by the emergence of web.

  1. Web Content Mining (wcm, Web Content Mining)Web content mining refers to the process of obtaining potential and valuable knowledge or patterns from the file content and descriptions on the web, the mining objects are text documents or multimedia documents, which can be divided into text mining and multimedia mining.
  2. Web Structure Mining (WSM)The basic idea of web structure mining is to regard the Web as a directed graph. Its vertex is a web page, and the hyperlink between pages is the edge of the graph. Graph theory is used to analyze the Web topology.
  3. Web usage record mining (wum, Web Usage Mining)Web usage record mining is also called Web log mining or web access information mining. It is used to mine related web log records to discover the user's web page access mode. By analyzing the rules in the log records, it can identify users' preferences and satisfaction, and discover potential users, enhance the service competitiveness of the site.

In addition to server log records, it also includes proxy server logs, browser logs, registration information, user session information, transaction information, cookie information, user queries, and other possible interaction records between users and sites. The following two methods are used to mine Web user records:

  1. The log files of network servers are used as raw data, and specific preprocessing methods are used for processing before mining;
  2. Convert the log file of the network server into a chart, and then perform further data mining. After preprocessing the raw data, you can use the traditional data mining method for data mining.

Web user Mining

If the use of web record mining is to mine the traces of website visitors on major websites, the Web user nature mining is to explore the old nest of Web users. In the web era, the network is completely personal. It allows customers to create their own Internet with their own methods, preferences, and personalized customized services. On the one hand, it gives Internet users the maximum freedom, on the other hand, it provides high-gold information data to be explored by interested sellers. Through statistical analysis of customer information under the Web user's self-built RSS, blog, and other Web2.0 functional modules, it can help operators obtain information such as high-accuracy customer interests, personalized requirements, and new business development trends at a low cost. Data Mining Under Web2.0 is under further research.

2) Web data features

  1. Heterogeneous database environment. Each site on the Web is a data source, and each data source is heterogeneous. Therefore, the information and organization of each site are different, which constitutes a huge heterogeneous database.
  2. Distributed Data Source. Web pages are distributed on Web servers around the world, forming a distributed data source.
  3. Semi-structured. Semi-structured data is the biggest feature of Web data. Data on the Web is very complex and has no specific model description. It is a type of non-fully structured data, called semi-structured data.
  4. Dynamic. The Web is a dynamic and powerful source of information. The information is constantly updated, and the link information and access records of various sites are updated frequently.
  5. Diverse complexity. The web contains various types of information and resources, including text data, hypertext data, charts, images, audio data, and video data.

3) typical Web Mining Process

  1. Search for resources: Based on the mining purpose, extract relevant data from the web resources to form the target dataset. Web Data Mining mainly extracts data from these data communications. The task is to obtain data from the logs of the target Web site and the data in the network database.
  2. Data preprocessing: filter "impurity" data before Web mining. For example, data inconsistency is eliminated, and data from multiple data sources is stored as one data. The effect of data preprocessing directly affects the rules and modes produced by mining algorithms. Data preprocessing mainly includes site identification, data selection, data purification, user identification, and session recognition.
  3. Pattern Discovery: Uses mining algorithms to mine effective, novel, potential, useful, and ultimately understandable information and knowledge. Common pattern discovery technologies include path analysis, Association Rule Mining, time series pattern discovery, clustering, and classification.
  4. Pattern Analysis: analyzes, explains, and visualizes the mined patterns using appropriate tools and technologies, and converts the pattern of discovery into knowledge.

4) common Web mining technologies

  1. Path Analysis TechnologyIt can be used to determine the path with the most frequent access to a site. Other information about the path can be obtained through path analysis. Using this information can improve the design structure of the site.
  2. Association rule TechnologyAssociation rule mining is mainly used to mine relevant rules from the sequence items of the user access sequence database, that is, to mine the user's access period (session ), links between page files accessed from the server. There is no direct reference relationship between these pages. Using association rules can develop many related information or product services. For example, if information A and B are browsed by many users at the same time, it indicates that a and B may be related. The more users you click, the higher the relevance. The system can use this idea to recommend related information or product services to users. For example, Dangdang electronic bookstore adopts this model to recommend related bibliography.
  3. Sequence mode Mining TechnologyIn a timestamp-ordered transaction set, the discovery of the sequence mode refers to internal transaction modes such as "some items follow another item. The discovery sequence mode facilitates the prediction of the reader's access mode and provides targeted services.
  4. Clustering TechnologyThe discovery classification rule can provide a description of the Public attributes used to identify a special group. This description can be used by readers of classification. Clustering analysis can generate readers with similar characteristics from web access information data. In Web transaction logs, clustering reader information or data items can facilitate the development and design of future service models and service groups.

5) Application of Web Log Mining in Customer Relationship Management (CRM)

  1. Customer acquisition. In most business areas, the main indicators of business development include the acquisition capability of new customers. Personnel in the marketing department of an enterprise can use traditional methods to develop new customers, such as advertising activities. They can also be classified based on the target customer groups and then conduct direct sales activities. However, as the number of customers continues to grow and the number of details of each customer increases, it is also difficult to select the Screening Conditions for the relevant demographic survey attributes. Data mining can help screen potential customers.
  2. Customer maintenance. As the competition in the industry becomes more and more fierce and the cost for getting a new customer increases, the work of maintaining the original customer is becoming more and more valuable. In the implementation of CRM, enterprises can predict the potential loss of customers and analyze the main factors that lead them to leave. On this basis, target customers who tend to leave.
  3. Customer segmentation. Segmentation refers to dividing a large consumer group into segments. Consumers in the same group are similar to each other, and consumers in different groups are considered different. Through the implementation of CRM, the customer groups will be subdivided, and the enterprise continuously improves products and services according to the customer's requirements, so that the enterprise can continuously improve the customer group's satisfaction.

6) Application of Web Log Mining in e-commerce websites

As an e-commerce website operator, it is important to know not only what products users care about on the Internet, but also how anonymous users become registered users and the conversion rate, anonymous Users directly access the website or use search engine links, such as what the purchase behavior is, and what the performance is. For email marketing, the silent user analyzes the silence time and determines the marketing effect based on the outgoing volume, returned volume, and transaction volume. The advertising market promotion effect is reflected by the impression volume, clicks volume, and transaction volume.

  1. SUMMARY statistics. The summary statistics of the website include the time covered by the analysis, the total number of pages, the number of sessions, the number of unique visitors, and the average access, the maximum access, last week access, yesterday access and other result sets.
  2. Content Access analysis. Content Access analysis includes the most visited pages, the most accessible paths, the most accessed News, and the maximum access time.
  3. Customer information analysis. Customer Information Analysis includes statistics on the visitor's source province, browser used by the visitor and System Analysis on the operation, access from the page or website, IP address, and search engine used by the visitor.
  4. Activity Cycle Behavior Analysis of visitors. The Behavior Analysis of the visitor activity cycle includes access behavior for 7 days a week, 24 hours a day, the most visited days per week, and the most visited periods per day.
  5. Main Access Error Analysis. Analysis of major access errors includes server errors and page failures.
  6. Analysis of website columns. Website topic analysis includes customized channels and topic settings, which can be used to collect statistics on the access status of each topic and analyze it.
  7. Business website extension analysis. Business website extension analysis is an access Analysis for topics, multimedia files, downloads, and other content.

7) Web business intelligence bi

  1. Abnormal access AnalysisGenerally, a normal user accesses a website by sending a URL request to the website through a browser. The so-called "abnormal access" refers to a high-speed and mechanized continuous URL request process not through a browser, but through a program. This includes malicious program hacker attacks and search engine spider program access to websites. "Abnormal access" mainly includes five functions: Abnormal access analysis, search engine access analysis, error analysis, abnormal URL analysis, and period Access analysis. Through "abnormal access analysis", users can discover abnormal access behavior and access rules. By analyzing the URL request frequency, server processing time, request traffic, and other time series graphics trend, you can determine hacker attack points, troubleshoot software errors, diagnose server processing capabilities, and locate the "bottleneck" of the website's Internet bandwidth limit.
  2. Channel Association AnalysisThe target of the channel association analysis application is the content manager. A website is abstracted as "channel-subchannel-content" at the content service level to form a "website structure tree ". The purpose of association analysis is to discover the association between each element in a transaction. Through Association discovery, we can guide "link Settings ", in this way, we will guide things to the direction that is conducive to the subjective disposition of managers.
  3. Specific Association Analysis"Channel association analysis" is the association analysis at the internal logic level. For "advertisement" and the page association analysis that users are particularly concerned about, it is the data that website administrators want to master. What pages contribute to advertising? Which pages are more viewed by advertisers? What is the relationship between the special content and other URLs of the website? What is the degree of association? The web-DM "specific Association Analysis" provides in-depth analysis results and presents them to users in a simple and intuitive manner.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.