Hadoop technology in telecom operators Internet log processing application architecture
Fang Jianguo
First, telecom operators Internet log processing status quo
Today, with the popularization of the mobile Internet, there are a lot of Internet logbooks every day. Due to the huge amount of data, these Internet logbooks can only be kept for 3 days after being discarded because of storage space and other reasons. At present, telecom operators may only have a lot of useful information on customer behavior missing from the analysis of customer behavior mainly based on CDRs (call logs). For example, two people with similar phone conversations may be completely different types of customers. If treated the same, customers will inevitably have poor receptivity, waste a lot of resources, and will not be able to achieve good results. This analysis, due to inability to know the content of the call, two similar behavior patterns (such as long-distance long-distance calls at night), the actual purpose of the call and lifestyle (one is to chat with friends at night and the other is the work phone required for overtime) Two completely different people are often mistaken for a class in the analysis, resulting in a greater deviation of understanding of customers, poor marketing.
The user's Internet behavior contains a large number of customer characteristics and customer demand information, the information is crucial, but traditional CDR CDR analysis can not provide. Therefore, this requires that the user's Internet log record must be saved, but also need to conduct data analysis and mining processing, and then define the user's behavior habits according to the processing result, providing an important marketing basis for the telecom operators to realize the refined operation.
With the advent of the Hadoop technology platform system, it can store Internet logs and provide data mining processing functions.
Second, the Internet log data processing method of the process
Internet log data processing method is as follows:
1. Internet log data URL address extraction.
2. Known URL data, according to the standard URL classification criteria for classification.
3. For the unknown URL address, first crawl the webpage data, and then crawl the webpage data according to the webpage classification model, webpage classification, continue to optimize the model, improve the accuracy of webpage classification.
4. According to each person to visit the web site URL and the corresponding Web site classification, use the model to calculate the personal preferences of each phone number, to provide a basis for accurate marketing.
The specific process as shown below.
Third, the Internet log system technology architecture program
Based on the above processing flow, the logic architecture scheme of the Internet log processing system is as shown in the following figure.
Specific functions for each part are described below.
data source
From the telecom operator system server regularly access to basic user information and Internet log information, enter the cluster HDFS file system and HBase database.
Interface layer
The interface layer is responsible for exchanging data with external systems, including user data, collection of Internet log data, crawling of Internet web content and access to peripheral systems.
The data interface can realize the data exchange of the relational database such as Oracle and DB2, including the collection and loading process, and also supports the data of the file type and can be collected by means of FTP or the like. System provides a unified access to external interfaces, with open, high performance, monitoring and management and security features.
Data layer
The data layer is a distributed big data processing platform that extracts the required data from the data source, cleans the data, and finally loads the data into the distributed storage according to the predefined data model. Through the distributed computing framework can be achieved data cleaning, conversion, calibration and loading process.
Resource layer
Due to the complexity of hardware deployment brought by distributed architecture, the physical resource layer and the system platform layer need to be further abstracted to provide automated deployment and flexible operation and maintenance capabilities. Therefore, the resource layer enables automatic deployment of physical resources and Dynamically expand and flexibly deploy different roles in a distributed cluster.
Functional layer
The functional layer implements modular processing of the data processing flow module, provides cluster access control, and is responsible for Hadoop cluster operation management and system alarm log management. Data processing can be any series-parallel flow scheduling, and can control the node priority, timeout, retry times, and have the ability to judge the routing, in the case of multi-branch conditions under different flow of the flow. Asynchronous scheduling strategy, can support the large amount of scheduling.
Application layer
Responsible for the application of specific algorithms to achieve. Realize the classification index of web pages, through the Internet registration URL and its category crawl, unified management of data, and placed in persistent storage. The categories will be reorganized and divided into corresponding hierarchies, such as (social-community) or (social-weibo), to index the categories. Realize the classification of thesaurus management, through the network of popular terms and commonly used words to crawl, according to their category to build thesaurus. Thesaurus regularly updated, continuous improvement. Realize the unified analysis of user behavior, based on the customer's access behavior, identify their preference characteristics, customer segmentation according to the content preference characteristics, and support the target customer base extraction in order to support marketing activities and achieve the unified management of URL addresses.
Display layer
Responsible for the application function processing results through the Web page display, and provides interactive pages, skilled use of various application processing functions, and the results of the dynamic display.
Web Crawler
Responsible for crawling web pages from the Internet system, the specific content of information. The specific processing flow is a program or script that automatically fetches World Wide Web information according to certain rules. The program extracts the URL address from the log file and filters it to re-operate. One filtering operation to remove pictures, video, software and other content URL address; To duplicate the URL will have been crawled URL has been classified URL address removed. The rest of the URL address into the crawler address library, crawlers will be generated according to certain rules to crawl the URL address, and then by way of MapReduce concurrent crawling generated URL address, and finally crawl the contents of the URL address Store to HDFS file system.
Because the network crawling process requires Internet resources, the data processing Hadoop cluster is interconnected with the telecom carrier's internal network, and the Hadoop cluster's security measures are not perfect. Therefore, the access between them needs to be strictly controlled to ensure the network deployment security.
In the physical architecture design, you need to design two complete internal cluster network, the cluster network need to use a firewall for access control. Internet log processing system's physical network deployment topology as shown below.
Fourth, the advantages of Internet log system solutions
The advantages of Hadoop technology solutions for Internet log systems are as follows.
Hadoop technology program is based on distributed infrastructure, take full advantage of the two distributed core technologies - distributed file system and distributed computing framework to build a complete set of distributed storage and distributed computing systems.
Distributed system has the characteristics of high fault tolerance, and designed to run on a common PC server using X86 architecture, greatly reducing the server and storage costs, as well as the cost of the database License, which effectively eases the high cost of system expansion pressure.
Hadoop technology program uses parallel processing of large data sets of software framework. When working with big data, it decomposes its tasks through distributed computing and processes them across multiple nodes in operation. When an error occurs on a server in a cluster, the entire calculation does not end, and the distributed system guarantees data redundancy in the event of a fault error throughout the cluster. This design allows crawlers, web pages and online behavior models such as fast and efficient operation.
Currently in China, such as Tianyun big data, Huawei, AsiaInfo and many other well-known large enterprises for Internet log processing system in accordance with the above structure, put forward a complete solution. In particular, Tianyun Big Data Co., Ltd. has relied on the self-developed BDP platform software (a complete solution that includes the Hadoop platform) and the company's strong algorithm support team to successfully deploy Internet log processing on a provincial carrier's operating platform System, for the company's precision marketing users provide a strong support.
Author introduction: Fang Jianguo, Microsoft three MVP, a well-known real estate company senior director of information technology. He has been involved in the deployment and maintenance of large real-world production environments on several occasions, and has conducted in-depth research on server storage and virtualization architectures and solutions (server consolidation and virtual desktop infrastructure), especially based on Windows virtualization solutions.