Distributed data processing system based on Hadoop
Above is the Internet record data, data such as log retention data, which are constantly expanding, we build "data warehousing", do a slight summary of the original data to form statistical analysis of the data, and based on these data to build users on the Internet user's portrait. In addition, the establishment of the Internet user Identity Library, if the use of micro-letter, we will know that as long as the use of micro-letter, we know that you are not only a unicom users are micro-credit customers. For example, using micro-blogging, we can collect the ID of the microblog, we know your phone number, we also know the Weibo ID. If you are using QQ, if you do not even know the number of QQ, the data is to supplement the original data.
China Unicom will contact you more and more channels, through the mobile phone number can contact you, but also through the micro-letter and you contact, Unicom built a lot of systems, but also the construction of data distribution and open platform, I hope that the data can be opened, we can data through some of the processing, some of the privacy of the processing can be distributed, Can be distributed to China Unicom's private side of the business system, can also be distributed to third party business systems, they do for their data analysis and mining work.
Unicom mainly used Hadoop, HDFS and statistical analysis and mining work
At present our entire platform three point Namenode node, the cluster monitors the node and the Warehousing service node, also has zookeeper node 7, we also provided the web for the Query service node, we have built the data center network.
Our Internet data has user number data, there are currently in the Network load data, is using 2G network, or GPRS up, or WCDMA up, there is the Internet site, we can know the base station number, if it is a foreign roaming also know base station good, We can tell by SDSD whether you are in Thailand or Malaysia or Singapore. Including the way of surfing the internet and the type of business, we identify every business you use, this information can be captured and saved by including the type of information and the flow of the previous page, the flow of the next page, the time of the start and end, and the server-side ID address and the type of terminal, and the type applied on the terminal.
The amount of data recorded on the Internet is currently 2 trillion records per month. The red histogram is the average daily traffic from 1 to October, and the traffic on the Unicom mobile network in January is 550TB and has now risen to 1PB. January The daily Internet record amount is 32 billion, by now October average record amount is 75 billion. In November it was often more than 80 billion records per day, with a peak of 87.8 billion records. The whole chain is growing at 10% speed.
The number of bars recorded per day, October 1 is OK, holiday users do not have a lot of internet behavior, maybe everyone in the travel with friends and family gatherings, to the 7th after the entire online record of rapid growth, until 23, 24th with the user's traffic is super, or the province of traffic is not much, this time is slow down , then 28, 29th and then a little rebound, is still some of the flow can be assured bold use. The whole simple look at the amount of records to a certain extent can reflect the behavior of a user's group.
The data in the warehouse is five o'clock in the morning every day. Every morning the Internet record amount is relatively few, everyone is sleeping condition. After seven points is a significant increase, 12 o'clock noon is the peak, followed by six or seven o'clock in the afternoon is a small trough, the night of nine o'clock is the user to use the mobile Internet peak period, Tencent Micro-letter is also the peak period. More than 70 billion of the case, the storage peak is 1.2 million per second.
Province flow distribution, the first is to solve the problem of flow complaints, open to 10010 and front-end customers to use, now also face the end user open, users can download the mobile phone business Hall through the mobile phone, can be queried to the large flow of users on the Internet record situation, now all open.
At present, our entire collection covers all the ports of Unicom's mobile network, data storage time from the occurrence of traffic to the record can be queried to ensure that 30 minutes can be queried, the actual operation of almost 10 minutes can be traced to 10 minutes before the Internet records. Currently has four months of data, although the scale of the expansion in doing the upgrade and expansion of the work, and then want to save longer time data. Statistical analysis of data is not less than five years, the current situation, a single table 2 trillion records, you can ensure that the front desk query is not higher than 2 seconds, you can query to if there are tens of thousands of online records, we are in 2 seconds in the exhibition now customer service interface, this speed basically dozen 10010 have flow complaints problem, Ask for permission to query the user online records to do the answer.
We can see each record of access to the URL situation, what kind of client you are using, and what terminals are all available to provide inquiries. Mobile Self-Service inquiry, can provide a large flow of inquiry services, the current system of 10010 daily flow of about 15,000 times a day about the amount of inquiries, now the number of mobile phone this part of the query now daily also keep 四、五万条 query volume, the whole system in the circumstances just to ensure the quality of the entire service.
Monitoring and planning optimization of mobile networks
The former is based on the traffic volume forecast based on the network, local traffic volume, local economic development, GDP development to predict. The three major operators are spending hundreds of millions of of their money on the web every year, network resources overall surplus, the overall utilization of network resources less than 50%, the network is the state of light load, but the network complaints are many, local area dozen not phone calls, the problem of slow Internet access, we should build the base station did not build in the most should be built in the place, 5 A-Class scenic spot we want to do 3G good coverage, in fact, the user will not take the mobile phone to play, he is more traffic coverage mainly, if we build a depth of coverage to cover up the flow, the base station is a light load, operators a lot of investment is wasted.
Also lack of a lot of monitoring means, we do indoor coverage, covered a lot of base stations, but if the base station in the room is not used, operators can hardly find, if the indoor base station is broken, outside the outside coverage, users like to make a phone call, the same is the use of mobile internet business only experience is reduced, then how we find. Planning and construction of mobile base station we think that we need to match the actual distribution of the user's traffic, with the data recorded by users on the Internet, we can clearly understand the current distribution of mobile Internet traffic, according to the flow distribution can effectively improve the entire network construction precision and investment effectiveness.
Through the Internet recorded data, you can know that the base station seems to have two days no traffic occurred, if the office in the original, if it is Thursday, Friday we have to put forward early warning, may be a problem.
Data is centralized
We did a pilot in an area, and by analyzing the current base station traffic, to guide the next phase of the construction of the base station, found that indeed achieved accurate, effective and satisfactory objectives, can be analyzed when the 2G base station data volume is very large, means that this place may be 3G base station does not have effective coverage, Users have needs, but all fall back to 2G base station, this place to build a 3G base station to ensure that the investment is accurate and effective.
At the same time, also do a statistical analysis and data mining work, you can see the current identification of the distribution of each business traffic, such as QQ traffic. At five o'clock in the morning is the lowest point, the night 21 to 22 points is the peak flow.
Looking at the value of applying large data:
First, can enhance the user's service level. The network record provides the user's service level, may provide the accurate inquiry service, but also has the business marketing. Use the present large data to do accurate marketing and marketing work, as well as decision support, business status evaluation, the development of the operator's overall strategy, as well as network optimization and management.
Second, the online data collection can be better after the restoration, such as not to the net, can be accurately restored out of the process of not going to the net, which step of the problem, can be accurately positioned to a network source equipment.
Open Web Data
The first is to provide a service to the interface to open, for example, that currently receives a lot of spam messages, why the spam message because it is no target hair sms, we use the user's behavior and portrait, we can do accurate send, now sent a message may be two cents a, we provide services is two cents a message service, But to ensure quality, the message sent to the most should be sent to the staff.
The pros and cons of large data on telecom operators and internet companies. Operators have the user's real information, which is dependent on operators, China Unicom, the Chinese mobile, and China Telecom business, this is to hold the ID card, we have actual payment information, this month with 220 or 386 of the package is very accurate data, including your level of consumption. Internet companies are hard to get accurate user identity data. We have user behavior full dimension information, you are accessing Weibo and Taobao, all streaming into the operator's network. Internet Enterprise Taobao is its own data, Baidu is also to see their own data, we have a more comprehensive attempt. We see the data of the process, but do not know what the final purchase, the middle of what we have seen, the name of the intermediate goods, which we are very clear, we feel is complementary to the process.
Large data mining applications can operate intelligently
Can do off the network of early warning, analysis off the net user. After a few months before the user consumption behavior, to build a suitable model, you can advance a single-day to which users out of the network to do an early warning.
In addition, you can do differentiated services. Personalized recommendations, some recommendations may be real-time, and some is not real-time, we have data mining data, and then with the foreground of the data update processing, the two combined, we know the user context information, now where? What time is it? In what position? What type of user are you? This combination of these several aspects of the entire personalized recommendation is more accurate.
The launch of Intelligent advertising, the goal is what, put to who, through what channel to launch, the previous operator contacts the user's channel is the mobile phone number, we through the network behavior data, may know the micro-letter number, thus has the more open channel, these open channels, from the operator's angle also may open to the third party application, Third-party applications can also use channels to access users by invoking services.
Our drive to traffic, including proprietary business analysis of the package, the gap between our Wo store and the 91 assistants, as well as the refinement of operations, the support of LTE decision-making and so on have achieved results, these results are still preliminary, the bigger prospect is in the back.
Summary
China Unicom relies on open source Hadoop technology, construction of the data platform for the Internet, the data platform is currently the global communications industry for the first time to achieve a full network of records collection and centralized storage, but also the first time to provide users with real-time query services, so far no second operator to do this thing.
Relying on large data platform, the realization of customer service innovation, a certain sense to solve the problem of transparent consumption, so that users rest assured consumption. Relying on large data platforms, applied to the network planning and construction of operators, there is a preliminary construction of a large data analysis and mining platform, this platform for the next step will further build large data distribution and open platform, the data open to share with partners, of course, sharing is to protect the privacy of users under the premise, Another active expansion of data sources, now to do the limited data collection, fixed network broadband data This time next year has been collected, there are other data collection.
The experience of using Hadoop
First, do not underestimate the growth in data volumes. The second to do continuous optimization, build a full-time team, to carry out system optimization, Taobao optimization method may be different from the way unicom optimization, because each has the characteristics of the business. It is very important to pay more attention to the interconnection of the inner network of Hadoop cluster and the stable and efficient data mining of the whole network Hadoop cluster. Before the network had some problems, it caused the whole cluster in a period of time to run more and more slowly, until the entire cluster crashed, need to restart, there is such a cyclical process, many problems are caused by the network. Statistical analysis of the query to do proper separation, our system to provide real-time data warehousing and query, as a large number of statistical analysis when the storage will have an impact, may be the storage of the squeeze, the impact on the query is not, we only run some regular tasks on the above, these regular task results will be built on another cluster, The structure design of database should be well prepared.