2014 Zhongguancun Large Data day on December 11, 2014 in Zhongguancun, the General Assembly to "aggregate data assets, promote industrial innovation" as the theme, to explore data asset management and transformation, large data depth technology and industry data application innovation and ecological system construction and so on key issues. The Conference also carries on the question of the demand and practice of the departments in charge of the government, the finance, the operators and so on to realize the path of transformation and industry innovation through the management and operation of data assets.
In the afternoon of the financial @big Data Forum, the Yang Jin, a product manager of the large information platform, gave a keynote speech about Asiainfo's application and sharing in technology.
Yang Jin: Good afternoon, I'm the last one to make a speech. Previous experts and leaders shared large data applications in the financial industry, including internet finance, credit and so on. Let me talk about the technical application and sharing of our sub-letter.
Our own letter is mainly focused on the carrier industry, in mobile, telecommunications, unicom three major operators in the construction of the system of our company for many years to occupy the first, and we open up a lot of overseas markets. We are the internet architects, and now we want to be the leader of the industrial Internet, so we are out of the traditional carriers, and we are in other industries, including today's financial industry.
This is a period of time and a bank to enlarge the data of the research program and the topic of communication when learned that the bank is a minicomputer to do data processing, processing more than 8,000 tasks a day. The core of the tables and models involved are more than 3,000, each day involving 1T of business. The data business is very complex, and the volume is also very large, some indicators are t+2 can show, today's trading behavior, may be the day after tomorrow's leadership and business personnel can see the analysis indicators. Now in the Internet age, the large data age this efficiency is intolerable, so need to implement to t+1, the first or traditional architecture, minicomputer to achieve expansion. On the other hand through large data to achieve, home a X86 cluster, to achieve large data storage. Based on the growth of the data volume, including the growth of business complexity constantly adding servers to do cluster expansion. At the same time can be a substantial cost savings.
We believe that the enterprise platform is divided into four stages, the first is the import period, the use of the technology to achieve a specific scenario needs. For example, operators do the flow of business, small loan inquiries and so on. The second stage is the platform open period, when the large data platform is built to store more and more data on the platform, now we repeatedly stressed that data is an important asset, this asset does not mean that the data collected over the hard disk, the value of the information. Only through continuous analysis and mining of data can the real realization of data be realized, this may not be good for a single manufacturer to achieve this goal, may want to introduce more different vendors, in various departments in the same large data platform targeted development to realize the realization of data. This phase involves the effective management and allocation of the platform's resources, including the subdivision of the permissions.
The third phase is the expansion period, like the large Internet companies that are at this stage, and they perform data mining and analysis on large data platforms with complex algorithms, and they pay more attention to the stability of the platform and lower the cost of investment. The fourth stage is maturity, and we think that Hadoop will be the underlying core of the infrastructure.
At the same time to have several capabilities, the first is efficient, we now hope that on the large data platform can be in the standard (English) way to achieve efficient data processing. The second resource management, mentioned in front of the platform to open the situation needs to introduce different departments, the need to introduce vendors in the same platform to do data development, this time need to the manufacturer or department, it is divided into a certain resources, the management of resource effectiveness, at the same time do permissions division. 3rd, it involves platform security. The efficient processing of its own platform is achieved through Spak technology, it is a complete system, like (English), the implementation of P processing, flow processing, and so on a variety of application scenarios. Now off-line processing, we can use in the model rollup aspect. The amount of data involved in the operator is large, and the business logic is complex, so it involves the hierarchical processing of data. Mainly divided into several layers, one is the raw data, one is the data preparation layer, then is the data of the summary layer, as well as the performance layer of data. There are also data hierarchies in the bank, of course, the name of the specific hierarchy and the specific meaning of the operator is different, but probably the framework is similar. Now the standardization of this massive data, the rapid processing of data, including a preliminary summary of the data, because the data summarized after the volume is better, this time can use Spak technology in memory, greatly improve efficiency. Like customer inquiries whether in the operator or on the Internet, or in the bank there are customer tags and customer classification concept, the scene is mainly for 30 million users, each user is more than 2000 user tags, which includes the nature of the user, such as youth, gender, home address and so on, There is a social attribute of the label, such as it employees or workers, is a preference for sports or prefer to nest at home to watch movies or to say like shopping. For 30 million users, each user two tags, the establishment of a large data platform, through the Spak technology to achieve, than the original minicomputer to increase the efficiency of 3 times times, save investment to more than million. This technique can also be used in real time processing scenarios.
In the past we do real-time processing is more dependent on (English) technology, (English) technology is only to provide a basic flow processing framework, but the development of many applications are to our own development, in the real-time processing of memory management is also we do. Brings complexity, including a huge increase in the amount of work, such as large internet companies, such as Tencent's use of Spak is better, because of a lot of people and resources behind. Now that we have spark, we can slice it by the time window, the time window added to the memory, through the spark in the memory processing, you can achieve the necessary good efficiency, sacrificing a certain timeliness, not a data on a data processing, but brought a significant increase in throughput.
This is the case of a project that we did in a province before, the data that was analyzed in the traditional warehouse and the business logic, moved from the warehouse to the big data platform. At this time we found that there are some scripts in the data volume of the data model is not very clear, but in our large data platform after the operation of the effect is much lower. There are times when the design logic in the script is very complex, the large data will be divided into different jobs, but the entire processing time will become particularly long, because in the (English) processing framework will have many data landing and serialization. Now introduced spark technology, we can see compared to the original (English) mechanism, efficiency is faster than 5 times times, at the same time a huge advantage is to put the standard (English) directly from the warehouse out, directly on the spark run, do not need to do too much rewriting and processing. This is also an important component of spark, which can support both (English) and standardized (English), and can be a common solution on the platform.
This is the first time in a province to do real-time marketing platform, which is the processing of user name data. Every time a mobile phone switch machine, or move to a specific location will produce name data, including our mobile phone number, time, current location information, etc., this information is very useful. In the user's name data per second is nearly 50,000, the province is 80 million of the user data, his business needs is through the analysis of the data to form the current location of the user, as well as the user's location track changes in information, and according to the changes have marketing rules, Through the marketing platform to the user to filter out I want to recommend the user sent to the marketing system. As often as we encounter into the manufacturer inside, in a very short period of time will receive a message to remind, the factory chamber to engage in promotional activities, this is the typical application of this platform. We are implemented by Spark (English), we are every 30 seconds for a time window, the data loaded into memory, and information to match. We output the information will be as a name data enhancement, will be in the original name data on the user's previous label, the user information related to matching, through the Enterprise Standard (English) way to achieve business development.
The benefits of our system first achieved a large amount of data processing, each time window to deal with 3 million of the amount of data, as long as less than 30 seconds to complete processing. can also write standard (English) can carry out the rapid development of business logic, for example, like 11 Golden Week, 7 days we will do some travel products marketing recommendations, such as Double 11 may only be in double 11 one or two days before and after the recommendation of the product, tomorrow immediately to double 12, do not know whether you have received relevant promotional information, Received a statement that they are very good use of real-time marketing platform, do a good job of data processing.
Spark technology has become particularly hot from last year to this year, and we have been tracking and researching spark technology since the beginning of 2013 in our Department of the Asia-trust data platform. At the same time we have trained 7 (English), the next step is our goal is to further develop our (English) technology Daniel, we can also play a huge role in the development of Spark. can also make spark in our products and platforms to play a better application.
Before we talk about the efficient processing of data, let's talk about distribution. I mentioned Hadoop. Now in the Hadoop2.0 era, there is the sub-component, which enables the management of components, the framework of Hadoop, and the management of a mashup architecture of different frameworks like spark. At the same time realize the effective use of resources. In the Hadoop2.0 era, the data can only be distributed through the concept of abstraction, and the allocation and management of fine resources can be realized in Hadoop. Open in the large data platform we want to introduce different vendors and departments, we think that different vendors and departments belong to a tenant, the large data platform for different tenants to allocate resources, CPU and memory, allocating resources will be limited, including the minimum and maximum value, To ensure that different vendors submit tasks to the large data platform, to ensure that your resources to you, the minimum resources can not meet your needs, you will be able to provide more resources, seize the free resources to achieve peak.
Here we through the Network security protocol to achieve large data platform itself server internal service interaction security control, large data platform and many external systems have business interaction and data interaction, here is through the interface to open computing resources and storage resources. Here we have also extended the security component, the (English) component implements (English) the control, now realizes the comprehensive control for (English) reads and writes, can satisfy each kind of application scenario the demand. Their use of our large data platform may involve different people, such as business people just have access to the core data model, testers just assign Read permissions, do a collection of data quality. Through our detailed security management methods can be very good to meet the needs.
Finally also take this opportunity is equivalent to give us a product to do an advertisement, its own large data platform, Spark and Hadoop on the basis of data analysis platform for customers. We mainly include two products, one is our own open source community based on the integration of differentiated Hadoop products to meet the P processing and streaming, and so on the application of various scenarios. There is OCDC data analysis products, you can achieve process choreography, user management and so on, this is our two products. At the same time we also adhere to the principle of openness and sharing of technology, technology from the community, contribution and community. At the same time we will be for all products, including our own products, to provide professional services, including deployment, optimization, upgrades and so on. This is all my introductions today, thank you.
(Responsible editor: Mengyishan)