The development of any new technology will undergo a process from the public to the final universal application. Large data technology as a new data processing technology, after nearly a decade of development, has just begun to be applied in various industries. But from the media and public view, the big data technology always has the mysterious color, appears to have the magical power which digs the wealth and forecasts the future. Widely circulated large data applications include the target supermarket based on the girl's shopping history to determine whether pregnancy, credit card companies based on the user in different time and space shopping behavior to predict the customer's next purchase behavior, and so on. Large data technology also depicts us as a "smart city", "intelligent transportation" and "intelligent medicine" and so on the beautiful dream. These descriptions give us a lot of vision and good anticipation for big data technology.
I've summed up two important phenomena or application trends from the 2014 Big Data application. The first phenomenon is that large data technology is a priority in the application of SQL for structured data processing to address the challenges of processing power resulting from the increase in the volume. This is the opposite of what many people advertise as large data technologies that are best suited for unstructured data (rather than structured processing). We find that the enterprise faces two challenges, on the one hand, the cumulative amount of data is growing, from GB to TB (with petabytes of enterprise customers have, but a few), on the other hand, as the application increased and complicated, the computational capacity is increasingly unable to meet the requirements. Most companies have developed their applications in traditional relational databases such as DB2 or Oracle over the years based on business requirements, and the number of data and applications has increased rapidly, and traditional databases have spent more and more time running these applications, even if only 1TB of data, due to the complexity of business logic, Running the statistics business on the traditional relational database is also reduced from previous dailies (daily statistics) to only the current weekly report. Such timeliness has greatly limited the productivity of the enterprise. Under the trend of the IT system becoming the business itself, the low efficiency of IT system seriously affects the enterprise's competitiveness. These pending data are structured business data for the enterprise, and the existing applications are based on SQL. This is an objective reason for the development of distributed SQL on Hadoop technology and is a realistic requirement for star-ring technology to improve the performance of SQL operations and the integrity of SQL support.
The second phenomenon or demand is the increasingly strong demand for real-time sequential data processing, especially with the popularization of electronic instruments such as sensors and monitoring equipment, the enterprise has more and more real-time data. The traditional processing method is to put the data generated by electronic instruments into the database and then unify the analysis. With the increase of equipment and the increase of data, the delay of traditional scheme is getting higher. It can greatly improve the reaction speed and work efficiency of the enterprise when the data is processed by the flow processing technology. In the 2014, star-ring technology deployed a large number of streaming processing clusters to handle data generated from real-time data from the user to the sensor.
These two application trends I think will become more intense in 2015 years. The following brief summary of the past year in the operator, finance, logistics, industry and commerce, transportation, energy, radio and television and other fields of large data applications.
Telecom operators
Operators in the mobile internet era face many new challenges. Micro-mail and other mobile phone communications app has eroded the operator's voice and SMS revenue, traffic services appear more important. On the other hand, wireless network service is the core competitiveness of operators. In recent years, operators are investing a lot of money to build networks to vigorously develop 4G. 4G network coverage is not high or quality caused by 4G down to 3G or 2G will greatly reduce customer satisfaction.
After a year or two of exploration, operators have summed up two directions in building large data platforms, using large data technology to enhance operational efficiency while exploring new business models and data operations. Over the past year, big data has been validated in terms of operational efficiencies, and new business models are still being explored. We are in Guangdong Mobile Business Data analysis application of the Star Ring memory technology successfully reduced the calculation of more than 800 indicators from the original Oracle 30 hours to 4 hours, in Shanghai Mobile successfully moved the flow management system from DB2 to the star ring TDH, Operating efficiency is 5 times times higher than the original cluster. Our full support for SQL makes application-system migrations possible, and previous partners have tried to migrate applications to a well-known Hadoop release without success. We are participating in a province telecom and a municipality mobile 4G network Optimization project, in these projects, our partners are using a more high-performance star-ring TDH instead of the traditional MPP database for the establishment of network optimization model and high-speed model operations, on the one hand, found in the network problems, such as the problem of signal drop, Help operators quickly identify problematic areas. On the other hand, through the complete SQL combining statistics and machine learning algorithm provided by TDH, the optimal model and parameters are found, and the fine granularity of the network is adjusted to improve the coverage and the quality of the signal.
Financial
Between 2013 and 2014, the state-owned banks and some joint-stock banks have more or less explored the application of large data technology, but the early applications are limited to simple historical transaction queries and the storage and retrieval of unstructured data, and do not have an impact on the key business of banks. The application prospect of large data technology in bank is widely disseminated, through the comprehensive processing of bank's own structured transaction data and external Internet/government data, it can improve the fine customer management level and the risk reduction of large data credit. These visions did not materialise in 2014 years, and 2015 is expected to be the year of application exploration. But we practiced some practical applications in the bank in 2014. In these applications, TDH is used as a supplement to the data warehouse to enhance the efficiency of data analysis. Also benefiting from our full support for SQL, a joint-stock bank began to transfer some complex loan risk control logic to the TDH Hadoop platform for operations. These wind-control model customers have previously experimented on multiple MPP databases and Hadoop distributions, and performance or functionality has failed to meet their requirements. From a technical standpoint, these analyses involve only a few terabytes of data, but the analysis business is extremely complex, involving nearly hundred fact tables and dimension tables, and some tables are even more than tens of thousands of bytes wide. This case shows that traditional relational databases or MPP databases are becoming more and more difficult to compute in large data scenarios, and that banks need a more efficient data processing tool.
Express
Express industry IT system generated data volume and load pressure in the past has not been the attention of everyone. In recent years, the scale of the express industry with the rapid development of electronic commerce has appeared fast expansion. The huge market demand has brought the unprecedented challenge to the express company, each year "double 11" will give the express company the processing ability to exert far above peacetime pressure. Therefore, how to alleviate the "double 11" of the burst warehouse, to avoid the express change "piece" is the problem of each express company.
How to improve and optimize the express process through the analysis of large data is a problem worth studying, and also an important means to improve the competitiveness of express industry. Express each production link will produce a large number of data, monitoring the data in order to the National Processing center of the delivery and delivery capacity, the shift delivery plan to do real-time optimization adjustment, the company can reduce costs. By analyzing these data to make predictions about trends in business development, companies are ready to respond to soaring demand. However, the data in the Express production link has the characteristics of large data, high concurrency, complex type, high demand for real-time in the upper application, and the traditional database is stretched under such circumstances.
We and Hua Sheng Tiancheng cooperation for China Post EMS Courier Department deployed a large data platform, processing of its data in the country's range of hosting, processing and distribution centres (including received, retained, under, not under, posted, undelivered, receiver, address, sealed, shipped, not shipped, etc.). The large data platform dynamically loads the data from the ESB (Enterprise production bus) into the stream processing cluster and the real-time database, carries on the real-time statistics and the index monitoring, and realizes the real-time data query. This deployment to the customer's easy-to-use tools to real-time monitoring of every aspect of the business, making them in the vast volume of express business can be quickly and accurately find problems, such as the backlog of express mail, lost, damaged, and so on to improve service quality. This large data platform has steadily supported the 2014-year "double 11" Data processing pressure. In the future, the platform can also be based on the latest production data to help express companies adjust and optimize delivery plans for the company to reduce costs.
Business
The Industry and Commerce Department has accumulated a large number of market principal information, annual inspection, law enforcement data and 12315 complaints, etc. in the construction of the national "Economic Register Bank". Statistical analysis of these data can help the business sector understand the market and economic situation.
One of the simple applications of large data technologies is in data quality management and statistical analysis. Because it is a manual input data, there will inevitably be a certain error probability, although the probability is not. At the same time, the basic information of the enterprise and the individual is dispersed in the dozens of-sheet relational table, the information has a certain degree of cross correlation. By making large cross ratios and statistics of data, the hidden errors in the data can be found and corrected in time. This application uses the Star Ring memory technology, the full amount of data verification and statistics can be completed in 10 minutes, greatly improving the efficiency.
In addition, large data technology is also used in the market subject information Query system, can deal with hundreds of millions of users concurrent queries and within hundreds of milliseconds to return query or search results. The Enterprise History Snapshot query can let the user track the enterprise change information, grasps the enterprise life cycle the change rule. On the basis of solving the problem of storage and query, we also help customers to quickly discover the association between enterprises and related people by using Graph computing engine. By scanning the whole library data, the paper confirms the relationship between these enterprises based on equity and service, and establishes the information base of Enterprise Association.
Power
With the rapid construction of electric power enterprise Informationization and the complete completion of intelligent power system, the growth rate of power data will be far beyond the expectation of power enterprises. From the generation side, for example, the improvement of automation control degree of electric power production, such as pressure, flow and temperature and other indicators of monitoring accuracy, frequency and accuracy is higher, the massive data acquisition and processing put forward higher requirements. On the power side, the increase in the frequency of a collection of data will result in the volume of "finger-level" changes. The growth of power data has far exceeded the processing power of a relational database used by a power sector.
In 2014, we mainly helped the power sector to handle data on the electricity side. We have accidentally discovered that the statistical analysis of power data involves very complex SQL operations, which, from a technical standpoint, use a large amount of Oracle's PL extended syntax, including stored procedures/control Flow/exception handling/deletion and/or transaction processing. From the application point of view, these SQL logic is mainly used in the historical statistics of electricity consumption and the analysis of the trend of electricity consumption, as well as the calculation of line losses. We assist customers through the machine learning method of analysis, found that the use of electricity and macroeconomic trends and climate has a certain degree of relevance, but also with each industry and the business situation of each enterprise closely related. By comparing the consumption of electricity and the level of electricity consumption in the industry, we can find out the energy-saving situation of the enterprise, and through analyzing the historical data of electricity consumption, we can find out the change of production activity or the effect of energy-saving measures. A Southern power supply Bureau uses TDH platform statistics to find energy-saving and environmental-friendly enterprises and large consumers, and to subsidize energy-saving and environmental protection enterprises, the aim is to guide the concept of energy saving and emission reduction in the whole society, and to promote the development of industry from high energy consumption to low energy consumption and high efficiency
We have also deployed a pilot fault handling system for a power department, we have established a unified power supply topology model with our partners, using the graph database to store the entire power supply topology network data from the user to the substation, using the flow processing system to carry out real-time alarms, and querying the network topology map in real time, Quickly judge the location of the outage and the extent of the impact. On this basis, the power outage event can be notified to repair teams, timely resumption of power supply. At the same time can be active to inform users, enhance interaction with users, comprehensive and intuitive grasp of the distribution of power outages across the network.
Traffic
With the rapid development of economy, the increasing of motor vehicles, the nationwide traffic congestion phenomenon is becoming more and more serious, how to improve traffic management level and ensure road safety by means of information technology has become an important topic.
The current common approach is to deploy digital surveillance equipment at the road Gate, which captures image and video data on a 7x24-hour basis and identifies tens of millions of records per day from a province or municipality. These data are mainly used to provide traffic management with real-time information on road conditions, which can be released to the public as a reference for travel in the future. At the same time, assist the management department to carry out traffic management, including the key operating vehicle monitoring, illegal vehicle identification and supervised, interval speed measurement, card analysis, such as real-time analysis and application. We and partners for a provincial public security Bureau Traffic Management department deployed the province-wide traffic monitoring system, the use of distributed queue real-time acquisition of all traffic card port vehicle information, the use of streaming computing cluster over the vehicle records for real-time statistics and monitoring, and to achieve the above many real-time analysis applications, The End-to-end delay of processing information in the system is less than 2 seconds, which improves the efficiency of traffic management.
Of course, large data applications in the transport industry are still in their infancy, just beginning, or are about to be completed in a centralized collection of large data. With the powerful analysis and excavation ability of large data technology, the real-time transparency of traffic information can be improved significantly, the level of traffic and congestion management will be increased, the incidence of accidents should be reduced, and the urban planning should be referenced.
Radio
In China, radio and television systems are experiencing the impact of the digital wave, based on the network of Film and television broadcast to the traditional operators a great challenge. In this context, the Chinese media is keenly aware that in order to gain the survival and competitive advantage of the future networked media, it is necessary to tilt the user to create "precision" broadcasting content and communication operators. The data infrastructure needed by Chinese digital media needs to be able to meet the storage and management requirements of large, multiple sources and diversity of data, support the linear expansion of the platform hardware, and provide fast and real-time data analysis results to be used in the business quickly. The Chinese media has chosen us to deploy a large data platform, on which the digital TV analysis system has been developed. The system can provide real-time lists based on full data. To time (hours/days/weeks), users and other dimensions, on-demand programs, live programs, program categories, search keywords, such as ranking analysis, the chain analysis, trend analysis. The system can also from the time, channel, film type, drama and other dimensions, according to the number of views, the number of new, finished watching the number, complete reading and other analysis of user direction. In addition, through the collection and analysis of user behavior data, the number of Chinese media can be accurate portrait of customers, using intelligent recommendation engine, the system can be before the audience know their needs, predict will be popular television, for each user tailored recommended programs to improve the product arrival rate and enhance user loyalty. In addition, the system can also through the audience to the actor, plot, tone, type and other metadata labelling, to understand the audience preferences, so as to carry out analysis and observation for the follow-up film production and other content development ready. Thanks to the digital TV analysis system based on large data platform, the Chinese media is carrying out the "gorgeous turn" from content transmission to content manufacturing.
Commerce
In the field of E-commerce, large data can be said to have become the key business support technology, marketing, customer care and many other links play an important role. We and Jinjiang electric business cooperation, the use of large data platform for the electrical business to create a product recommendation system. We build a customer tag system based on a large data platform. Relying on the large number of members and visitors to the electrical business, the depth of learning and mining customer behavior data, based on RFM model and customer information, the formation of customer consumption preferences, customer age, family status, and even constellation, Zodiac, consumption frequency, amount, way of travel and so on account of the customer label. Then the customer tag cluster analysis, forming customer clustering. In this way, we can accurately access customer groups, the implementation of precision marketing. At the same time, we also help customers to build a product labeling system. According to the hotel and tourism and other types of product characteristics, construction and mining product labels, and through a certain machine learning mining process, the customer label and product label docking, according to various types of label analysis weights, the construction of intelligent recommendation system.
This recommendation system can intelligently recommend products, and is gradually becoming an important part of the customer care system and precision service system.
Summary and Prospect
summed up the 2014 Hadoop large data industry applications, some applications may not have been previously thought of simple applications, and some are complex data analysis and mining class applications. Large data technology itself is a new data processing and analysis technology, with more than the power of the existing technology and depth of data mining capabilities, but the value of technology itself needs to be demonstrated through the upper application, so how to apply these capabilities to solve the real problem is the various industries are exploring the subject. A large number of innovative applications based on large data technologies are expected to emerge in 2015.
And in the past year, large data technology has proven to significantly improve operational efficiency, we expect that in the next year, the use of SQL on Hadoop technology to solve the problem of data volume disaster will become a common application trend, with the continuous improvement of SQL support and performance improvement, The enterprise uses the big data technology to carry on the structural data processing, improves the operation efficiency and liberates the productivity, will obtain the immediate effect.
2014 is the year when big data technology began to fall, and we saw a huge demand for large data technology and products. We are very optimistic about the development of large data in the 2015 and beyond. The trend of rapid development of large data will continue for a long time, there is still too much value in the data to be dug out, there will be more and more enterprises, government agencies and public organizations need large data solutions. Popularize excellent large data products to help the public solve the problem of data processing, let us work together!
(Responsible editor: Mengyishan)