Introduction: Now More and more public emergencies, especially such as man-made emergencies, such as the recent Stampede events in Shanghai, the Internet or large data, can play some positive energy role? To prevent the recurrence of such tragedies? This session of the IT Hall of Fame is the founder of star Ring Technology, Mr. Sun Yuanhao, and we had an exclusive interview at the 2015 China Hadoop Technology Summit.
Sun Yuanhao that, can use some new technical means to detect the change of Waitan flow, provide some information guidance for Public Security Department and Transportation Department, for example, the camera data serves as the data source to do some early warning. Through the subway card data, and rail traffic data to judge the traffic, found that the subway data anomalies, the public security departments can directly coordinate with the transport sector, thus evacuating people. Second, we can combine the data source operator base station signal to analyze the data, they contain the approximate position of the user's phone, we can quickly judge the population density and change trend. As the mobile phone, according to the mobile direction of the base station can predict the range of density, the information can be combined to form from the track underground, ground to the air of all-round detection, this information can be quickly fed back to the public security, to provide guidance for public security programs. In addition, there is a flow of information data collection is also very important, motor vehicles through the Waitan, and even the city traffic, will leave a record, we can quickly determine which motor vehicles did not leave, stay, and thus infer that the vehicle here may have been squeezed condition. In this case, we can immediately feedback to the Transport Department, all the operating vehicles are not allowed through the Waitan, this way can also alleviate traffic, so the integration of these measures can be prevented.
Del Piero: In the age of large data, the data is a very tangled topic, many people will think that the data is dead, people are alive, the world of data mining is a mine array, but also gold mine, that big data in the end can bring us what? How to dig out valuable data in a huge amount of data for your own use?
In the interview, Sun always summed up the big data for us three typical scenarios, its use small to individual, family, large to the country, large data is omnipotent. Today the main application scenarios for Hadoop are focused on technical processing, but some of the applications have started leaning toward machine learning. Star-ring technology and partners have also begun a taste of a sophisticated, using Hadoop technology to process data advanced analysis, from large data mining valuable data.
The first typical scenario is the use of large data to meet real-time marketing, such as real-time collection of user mobile phone location information, push Wi-Fi hotspot, according to the user's shopping history, credit card records to do data analysis, push personalized marketing, such as movie tickets or interested in merchandise.
The second typical scenario is the use of large data to predict electricity consumption, sun always introduced us to engage in electricity data analysis of real customer cases. Some provinces have already laid out a lot of smart meters, as many as tens of millions of households, meter collection density of up to 23 times a day, through the grid sensor data can analyze the relationship between electricity consumption and climate, can help power companies to preliminary forecast the future demand for electricity, but also can dig out the relationship between power consumption and GDP growth.
The third typical scenario is large data applications in the medical field, and some companies use large data analysis to compare DNA. In the past, the examination of the elderly women, surgery risk. Now using the new technology of large data, through the collection of fetal DNA sequence to compare, once the abnormal symptoms of fetal, can take measures, this method and surgery, more accurate, no risk, this new technology with large data applications more and more widely.
Phi: 60% of Hadoop applications are used in the field of SQL statistics, the earliest Hadoop was used for ETL, from data extraction to transition to final loading, and now we find that data warehouses like Facebook use Hadoop's Data Warehouse, So what's the relationship between Hadoop and the Data Warehouse?
Sun always admits that internet companies have been using Hadoop as a data warehouse since the first day, so Hadoop is the first choice for internet companies to build data, and Hadoop is actually a data warehouse for Internet companies. For traditional enterprises, the IT architecture has changed a lot, such as operators, banks, logistics, aircraft and other industries, Hadoop as a data warehouse supplement, but the use of Hadoop to these enterprises there is a significant problem, the traditional IT architecture, There are already large applications, many of these applications are based on SQL, the application type and complexity is actually more than the internet company, so Hadoop in the field of entry, some limitations, the early only to do ETL. With the development of Hadoop technology, some companies, including our companies abroad, can provide more complete SQL support, which allows us to further use Hadoop to replace some of the enterprise's data warehouses.
The traditional data warehouse, like some big corporate state-owned banks, is a few billion, maintenance expansion is also a few billion, the cost is very expensive, and Hadoop provides a very cost-effective solution, which is an important factor in the choice of enterprises.
In addition to cost, Hadoop can be used to process unstructured data. For banks, like video data, Bill data, although the current value of the bank is not too high, but need a storage mechanism to store, Hadoop technology algorithm is more and more mature, data mining tools are more and more rich, which allows enterprises to use Hadoop technology can find additional value-added things.
Sun always predicted that the traditional enterprise IT architecture slowly migrated to Hadoop, the next about two or three years, the enterprise's traditional it architecture will slowly be replaced by Hadoop. Hadoop will be the center of the enterprise's Data Warehouse, and future Hadoop will be the enterprise data Warehouse for every industry.
Del Piero: When it comes to large data, there are 3v,volume (large), Velocity (high Speed), produced (diverse), especially in the internet of things, such as meteorology, traffic, such as real-time data volume, concurrency is high, then the big data and Internet data What is the difference? What are the challenges to the enterprise's technology underlying architecture?
Sun always said that the Internet is actually a network of connected people, the data collected are mostly human behavior data, such as people's trading data, People's online records, and the internet of things to collect more data is the machine. If we compare these two data sources, we find that the amount of data in it will be one magnitude worse, the world population may be 6 billion people, but there are billions of equipment, these devices, if they are collected data, it will be more than the amount of data in the Internet, so this will have a new big challenge to the future data architecture.
The second feature is that the data concurrency of the internet of Things is very high, and the data should be processed as soon as it arises. Sun always raised a real customer case, the customer currently has 10 million sensors, 10 million levels per second of data delivery, may have exceeded the number of internet companies, the concurrency requirements for the underlying architecture is very high.
The third difference is that the data on the Internet may be human behavior data, mainly used for analysis, can do some marketing, but the Internet data is more to find some laws of nature, of course, it also uses a lot of technical operations, but also use a large number of complex physical and mathematical methods.
Del Piero: The tide of big data is sweeping the world, similar to Hadoop, Spark also fire. In foreign countries, Intel, Amazon, Cloudera and other companies to take the lead in the application and promotion of Spark technology, in the domestic Alibaba, Baidu, Taobao, Tencent, NetEase, star ring and other companies pioneers, Spark in the IT industry is the application of Spark Prairie, Can future spark replace Hadoop?
Sun always expressed the hope (Spark) can replace Hadoop, from this overall ecosystem development trend, (Spark) will slowly replace (MapReduce), of course, in the star ring technology products have taken (Spark) replaced (MapReduce), In addition, Sun always in the video interview also focused on the Hadoop distributed computing framework for us, dry-cut more, please click the video to see details.
Del Piero: I've noticed that in the beginning of the new Year of 2015, your company has successfully completed another round of tens of millions of financing. Then I also learned that the tide and your strong alliance, successfully built a large data information platform based on Hadoop, can you from the perspective of partners and we simply talk about Hadoop biosphere?
Sun always admits that he hopes to promote the development of the whole Hadoop ecosystem, there are three types of partners, one is the industry application solution provider, such as in the transport industry partners, in the depth of cooperation with us to be able to efficiently process data or bank data or traffic-oriented information. Another type of partner is one of our certified service providers, training him, they help us carry out the installation deployment, these services work, the third is that their products are complementary to us are likely to be hardware vendors, like the tide.
Del Piero: That last question, IDC forecasts that the data will grow 40% to 50% per day, which means that the total amount of data will reach 40PB by 2020. The main source of unstructured data is our daily emails and forums. Blog social networking, including our pose system and some machine-generated data, what kind of Hadoop solutions do you offer in the face of unstructured data, and what new versions of Hadoop will be released in the future?
Sun Yuanhao that many of the future computing framework will be integrated with Hadoop, wait until hadoop3.0, the security and performance can be greatly improved, in the resource management efficiency is greatly enhanced.
Sun always revealed that star-ring technology is expected to release 2 new products in 2015, the first product for the Internet deployment of a large number of sensors generated data, focus on processing time series data, first of all into the new energy industry. It can efficiently process the large amount of data produced by the sensor, store the data in memory or convert the data from SSD into memory storage, and analyze all the time series data.
The second product is expected to be launched in the second half of 2015, an existing version using container and Docker to run Hadoop, helping businesses simplify the deployment process of Hadoop, and when the enterprise deploys the Hadoop fleet, It may take only 2 or 3 seconds to start the 100 cluster and automatically expand, even if the machine fails to automatically migrate. This can greatly reduce the cost of managing Hadoop, including the cost of maintenance, and can also do very efficient resource isolation, as the use of container technology can isolate the CPU memory network disks, and the isolation is better than before. As a result, Hadoop can be effectively realized through the data analysis of several departments on the unified data platform. "
(Responsible editor: Mengyishan)