How fast is the tide of big data? IDC estimated that the amount of data produced worldwide in 2006 was 0.18ZB (1zb=100), and this year the number has been upgraded to a magnitude of 1.8ZB, which corresponds to almost everyone in the world with more than 100 GB of hard drives. This growth is still accelerating and is expected to reach nearly 8ZB by 2015. The storage capacity of the IT system is far from adequate, let alone digging and analyzing it deeply.
In this article, four industry experts, such as Baidu chief Scientist William Zhang, Teradata Principal customer Officer Zhou Junling, Yahoo! Beijing Global Software Research and Development Center architect Han Yiping, SAP China Enterprise information management consultancy, etc., will share their insights and experience in coping with massive data challenges.
Teradata Chief customer officer Zhou Junling Baidu scientist William Zhang Yahoo! Beijing Global Software Development Center architect Han Yiping SAP China Enterprise Information Management Consulting Senior Consultant Du Yu
What is the size of your enterprise's data volume now?
William Zhang: This question is easier to answer. Baidu is not a product, not only has the search engine, but also includes a lot of community products and media products, so this number is about hundreds of PB, daily processing of data about dozens of PB. I was almost 4.5 before joining Baidu, so I remember more clearly at that time the scale. Compared with that time, the current data scale growth is more astonishing, probably at that time 500~1000 times.
Data volume is not terrible, the problem is to process data in real time, because any delay will cause the service to lose some advantages, resulting in a decline in business economy. Our strategy is for real-time, and today the needs of Internet users more real-time, such as micro-blog, buy, second kill.
Zhou Junling: From the IDC Data Statistic report, the data growth is very fast. Relative to the specific amount of data, Teradata is more concerned with trends in data development and is investing heavily in this trend, including the bi-change and growth model, which is very valuable to us by studying this model, including how much transaction volume per minute, per second, and so on, Data scientists conduct research and discussion to apply these technologies to the production system and to the enterprise.
Han Yiping: Yahoo! 's main cloud computing platform Hadoop now has 34 clusters, a total of more than 30,000 machines, the largest cluster is around 4000, total storage capacity of more than 100PB. This order of magnitude can be said to be not big, the main reason is that we have recently put a lot of effort into dealing with user privacy and data security, because according to the EU regulations, Yahoo! can not store more than a year of data, so our response is: Do not save the original data, but do very in-depth data mining, Dig out the really valuable information and save it.
Du Yu: SAP, as an enterprise application provider, focuses more on the amount of data that customers have, and our customers have many data-intensive businesses, such as telecoms, finance, government, retail, and data levels ranging from several TB to hundreds of TB. SAP has 30,000 servers in the data center at its headquarters in Germany, which is about 15PB, providing services to customers. We are helping customers migrate their internal applications to our data center service platform, which means that more and more customer data will be available to us.
How do you handle analysis in the face of big data?
Du Yu: On the one hand, we use standard virtualization as well as distributed storage in the data center, on the other hand, we have introduced memory technology to meet the challenges of data application and analysis. Traditional architectures have a large bottleneck, disk reads in milliseconds, and memory reads nanoseconds. Therefore, we will need to do in the application layer of the calculation analysis, such as predictive analysis or a large number of operations, are put into memory operations, so as to achieve performance improvement, to help users make full use of data.
Han Yiping: For the Yahoo!, I would like to explain in three parts: Data acquisition, storage and processing.
In the area of data acquisition, we have established a real-time data collection system of several data centers and hundreds of thousands of machines in Yahoo!, which is characterized by a main road that is responsible for filtering, cleaning up and consolidating data, and putting it on the Hadoop platform in high reliability. Although the accuracy is relatively high, the effect is very good, but the speed is some slower. In order to meet the real-time requirements of William Zhang, there is also a bypass system, the bypass system in the second level can be data to the main road, which is the part of the data collection.
In the data storage aspect, basically take HDFs as the core. In data processing, the main technology is Hadoop, MapReduce and our own development of pig. At present, we have more than half of the data processing engine is done with pig.
Zhou Junling: Teradata has been continuously innovating the traditional enterprise-class data Warehouse product line, while docking the big data age, continues the traditional bi domain, including enhances the data processing ability, thus is easier to adapt to the big data management. For example, data access frequency to confirm data temperature, data compression, adapt to the requirements of large data analysis, make data management easier.
We have to adapt to ultra-high-scale data capacity requirements of the hardware platform product Teradata 1000, you can compress 35PB of data. The analysis of structural and unstructured data is particularly useful, along with a wide range of software packages that enable data statistics and analysis, including the integration of Hadoop architectures into Teradata data warehouses, which can be based on the current Teradata Enterprise Data Warehouse interface.
We provide a cloud-based architecture that can use Amazon EC2 to provide customers with secure storage products to store data stored in the cloud outside of the corporate firewall. We've just acquired the Aster data company, which has some very good tools for some of the applications of Hadoop and MapReduce.
William Zhang: Internet companies in the cloud computing technology applications are similar, for example, Baidu also used Hadoop, I mention a few more characteristics of the place.
The first is the big search, that is, not only to grasp the Web page, the establishment of an extremely large index, and in order to make the data quasi real-time or faster updates, to do some optimization, such as according to geographical distribution and importance distribution in the South or north of the room, mainly in accordance with the data application strategy. In addition, the use of data flow technology.
The second is the machine learning algorithm. In the field of science and technology, machine learning is more of a server in the memory of the data in a highly complex calculation, may run for a long time. In Baidu, machine learning applies to all places, such as judging user needs, from user behavior feedback to get what we should recommend what content, matching what kind of advertising, timeliness is very high. Can be called the incremental, large-scale machine learning methods.
In addition, to continue the development of Internet applications, the key is to find more valuable data, that is, regardless of where the data come from, we should decide how to deal with it according to value.
What do you think of the endless NoSQL technology?
Du Yu: I have always thought that there is a reasonable, nosql generation and evolution is also due to our existing application requirements. At present, there is a higher requirement for relational database in the aspects of large concurrency and high efficient reading and writing of massive data, and NoSQL has unique value and advantages in this respect.
Of course, this is not to say that the appearance of NoSQL represents the end of the world of relational databases, because for some applications, especially enterprise-class applications, the consistency of transactions and the real-time reading and writing of the high requirements, and relational database in these years of development has accumulated its own advantages.
Therefore, I very much agree that NoSQL is "not only SQL", I believe that in the future relational database and NoSQL will coexist or even integration.
Han Yiping: NoSQL is a very broad concept. In Yahoo!, although NoSQL said not much, but use a lot of nosql tools, our Key-value database and other systems, all belong to the NoSQL framework. As for the relationship between NoSQL and SQL, because of the need for acid in many occasions, and the need for NoSQL, and NoSQL, as I have often said, "God is fair", when there is a need to give up another thing when a demand arises. Many of our needs, such as large data volumes and high distributions, can become a new bottleneck when these requirements are in demand. In fact, for us, the Internet industry does not need consistency in many applications. When the demand is relaxed, it will naturally meet other needs.
How to mine the value of the data?
William Zhang: I give an intuitive example of matching ads, it includes two kinds of data: One is the advertisement storehouse, namely the advertisement content information and the advertisement customer information, this kind of information is suitable for the traditional database, another kind of information is the user sees all behavior after the advertisement, has experienced the cumulative accumulation, may have the multi-trillion user behavior. These two kinds of data can be combined, the machine learning algorithm can produce value. Obviously, the second kind of information is more important, because it can provide users with the information they want, such as search a word, you can use all the users before him, after his group intelligence, group behavior, determine what kind of information is the most important, best quality, which kind of information may be cheating information, and then through the feedback mechanism, Provide the best content to the user, and even recommend some related search, query information. In short, data is the lifeblood for any business, and for cloud computing, data processing is the reason for the existence of a cloud datacenter or cloud computing.
Han Yiping: We often joke after work: from the data can be dug up things, not necessarily money, more importantly, the user experience, for Internet companies, the data is everything.
Yahoo! is not just a search engine, there are many sites in the United States ranked first in various fields. We do a lot of work, such as news site information, are based on the relevance of the news and the interests of everyone recommended, we want to be based on each user's own interests, and even every user at this moment of interest, to recommend. Yahoo! News Recommendation System, is to Yahoo! all the data collected, the user in the Yahoo! Search all the actions are collected together, do deep mining and personalization, for each user analysis and recommendation, without such data we can not provide customers with experience, The data is everything to us.
Du Yu: Since you are looking at the value of the data from the Internet point of view, I will share it from the perspective of the enterprise.
Smart grids are now Europe's terminals, the so-called smart meters. In Germany, in order to encourage the use of solar energy, solar energy will be installed in the home, in addition to selling electricity to you, when your solar energy has surplus electricity can also buy back. Collecting data every five minutes or 10 minutes through the grid can be used to predict customers ' electricity habits, and to infer how much electricity the entire grid will require over the next 2-3 months. With this forecast, a certain amount of electricity can be purchased from a power generation or power supply company. Because electricity is a bit like futures, if buy in advance will be relatively cheap, buy spot is more expensive. After this forecast, the purchase cost can be reduced.
Another example is more of a personal interest. Dan Brown's lost secret book says that if you focus a lot of people on one point, you can move objects. Of course, we can not verify this, but when we search the Internet for keywords, sensitive words, we will be able to determine the public attitude of a certain matter. There are some new business models, such as doing a network advertising evaluation company, using such technology to evaluate the effectiveness of online advertising, I think may be the future of business value generation point.
What are the challenges that the mass data age poses to enterprises and technicians?
Han Yiping: We used to say that we are software engineers, our industry is often called the software industry, but I think we are the real information Marvell industry. For most people, the most important thing now is to change the concept, from the Code/program concept to the data concept, in the design and development of any, to put data in the first place.
Du Yu: Massive data has been growing, but we should try to control it, and the future trend should be how to shrink the mass of data rather than letting it expand. In addition, the mass data age is an opportunity for China to lead the world's IT industry.
Zhou Junling: In the cloud computing era, business data and cloud closely combined to provide business development capabilities, we learned a lot of new things, there are some things are no longer their own to store and develop, but are stored in the cloud. The way technology products are marketed has changed a lot compared to the past. Such an environment for the cloud also poses a number of technical challenges for the database provider, such as how to secure the storage, including the integrity of the identity. This relates to where the data is stored, for example, the data being shipped is now anywhere in the world, not in a particular country, which brings about the issue of data sovereignty, and the fact that some countries and governments may not allow data to be placed in certain parts of the country is a challenge that requires technical solutions to security issues.
William Zhang: Here I am talking about two points of feeling.
First, data management is an important skill of DBA, and there is no special emphasis on data programmer in Computer education in university, and there is no data administrator; second, MapReduce is not a new concept, as early as 30-40 years ago when the computer ability is very small, the functional programming language appears, But so far there have been no courses on mapreduce or similar data processing in universities, and almost no one has heard of them.
The future will be all the life experience data in the cloud, this can be achieved, but if the problem of solving bad data security, then the final implementation will be very far. I expect cloud computing to become cloud knowledge, cloud intelligence, not just computational tools. The establishment of data integration and sharing is a necessary and sufficient condition for the success of cloud computing.
(Responsible editor: admin)