What is big data? How much data can be called big data? Different years have different answers. In the early 1980s, big data meant data that was large enough to be stored on tens of millions of tapes; in the 1990s, big data meant data that exceeded the storage capacity of a single desktop; today, Big data refers to data that is difficult to store in relational databases and cannot be processed by stand-alone data analysis statistical tools. This data needs to be stored on a massively parallel system with tens of millions of machines. Big data appears in all areas of daily life and scientific research, and the continued growth of data has forced people to reconsider the storage and management of data.
With the rise of social computing, people are used to sharing and exchanging information online. For example, Facebook, a social networking site, has a large user base and is growing. The daily logs and shared data of these users are countless, and the amount of data has reached the PB level. Traditional solutions have not been able to process these data well. Facebook has developed the Cassandra system itself, and now uses HBase. These management systems for massive data can provide users with better services, and are scalable and fault-tolerant, which is the performance required to solve big data problems. Weibo service provider Twitter is also facing the challenge of big data. The number of messages sent is hundreds of millions of times per day, and the number of queries reaches billions of times per day. This requires the storage management system to not only store large-scale data, but also provide high throughput. Read/write service. Twitter originally used the MySQL database, and then migrated the data to the NoSQL system due to the user surge. Although the NoSQL system is not yet mature, it is a more effective solution for solving massive data. Other Internet companies are also facing the problems brought by big data, such as the Goolge search engine needs to deal with large-scale web page information, YouTube needs to store and provide video data shared by users, Wikipedia submits user-shared knowledge, etc. Large-scale data information storage and management.
With the development of e-commerce, more and more people buy goods online, business websites need to store a large amount of product information and user transaction information, involving large-scale data. At the same time, the website needs to provide a prompt request response to enhance the user experience to attract customers. Moreover, the website also needs to process and analyze these massive data in order to recommend products to users more specifically. Massive data becomes a key factor in system construction and business success. China's commercial website Taobao uses HBase to store data, while constantly exploring its own solution, and developing a database system OceanBase that supports big data to implement some online applications. eBay, the world's largest online auction and shopping site, is also actively seeking massive data solutions. It built its own cluster system Athena based on Hadoop to handle large-scale data, and developed its own open source cloud platform project Turmeric to better develop. And manage various services. At the same time, major retail companies, whether online or physical, will pay attention to collecting customer's consumption information in order to provide targeted services or recommended products, all of which involve the application of large-scale data.
Scientific research in various fields is also facing the challenge of massive data, from biological genes to astronomical meteorology, from physical experiments to clinical medicine, thanks to the development of measurement techniques and equipment, which generate a large amount of data in experiments or practice. People need to process and analyze this data to mine valuable information, but this is not an easy task. With the development of next-generation gene sequencing technology, the information contained in the genes has gradually been discovered, people get more accurate genetic data, but how to match the genetic data, how to extract the required information from these data, this It is a new challenge for bioinformatics. In environmental meteorological research, scientists have collected meteorological environmental data for decades or even hundreds of years, and analyzing climate change in these data requires the support of massive data processing technology. A large number of patient physiological data and drug test data collected in medical drug research, which are large in scale and need to analyze useful information from them. In the humanities and social sciences, sociologists have begun to pay attention to interpersonal and social relationships on Internet social networks. The amount of data involved is also very large. It is challenging to find out what sociologists are interested in from massive data. matomeIn the field of artificial intelligence research, people hope that computers have human learning ability and logical reasoning ability, which requires the machine to store a large amount of empirical data and knowledge data, and also need to quickly obtain and analyze the required content from the large amount of data. Process to make a correct and effective judgment.
Nowadays, the wide use of sensors makes data collection more convenient. These sensors continuously generate data, such as real-time monitoring systems and network traffic monitoring. In addition to the continuous generation of data from sensors, many fields involve streaming data. For example, stock prices and transaction data in the economic and financial fields, transaction data in the retail industry, and data in the communication field are all streaming data. The biggest features of these data. It is massive, because they are continuously generated every moment, but unlike other massive data, the stream data is continuously ordered, rapidly changing, and the responsiveness to processing analysis is high, so the processing and mining of stream data different methods are often used. Every aspect of the economic and financial field produces massive amounts of data, such as changes in the price of securities and the flow of data generated by stock transactions, and data generated by various economic activities of enterprises or individuals. The modern economy has entered the era of massive data, which can bring innovation and productivity growth in the new era, and new business models may emerge. Utilizing the massive data generated by economic life can play an important economic role, not only beneficial to the business activities of enterprises, but also beneficial to the national economy and enhance the competitiveness of the country. In the face of large-scale economic data, people need to improve the ability to acquire, store and analyze data, while ensuring data security and privacy, but this is still a huge challenge.
Traditional relational databases are not able to solve the problems caused by massive data well, and the statistics and visualization tools of stand-alone computers have become inadequate. Some new data management systems such as parallel databases, grid databases, distributed databases, cloud platforms, and scalable databases have emerged, providing a variety of options for solving massive amounts of data.