What is big data?
Large data refers to the phenomenon that the digital data of the internet era is super high speed growth. Data is only a concept of quantity, and "digitization" is a qualitative change. Digital data can be processed at high speed by computers. The digital camera replaces the film camera because it can process data in real time with a computer chip to generate photos and images. This transformation is epoch-making, it has changed an industry.
In addition to the large amount of digital data (Volume), its cumulative speed (velocity) is even more amazing. The accumulation of the way is not the past batch type but a stream of data flow. The generation of data is also multi-source (produced). The acquisition means include cell phone, sensor, computer, scanner, monitor, etc. Data form includes numeral, text, sound, photograph, film and television. This real-time data generated from real life has a lot of noise (voracity). There are both traditional structured data and a large number of semi-structured and unstructured data.
Why is big data important?
Whether you admit it or not, big data has become the reality of modern life and is changing the way we think and behave. When you buy a book on the Internet, you receive a few other books that are recommended by the website. This is the result of a comparative analysis of thousands of other groups of people buying the same book. The recommended books are the ones most often bought by the group.
Large data processing technology allows us to do things that we didn't think we could do before. The Chinese people's academic realm is to "break the book, walk thousands of miles road." Today, using the computer to "see" the digital book, broken million books is light and one thing.
The ability of large data real-time processing multimedia and multi-channel digital data is changing our lives. The emerging telemedicine is one example. In the patient's home to install a monitor to collect medical data, patients with the results of the smartphone self-examination can be transmitted to the hospital in real time. Medical staff can diagnose, treat and care according to real-time electronic medical records. These accumulated electronic medical records have been data mining to generate new information and intelligence, to promote the quality of health care services to improve and improve.
Big data challenges traditional statistics and creates opportunities. The idea of sampling statistics based on probability is to speculate the overall state and behavior with as little random sample as possible. We must first design questionnaires, then data collection, cleaning, analysis to form a report. The process is very long, the cost of data acquisition is high, and the questions that can be answered are limited by the questionnaire. This results in a sampling of incurable diseases: slow, less, and more expensive. Even in administrative statistics, structured data are collected using prefabricated forms. The idea of large data is to process, excavate and analyze the data of semi-structured, unstructured and digitized, which is continuously obtained through various channels and means in modern life. It is real-time or near-real-time data processing, looking for relevance rather than causality, answering questions that are not limited by questionnaires or tables.
If the sampling statistics and administrative statistics are 1.0, we can call the statistical study, which includes large data, the statistic 2.0. The statistics for 1.0 are structured samples and administrative data. The statistics of 2.0 also include unstructured or semi-structured digital data obtained from a variety of sources in real life. In this sense, large data provide a new possibility to solve the chronic disease of traditional statistics "slow, little, expensive".
Large data and modern information technology
In recent years, the rapid development of information technology has promoted the arrival of large data age.
First, the high speed processing of massive data must solve the bottleneck problem of traditional data processing: The database needs to be handled by a queue. Searching for a database can be done instantaneously, but searching 50, hundreds of databases, even a fast computer takes a long time. Online search giant companies such as Yahoo and Google are among the first to develop and apply data-decentralized combination processing technology, represented by Hadoop. The core idea of this technology is to spread the task of searching 50 databases to 50 computer servers at the same time, then combine all the search results and feed back to the users, which greatly accelerates the data processing speed.
Second, because large data processing (such as Hadoop) technology requires a large amount of hardware investment and professional maintenance, most companies and government departments are unable to afford it. Cloud computing makes resource sharing possible, thus greatly reducing the cost of large data technology applications and driving the process of large data development.
Third, it is necessary to combine multi-source data in the process of large data processing. We know that there are many databases (Oracle, SQL, etc.) that do not match the data structures, definitions, etc. In recent years, the development and maturity of data integration technology based on XML has made it possible to integrate real-time multi-source data in the field, which has removed an obstacle for the arrival of large data age.
Finally, the results of mass data processing and analysis are often dispersed and messy. How to make the results of data analysis into simple and understandable information is the key to the value of large data. In recent years, the innovation and development of data visualization technology (visualization) makes it possible to make data based, real-time intelligent decision support, which is a green light for the big data age.
The application of large data in the United States
Google, Amazon, Wal-Mart, and other large companies have been successful in applying large data to commercial interests. What these companies have in common is that they own the data and start early. As a result of a large number of data hot, a number of new companies to develop tools for large data processing has also been a certain success. Companies that target large data Analysis Services have seen a number of them, but few have succeeded. American governments have made great strides in opening data since Mr Obama took office, but most of them are administrative data, and the success stories of big data applications are rare.
Large data high-end analysis talent (data scientist–) shortage. Many universities in the United States have started to set up courses and degrees for data scientists. Traditional statisticians have difficulty in accepting new concepts of large data and mastering new analytical tools and skills.
Because of the lack of a successful analysis model, massive data cannot quickly produce commercial value, and investment in large data industries is rapidly cooling.
Protecting personal privacy and rights is still a big puzzle because it involves ethics, law, and information technology.
There are many potential problems in the application of large data. Big data can reveal a lot of correlations, but it's hard to make sense of that. Once the calculation method is exposed, data forgery is easy. The echo effect of the network data is very strong, it can easily cause the erroneous message to upgrade and spread the error information quickly. Using data analysis to attribute complex problems to a simple number often leads to misleading conclusions (e.g., calculating the number of occurrences of a particular word to determine the historical status of a person or event). Greater challenges in data consolidation. Shopping malls such as battlefield, enterprises have to protect their own data, data sharing is almost impossible. Government departments also because of the conflict between responsibility, power, and profit, beggar their own data. This is not a technical issue, but a matter of interest, politics, policy, law and ethics.
Conclusion
Large data represents a new trend in the development of the information age, and its importance and influence cannot be underestimated. This new tool provides us with opportunities and tools to recognize and solve problems. But it is not a panacea and cannot replace traditional statistics and data analysis. For example, large data can be used to deal with massive amounts of data in real time and find a lot of dependencies, but it's not sure which relevance is the most meaningful. In the process of scientific research and intelligent decision-making, a definite causal relationship is more important and meaningful than many uncertain correlations. If the sampling and administrative statistics (small data) is to kill the rifle, the big data is the machine-gun fire. Wise commanders know the properties and characteristics of weapons and can carry out effective firepower to win battles.