Detailed development of big data: from small samples to big data, from theory to value

Source: Internet
Author: User
Keywords Big data massive sample screening
Tags .mall analysis application big data big data era bloggers business computing

From Small Samples to Big Data: Concepts and Myths

The total amount of data generated and recorded in the last two years has accounted for 90% of the sum total of all human civilization since the beginning. We keep recording all valuable information all the time. The changing data of the world and of all things form an "automatically growing" gold mine. The data mining technology is responsible for digging out the gold from the mines.

The term "big data" was a commercial concept advocated by IBM and EMC earlier in life, with a concept packed commercial gene from its birth. Understand this truth will not be overly entangled "in the end what is big data", "how much data count big data" and the like. This concept encompasses our philosophical myths, technical dilemmas, solutions and resulting business opportunities in the face of massive data.

Before discussing the big data problem, we first review another classic problem in the data world - the small sample problem. The "small" surface of a small sample refers to the small number of data samples, while the essence is that the existing sample has insufficient capacity to characterize the feature space.

The problem of overfitting is one of the core issues in the era of small data and has also led to the achievement of the vapnik theory and svm algorithm. The dominant feature of big data is "large" scale data that exceeds the general algorithm or the computational power of general hardware. Another characteristic accompanying big data is that it has "excess" samples that can describe the feature space of samples. The former dominant features drive the development of parallel / cloud computing hardware and software, while the latter drive industry change from the methodological level of business models and data analytics.

How do we understand the value that these "excess samples" bring to us? Obviously, it does not need these "excess samples" to capture the global characteristics of objects through data, The bigger the better, "the" Big data needs sampling "debate, which was a concern before the big data era. It can be said that those who are entangled in these issues have yet to reach the core values ​​of big data. To summarize it, before the big data era, we dealt with group disciplinary knowledge discovery (KDD) using small samples or modestly sampled small data. In the era of big data, we relied on the data extracted from small samples or known to the public The rule of thumb is to realize the business value by discovering the target individuals by searching large sample data.

From Theory to Value: Examples of Government Applications

Where are the big data? These rich miners include: industry, finance, communications, research institutes, Internet companies and more. In addition, there is a super-mine owner - the government. In the United States, for example, there are over 400,000 raw data files on the public U.S. government website Data.gov, covering nearly 50 categories of agriculture, finance and employment. U.S. officials say the goal is to "make it easier for the public to access federal government data and use it creatively by encouraging innovation to break through the government's walls." At the same time, big data from various industries can greatly improve government decision-making.

In recent years, the application of big data to the national and government fields has started to emerge:

Emotional measurement and happiness index

In 2008, French President Nicolas Sarkozy formed a panel of experts, including more than 20 world-renowned experts including Nobel Prize laureates Joseph Stiglitz and Amartya Sen, A study entitled "Happiness and Measuring Economic Progress." The study incorporates the subjective well-being of the nation into measures of economic performance and measures economic development based on subjective well-being, quality of life and income distribution.

Vermont University Computing Lab Project Hedonometer

(1) 2011: happiness comes from the travel distance

Christopher Danfoss of the University of Vermont presided over a study of the relationship between happiness and geographic location, which they filtered from Twitter in 2011 with bloggers with bloggers. Of the 37 million tweets released by more than 180,000 users worldwide, about 1% of Weibo contains such latitude and longitude information.

The study found that people usually have two places to go the most, and these two places are not far apart, it should be home and work place. To assess the amount of bloggers' happiness, the University of Vermont team developed a "hedonometer," a tester that detects words in the text that represent positive, happy emotions (eg, "fresh", " "Excellent," "coffee," and "lunch.") And words that indicate negative emotions (eg, no, no, nasty, damn, boring). Happiness tester will use this as a basis for evaluating the happiness index of each microblogging. The research team found that the farther away from home, the more happy words people have in their microblogging.

(2) 2011: People are not happy before

2011 December 21 news, the University of Vermont scholars to analyze the terms on Twitter, and finally come to the conclusion that "people are not happy before." According to the research, since April 2009, the general sense of well-being has been on the downward trend. Peter Dodds, an application mathematician at the University of Vermont, the lead author of the study, said "People are feeling less well-being." This is a result of 46 billion words analyzed by the Doz team for tweets made to 63 million Twitter users in conclusion.

(3) 2013: the highest happiness on Saturday

The Hedonometer project team at the University of Vermont Computing Lab released a Twitter emotional analysis report. Through natural language processing, this project analyzes the emotional analysis of millions of microblogs released every day in the past five years, finds some key words that reflect positive emotion or negative emotion, and records the result. Every year, the highest point of happiness is Christmas on December 25, and other high-happiness days include New Year's Day, Thanksgiving Day, Valentine's Day, etc. From the weekly perspective, the highest average happiness day is Saturday, The lowest day is Tuesday.

2. United Nations global pulse project

Big Data for Development: Challenges and Opportunities white paper project

With the big data development battle taking global priority, the UN Secretary-General's Executive Office officially launched the Global Pulse initiative in 2009 to promote the innovation of digital data and rapid data collection and analysis. As a result of this project, the "Big Data for Development: Challenges and Opportunities" report, led by Emmanuel Letouzé, senior development economist at Global Pulse, was released in May 2012. The report comprehensively analyzes the historical opportunities and challenges that all countries, especially the developing countries, are facing in using big data for social development and systematically gives some suggestions on how to make the right use of big data in the application process.

In line with the United Nations judgment on the value of big data, the London think tank Policy Exchange also announced that big data could save the British government £ 33 billion a year. The UN report explains how big data helps governments better respond to changes in social and economic indicators such as income, unemployment, food prices and more. The United Nations points out that the era of big data has come and that the vast wealth of data available today, both old and new, can be used to carry out an unprecedented real-time analysis of the social population.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.