American Journal: Is big data really that amazing?

Source: Internet
Author: User
Keywords Large data people exist but
Tags accounts address analysis analyzing anonymous big data community data

The U.S. Foreign Policy magazine website recently published the article titled "Rethinking Big Data-why the rise of machines is not exactly what people boast about", the author of Microsoft Research chief researcher, the MIT Civic Media Center guest Professor Kate Claufford.

"Big Data" is the current buzzword, the article says, and is a versatile way for the technology community to solve the world's most intractable problems. This term is generally used to describe the art and science of analyzing massive amounts of information to find patterns, to collect insights and to predict answers to complex questions. It may sound boring, but there is no problem with big data advocates, from stopping terrorists to eradicating poverty to saving the planet.

Victor Meyer-Schoenberg and Kenneth Chukil in the simple title of the "Big Data: a revolution that will change the way we live, work and think, cheers:" The benefits to society will be endless, as large numbers address the looming global problems to some extent, such as tackling climate change, Eradicate disease and promote good governance and economic development. ”

As long as there is enough data to deal with--whether it's data on your iphone, grocery shopping, personal profiles of online dating sites, or anonymous health records across the country--you can get countless insights into the computational power of decoding these raw data, the article says. Even the Obama administration has caught up with the trend and, on May 9, "unprecedented" data for entrepreneurs, researchers and the public, "previously difficult to obtain or manage".

But is the big data really exactly what people boast? Can one believe that so many 1 and 0 will reveal the hidden world of human behavior?

"With enough data, numbers can speak for themselves," he said. Freeloaders。

The authors point out that large data advocates want to believe that there is an objective, universal insight into human behavior patterns behind a line of code and a vast database, whether it is consumer spending patterns, criminal or terrorist actions, health habits, or employee productivity. But many big-data preachers are unwilling to face up to their shortcomings. Numbers cannot speak for themselves, and datasets-whatever their size-are still the product of human design. Large data tools-such as the Apachehadoop software framework-do not get people out of distortions, gaps and false stereotypes. These factors become particularly important when large numbers attempt to reflect the social world in which people live, and people often foolishly believe that these results are always more objective than artificial opinions. Biases and blind spots exist in large data, just as they are in personal feelings and experiences. There is, however, a dubious belief that the larger the data is, the better, and that the correlation is equivalent to causation.

For example, social media is a pervasive source of information for large data analysis, where there is no doubt that there is much information to be mined. People are told that Twitter's data show that people are happier when they are farther away from home and most frustrated in the Thursday night. But there are many reasons to question the meaning of the data. First, people learn from the Pew Research Center that only 16% of adults in the United States use Twitter, so they are definitely not a representative sample-they have a disproportionately high proportion of young people and urban people compared to the overall population. In addition, many Twitter accounts are known to be automated programs called "Robot" programs, false accounts or "semi-robotic" systems (i.e., artificial control accounts that are aided by robotic programs). Recent estimates suggest there may be up to 20 million false accounts. So even before people want to step into the methodological minefield of how to evaluate Twitter user sentiment, ask whether these emotions come from real people or automated algorithmic systems.

"Big data will make our city smarter and more efficient. "To some extent yes."

The article says big data can help improve the city's valuable insights, but it helps people. Because the process of data generation or acquisition is not all equal, large datasets have "signaling problems"-that is, some people and communities are ignored or insufficiently represented, which is called the data dark zone or shaded area. Therefore, the application of large data in urban planning depends to a large extent on the understanding of the data and its limitations by municipal officials.

Boston's Streetbump application, for example, is a smarter way to collect information at low cost. The program collects data from drivers who drive through potholes. More similar applications are emerging. But if cities start to rely on information that comes only from smartphone users, the citizens are just a sample of their own choosing--which inevitably leads to data loss in communities with fewer smartphone users, who typically include older and less affluent citizens. While the new city machinery office in Boston has made a number of efforts to fill these potential data gaps, less-responsible public officials may miss out on these remedies and eventually get uneven data, further exacerbating existing social injustices. A review of the 2012-year "Google flu Trend", which has overestimated the annual flu incidence, can be used to recognise the impact on public services and public policy of relying on large, flawed data.

The same is true of "open government" programs that disclose government data on the Internet, such as the Data.gov Web site and the White House Open Government program. More data may not improve any function of the Government, including transparency and accountability, unless there is a mechanism to keep the public and public institutions in contact, let alone the ability of the Government to interpret the data and respond with sufficient resources. All this is not easy. In fact, people do not have many skilled data scientists around. Universities are now scrambling to define this business, develop tutorials and meet market demands.

"Big data doesn't discriminate against different social groups," he said. "Hardly so.

The article points out that another expectation of the objectivity claimed for large data is that discrimination against minorities will be reduced, since the original data is always free from social prejudices, which allows analysis to be carried out on a large scale, thus avoiding discrimination on the basis of groups. However, since large data can make assertions about the different patterns of behaviour of groups, their use is often precisely to achieve this goal-that is, to classify different individuals into different groups. For example, a recent paper refers to the fact that scientists allow their own racial prejudices to influence large data studies on the genome.

Big data could be used to make price discrimination, causing serious civil rights concerns. This practice has historically been known as the "Red Line". Recently, a large data study by the University of Cambridge on Facebook's 58,000 "favorites" is used to predict user-sensitive personal information, such as sexual orientation, race, religious and political views, personality traits, intelligence, happiness or not, addictive drug use, parental marital status, age and gender. Reporter Tom Formsky This study: "Such highly sensitive information is likely to be used by employers, landlords, government departments, educational institutions and private organizations to discriminate against and punish individuals." And people have no means of fighting. ”

Finally, consider the impact on law enforcement. From Washington to Newcastle County in Delaware State, police are turning to the big data "predictive policing" model, hoping to provide clues to the unsolved case and even help prevent future crimes. However, the specific "hot spots" found by the police to focus their work on large data have the danger of reinforcing the police's skepticism about the discredited social groups and making differentiated law enforcement a system. As the chief constable wrote, although predictive policing algorithms do not take into account factors such as race and gender, the actual results of using such systems may "lead to a deterioration in the relationship between the police and the community, create a sense of judicial process deficiencies, and lead to allegations of racial discrimination without considering the differential impact, And the legality of the police is threatened. ”

"Big data is anonymous, so it won't violate our privacy. Wrong。

The article says that while many large data providers try to eliminate individual identities in datasets that target humans, the risk of identity being reconfirmed is still high. Cellular phone data may look rather anonymous, but a recent study of data sets for 1.5 million of European mobile subscribers suggests that only 4 references are needed to identify 95% of the people. The researchers point out that the path people walk through in the city is unique, and that the use of a large number of public data sets can infer a lot of information, which makes personal privacy a "growing concern".

But the privacy of big data is far beyond the scope of conventional identification risks. Medical data currently being sold to analysis companies may be used to track down personal identities. There's a lot of talk about personalized medicine, and the hope is that in the future, drugs and other therapies can be developed for individuals, as if they were made using their own DNA. This is a wonderful prospect in terms of improving the efficacy of medicine, but it essentially relies on personal identification at the molecular and genetic level, which can pose a significant risk if improperly used or leaked. While personal health data collection applications such as Runkeeper and Nike have developed rapidly, improving medical services with large data in practice is still a wish, not a reality.

Highly personalized large data sets will become the main target that hackers or leaks covet. WikiLeaks has been at the centre of several of the most serious data leaks in recent years. As seen from the massive data leaks in the offshore financial sector in Britain, the personal information of the world's 1% richest people is vulnerable to openness, like everyone else.

"Big data is the future of science. "Partly right, but it still needs some growth.

The article points out that large data provides a new way for science. One can only look at the discovery of the Higgs boson, which is the product of the largest grid computing project in history. In this project, CERN uses the Hadoop distributed file system to manage all data. But unless people recognize and address some of the inherent deficiencies in the large data that reflect human life, significant public policy and commercial decisions may be made on the basis of erroneous stereotypes.

To address this problem, data scientists are beginning to collaborate with social scientists. Over time, this will mean finding new ways to combine large data strategies with small data research. This will go far beyond the practice used in advertising or marketing, such as a central team or A/b test (i.e., showing the user the design or results of two versions to determine which version works better). Rather, the new hybrid approach will ask people why they do something, not just how often things happen. This means that, in addition to information retrieval and machine learning, sociological analysis and a deep understanding of ethnology will be used.

Technology companies have long been aware that social scientists can help them to understand more profoundly how people relate to their products, such as the Xerox Research Center, which has hired pioneering anthropologists Lucy Sachman. The next stage will be to further enrich the collaboration between computer scientists, statisticians and a wide range of social scientists-not only to test the results of their research, but also to raise different types of problems in a more rigorous manner.

Given that a lot of information about people--including Facebook hits, Global Positioning System (GPS) data, medical prescriptions, and Netflix's booking queues--is collected every day, people will sooner or later decide who to entrust such information to and what to use them for. One cannot avoid the fact that data is by no means neutral and it is difficult to remain anonymous. But people can use expertise across different fields to better identify prejudices, flaws and stereotypes, and face new challenges to privacy and justice. (Reference message network Cao Wei/compile)

(Responsible editor: The good of the Legacy)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.