Large data in the cloud: data speed, amount of data, type, authenticity

Source: Internet
Author: User
Keywords Cloud computing large data data speed

This article describes ways to perform large data analysis using the R language and similar tools, and to extend large data services in the cloud. In this paper, a kind of digital photo management which is a simple and large data service is analyzed in detail, and the key elements of searching, analyzing and machine learning are applied to the unstructured data.

This article focuses on applications that use large data, explains the basic concepts behind large data analysis, and how to combine these concepts with business intelligence (BI) applications and parallel technologies, such as the computer Vision (CV) and machine learning methods described in part 3rd of the Cloud Extensions series.

The difference between a large data analysis and a video analysis is the breadth of the data type being processed, and the interactive analysis and search tools that are provided may be more complex than the data mining or MapReduce methods used, and are much longer than the Google BigQuery. For example, Google BigQuery uses a column search to compress and accelerate an interactive search for massive unstructured data. In fact, in the "an Inside looking at Google BigQuery," Google explained that in just dozens of seconds, BigQuery can perform regular expression text matching on a large log table that contains about 35 billion rows and 20TB. The tool's MapReduce functions run longer, but provide sophisticated data reduction.

Anyone with a Google account can use BigQuery. Or, to quickly understand a typical large data search, download my photos and upload them to Google Images. You should get all the same pages that contain my images (from IBM, Colorado State University, Boulder, etc.), including at least one error affirmation. I use this example primarily to ensure that the downloaded image has the appropriate photo reputation and has been granted reuse permissions. Like the other examples analyzed here, this example shows the essence of large data, which is to look for a useful piece of information in an unstructured data heap, which is actually more likely to be millions of data heaps.

Defining large data

Large data is defined broadly to capture, manage, and analyze data beyond typical structured data ranges that can be queried through relational database management systems-often from unstructured files, digital video, images, sensor data, log files, And almost any data that is not contained in a record with a clearly searchable field. In a sense, unstructured data is an interesting data, but it is difficult to synthesize or draw conclusions from it if it is not associated with structured data.

Big data also has new sources, such as machine-generated files (log file or sensor network), mobile devices (video, photos and text messages), and machine to machine transmission data, through this data, the status of the Internet of things reported to the fleet or aircraft Group maintenance planning, or general telemetry monitoring. One way to analyze this data is to analyze the characteristics of the data volume. IBM estimates that 25,000 trillion (quintillion) (2,500,000,000,000,000,000) bytes of data are now created every day. The second is speed, and the data rate is increasing as network bandwidth increases (now typically GB-rated rates, such as GigE, 10G, 40G, 100G, rather than MB). The third is the category, which now contains more unstructured data types, such as digital video streaming and sensor data, and log files. Finally, the authenticity of the data, or the degree to which the data is trusted when it comes to making critical decisions depending on the large amount of data that is collected at high speed. It is not easy to know that the data is not actually counterfeit, is not damaged or from the intended source-for example, the data may come from one of the thousands of security cameras, and each camera generates thousands of frames of video per hour. So let's make a list of key aspects of the big data, applications, and systems to better understand them.

Perhaps the best way to understand big data is to review its history, as Forbes Magazine. Of course, the size of the data being considered large is now growing at a rate of more than 2.5 EB per day. Interestingly, most of the data will never be looked at by humans (according to the U.S. Census, there are 7 billion people around the world who need to check out more than 300MB of information each day). Given this challenge, the only logical way to use so much data is to execute machines to machine automation or intelligent queries for large data. And if you keep so much data for a long time, how do people know if some of the data is corrupted? Of course, we can store data digests (such as MD5, a checksum format) and use redundant arrays of independent disks (RAID use mirrors, XOR parity, or erasure codes to detect and recover corrupted data), but there is growing concern that some data may suffer from silent damage.

The Internet Archive (a data management agency) investigated this concern. Overall, the authenticity of large data is a big challenge, but the erasure code and Advanced Data digest methods show great potential. Traditional methods such as XOR RAID or simple mirrors are being replaced by RAID-6 and more advanced erasure codes. These traditional methods provide only a single fault protection for data loss in the event of a storage device failure, and do not deal with the imperceptible damage scenarios caused by software defects, data center operator errors, or media failures. The concept of data persistence for large data has become very important, and I have worked with Intel and amplidata to study this topic in conjunction with mathematical models. With so much data, the idea of people checking its authenticity is completely unworkable, and it is only after a long time that the data is finally queried or accessed that the data is lost.

Design of large Data system

The architecture for large-scale data protection should include defenses against loss, silent damage, malware, and malicious changes to data executed by cyber criminals or through cyber warfare. Data is an asset that is increasingly being used by governments and businesses to make important decisions, but if the data is not true, the value of the data is reduced or may even be lost-or worse, making bad decisions. This topic is outside the scope of this article, but it is clear that the loss of data and the lack of detection of changes or damage to the defense are indispensable.

One way to better understand large data is to analyze some of the cloud sites with enough data (often PB) and the query tools (usually terabytes) that are available to the application. Most of us use Google queries every day, but Google also provides BigQuery, a tool that uses more sophisticated column storage and search (which will be discussed in more detail as an example). Other notable examples include Facebook (social networking), Wikipedia (Common sense collection), Internet Archive (digital data Management Agency), DigitalGlobe (Geographic information system [GIS]), microsoft®virtual Earth (GIS), Google Earth (GIS) and many new large data service providers.

The company has large internal data and large data on private cloud systems. Many large data systems have read-only mode for user queries (collected from machine-generated sources), but if they allow database or unstructured data to be updated, they may contain strong authentication using passphrase, which requires the user to authenticate with a mobile phone text message Acknowledgement code, Use graphical challenges to validate human data entry and perhaps use biological authentication more in the future.

Large Data applications

People are thinking about killer apps for CV and video analytics every day, but some applications can take years to implement because of the need to compute or implement cost constraints. Still, here's a list of some interesting applications:

It has been proved that the use of Google Trends's stock market attitude analysis can have a good correlation with the decline and rise of the historical index, which may not be novel, but is of great significance as a large data application. The article "using Google Trends quantify transactions in financial markets" provides evidence that using attitudinal analysis to make long-term and short-term stock purchase and sales decisions may outweigh the simple buying and holding of strategic and index-fund investments. The study certainly needs more analysis, but it is attractive. An interesting consideration, however, is what happens when these machine-based trading systems are present with existing program transactions. Picasa photo sequencing from Google is a useful tool that allows users to sort, query, and automatically recognize faces using CV technology and machine learning. This is a good way to experience the value of large data services and applications. This method clearly shows that large data analysis will require such advanced analysis as CV and machine vision. Recommended systems such as Pandora (music), Netflix (movie), and Amazon (books and products) use customer data and multiple proxies in a method called collaborative filtering. This large data service has become a more advanced research topic in the field of machine learning and data mining. Clearly, the ability to make good recommendations increases sales and increases customer satisfaction. Customer group analysis can associate social network data (such as Facebook and Twitter) with BI collected from traditional customer transactions and provide customer attitude analysis based on this text data. Attitude analysis enables companies to know what customers think of their products, their interest in the business or their competitors, and what they like and dislike. A machine-generated data from a sensor network (such as sensors embedded in large systems such as urban traffic, traffic lights, and general infrastructure); machine to machine data, sensor or log data from one machine (usually on site) is absorbed by another machine; and log files, most often by IT Used to debug problems and manage system exceptions (ignoring them when they do not require human intervention to perform recovery and continuous operations). The travel reservation system is improved by merging customer preferences, logistics, and previous historical behavior to provide helpful advice for the difficult journey planning task. Entertainment social networks are replacing the social aspects of the discussion in the pantry of radio and video, where video-on-demand digital media now allow anyone to view content almost anytime, but still share experience through social networks. While this makes content use more enjoyable, it enables content creators, script writers, and artists to really better understand their audiences than ever before. Medical diagnostics often include rule-based expert decision support Systems (DSS), but with large data there is evidence that these systems may be completely divorced from research andBecome a mainstream medical assistant. For example, the new DSS, which assists in the objective psychological assessment of the patient's risk of suicide, has shown promise in the study. One way to prove the value of these systems is to compare them to historical data: These systems do not replace human decisions, and are likely to improve decisions when used as support tools.

Application of column query, unstructured data analysis, MapReduce and large data visualization and inference are just entry-level.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.