Infosphere streams a large data platform for analyzing mobile
Source: Internet
Author: User
KeywordsLarge data infospherestreams analysis mobile
Information from multiple sources is growing at an incredible rate. The number of Internet users has reached 2.27 billion in 2012. Every day, Twitter generates more than TB of tweet,facebook to generate more than TB log data, and the New York Stock Exchange collects 1 TB of trading information. Approximately 30 billion radio frequency identification (RFID) tags are created every day. In addition, hundreds of millions of GPS devices are being sold each year, and more than 30 million network sensors are currently being used (and increasing at a rate of more than 30% per annum), generating data. The amount of data is expected to double every 2 years in the next 10 years.
A company can generate up to petabytes of information in a year: Web pages, blogs, click streams, search indexes, social media forums, instant messages, text messages, e-mail, documents, user demographics, sensor data from active and passive systems, and more. Many people estimate that up to 80% of these data are semi-structured or unstructured data. Companies have been looking for more agility in their business to perform data analysis and decision-making processes in more innovative ways. And they realize that the time lost in these processes can lead to missed business opportunities. At the heart of the Big Data challenge is the ability of companies to easily analyze and understand Internet-level information, just as they can now analyze and understand less structured information.
Figure 1 shows the large data challenge of extracting insights from the maximum, diversity, and high speed generated context data that was not previously possible.
Figure 1. Big Data Challenge
In Figure 1, Volume refers to the size of the data, from terabytes to ZB levels. Produced refers to the complexity of data in many different structures, from relational data to log to original text. Velocity reflects streaming data and large-scale data movement.
IBM is helping companies deal with big data challenges by providing them with tools to integrate and manage massive, high-speed data, apply native format analysis, visualize available data for specialized analysis, and more. This article describes Infosphere Streams, which enables you to analyze many data types at the same time and perform complex computations in real time. You will learn how Infosphere Streams works, what it does, and how to use it in conjunction with another IBM product (IBM Infosphere biginsights) for large data analysis to perform a highly complex analysis.
Infosphere biginsights: Overview
Understanding Infosphere Biginsights will enable you to understand more fully the purpose and value of Infosphere Streams. (If you are already familiar with biginsights, you can skip to the next section.) )
Biginsights is an analytical platform that helps companies transform complex Internet-level information sets into insights. It contains a suite of Apache Hadoop distributions (with a highly streamlined installation process) and associated tools for application development, data mobility, and cluster management. Thanks to simplicity and scalability, Hadoop (an open source implementation of the MapReduce framework) has achieved tremendous success in the industry and academia. In addition to Hadoop, other open source technologies in biginsights (all except JAQL are part of the Apache Software Foundation project) include:
Pig: This platform provides a high-level language to express the analysis of large data sets of programs. Pig is equipped with a compiler that converts Pig programs into MapReduce job sequences performed by the Hadoop framework. Hive: A data Warehouse solution built on the HADOOP environment. It brings familiar relational database concepts to the unstructured world of Hadoop, such as tables, columns and partitions, and a subset of SQL (HIVEQL). Hive queries are compiled into MapReduce jobs that are performed using Hadoop. JAQL:IBM is a query language developed specifically for JSON (JavaScript object Notation,javascript) that provides a SQL-like interface. JAQL handles nesting moderately, highly oriented, and very flexible. It is suitable for loosely structured data and is an interface for HBase column storage and text parsing. HBase: A column-oriented NoSQL data storage environment designed to support large, sparsely populated tables in Hadoop. Flume: A distributed, reliable, and available service for efficiently moving large amounts of data generated. Flume is ideal for collecting generated logs from multiple systems, inserting them into HDFS (Hadoop Distributed File System,hadoop Distributed File System). Lucene: A search engine library that provides high-performance, full-featured text Search. Avro: A data serialization technique that uses JSON to define data types and protocols, and to serialize data in a compact binary format. Zookeeper: A centralized service that provides distributed synchronization and packet services by maintaining configuration information and naming. Oozie: A workflow scheduler system for managing and orchestrating the execution of Apache Hadoop operations.
In addition, the Biginsights release includes the following IBM-specific technologies:
Bigsheets: A browser-based, spreadsheet-like query and discovery interface that enables business users to easily collect and analyze data, leveraging the power of Hadoop. It provides a built-in reader to handle data in a variety of common formats, including JSON, comma-separated values (CSV), and tab-delimited values (TSV). Text analytics: A pre-built library of text annotation characters for common business entities. It provides rich language and tools to build custom location annotation characters. Re-use MapReduce: An IBM solution that accelerates the execution of small MapReduce jobs by changing the way MapReduce tasks are handled.
Generally speaking, biginsights is not designed to replace a traditional relational database management system (DBMS) or a traditional data warehouse. Specifically, it is not optimized for interactive queries, online analytical processing (OLAP), or online transaction processing (OLTP) Applications for table-column data structures. However, as part of IBM's large data platform, Biginsights provides potential integration points with other components of the platform, including data warehousing, data integration, and governance engines, and third-party data analysis tools. As you'll see later in this article, it can also be integrated with Infosphere Streams.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.