Develop a large data application to perform data sniffing and discovery

Source: Internet
Author: User
Tags hadoop ecosystem

Exploring large data and traditional enterprise data is a common requirement for many organizations. In this article, we outline methods and guidelines for indexing large data that is managed through a Hadoop based platform, so that this data can be used for data discovery solutions. Specifically, we will explain how to push data from the Infosphere biginsights (a Hadoop based platform) in IBM to Infosphere. Infosphere data Explorer is a complex tool that enables business users to explore and combine data from multiple enterprise and external data sources.

Brief introduction

If you've focused on a lot of early case analysis around big data, you might believe that you don't know what you don't know. Indeed, large data applications are often focused on collecting business insights from data that may be discarded or ignored for a variety of reasons. Companies are increasingly looking to develop a comprehensive information management strategy that involves more than simply probing or analyzing large data. Specifically, they want to integrate large data with existing data systems (including relational DBMS, enterprise content management systems, data warehouses, etc.) into their overall information management strategy.

This article analyzes one aspect of this challenge, lists a framework and methodology for indexing large and traditional data sources, and provides web-based interfaces to discover new insights in these different data sources. Specifically, it describes how data explorer (a data discovery platform) is indexed for Infosphere biginsights Management, and supports the integration of large data persistence formats with existing enterprise data. Both data Explorer and biginsights are important components of IBM's large data platform, so we first outline this platform and the two important products.

IBM's Large data platform overview

IBM's large data platform is designed to help organizations explore, analyze, and manage rich data, including streaming data, traditional business data, and "non-traditional" data or secondary data that have previously been difficult to incorporate into the business intelligence and analysis platform of the enterprise. Let's start with a brief look at this platform and then focus on two important components: Infosphere Data Explorer and Infosphere biginsights.

Figure 1 depicts the architecture of IBM's large data platform, which differs from other commercial products in its richness of functionality. From top to bottom, you'll see that IBM's platform contains a wealth of features and technologies to visualize and discover insights from a variety of data sources, develop analytics applications, and manage your environment. Data Explorer provides important visualization and discovery capabilities for IBM's large data platform, so we'll discuss that component in more detail later. The accelerator shown in Figure 1 is IBM's toolkit, which includes dozens of pre-built software artifacts to help companies quickly deploy solutions that analyze social media and machine data, such as logging. 3 data-processing engines enable organizations to effectively respond to the diversity, volume, and speed inherent in large data. These engines contain a Hadoop based system (biginsights, which we'll explore in detail later), a streaming computing platform (Infosphere Streams), and a data warehouse platform (such as puredata for Analytics or DB2). Finally, IBM's large data platform also includes connections to other popular enterprise software, including relational DBMS, extraction/conversion/loading platforms, business intelligence tools, content management systems, and more.

Figure 1. IBM's large Data platform architecture

Infosphere biginsights Overview

Infosphere Biginsights is a platform for IBM to persist and analyze large data in many forms. Based on the open source Apache Hadoop project, Biginsights is designed to help companies discover and analyze business insights that are hidden in massive amounts of data that can be ignored or discarded at ordinary times because it is impractical or difficult to use traditional methods to process the data. Examples of such data include logging, click Flow, social media data, news sources, e-mail, electronic sensor output, and even some transactional data.

To help businesses efficiently derive value from these types of data, Biginsights Enterprise Edition contains some open-source projects from the Hadoop ecosystem, as well as technologies developed by IBM that enhance and extend the value of this Open-source software. As shown in Figure 2, these technologies span from application accelerator to analysis tools, development tools, platform improvements, and enterprise software integration. For example, biginsights customers can use complex text analysis to extract content and context from documents, e-mails, and messages. Application developers can use an Eclipse based wizard to accelerate the development of custom Java MapReduce, JAQL, Hive, Pig, and text analysis applications. Administrators can manage and monitor their biginsights environments through an integrated Web console that allows business users to launch IBM-provided or self-developed applications through web-based catalogs.

In this article, we will focus on a subset of the biginsights features, such as text analysis and application lifecycle tools.

Figure 2. Infosphere biginsights Architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.