Methods and guidelines for indexing large data in Hadoop platform management

Source: Internet
Author: User
Keywords Large data Hadoop indexing

Specifically, we'll talk about how to push data from IBM's infosphere®biginsights™ (a Hadoop based platform) to Infosphere data Explorer. Infosphere data Explorer is a complex tool that enables business users to explore and combine data from multiple enterprise and external data sources.

If you've focused on a lot of early case analysis around big data, you might believe that you don't know what you don't know. Indeed, large data applications are often focused on collecting business insights from data that may be discarded or ignored for a variety of reasons. Companies are increasingly looking to develop a comprehensive information management strategy that involves more than simply probing or analyzing large data. Specifically, they want to integrate large data with existing data systems (including relational DBMS, enterprise content management systems, data warehouses, etc.) into their overall information management strategy.

This article analyzes one aspect of this challenge, lists a framework and methodology for indexing large and traditional data sources, and provides web-based interfaces to discover new insights in these different data sources. Specifically, it describes how data explorer (a data discovery platform) is indexed for Infosphere biginsights Management, and supports the integration of large data persistence formats with existing enterprise data. Both data Explorer and biginsights are important components of IBM's large data platform, so we first outline this platform and these two important products.

IBM's Large data platform overview

IBM's large data platform is designed to help organizations explore, analyze, and manage rich data, including streaming data, traditional business data, and "non-traditional" data or secondary data that have previously been difficult to incorporate into the business intelligence and analysis platform of the enterprise. Let's start with a brief look at this platform and then focus on two important components: Infosphere Data Explorer and Infosphere biginsights.

Figure 1 depicts the architecture of IBM's large data platform, which differs from other commercial products in its richness of functionality. From top to bottom, you'll see that IBM's platform contains a wealth of features and technologies to visualize and discover insights from a variety of data sources, develop analytics applications, and manage your environment. Data Explorer provides important visualization and discovery capabilities for IBM's large data platform, so we'll discuss that component in more detail later. The accelerator shown in Figure 1 is IBM's toolkit, which includes dozens of pre-built software artifacts to help companies quickly deploy solutions that analyze social media and machine data, such as logging. 3 data-processing engines enable organizations to effectively respond to the diversity, volume, and speed inherent in large data. These engines contain a Hadoop based system (biginsights, which we'll explore in detail later), a streaming computing platform (Infosphere Streams), and a data warehouse platform (such as Puredata™for Analytics or db2®). Finally, IBM's large data platform also includes connections to other popular enterprise software, including relational DBMS, extraction/conversion/loading platforms, business intelligence tools, content management systems, and more.

Figure 1. IBM's large Data platform architecture

Infosphere biginsights Overview

Infosphere Biginsights is a platform for IBM to persist and analyze large data in many forms. Based on the open source Apache Hadoop project, Biginsights is designed to help companies discover and analyze business insights that are hidden in massive amounts of data that can be ignored or discarded at ordinary times because it is impractical or difficult to use traditional methods to process the data. Examples of such data include logging, click Flow, social media data, news sources, e-mail, electronic sensor output, and even some transactional data.

To help businesses efficiently derive value from these types of data, Biginsights Enterprise Edition contains some open-source projects from the Hadoop ecosystem, as well as technologies developed by IBM that enhance and extend the value of this Open-source software. As shown in Figure 2, these technologies span from application accelerator to analysis tools, development tools, platform improvements, and enterprise software integration. For example, biginsights customers can use complex text analysis to extract content and context from documents, e-mails, and messages. Application developers can use an Eclipse based wizard to accelerate the development of custom Java™mapreduce, JAQL, Hive, Pig, and text analysis applications. An integrated Web console allows administrators to manage and monitor their biginsights environments, which enable business users to launch IBM-provided or self-developed applications through web-based catalogs.

In this article, we will focus on a subset of the biginsights features, such as text analysis and application lifecycle tools.

Figure 2. Infosphere biginsights Architecture

Infosphere Data Explorer Overview

The Infosphere data Explorer allows you to index a large number of structured, unstructured, and semi-structured data from different data sources. It also provides the ability to build large data-discovery applications and 360-degree information applications. Infosphere data Explorer allows users to create views of information about different entities (such as customers, products, events, partners, and so on), without moving data, based on a large collection of data stored in different internal and external data repositories.

An important challenge for today's businesses is that users cannot quickly find the information they need to solve a business problem or complete a task. Typically, data is dispersed in different systems to support specific applications that are managed by different organizations. In addition, new data sources are becoming critical resources, and people may need to consider them in their day-to-day work and in making important decisions, such as social media, sources from mobile devices, Twitter, and so on.

An example of this is that customer information such as contact information, purchased products, service notes and warranty information are stored in different business applications, such as CRM, support ticket systems, market portals, and so on. Imagine a salesperson who wants to contact a customer for an additional sale. He must first log into 10 applications to summarize the customer's information or communicate with 5 people to understand all this information.

Data Explorer addresses this important challenge. Information is stored in many different systems and silos, and users need to see all the data in a consistent way, quickly navigating to information that is most relevant to them. The challenge here is to provide this information where employees are most in need of making decisions.

Figure 3. Infosphere Data Explorer Architecture

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.