Why business Hadoop implementations are best suited for enterprise deployments

Source: Internet
Author: User
Keywords Hadoop Enterprise deployment

Analysis is the core of all enterprise data deployments. Relational databases are still the best technology for running transactional applications (which is certainly critical for most businesses), but when it comes to data analysis, relational databases can be stressful. The adoption of an enterprise's Apache Hadoop (or a large data system like Hadoop) reflects their focus on performing analysis, rather than simply focusing on storage transactions.

To successfully implement a Hadoop or class Hadoop system with analysis capabilities, the enterprise must address some of the preparation issues in the following 4 categories:

Security-prevention of data theft and control access support-documentation and consulting analytics-the least required analysis feature integration for an enterprise-integration with legacy or Third-party products for data migration or data exchange

Using these 4 categories as a basis for comparison, this article will conduct the following case study: Why businesses use commercial Hadoop products (such as Infosphere biginsights) rather than open source "normal" Hadoop installations.

Prevention of data theft and control access

Security issues are a common problem in Hadoop deployments. By design, Hadoop stores and processes unstructured data from multiple sources. This can lead to access control, data authorization, and ownership issues. IT managers need to control access to data entering and leaving the system. The fact that Hadoop (or a class of Hadoop environments) contains data with various levels of confidentiality and sensitivity may exacerbate access control issues. Ultimately the risk of data theft, improper data access, or data disclosure.

Data theft is a popular issue at the enterprise level. Enterprise IT systems are often under attack. These problems have been solved in the traditional relational system. But implementing solutions for large data systems is different because some new technologies are playing a role. By default, most large data systems do not encrypt static data, and this problem must be resolved first. Again, the relational system has overcome similar problems. However, given that the class Hadoop system does not yet have a cluster management tool available, there may be unnecessary direct access to the data file or data node process.

In addition, merging multiple databases for analysis creates a new dataset that may require independent access control. You must now define the roles that apply to each data source for this data source combination. You must define clear boundaries for roles on a technical or functional basis. Neither option is perfect. Building roles on a functional basis can foster snooping on data, but it is easier for administrators to implement it when a dataset is merged. The technology base protects the original data nodes, but brings access problems after merging nodes. The access control and security features built into Hadoop distributed File System (HDFS) cannot relieve this dilemma. Some companies that use Hadoop are building new environments to store merged datasets or are protecting access to merged data through custom firewalls.

Infosphere Guardium®data Security and other products are available to help ensure the safety of data in a Hadoop based system. Infosphere Guardium Data Security automates the entire compliance audit process in a heterogeneous environment with features such as automatic discovery of sensitive data, automated compliance reporting, and data set access control.

Documentation and Consulting

Lack of documentation is another common enterprise problem. Roles and specifications are constantly changing, and consultants and employees are leaving. Unless the roles and specifications are clearly documented, much of the work must start from scratch when changes occur. This is a major problem with open source Apache Hadoop. On the contrary, Hadoop-based structured products (such as IBM Infosphere biginsights) designed for enterprises can address this issue and provide structured documentation and enterprise-class support. In fact, every development for the open source Hadoop version applies to biginsights, because Biginsights is built on Apache Hadoop, and biginsights adds these advantages on that basis.

By deploying products such as Infosphere biginsights, organizations can gain the benefits that external support offers. For business reasons, large enterprises typically retain only one support team for core IT capabilities. Constrained by the level of technical experience, complex deployments are almost impossible for these teams to accomplish. Some small companies specialize in helping large companies perform complex Hadoop deployments. But it cannot rely on small companies to provide long-term support. Because they may not exist for long.

The structured advice and support provided by reputable vendors solves these problems. A standard version of Hadoop can be deployed, tracked, and supported to meet business needs and expectations. External consultants can also assume the role of full-time employees-but with the right skill set. And they can apply experience and best practices from all walks of life. This is a particularly important advantage, given the fact that large data remains a new area of lack of professional experience. Large data consulting can also meet the internal team training needs, can be used to enrich the development of staff skills set. Consultants support can be used to extend projects and general maintenance.

Create business value through analysis

Large data deployment is closely related to maximizing information gain. Apache Hadoop provides the technical power and infrastructure for processing data in the following three areas: Data Volume (volume), species (produced), and velocity (velocity). However, the accumulation and processing of all data is meaningless unless the data is available for analysis. Data may come from multiple data sources: Flat files, databases, packaged applications, enterprise resource Planning (ERP) or Customer Relationship Management (CRM) systems, or data flow. The first task is to manage the data and store it, and Hadoop is good at it. But data management and storage per se do not provide any business value. Business value is derived from the analysis of the data. (This is where relational databases are weak.) They can store large amounts of data, but they cannot be processed in real time and efficiently. )

To analyze the data stored in Hadoop, the application designed for that purpose must be built on Hadoop. They may be statistical data visualization tools or analysis tools. If they are not built from scratch, software such as IBM SPSS, SAS, or R must be linked to Hadoop via an API. Even Google, which invented MapReduce, now uses it only to collect and collate data. For analytics, Google uses Dremel, a scalable query system that analyzes read-only nested data.

Enterprises (even companies that do not belong to large-scale Internet companies dealing with PB-level data) still have a large number of analysis usage scenarios, including:

Risk analysis in financial services fraud detection procedural instantaneous transactions understand customer behavior for insurance purposes understand customer behavior to improve credit risk management analyze supplier performance in high speed service business or analyze supplier performance for optimization of related services. Manufacturing and monitoring intelligent products, such as embedding RF IDs (RFID ) Tagged products (such as courier services or inventory systems) cost management sensor data analysis customer transaction analysis for marketing purposes (e.g. in the telecommunications industry, where businesses often provide call and data service packages based on popular customer trends) marketing campaigns through social media

Traditional data analysis or business intelligence tools cannot analyze massive amounts of data for these purposes. The software you use must not only be able to perform large-scale analysis, but must also be able to drill down into certain details to determine the operations required to achieve the business purpose of the analysis. This feature (get useful nuggets of information) is a necessary skill to analyze. It is also a weakness for most large data analysis. You can't take care of it: the more large-scale analysis you perform, the weaker the ability to drill down on details, and vice versa.

Infosphere Biginsights supports the implementation of large-scale analysis and access to in-depth insights. By using the included Hadoop implementation, Infosphere Biginsights fully considered the exploratory analysis of a large number of data, and achieved the previously impossible structure of data insights. It supports built-in data compression and features, such as the JSON Query Language (JAQL), which enables easy manipulation and analysis of semi-structured JSON data. On this basis, it provides text and machine learning analysis based on MapReduce. This is important because it is often impossible to know exactly what to look for when trying to gain insights from large-scale data. Machine learning is useful for discovering and predicting patterns and trends, and for extracting statistical models (if any) from unstructured data.

Integration with legacy systems and Third-party systems

For practical reasons, advanced applications such as ERP software are currently not built on Hadoop. Conversely, data from third-party systems must be seamlessly integrated with the class Hadoop system. The most common way to introduce web-based data is through SOAP. For other applications, you need professional connectors built primarily with Java™,. NET, or C + +. You can develop these custom integration programs or use products such as IBM Netezza. In addition to providing a large number of parallel advanced and predictive algorithms, Netezza enables you to create custom analytics (including C, C + +, Java, Perl, Python, and R) in a number of programming languages. It supports integrated spss® or analysis software from companies such as SAS, Revolution Analytics (for Enterprise R), Fuzzy Logix, and Zementis. Its programmatic interface also supports integration with almost all ERP systems with C and Java connectors, such as SAP's Jco Java connectors.

The Infosphere biginsights is further integrated into the Third-party integration category, supporting not only IBM's Hadoop release, but also Cloudera's Hadoop release. Cloudera support is important because Cloudera has a huge customer base. These customers can now easily use the Biginsights tool.

For data flows from multiple sources, biginsights can be directly connected to db2®, Netezza, and puredata™. It also comes with Bigindex, a MapReduce tool that builds indexes for search based analytics applications.

Concluding

Hadoop, which leverages integrated analysis capabilities, is ideal for business purposes. Ordinary Hadoop cannot easily take advantage of analytic applications, which themselves do not provide business value. Developing profiling features from scratch and supporting common Hadoop with cross application features and support is a daunting, time-consuming, and potentially expensive task. Enterprise Hadoop products, such as Infosphere biginsights, address technical issues related to deployment, making consulting easy and sustainable, and seamlessly integrated with a large legacy system and modern systems. Enterprise Hadoop contains cutting-edge analysis tools to gain insights from the data itself and to combine insights with Internet data and sensor data to collect hidden, useful nuggets of information.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.