Why is the business Hadoop implementation best suited for enterprise deployment?
MapReduce implementation is the preferred technology for enterprises that want to analyze still large data. Companies can choose to use a simple open source MapReduce implementation (most notably Apache Hadoop), or you can choose to use a business implementation. Here, the authors prove that Hadoop based products (such as Infosphere biginsights) are more responsive to enterprise requirements than different Hadoop.
Analysis is the core of all large data deployments for the enterprise. Relational databases are still the best technology for running transactional applications (which is certainly critical for most businesses), but when it comes to data analysis, relational databases can be a bit stressful. The adoption of an enterprise's Apache Hadoop (or a large data system like Hadoop) reflects their focus on performing analysis, rather than simply focusing on storage transactions.
To successfully implement a Hadoop or class Hadoop system with analysis capabilities, the enterprise must address some of the preparation issues in the following 4 categories:
Security-Data theft prevention and control access
Support-Documentation and consulting
Analysis-the least analytical features required by the enterprise
Integration-integration with legacy or Third-party products for data migration or data exchange
Using these 4 categories as a basis for comparison, this article will conduct the following case study: Why businesses use commercial Hadoop products (such as Infosphere biginsights) rather than open source "normal" Hadoop installations.
Infosphere biginsights
Infosphere Biginsights is an IBM version of the Hadoop release. It contains the core Hadoop (Hadoop distributed File System, MapReduce) features and other services in the Hadoop ecosystem, such as Apache Pig, Hive, and zookeeper , it adds some excellent operational features such as compression for large data optimization, workload management and scheduling, and an application development and deployment ecosystem.
Prevention of data theft and control access
Security issues are a common problem in Hadoop deployments. By design, Hadoop stores and processes unstructured data from multiple sources. This can lead to access control, data authorization, and ownership issues. IT managers need to control access to data that enters the system and leaves the system. The fact that Hadoop (or a class of Hadoop environments) contains data with various levels of confidentiality and sensitivity may exacerbate access control issues. Ultimately the risk of data theft, improper data access, or data disclosure.
Data theft is a popular issue at the enterprise level. Enterprise IT systems are often under attack. These problems have been solved in the traditional relational system. But implementing solutions for large data systems is different because some new technologies are playing a role. By default, most large data systems do not encrypt static data, and this problem must be resolved first. Again, the relational system has overcome similar problems. However, given that the class Hadoop system does not yet have a cluster management tool available, there may be unnecessary direct access to the data file or data node process.
In addition, merging multiple databases for analysis creates a new dataset that may require independent access control. You must now define the roles that apply to each data source for this data source combination. You must define clear boundaries for roles on a technical or functional basis. Both options are not perfect. Building roles on a functional basis can foster snooping on data, but it is easier for administrators to implement it when a dataset is merged. The technology base protects the original data nodes, but brings access problems after merging the nodes. The access control and security features built into Hadoop distributed File System (HDFS) cannot relieve this dilemma. Some companies that use Hadoop are building a new environment to store merged datasets or are protecting access to merged data through a custom firewall.
Products such as Infosphere Guardium data security are available to help ensure the safety of data in a Hadoop based system. Infosphere Guardium Data Security automates the entire compliance audit process in a heterogeneous environment with features such as automatic discovery of sensitive data, automated compliance reporting, and data set access control.
Documentation and Consulting
The lack of documentation is another common enterprise problem. Roles and specifications are constantly changing, and consultants and employees are leaving. Unless the roles and specifications are clearly documented, much of the work must start from scratch when a change occurs. This is a major problem with open source Apache Hadoop. On the contrary, Hadoop-based structured products (such as IBM Infosphere biginsights) designed for enterprises can address this issue and provide structured documentation and enterprise-class support. In fact, every development for the open source Hadoop version applies to biginsights, because Biginsights is built on Apache Hadoop, and biginsights adds these advantages on that basis.
By deploying products such as Infosphere biginsights, organizations can gain the benefits that external support offers. For business reasons, large enterprises typically retain only one support team for core IT capabilities. Constrained by the level of technical experience, complex deployments are almost impossible for these teams to accomplish. Some small companies specialize in helping large companies perform complex Hadoop deployments. But it cannot rely on small companies to provide long-term support. Because they may not exist for too long.
The structured advice and support provided by reputable vendors solves these problems. A standard version of Hadoop can be deployed, tracked, and supported to meet business needs and expectations. External consultants can also assume the role of full-time employees-but with the right skill set. And they can apply experience and best practices from all walks of life. This is a particularly important advantage, given the fact that large data remains a new area of lack of professional experience. Large data consulting can also meet the internal team training needs, can be used to enrich the development of staff skills set. Consultants support can be used to extend projects and general maintenance.
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/