Java Future Trends Java facilitates big data development

Source: Internet
Author: User
Tags cassandra zookeeper hadoop mapreduce sqoop

Without Java, and without even big data, Hadoop itself is written in Java. When you need to publish new features on a server cluster running MapReduce, you need to deploy dynamically, and that's what Java is good at.


The big data area supports Java's mainstream open source tools:

1. HDFS

HDFs is the primary distributed storage system in Hadoop applications, and the HDFs cluster contains a Namenode (master node) that manages the metadata of all file systems and the Datanode (data nodes, which can have many) that store real data. HDFs is designed for massive amounts of data, so HDFs optimizes access and storage for small batches of large files compared to traditional file system optimizations on large batches of small files.

2. MapReduce

Hadoop MapReduce is a software framework that makes it easy to write parallel applications that handle massive (terabytes) of data, connecting tens of thousands of nodes (commercial hardware) in a large cluster in a reliable and fault-tolerant manner.

3. HBase

Apache HBase is a Hadoop database, a distributed, scalable, big data store. It provides random and real-time read/write access to large data sets and targets commercial servers

The large tables on the cluster are optimized-tens of billions of rows and millions of columns. Its core is the open source implementation of the Google BigTable paper, distributed Columnstore. Just like BigTable used

Like the distributed data store provided by GFS (Google File System), it is a class bigatable provided by Apache Hadoop on an hdfs basis.

4. Cassandra

Apache Cassandra is a high-performance, linearly scalable, high-availability database that can run on commercial hardware or cloud infrastructure to create the perfect mission-critical data platform.

In replication across the data center, Cassandra is best-in-class, providing users with lower latency and more reliable disaster backups. With strong support for log-structured update, anti-normalization and materialized views, and powerful built-in caches, the Cassandra Data model provides a convenient two-level index (column Indexe).

5. Hive

Apache Hive is a data warehouse system for Hadoop that facilitates a review of data (mapping a structured data file into a database table), ad hoc queries, and large data set analysis stored in a Hadoop compliant system. Hive provides full SQL query functionality--hiveql language, and while using this language to express a logic becomes inefficient and cumbersome, HIVEQL also allows traditional map/reduce programmers to use their own custom mapper and reducer.

6. Pig

Apache Pig is a platform for large data set analysis that includes a high-level language for data analysis applications and an infrastructure to evaluate these applications. The flash feature of pig applications is that their structures stand up to large amounts of parallelism, which means that they support very large datasets. The infrastructure layer of pig contains the compiler that generates the Map-reduce task. The language layer of Pig currently contains a native language--pig Latin, which was originally developed to be easy to program and ensure scalability.


7. Chukwa

Apache Chukwa is an open source data collection system for monitoring large distribution systems. Built on the HDFs and map/reduce frameworks, it inherits the scalability and stability of Hadoop. The Chukwa also includes a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure optimal use of data.

8. Ambari

Apache Ambari is a web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, supporting Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides cluster health dashboards, such as heatmaps and the ability to view MapReduce, Pig, and hive applications to diagnose their performance characteristics with a friendly user interface.

9. ZooKeeper

Apache Zookeeper is a reliable, coordinated system for large distributed systems that includes configuration maintenance, naming services, distributed synchronization, group services, and more.

The goal of zookeeper is to encapsulate complex and error-prone services that provide users with easy-to-use interfaces and performance-efficient, robust systems.

Ten. Sqoop

Sqoop is a tool used to transfer data from Hadoop and relational databases to and from a relational database to HDFs in Hadoop, or to the data in HDFs into a relational database.

Oozie.

Apache Oozie is a scalable, reliable, and extensible workflow Scheduling system for managing Hadoop jobs. Oozie Workflow Job is an active directed acyclical

Graphs (DAGs). The Oozie Coordinator job is triggered by periodic Oozie workflow jobs, which typically depend on the time (frequency) and the availability of the data. Oozie and

The rest of the Hadoop stack is used in conjunction with out-of-the-box support for multiple types of Hadoop jobs such as Java map-reduce, streaming map-reduce, Pig, Hive, Sqoop and DISTCP) and other system jobs (such as Java programs and shell scripts).

Mahout.

Apache Mahout is a scalable machine learning and data Mining library that currently supports the main 4 use cases of mahout:

Recommended mining: Collect user actions and use this to recommend things that you might like.

Aggregation: Collects files and groups related files.

Classification: Learn from existing classification documents, look for similar features in documents, and categorize them correctly for untagged documents.

Frequent itemsets mining: grouping a set of items and identifying which individual items will often appear together.

Hcatalog.

Apache Hcatalog is a mapping table and storage Management Service for Hadoop to build data, which includes:

Provides a mechanism for sharing patterns and data types.

Provides an abstract table so that users do not need to focus on the way and address of the data store.

Provides interoperability for data processing tools like pig, MapReduce, and hive.

Java Future Trends Java facilitates big data development

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.