13 Java open-source big data tools

Source: Internet
Author: User
Tags cassandra hadoop mapreduce sqoop

Big Data has almost become the latest trend in all commercial areas. However, what is big data? It is just as important as rumors.

In fact, big data is a very simple term-as it said, it is a very large dataset. So most of them? The real answer is "as big as you think "!

Why is such a large dataset generated? Because today's data is everywhere and there is a huge return: RFID sensors that collect communication data, sensors that collect weather information, GPRS data packets sent by mobile devices to social networking sites, pictures and videos, transaction records generated by online shopping are all available! Big Data is a huge data set that contains information generated by any data source, provided that this information is of interest to us.

However, the meaning of big data is not only related to the volume, because big data can also be used to find new insights and form new data and content; we can use the insights, data, and content extracted from big data to make business more flexible and answer questions that were previously considered to be far beyond the current category. This is why Big Data is defined in the following four aspects:Volume, variety, velocity, and veracity (value)That is, 4 V of big data. The following describes each feature and the challenges it faces:

1. Volume

Volume refers to the amount of data that must be captured, stored, and accessed by a business. In the past two years, volume has produced 90% of all data in the world. Today's organizations are completely overwhelmed by the volume of data, which easily produces TB or even petabytes of different types of data, and some of the data needs to be organized, protected (stolen) and analysis.

2. Variety

80% of the data produced in the world is semi-structured, sensors, smart devices, and social media are generated through web pages, network log files, social media forums, audios, videos, click streams, emails, documents, sensor systems, etc. Traditional analysis solutions are often only suitable for structured data. For example, the data stored in a relational database has a complete structure model. The diversity of data types also means that in order to support decision making and real-Knowledge Processing, we need to make fundamental changes in data storage and analysis. Variety represents the data types that cannot be easily captured and managed in traditional relational databases, but can be easily stored and analyzed using big data technology.

3. Velocity

Velocity requires near-real-time data analysis, also known as "sometimes 2 minutes is too late !". Obtaining a competitive advantage means that you need to identify a new trend or opportunity in minutes or even seconds, and also need to be as fast as possible to your competitors. Another example is the processing of time-sensitive data. For example, to capture criminals, the data must be collected and analyzed to obtain the maximum value. The shelf life of time-sensitive data is usually very short, which requires organizations or organizations to analyze it in near real time.

4. Veracity

By analyzing data, we can find out how to seize the opportunity and gain value. The importance of data lies in the support for decision-making. When you focus on a decision that may have an important impact on your enterprise, you want to obtain as much information as possible related to use cases. Data volume alone cannot determine whether it is helpful for decision-making. The authenticity and quality of data are the most important factors to gain insights and ideas. Therefore, this is the most solid foundation for successful decision-making.

However, the existing business intelligence and data warehouse technologies do not fully support the 4 V theory. The development of big data solutions addresses these challenges.

The following describes mainstream open-source tools supporting Java in the big data field.:

1. HDFS

HDFS is the main distributed storage system in hadoop applications. The HDFS cluster contains a namenode (master node ), this node is responsible for managing metadata of all file systems and datanode that stores real data (there can be many data nodes ). HDFS is designed for massive data volumes. Therefore, compared with traditional file systems that optimize large volumes of small files, HDFS optimizes access and storage of small batches of large files.

2. mapreduce

Hadoop mapreduce is a software framework used to easily write parallel applications that process massive (Tb-level) data and connect tens of thousands of nodes (Commercial hardware) in a large cluster in a reliable and fault-tolerant manner ).

3. hbase

Apache hbase is a hadoop database that provides distributed and scalable big data storage. It provides random and real-time read/write access to large datasets and optimizes large tables on commercial Server clusters-tens of billions of rows and tens of millions of columns. Its core is the open-source implementation of Google bigtable and distributed columnar storage. Just like bigtable uses the distributed data storage provided by GFS (Google File System), it is a bigatable class provided by Apache hadoop Based on HDFS.

4. Cassandra

Apache Cassandra is a high-performance, linearly scalable, and highly effective database that can run on commercial hardware or cloud infrastructure to build a perfect mission-critical data platform. In cross-Data Center replication, Cassandra is the best of its kind to provide users with lower latency and more reliable disaster backup. With strong support for log-structured update, denormalization, materialized views, and powerful built-in cache, Cassandra's data model provides a convenient secondary index (column indexe ).

5. hive

Apache hive is a hadoop data warehouse system that promotes data summarization (ing structured data files into a database table) ad-hoc queries and analysis of large datasets stored in hadoop compatible systems. Hive provides the complete SQL query function-hiveql language. When using this language to express a logic becomes inefficient and cumbersome, hiveql also allows traditional map/reduce programmers to use their own custom er and reducer.

6. Pig

Apache pig is a platform for large-scale dataset analysis. It contains a high-level language for data analysis applications and the infrastructure for evaluating these applications. The Flash feature of pig applications is that their structures can withstand a large number of parallel operations, that is, they can support very large datasets. The pig infrastructure layer contains the compiler that generates map-reduce tasks. The pig language layer currently contains a native language, Pig Latin, which was originally designed to be easy to program and ensure scalability.

7. chukwa

Apache chukwa is an open-source data collection system that monitors large distributed systems. Built on the HDFS and MAP/reduce frameworks, it inherits the scalability and Stability of hadoop. Chukwa also contains a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure optimal data use.

8. ambari

Apache ambari is a web-based tool used to configure, manage, and monitor Apache hadoop clusters. It supports hadoop HDFS, hadoop mapreduce, hive, and hcatalog,, hbase, Zookeeper, oozie, pig, and sqoop. Ambari also provides a cluster status dashboard, such as heatmaps and the ability to view mapreduce, pig, and hive applications, to diagnose their performance characteristics on a friendly user interface.

9. zookeeper

Apache zookeeper is a reliable coordination system for large-scale distributed systems. It provides the following functions: configuration maintenance, Naming Service, distributed synchronization, and group service. The goal of zookeeper is to encapsulate key services that are complex and error-prone, and provide users with easy-to-use interfaces and systems with high performance and stable functions.

10. sqoop

Sqoop is a tool used to transfer data between hadoop and relational databases. It can import data from a relational database into HDFS of hadoop or import data from HDFS into a relational database.

11. oozie

Apache oozie is a scalable, reliable, and scalable workflow scheduling system for managing hadoop jobs. Oozie workflow job is the active directed acyclical graphs (dags ). Oozie Coordinator jobs are triggered by periodic oozie workflow jobs. The cycle is generally determined by the time (frequency) and data availability. Oozie works with the remaining hadoop stacks to support multiple types of hadoop jobs (such as Java map-Reduce, streaming map-Reduce, pig, hive, sqoop, and distcp) Out-of-the-box) and other system jobs (such as Java programs and shell scripts ).

12. mahout

Apache mahout is a scalable machine learning and data mining database. Currently, mahout supports the following four use cases:

  • Recommendation Mining: collects user actions and recommends things that users may like.
  • Aggregation: Collects and groups related files.
  • Classification: You can learn from existing classification documents, find similar features in documents, and correctly classify unlabeled documents.
  • Frequent Item Set Mining: groups a group of items and identifies which items appear frequently.

13. hcatalog

Apache hcatalog is a hadoop data ing table and storage management service, which includes:

  • Provides a sharing mode and data type mechanism.
  • Provides an abstract table so that you do not need to pay attention to the data storage methods and addresses.
  • Provides interoperability for data processing tools such as pig, mapreduce, and hive.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.