13 Open source Java Big Data tools, from theory to practice analysis

Source: Internet
Author: User
Tags cassandra zookeeper hadoop mapreduce sqoop



Big data has become almost the latest trend common in all business sectors, but what is big data? is a gimmick, a bubble, or really as important as rumors.



In fact, big data is a very simple term-as it says, it is a very large data set. So what is the majority? The real answer is "as big as you think"!



So why do you have such a large data set? Because today's data are ubiquitous and hugely rewarding: the RFID sensors that collect communications data, the sensors that collect weather information, the GPRS packets that mobile devices send to social networking sites, picture videos, the transactions that are generated online, everything! Big data is a huge data set that contains information generated by any data source, and of course, if that information is of interest to us.



However, the meaning of big data is never only related to volume, because big data can also be used to find new knowledge, form new data and content; We can use the truth, data and content extracted from big data to make business more flexible and to answer questions that were previously considered far beyond the current scope. This is why big data is defined in 4 ways: Volume (volume), Variety (variety), Velocity (efficiency), and veracity (value), or 4V of big data. The following outlines each feature and the challenges it faces:



1. Volume



Volume is talking about the amount of data a business has to capture, store, and access, producing 90% of all the world's data in just the last two years. Today's institutions are completely overwhelmed by the volume of data, easily producing terabytes or even petabytes of data, and some of these data need to be organized, protected (stolen), and analyzed.



2. Variety



80% of the world's data is semi-structured, with sensors, smart devices, and social media generating data from Web pages, blog files, social media forums, audio, video, clickstream, email, documents, sensing systems, and more. Traditional analysis schemes are often only suitable for structured data, for example: data stored in relational databases has a complete structural model. The diversity of data types also means that in order to support today's decision making and knowledge processing, we need to make fundamental changes in data storage and analysis. Variety represents data types that cannot be easily captured and managed in a traditional relational database, but can be easily stored and analyzed using big data technologies.



3. Velocity



Velocity requires near real-time analysis of the data, also known as "Sometimes 2 minutes is too late!". Gaining a competitive advantage means that you need to identify a new trend or opportunity in a matter of minutes, even seconds, and also need to be as fast as possible to your competitors. Another example is the processing of time-sensitive data, such as capturing criminals, where the data must be collected and then analyzed so that the maximum value can be obtained. Time-sensitive data shelf life is often very short, which requires organizations or agencies to use near real-time analysis of it.



4. Veracity



By analyzing the data, we come to how to seize the opportunity and reap the value, the importance of the data is to support the decision-making; When you look at a decision that may have a significant impact on your business, you want to get as much information as possible about the use case. The volume of data alone does not determine whether it helps in decision making, the authenticity and quality of the data is the most important factor in acquiring knowledge and ideas, so this is the most solid foundation for making successful decisions.



However, the current business intelligence and data warehousing technologies do not fully support 4V theory, and the development of big data solutions addresses these challenges.



The main open source tools for Java in the Big Data area are described below:






1. HDFS



HDFs is the primary distributed storage system in Hadoop applications, and the HDFs cluster contains a Namenode (master node) that manages the metadata of all file systems and the Datanode (data nodes, which can have many) that store real data. HDFs is designed for massive amounts of data, so HDFs optimizes access and storage for small batches of large files compared to traditional file system optimizations on large batches of small files.






2. MapReduce



Hadoop MapReduce is a software framework that makes it easy to write parallel applications that handle massive (terabytes) of data, connecting tens of thousands of nodes (commercial hardware) in a large cluster in a reliable and fault-tolerant manner.






3. HBase



Apache HBase is a Hadoop database, a distributed, scalable, big data store. It provides random and real-time read/write access to large data sets and optimizes for large tables on commercial server clusters-tens of billions of rows and millions of columns. Its core is the open source implementation of the Google BigTable paper, distributed Columnstore. Just as BigTable uses the distributed data store provided by GFS (Google File System), it is a class bigatable provided by Apache Hadoop on an hdfs basis.






4. Cassandra



Apache Cassandra is a high-performance, linearly scalable, high-availability database that can run on commercial hardware or cloud infrastructure to create the perfect mission-critical data platform. In replication across the data center, Cassandra is best-in-class, providing users with lower latency and more reliable disaster backups. With strong support for log-structured update, anti-normalization and materialized views, and powerful built-in caches, the Cassandra Data model provides a convenient two-level index (column Indexe).






5. Hive



Apache Hive is a data warehouse system for Hadoop that facilitates a review of data (mapping a structured data file into a database table), ad hoc queries, and large data set analysis stored in a Hadoop compliant system. Hive provides full SQL query functionality--hiveql language, and while using this language to express a logic becomes inefficient and cumbersome, HIVEQL also allows traditional map/reduce programmers to use their own custom mapper and reducer.






6. Pig



Apache Pig is a platform for large data set analysis that includes a high-level language for data analysis applications and an infrastructure to evaluate these applications. The flash feature of pig applications is that their structures stand up to large amounts of parallelism, which means that they support very large datasets. The infrastructure layer of pig contains the compiler that generates the Map-reduce task. The language layer of Pig currently contains a native language--pig Latin, which was originally developed to be easy to program and ensure scalability.






7. Chukwa



Apache Chukwa is an open source  system for monitoring large distribution systems. Built on the HDFs and map/reduce frameworks, it inherits the scalability and stability of Hadoop. The Chukwa also includes a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure optimal use of data.






8. Ambari



Apache Ambari is a web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, supporting Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides cluster health dashboards, such as heatmaps and the ability to view MapReduce, Pig, and hive applications to diagnose their performance characteristics with a friendly user interface.






9. ZooKeeper



Apache Zookeeper is a reliable, coordinated system for large distributed systems that includes configuration maintenance, naming services, distributed synchronization, group services, and more. The goal of zookeeper is to encapsulate complex and error-prone services that provide users with easy-to-use interfaces and performance-efficient, robust systems.






Ten. Sqoop



Sqoop is a tool used to transfer data from Hadoop and relational databases to and from a relational database to HDFs in Hadoop, or to the data in HDFs into a relational database.






Oozie.



Apache Oozie is a scalable, reliable, and extensible workflow Scheduling system for managing Hadoop jobs. The Oozie workflow job is an active directed acyclical Graphs (DAGs). The Oozie Coordinator job is triggered by periodic Oozie workflow jobs, which typically depend on the time (frequency) and the availability of the data. Oozie is used in conjunction with the remaining Hadoop stacks to support multiple types of Hadoop jobs out of the box (e.g. Java map-reduce, streaming map-reduce, Pig, Hive, Sqoop and DISTCP) and other system jobs (such as Java programs and shell scripts).






Mahout.



Apache Mahout is a scalable machine learning and data Mining library that currently supports the main 4 use cases of mahout:



Recommended mining: Collect user actions and use this to recommend things that you might like.
Aggregation: Collects files and groups related files.
Classification: Learn from existing classification documents, look for similar features in documents, and categorize them correctly for untagged documents.
Frequent itemsets mining: grouping a set of items and identifying which individual items will often appear together.




Hcatalog.



Apache Hcatalog is a mapping table and storage Management Service for Hadoop to build data, which includes:



Provides a mechanism for sharing patterns and data types.
Provides an abstract table so that users do not need to focus on the way and address of the data store.
Provides interoperability for data processing tools like pig, MapReduce, and hive.






13 Open source Java Big Data tools, from theory to practice analysis


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.