Big data has almost become the latest trend in all business areas, but what is the big data? It's a gimmick, a bubble, or it's as important as rumors.
In fact, large data is a very simple term--as it says, a very large dataset. So what are the most? The real answer is "as big as you think"!
So why do you have such a large dataset? Because today's data is ubiquitous and there are huge rewards: RFID sensors that collect communications data, sensors to collect weather information, GPRS packets sent by mobile devices to social networking sites, picture videos, and transactions generated by online shopping! Large data is a huge dataset that contains information generated by any data source, provided that this information is of interest to us.
However, the meaning of large data is by no means only related to volume, because large data can also be used to find new insights, to form new data and content, we can use the insights, data, and content extracted from large data to make business more flexible and to answer questions that were previously considered far beyond the current category. This is why large data is defined in the following 4 areas: Volume (Volume), produced (multiple), Velocity (efficiency), and veracity (value, values), which is 4V of large data. The following describes each feature and the challenges it faces:
1. Volume
Volume is talking about the amount of data a business must capture, store, and access, producing 90% of all the data in the world in the past two years alone. Today's institutions are completely overwhelmed by the volume of data, easily producing terabytes and even petabytes of different types of data, and some of these data need to be organized, protected (stolen), and analyzed.
2. Produced
80% of the world's data are semi-structured, and sensors, smart devices, and social media generate this data through Web pages, blog files, social media forums, audio, video, click Streams, e-mail, documents, and sensing systems. Traditional analysis schemes are often only suitable for structured data, for example: data stored in relational databases has a complete structural model. The diversity of data types also means that we need to make fundamental changes in data storage and analysis to support current decision making and knowledge processing. Produced represents data types that cannot be easily captured and managed in traditional relational databases, but can be easily stored and analyzed using large data technologies.
3. Velocity
Velocity requires near-real-time analysis of the data, also known as "Sometimes 2 minutes is too late!". Gaining a competitive advantage means you need to identify a new trend or opportunity in minutes or even seconds, as well as your competitors as quickly as possible. Another example is the processing of time-sensitive data, such as capturing criminals, where data must be collected and then parsed to achieve maximum value. Time-sensitive data shelf life is often very short, which requires organizations or agencies to use the near real-time approach to its analysis.
4. Veracity
By analyzing the data, we get to grasp the opportunity and reap the value, the importance of the data lies in the support of the decision; When you look at a decision that may have a significant impact on your business, you want as much information as possible related to the use case. The volume of data alone does not determine whether it is helpful for decision making, and the authenticity and quality of data is the most important factor in obtaining insight and ideas, so this is the most solid foundation for making successful decisions.
However, the current business intelligence and data warehousing technology does not fully support the 4V theory, the development of large data solutions is to address these challenges.
The following are the main open source tools that support Java in large data areas:
1. HDFS
HDFs is the main distributed storage system in a Hadoop application, and the HDFs cluster contains a namenode (the master node) that manages the metadata for all file systems and the Datanode (data nodes) that store the real data. HDFs is designed for massive data, so compared with the traditional file system in High-volume small file optimization, HDFs optimization is the small batch of large file access and storage.
2. MapReduce
The Hadoop MapReduce is a software framework to easily write parallel applications that handle massive (terabytes) of data, connecting tens of thousands of nodes in a large cluster (commercial hardware) in a reliable and fault-tolerant manner.
3. HBase
The Apache hbase is a Hadoop database, a distributed, scalable, large data store. It provides random and real-time read/write access to large datasets and optimizes for large tables on commercial server clusters-tens of billions of rows. Its core is the open source implementation of Google BigTable paper, distributed storage. Just as BigTable uses the distributed data storage provided by GFS (Google File System), it is a class bigatable that Apache Hadoop provides on a hdfs basis.
4. Cassandra
The Apache Cassandra is a high-performance, linearly scalable, and highly efficient database that can be used to create the perfect mission-critical data platform on commercial hardware or cloud infrastructure. In replication across the data center, Cassandra is Best-in-class, providing users with lower latency and more reliable disaster backups. With strong support for log-structured update, anti normalization, materialized views, and powerful built-in caching, the Cassandra Data model provides a convenient two-level index (column Indexe).
5. Hive
The Apache hive is a data warehouse system for Hadoop that facilitates a review of the data (mapping structured data files to a database table), ad hoc queries, and large dataset analysis stored in a Hadoop-compliant system. Hive provides complete SQL query functionality--HIVEQL language, while using this language to express a logic becomes inefficient and cumbersome, HIVEQL also allows traditional map/reduce programmers to use their own custom mapper and reducer.
6. Pig
The Apache Pig is a platform for large dataset analysis that includes a high-level language for data analysis applications and an infrastructure for evaluating these applications. The flash feature of pig applications is that their structures stand up to a lot of parallelism, which means they support very large datasets. The Pig infrastructure layer contains compilers that produce map-reduce tasks. Pig's language layer currently contains a native language--pig correlation, which is designed to be easy to program and to ensure scalability.
7. Chukwa
The Apache Chukwa is an open source data collection system for monitoring large distribution systems. Built on the HDFs and map/reduce frameworks, it inherits the scalability and stability of Hadoop. Chukwa also contains a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure the best use of data.
8. Ambari
Apache Ambari is a web-based tool for configuring, managing, and monitoring the Apache Hadoop cluster, supporting Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, zookeeper, Oozie, Pig and Sqoop. Ambari also provides cluster-state dashboards, such as the ability to heatmaps and view MapReduce, Pig, and hive applications, and diagnose their performance characteristics in a user-friendly user interface.
9. Zookeeper
The Apache zookeeper is a reliable coordination system for large distributed systems, providing functions including configuration maintenance, naming services, distributed synchronization, group services, etc. Zookeeper's goal is to encapsulate complex and error-prone key services, delivering easy-to-use interfaces and high-performance, functionally stable systems to users.
Sqoop
Sqoop is a tool used to transfer data from Hadoop and relational databases to the HDFs of Hadoop, or to import data from a relational database into a relational database.
Oozie
The Apache Oozie is an extensible, reliable and scalable workflow scheduling system for managing Hadoop operations. The Oozie workflow job is an active directed acyclical Graphs (Dags). Oozie Coordinator jobs are triggered by periodic Oozie workflow jobs, which are generally determined by time (frequency) and data availability. Oozie is used in conjunction with the rest of the Hadoop stack, out-of-the-box supports multiple types of Hadoop jobs (e.g. Java map-reduce, streaming map-reduce, Pig, Hive, Sqoop and DISTCP) and other system jobs (such as Java programs and shell scripts).
Mahout
The Apache Mahout is an extensible machine learning and data Mining library, and the current Mahout supports the main 4 use cases:
Recommended Mining: Collect user actions and use them to recommend things that you might like. Aggregation: Collects files and groups related files. Categories: Learn from existing classification documents, look for similar features in documents, and categorize documents without labels correctly. Frequent itemsets mining: grouping a group of items and identifying which individual items will often appear together.
Hcatalog
The Apache Hcatalog is a mapping table and storage Management Service for Hadoop to build data, which includes:
provides a mechanism for sharing patterns and data types. Provides an abstract table so that users do not need to focus on the way and address of the data store. Provides interoperability for data-processing tools such as pig, MapReduce, and hive.