Technology used in Big data

Source: Internet
Author: User
Tags cassandra zookeeper sqoop

Transferred from: http://www.jdon.com/bigdata/whatisbigdata.html

----------

You may ask what is big data, which is almost the latest trend in every business area? Is it just hype?

In fact "big data" is a very simple term-it just says-a very large data set. How big is it? The exact answer is "as big as you can imagine"!

Why is this data set so large? Because the data can come from ubiquitous, time-invariant: RFID sensors, traffic data, used to collect weather information sensors, mobile phone GPRS packets, publishing of social media sites, digital photos and videos, purchase transactions online, your name! Big data is a huge data set that contains data from every source of information that we are interested in.

Big data is characterised by four main aspects: quantity, variety, speed and accuracy (value) in English: Volume, Variety, Velocity, and Veracity, known as the "Big Four V" of large data.

Number of volume

Quantity refers to the amount of business data that can be captured, stored, and accessed. In the past two years alone, 90% of the world's data has been generated. Most organizations are already overwhelmed with such a huge amount of data that has accumulated to terabytes or even petabytes, some of which need to be organized, preserved, and analyzed.

Variety Varieties

80% of the world's data is semi-structured. Sensors, smart devices and social media are all generating such data, web logs, social media forums, audio, video, clickstream, email, files, sensor systems and so on. Traditional analytic solutions can work well with structured data, such as data in relational databases and patterns of formation. Support for storage and analysis of various data types is expanding today, requiring a comprehensive representation of all types of data, not simply capturing traditional relational database-managed data, and easily implementing big data technologies for storage and data analysis.

Velocity Velocity

Speed is the need for real-time data analysis, "sometimes delayed 2 minutes too late!" " 。 To gain a competitive advantage, it means that your competitor may identify a trend or opportunity before you are a few minutes or even a few seconds ago. Another example is a time-sensitive process, such as capturing information fraud, because it will flow into your business at all times and must be analyzed in real time. Time-sensitive data has a short shelf life; some of the well-known weavers analyze them in near real time.

Veracity Authenticity Value

Based on data we create opportunities and gain value. Data is the support of all decisions, so if you are looking for decisions that can have a significant impact on your business, you will want as much information as possible to support your decision. However, the separation of individual data volumes does not provide sufficient trust, the authenticity and quality of the data is paramount, so decisions built on big data solutions are the biggest challenges and a solid foundation for successful decision-making.

Here are the products that support big data based on Java:

Hadoop

Hadoop Sub-HDFs and Map/reduce,hdfs are the primary distributed storage for Hadoop. An HDFS cluster consists primarily of a namenode (metadata that manages the file system) and Datanode that store the actual data. HDFs is specifically designed to store large amounts of data, enabling access optimization.

Hadoop's MapReduce is a software framework that makes it easy to write applications that handle large amounts of data (terabytes of data sets), and implement a reliable, fault-tolerant way of running parallel systems on large clusters of server hardware thousands of nodes.

Detailed access: Hadoop Big Data Batch Architecture

Apache HBase

Apache HBase is a database of Hadoop, a distributed, extensible data store. It provides random, real-time read/write access to big data, and is optimized to host very large data tables-billions of rows multiplied by millions of columns-to implement cluster on server hardware. In its core Apache HBase is a distributed column-oriented database that belongs to Google's Bigtable:apache HBase, which provides a similar bigtable capability on top of Hadoop and HDFs.

Detailed entry: NoSQL Tour---HBase

Apache Cassandra

Apache Cassandra is a high-performance, scalable and highly linear database that can run on a server or cloud infrastructure, providing the perfect platform for mission-critical data. Cassandra supports replication among multiple data centers as the best in its class, providing users with lower latency and even fear of power outages. The Cassandra Data model provides convenient column indexing, high performance attempts and powerful built-in caches.

Detailed entry: Cassandra Special topic

Apache Hive

Apache Hive Hive is a data warehouse system for Hadoop that facilitates simple data aggregation tools for querying and analyzing large datasets stored on Hadoop-compatible file systems. Hive provides query data that resembles SQL-like languages called HIVEQL. At the same time, the language allows traditional map/reduce programmers to embed their own custom maperhe reducer.

Detailed Entry: Hive architecture

Apache Pig

Apache Pig is a platform for analyzing large datasets. It contains a high-level scripting language for writing data analysis programs, and the notable attribute of the Pigde program is that it is suitable for a large number of parallelization, taking turns to handle very large datasets. The infrastructure layer of pig consists of the compiler that produces the sequence Map-reduce program. The pig language, called Pig Latin, is easy to develop and program and considers extensibility and ease of use.

Apache Chukwa

is an open source large-scale distributed system data acquisition and monitoring system. It is built on the Hadoop Distributed File System (HDFS) and map/reduce frameworks, and inherits the scalability and robustness of Hadoop. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the data collected.

Apache Ambari

is a web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, including support for Hbase,hadoop's Mapreduce,hadoop HDFs, Hive Hive,hcatalog Zookeeper,oozie, Pig and Sqoop. Dashboards are also available to view the health of clusters, such as thermal maps, and to visualize Mapreduce,pig and hive applications in a user-friendly way to diagnose their performance characteristics.

Apache ZooKeeper

is a centralized service (load balancer) that maintains configuration information, provides naming, provides distributed synchronization, and provides community service. Apache Zookeeper coordinates distributed applications running on Hadoop clusters.

Application of zookeeper in service discovery

Apache Sqoop

Apache Sqoop is a conversion tool designed for Apache Hadoop that efficiently transfers large amounts of data between structured data stores and relational databases.

Apache Oozie

Apache Oozie is a scalable, reliable and extensible workflow scheduling system that manages Apache Hadoop jobs. The work of the Oozie workflow is directed by the Dag (irected acyclical Graphs). Oozie coordination work is often the work of oozie workflows triggered by frequent data arrival. Oozie integration supports several types of Hadoop jobs out of the box (Java map-reduce, streaming map-reduce, Pig, Hive, Sqoop and DISTCP) as well as the specific work of the system (shell script).

Apache Mahout

Apache Mahout is an extensible library of machine learning and data mining. The current mahout support is mainly four use cases:
Recommended mining: Try to find the user's behavior and the items that might be liked.
Clusters: Locate text file related places, and then divide them into local files.
Classification: Learn from existing classifications and be able to assign to the right categories.
Frequent itemsets mining: Requires a project group (the content of a shopping cart in a query session), and determines where individual items usually appear together.

Using Mahout for natural language processing

Apache Hcatalog

Apache Hcatalog is a data table and storage Management service created using Apache's Hadoop. This includes:

    • Provides a mechanism for sharing schema and data types.
    • Provides a table abstraction so that users do not have to care about where or how to store their data.
    • Data processing tools, such as pig, ground map Reduce, and hive provide interoperability.

Structured data versus unstructured data

Technology used in Big data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.