Big Data Hot Word report

Source: Internet
Author: User
Keywords Large data byte
Tags analysis apache application based big data business collect data company

It can be said that big data is one of the hottest trends in the IT industry today, and it has spawned a new batch of technologies to deal with big data. And new technologies have brought the latest buzz words: acronyms, professional terms, and product names. Even the phrase "big data" itself makes a person dizzy. When many people hear "big data", they think it means "a lot of data", and the meaning of large data does not only involve the amount of data.

Here are a few popular words that we think you should be familiar with, sorted alphabetically.

ACID

The full name of acid is atomic, consistent, isolated, and persistent, which is actually a set of requirements or attributes: If all four aspects are adhered to, the data integrity of the database transaction can be ensured during processing. Although acid has been available for some time, the rapid growth of transaction data has focused more attention on the need to meet acid requirements when dealing with large data.

Three elements of large data

Today's IT systems produce data that is "huge" in number, speed, and variety.

Number: IDC estimates that the total global information this year will reach 2.7 kilobytes (which is equivalent to 2.7 billion bytes) and double every two years.

Speed: It is not just the number of data that makes IT managers headache, but the flowed of data from the financial system, retail systems, Web sites, sensors, radio frequency identification (RFID) chips and Facebook and Twitter social networks.

Category: If you go back 5 years or maybe 10 years ago, IT staff are mainly working with alphanumeric data, which can easily be stored in neatly arranged rows and columns in relational databases. It's not like that anymore. Today, unstructured data such as tweets and Facebook posts, various documents and Web content are part of a large data mix.

Column type (or column) database

Some of the next-generation databases, such as open source Cassandra and HP's Vertica database, are designed to store data in columns instead of storing it in rows like traditional SQL databases. This design provides faster disk access speed and improves performance when working with large data. For data-intensive business analysis applications, the column database is particularly popular.

Data Warehouse

The concept of data warehousing has been around for about 25 years, specifically the replication of data from multiple operational IT systems to a secondary off-line database for business Analysis application Systems

But as the volume of data increases dramatically, data warehousing systems are changing rapidly. They need to store more data and more kinds of data, so data Warehouse management is a big problem. 10 or 20 years ago, data could be replicated to the data Warehouse system on a weekly or monthly basis, and today the Data Warehouse updates are much more frequent and sometimes even updated in real time.

ETL

Extracting, transforming, and loading (ETL) software is used when transferring data from one database, such as a database that supports bank application transaction systems, to another database, such as a data warehouse system for business analysis. When data is transferred from one database to another, it is often necessary to reformat and clean the data.

Due to the rapid growth of data, processing speed is greatly accelerated, the performance requirements of ETL tools have been greatly improved.

Flume

Flume is a technology that belongs to the great family of the Apache Hadoop (other technologies include HBase, Hive, Oozie, Pig, and whirr), which is used to populate Hadoop with data. The technology uses software agents distributed across application servers, Web servers, mobile devices, and other systems to collect data and transfer data to the Hadoop system.

For example, a company can use the Apache Flume running on a Web server to collect data from Twitter posts for analysis.

Geo-spatial analysis

A trend to drive large data streams is the growing number of geospatial data generated and collected by today's IT systems. As the saying goes, the amount of information in a picture is worth 1000 words; so it's no wonder that more maps, charts, photos and other location-based content are the main drivers of today's big data explosion.

Geo-spatial analysis is a special form of data visualization (see "Visualization" below), covering data on geographical maps to help users understand the results of large data analysis more clearly.

Hadoop

Hadoop is an open source platform for developing distributed, data-intensive applications. It is controlled by the Apache Software Foundation.

The inventor of Hadoop is Yahoo's developer Doug Cutting, who developed a Hadoop based on the MapReduce concept of Google Labs, naming his son's toy elephant.

In addition, HBase is a non relational database that was developed as part of a Hadoop project. The Hadoop Distributed File System (HDFS) is a key component of Hadoop. Hive is a data warehouse system based on Hadoop.

In-memory databases

When a computer processes a transaction or executes a query, it typically obtains data from a disk drive. But when it systems handle large data, the process may be too slow.

In-memory database systems use the computer's main memory to store frequently used data, thus greatly reducing processing time. In-memory database products include SAP HANA and Oracle times ten in-memory databases.

Java

Java is a programming language developed by Sun, now affiliated with Oracle, and released in 1995. Hadoop and many other large data technologies are developed using Java, which is still a major development technique in large data areas.

Kafka

Kafka is a highly-throughput distributed messaging system, originally developed in LinkedIn, to manage the activity flow of the Service Web site (data on site usage) and to manipulate the data processing pipeline (about server component performance).

Kafka is very effective in processing large amounts of streaming data, and streaming data is a key problem in many large data computing environments. The storm developed by Twitter is another kind of flow processing technology.

The Apache Software Foundation has listed Kafka as an open source project. So don't think it's flawed software.

Delay Time

Latency is the delay in the process of transferring data from one point to another, or the number of delays a system, such as an application, is responding to another system.

While latency is not a new term, as data volumes grow and IT systems struggle to keep pace, you hear the term more often now. Simply put, "low latency" is a good thing, and "high latency" is bad.

Mapping/Simplification

Mapping/Simplifying (map/reduce) This approach involves splitting a complex problem into smaller pieces, distributing them to multiple computers, and then reassembling them into one answer.

Google's search system uses the concept of mapping/simplification, the company has a brand name for the MapReduce framework.

A white paper published by Google in 2004 described its use of mapping/simplifying. The father of Hadoop, Doug O Katyn, is fully aware of its potential and has developed the first version of Hadoop, which also borrows the concept of mapping/simplification.

NoSQL Database

Most mainstream databases, such as Oracle database and Microsoft SQL Server, are based on a relational architecture that uses Structured Query Language (SQL) for development and data management.

But a new generation of database systems called "NoSQL" (Some people now call NoSQL "not just SQL") is based on an architecture that proponents believe are more suited to dealing with large data.

Some NoSQL databases are designed to improve scalability and flexibility, while others are more effective at working with documents and other unstructured data. NoSQL Typical NoSQL databases include Hadoop/hbase, Cassandra, MongoDB and Couchdb, and some prominent developers such as Oracle have already launched their own NoSQL products.

Oozie

The Apache Oozie is an open source workflow engine that helps manage the processing of Hadoop. With Oozie, a series of work can be defined in multiple languages, such as pig and MapReduce, and then associated with each other. For example, once a job that collects data from an operational application is completed, the programmer can start the Data Analysis Query task.

Pig

Pig is another project of the Apache Software Foundation, a platform for analyzing large datasets. By its very nature, pig is a programming language that can be used to develop parallel computing queries running on Hadoop.

Quantitative data analysis

Quantitative data analysis refers to the use of complex mathematical or statistical models to explain financial and business behavior, or even predict future behavior.

Quantitative data analysis has become more complex because of the dramatic increase in the amount of data collected today. But if companies know how to use massive amounts of data, get better visibility, gain insight into the business and insight into market trends, more data is expected to bring more opportunities in data analysis.

One problem is that there is a serious shortage of people with this analytical skill. McKinsey, a renowned consultancy, says the US needs 1.5 million analysts and administrators with large data analysis skills.

relational database

The relational database management system (RDBM) is one of the most widely used databases today, including IBM's DB2, Microsoft SQL Server and Oracle database. Most enterprise transaction systems run on RDBM from the banking application system, the point-of-sale system of retail stores to the inventory management application software.

But some argue that relational databases may not keep up with the explosive growth of data and species today. RDBM, for example, was originally designed to deal with alphanumeric data, which was not as effective when dealing with unstructured data.

Piecewise

As the database grows larger, it becomes increasingly difficult to handle. Fragmentation (sharding) is a database partitioning technique that divides the database into smaller, more manageable parts. Specifically, the database is partitioned horizontally to manage the different rows in the database tables separately.

The slicing method makes the fragment of the huge database can be distributed on multiple servers, which improves the overall running speed and performance of the database.

In addition, Sqoop is an open source tool for future data transfer from non-Hadoop sources, such as relational databases, to the Hadoop environment.

Text analysis

One of the factors contributing to the big data problem is the increasing number of text collected from social media sites such as Twitter and Facebook, external news sources, and even within the company to analyze. Because text is unstructured (unlike structured data that is typically stored in relational databases), mainstream business analysis tools are often overwhelmed with text.

Text analysis uses a series of methods (keyword search, statistical analysis and language research) to gain insights from text-based data.

Unstructured data

Not long ago, most of the data was structured, and alphanumeric information, such as financial data from sales transactions, was easily stored in relational databases and analyzed by business intelligence tools.

But now a large chunk of the 2.7-gigabyte storage data is unstructured data, such as text-based documents, Twitter messages, photos posted on flickr, videos posted on YouTube, and so on. (Interestingly, there are 35 hours of video content uploaded to YouTube every minute.) Processing, storing, and analyzing all these messy unstructured data is often a challenge for today's IT systems.

Visualization

As the volume of data grows, it is increasingly difficult to understand data using static graphs and graphs. This leads to the development of a new generation of data visualization and analysis tools that enable people to understand the mass of information by presenting the data in novel ways.

These tools include: color coded thermal maps, three-dimensional graphics, animation visualization that shows changes over time, and geospatial rendering that covers data on geographical maps. Today's advanced data visualization tools are also more interactive, such as allowing users to zoom in on a subset of data for more scrutiny.

Whirr

Apache Whirr is a set of Java class libraries for running large data cloud services. Rather, it can speed up the process of developing hadoop clusters on virtual infrastructures such as the Amazon Elastic Computing Cloud (EC2) and Rackspace.

XML

Extensible Markup Language (XML) is used to transmit and store data (not to be confused with HTML, which is used to display data). With XML, programmers can create common data formats and share information and formatting over the Internet.

Because XML documents can be very large and complex, they are often considered to cause the IT department to face large data challenges.

Yao Byte

Yao Byte (yottabyte) is a data storage metric, equivalent to 1000 kilobytes. According to IDC, a leading research firm, the total amount of data stored globally this year is expected to reach 2.7 kilobytes, a 48% increase from 2011. So, we have a long way to go to the mark, but from the current growth rate of large data, the arrival of that day may be faster than we think.

By the way, the 1-gigabyte byte corresponds to 1021 bytes of data. It is equivalent to 1000 bytes (EB), 1 million bytes (PB), and 1 billion terabytes (TB).

Zookeeper

Zookeeper is a service created by the Apache Software Foundation designed to help Hadoop users manage and coordinate hadoop nodes across distributed networks.

Zookeeper is tightly integrated with hbase, and HBase is a database related to Hadoop. Zookeeper is a centralized service for maintaining configuration information, naming services, distributed synchronization, and other group services. IT managers use it to implement reliable messaging mechanisms, synchronize process execution, and implement redundant services.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.