Big Data Hot Word report

Last Update:2014-12-09 Source: Internet

Author: User

Keywords Large data large data used large data data warehouse large data Data Warehouse hot word big data used data Warehouse hot word is a

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It can be said that big data is one of the hottest trends in the IT industry today, and it has spawned a new batch of technologies to deal with big data. And new technologies have brought the latest buzz words: acronyms, professional terms, and product names. Even the phrase "big data" itself makes a person dizzy. When many people hear "big data", they think it means "a lot of data", and the meaning of large data does not only involve the amount of data.

Here are a few popular words that we think you should be familiar with, sorted alphabetically.

ACID

The full name of acid is atomic, consistent, isolated, and persistent, which is actually a set of requirements or attributes: If all four aspects are adhered to, the data integrity of the database transaction can be ensured during processing. Although acid has been available for some time, the rapid growth of transaction data has focused more attention on the need to meet acid requirements when dealing with large data.

Three elements of large data

Today's IT systems produce data that is "huge" in number, speed, and variety.

Number: IDC estimates that the total global information this year will reach 2.7 kilobytes (which is equivalent to 2.7 billion bytes) and double every two years.

Speed: It is not just the number of data that makes IT managers headache, but the flowed of data from the financial system, retail systems, Web sites, sensors, radio frequency identification (RFID) chips and Facebook and Twitter social networks.

Category: If you go back 5 years or maybe 10 years ago, IT staff are mainly working with alphanumeric data, which can easily be stored in neatly arranged rows and columns in relational databases. It's not like that anymore. Today, unstructured data such as tweets and Facebook posts, various documents and Web content are part of a large data mix.

Column type (or column) database

Some of the next-generation databases, such as open source Cassandra and HP's Vertica database, are designed to store data in columns instead of storing it in rows like traditional SQL databases. This design provides faster disk access speed and improves performance when working with large data. For data-intensive business analysis applications, the column database is particularly popular.

Data Warehouse

The concept of data warehousing has been around for about 25 years, specifically the replication of data from multiple operational IT systems to a secondary off-line database for business Analysis application Systems

But as the volume of data increases dramatically, data warehousing systems are changing rapidly. They need to store more data and more kinds of data, so data Warehouse management is a big problem. 10 or 20 years ago, data could be replicated to the data Warehouse system on a weekly or monthly basis, and today the Data Warehouse updates are much more frequent and sometimes even updated in real time.

ETL

Extracting, transforming, and loading (ETL) software is used when transferring data from one database, such as a database that supports bank application transaction systems, to another database, such as a data warehouse system for business analysis. When data is transferred from one database to another, it is often necessary to reformat and clean the data.

Due to the rapid growth of data, processing speed is greatly accelerated, the performance requirements of ETL tools have been greatly improved.

Flume

Flume is a technology that belongs to the great family of the Apache Hadoop (other technologies include HBase, Hive, Oozie, Pig, and whirr), which is used to populate Hadoop with data. The technology uses software agents distributed across application servers, Web servers, mobile devices, and other systems to collect data and transfer data to the Hadoop system.

For example, a company can use the Apache Flume running on a Web server to collect data from Twitter posts for analysis.

Geo-spatial analysis

A trend to drive large data streams is the growing number of geospatial data generated and collected by today's IT systems. As the saying goes, the amount of information in a picture is worth 1000 words; so it's no wonder that more maps, charts, photos and other location-based content are the main drivers of today's big data explosion.

Geo-spatial analysis is a special form of data visualization (see "Visualization" below), covering data on geographical maps to help users understand the results of large data analysis more clearly.

Hadoop

Hadoop is an open source platform for developing distributed, data-intensive applications. It is controlled by the Apache Software Foundation.

The inventor of Hadoop is Yahoo's developer Doug Cutting, who developed a Hadoop based on the MapReduce concept of Google Labs, naming his son's toy elephant.

In addition, HBase is a non relational database that was developed as part of a Hadoop project. The Hadoop Distributed File System (HDFS) is a key component of Hadoop. Hive is a data warehouse system based on Hadoop.

In-memory databases

When a computer processes a transaction or executes a query, it typically obtains data from a disk drive. But when it systems handle large data, the process may be too slow.

In-memory database systems use the computer's main memory to store frequently used data, thus greatly reducing processing time. In-memory database products include SAP HANA and Oracle times ten in-memory databases.

Java

Java is a programming language developed by Sun, now affiliated with Oracle, and released in 1995. Hadoop and many other large data technologies are developed using Java, which is still a major development technique in large data areas.

Kafka

Kafka is a highly-throughput distributed messaging system, originally developed in LinkedIn, to manage the activity flow of the Service Web site (data on site usage) and to manipulate the data processing pipeline (about server component performance).

Kafka is very effective in processing large amounts of streaming data, and streaming data is a key problem in many large data computing environments. The storm developed by Twitter is another kind of flow processing technology.

The Apache Software Foundation has listed Kafka as an open source project. So don't think it's flawed software.

Delay Time

Latency is the delay in the process of transferring data from one point to another, or the number of delays a system, such as an application, is responding to another system.

While latency is not a new term, as data volumes grow and IT systems struggle to keep pace, you hear the term more often now. Simply put, "low latency" is a good thing, and "high latency" is bad.

Mapping/Simplification

Mapping/Simplifying (map/reduce) This approach involves splitting a complex problem into smaller pieces, distributing them to multiple computers, and then reassembling them into one answer.

Google's search system uses the concept of mapping/simplification, the company has a brand name for the MapReduce framework.

A white paper published by Google in 2004 described its use of mapping/simplifying. The father of Hadoop, Doug O Katyn, is fully aware of its potential and has developed the first version of Hadoop, which also borrows the concept of mapping/simplification.

NoSQL Database

Most mainstream databases, such as Oracle database and Microsoft SQL Server, are based on a relational architecture that uses Structured Query Language (SQL) for development and data management.

But a new generation of database systems called "NoSQL" (Some people now call NoSQL "not just SQL") is based on an architecture that proponents believe are more suited to dealing with large data.

Some NoSQL databases are designed to improve scalability and flexibility, while others are more effective at working with documents and other unstructured data. NoSQL Typical NoSQL databases include Hadoop/hbase, Cassandra, MongoDB and Couchdb, and some prominent developers such as Oracle have already launched their own NoSQL products.

Oozie

The Apache Oozie is an open source workflow engine that helps manage the processing of Hadoop. With Oozie, a series of work can be defined in multiple languages, such as pig and MapReduce, and then associated with each other. For example, once a job that collects data from an operational application is completed, the programmer can start the Data Analysis Query task.

Pig

Pig is another project of the Apache Software Foundation, a platform for analyzing large datasets. By its very nature, pig is a programming language that can be used to develop parallel computing queries running on Hadoop.

Quantitative data analysis

Quantitative data analysis refers to the use of complex mathematical or statistical models to explain financial and business behavior, or even predict future behavior.

Quantitative data analysis has become more complex because of the dramatic increase in the amount of data collected today. But if companies know how to use massive amounts of data, get better visibility, gain insight into the business and insight into market trends, more data is expected to bring more opportunities in data analysis.

One problem is that there is a serious shortage of people with this analytical skill. McKinsey, a renowned consultancy, says the US needs 1.5 million analysts and administrators with large data analysis skills.

(editor: Heritage)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More