Humanity to explore large data technology __c language

Source: Internet
Author: User
Tags postgresql postgresql client aliyun

Click to view full text



The first two-stage small gang introduced the large data definition and large data of the humane help to the application of the humanity help backstage. Today also not with everybody suspense, next, the small help for everybody to serve is the big data platform technology exploration.

Large data technology, small help think can be divided into two large level, large data platform technology and large data technology. To use large data, you must first have the ability to compute, large data platform technology includes data collection, storage, flow, processing of the underlying technology needed, such as the Hadoop biosphere, number plus biosphere.

The application technology of data refers to the technique of processing data, translating data into commercial value, such as algorithm, model, engine, interface, product and so on, which is derived from the algorithm. These data processing of the underlying platform, including platform layer tools, as well as the algorithm running on the platform, can also precipitate into a large data ecological market, avoid duplication of research and development, greatly improve the processing efficiency of large data.

Large data need to have data first, data acquisition and storage to solve the problem, data collection and storage technology, with the outbreak of data and the rapid development of large data business, but also in the continuous evolution process.

In the early days of large data, or in the early days of many enterprises, only relational databases were used to store core business data, even data warehouses, which were centralized OLAP relational databases. For example, many companies, including Taobao early, used Oracle as a data warehouse to store data, then built the largest Oracle RAC in Asia as a data warehouse, at that time, can handle the size of 10T below.

Once an independent data warehouse, it will involve ETL, such as data extraction, data cleaning, data validation, data import and even data security desensitization. If the data source is just a business database, ETL will not be complicated. If the data sources are multiple, such as log data, app data, crawler data, purchased data, consolidated data, and so on, ETL becomes very complex, the task of data cleaning and verification becomes very important.

At this time the ETL must be in line with the data standards to implement, if there is no data standard ETL, may lead to data warehouse data are inaccurate, the wrong large data will lead to the upper layer of data applications, data products are the results are wrong. The wrong big data conclusions are not as good as big data. Thus, data standards and ETL in the data cleaning, data validation is very important.

Finally, as the source of the data becomes more and more users of the data, the whole large data flow has become a very complex network topology, everyone is importing data, cleaning data, while everyone is using data, but, who do not believe that the other import, cleaning data, will lead to more and more duplication of data, There are more and more data tasks, and the relationship between tasks becomes more and more complex. To solve such problems, it is necessary to introduce data management, that is, for large data management. such as metadata standards, the Public Data service layer (trusted data layer), data use information disclosure, and so on.

With the continuous growth of data volume, centralized relational OLAP warehouse can not solve the problem of the enterprise, at this time, there is a professional class based on MPP Data Warehouse processing software, such as Greenplum. Greenplum uses the MPP method to process the data, can process the data more, faster, but is essentially a database technology. Greenplum supports about 100 machines in size and can handle PB-level data volumes. Greenplum products are based on popular PostgreSQL development, almost all PostgreSQL client tools and PostgreSQL applications can run on the Greenplum platform, The internet has a wealth of PostgreSQL resources for users to refer to.

As the volume of data continues to increase, such as Ali needs to process more than 100PB of data per day, more than 1 million of the large data tasks per day. There is no solution to the above solutions, and there are some larger m/r distributed solutions, such as Hadoop,spark and Storm in the large data technology ecosystem. They are currently the most important three distributed computing systems, Hadoop is often used for off-line complex large data processing, spark is often used for off-line fast large data processing, and storm is often used in real-time large-scale online data processing. As well as the number of Aliyun introduced, it also includes large data computing services Maxcompute (former ODPS), relational database ads (similar to Impala), and java-based storm system Jstorm (former Galaxy).

Let's take a look at the different solutions in the big Data technology ecology and compare the Aliyun solutions, and I'll introduce the numbers separately.

1. Large Data ecological technology system

Hadoop is a distributed system infrastructure developed by the Apache Foundation. The most central design of Hadoop's framework is: HDFs and MapReduce. HDFS provides storage for massive amounts of data, MapReduce provides calculations for massive amounts of data. Hadoop as a basic framework, can also carry many other things, such as hive, do not want to use programming language to develop mapreduce, people familiar with SQL can use hive open off-line for data processing and analysis work. For example, HBase, as a column-oriented database running on the HDFS, HDFs lack of then read and write operations, HBase is to do so, HBase is a distributed, column-oriented open source database.

Spark is also an open-source project of the Apache Foundation, developed by the University of California, Berkeley, and is another important distributed computing system. The biggest difference between spark and Hadoop is that Hadoop uses hard disks to store data, and Spark uses memory to store data, so spark can provide more than Hadoop100 times the speed of operations. Spark can be yarn (another resource coordinator) in the Hadoop cluster, but now the spark is also going to the ecological walk, hope to be able to take all of the downstream, a set of technology stack to solve the various needs. For example, Spark Shark is for vs Hadoop hive,spark streaming is for vs Storm.

Storm is the Twitter-driven distributed computing system, developed by the Backtype team, and is the Apache Foundation's incubation program. It provides real time operation characteristics on the basis of Hadoop, and can deal with large data stream in real time. Unlike Hadoop and Spark,storm, which do not collect and store data, it accepts data directly through the network and processes the data in real time, and then returns the result directly through the network. Storm is good at dealing with real-time streaming. For example, the log, such as the website shopping click Stream, is a steady stream, in order, there is no end, so through Kafka and other message queues to the data, storm on the side began to work. Storm itself does not collect data and does not store data, it comes with processing with the output result.

The modules on it are just generic frameworks for large-scale distributed computing, often described by computational engines.

In addition to the computing engine, want to do data processing applications, we also need some platform tools, such as the development of the IDE, job scheduling system, data synchronization tools, BI modules, data management, monitoring and alarm, and so on, they together with the computing engine, forming a large data base platform.

On this platform, we can develop data application products based on the data processing and application.

such as a restaurant, in order to make Chinese food, western food, Japanese materials, Spanish cuisine, it must be food (data), with different kitchenware (large data base computing engine), combined with different condiments (processing tools) to make different types of cuisine, but in order to host a large number of guests, he must be equipped with a larger kitchen space, Stronger kitchenware, more cooks (distributed); Cooking is not good to eat, it depends on the chef's level (large data processing, application capabilities).

2, Ali large Data system

Let's take a look at Ali's three-piece calculation engine.

Aliyun first used the Hadoop solution, and successfully scaled the Hadoop single cluster scale to 5000 units. Since 2010, Aliyun began to independently develop a distributed computing platform similar to Hadoop Maxcompute platform (former Odps,https://www.aliyun.com/product/odps), the current single cluster size over million, and support multiple cluster joint computing , the amount of 100PB of data can be processed within 6 hours, which is equivalent to 100 million HD movies.

Analytical Database Services Ads (ANALYTICDB) is a set of Rt-olap (realtime OLAP, real-time OLAP) systems. In the data storage model, using the free and flexible relational model storage, we can use SQL to do the free and flexible calculation and analysis without prior modeling, while using distributed computing technology, ads can even surpass the processing performance of MOLAP system in processing Bai and even more data. Really realize the tens of billions of data millisecond level calculation. Ads is the use of Search + database technology data highly pre-distributed MPP architecture, the initial cost is relatively high, but the query speed is very fast, high concurrency. and similar product impala, using the Dremel data structure of the low pre-distribution MPP architecture, the initial cost is relatively low, concurrency and response speed is also quite slow.

Stream Computing Products (former Galaxy), can be used for large-scale mobile data in the continuous changes in the process of real-time analysis, Alibaba is open source based on Storm Java rewrite a set of distributed real-time flow computing framework, also known as jstorm, compared to the product is storm or spark Streaming. Recently Aliyun will start to test stream SQL, through the SQL way to achieve real-time streaming computation, reducing the use of streaming computing threshold.

Having said so much, maybe everyone will feel bored. What does this have to do with our benevolent help? All are specialized terminology, and people who do not know the big data are like the Bible.

In fact, the future of humanity will be to learn Ali in the business use of large data in the way as a reference standard.

The data store is only one part. On September 29, 2017, the registered user of humanity has reached level 100,000. From 100,000 to 1 million users, the time will be very short, estimated to be completed within half a year or so. How to ensure the stability of the background data to become a benevolent help must face or homework problems. From the user published data, to user behavior data, to log data, etc. for us is a valuable asset. Do not use large data analysis of enterprises, is tantamount to "sitting in Jinshan gnawing steamed buns." So what data engine to use is our technical department's most concerned topic. Start-up team is small, less money, and at times will face the explosion of users, so the initial architectural design is very important. The app architecture for humanity is dependent on the Aliyun, from the beginning of a cloud server to the current size of nearly 10 servers. First, the system to achieve clustering design, no single point, and support vertical and horizontal expansion. At the same time, the system can be divided into modules, data storage should be persistent storage.


Through a load-balanced solution,


Click to view full text


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.