Ali cloud ODPS vision, technology and difficulties

Source: Internet
Author: User
Keywords We very something can some
Tags .mall advanced ali cloud alibaba aliyun allow users allowing allowing users

In January 2014, Aliyun opened up its ODPS service to open beta. In April 2014, all contestants of the Alibaba big data contest will commission and test the algorithm on the ODPS platform. In the same month, ODPS will also open more advanced functions into the open beta.

InfoQ Chinese Station recently conducted an interview with Xu Changliang, the technical leader of the ODPS platform, and exchanged such topics as the vision, technology implementation and implementation difficulties of ODPS.

InfoQ: Let's talk about the current situation of ODPS. What can this product do?

Xu Changliang: ODPS formally known in 2011, the full name is called Open Data Processing Service, in simple terms is the data processing services. Its positioning is in the sky, providing data warehousing, data mining and other data applications and other functions.

When we tried to provide ODPS in 2011, there were some small pilot projects at that time, but later we found that the conditions were not fully mature. Whether it was external understanding of the cloud or internal expectations of ODPS in the future was not very clear, In 2013, it has been developing at a relatively slow pace. Last year, about June and July when there are some changes, because the flight to the 5K milestone in the technical capacity of the concerns have been much smaller. Because flying is a distributed operating system, it provides the most basic storage, CPU scheduling capabilities, memory usage, networking and other functions, is the most basic resource packaging integration, the equivalent of a computer, and we are in it to develop applications, Equivalent to a distributed data warehouse, so that users can do the basic ETL processing, SQL query, data import and export, there are some MATLAB, statistical software functions.

In addition to these basic functions, we also provide a complete set of data mining algorithms Xlib (see http://102.alibaba.com/competition/addDiscovery/faq.htm) that allows users to model and do advanced data analysis. In addition, we may also provide some programming framework, so that users can write programs for data processing, such as stand-alone Python, Java, we provide MapReduce programming framework, or specifically to solve the problem of Graph programming framework (also known as BSP, Similar to the Google Pregel model). We will gradually add a variety of content, all involving data processing tools and programming framework we will find ways to add, so that developers and users can all aspects of data operations.

All in all, ODPS is a set of services based on a set of tools and frameworks for data processing that are based on the distributed system.

InfoQ: The equivalent of AWS, the equivalent of RedShift and EMR right?

Xu Changliang: You can do so. Purely from the functional point of view, we will provide features like EMR and RedShift. But not only are we there, we also have libraries for modeling, libraries for machine learning, and MapReduce, in addition to the richness of the programming framework, and the iterative framework. For the time being we may do more, of course, AWS is also gradually provide more features.

Another big difference is that ODPS provides these services as an organic whole, with different functions serving different aspects of the service rather than selling them separately. For example, in my same system, an SQL from my data warehouse type is handled well, and I can correlate it well with a MapReduce job, both physically storing data and the metadata describing the data Inside the same system. Unlike RedShift and EMR, when they finish processing, they export the data to another system to handle. Their metadata description is not shared with each other, and there should be a third party to do the mapping, such as RedShift table structure What is the structure of the EMR how to design accordingly. ODPS wants to keep objects together so that the objects and metadata to be processed are in an ODPS system. On top of this, you have to do Ye Hao, management and maintenance Ye Hao, are the same interface, the user is in a system which do different processing, the user that I was inside a machine, but In different folders. AWS, users will perceive this as two computers.

Another difference is that ODPS wants to make a service: open data processing service. We want to see users pour data in, equivalent to the usage of public cloud. All in all, the data is in the same system. If, in the future, users want something to happen between their data, they can be very easy to do, simply by doing some mutual authorization. AWS's RedShift and EMR, for their part, are all private clouds and handle their own stuff in their own space only if they interact with the outside world and may have to resort to external means such as S3. Of course, it may be the original design goal is this, it does not mean good or bad, but the goal is different.

In our system, because the user's stuff is on a platform, we can actually open an application market like Apple does. The user can use the data mining algorithm or the cleanup process as an application to publish. If others want to use it to buy . Of course, this may be followed by a series of problems such as how to count the money to be dealt with, but the platform is here. If commercial products are willing to give more consideration, this issue is a natural solution. This is all about Alibaba's vision: Alibaba wants to be the number one platform for data sharing. That's really where there's such a big place to do storage, so much computing power, the power of users to handle big data, and security. In fact, security is also the contest where we are more nervous: we have to allow users to run the code, but also to protect the user's data security, this is a very big challenge.

InfoQ: How is your group divided?

Xu Changliang: There are basically three directions: data warehouse scenarios, data mining scenarios, programming framework scenarios. Which programming framework is not only SDK, there will be some redefinition, will introduce some new framework. For example Hadoop has Hive SQL done, as well as Yahoo's Pig - is entirely another language, there is now a spark of fire, although it is based on Scala, but the data processing layer is abstraction layer Come out, provided groupby, filter these operators. We also provide something similar or allow users to define their own framework based on the underlying programming capabilities we provide. It can also be said that in the future we may re-create a distributed system processing language, or allow users to create is also possible.

SQL for data warehousing and data mining There is an Xlab that allows users to model like R or Madlib, and MATLAB, which are packages of basic algorithms that are all user-visible. There are many modules, we do not necessarily see such as SQL execution engine how to do, how to do data storage. Last year we did a relatively big thing, I think with flying 5K can be comparable: Flying 5K is a single cluster of 5000, but today 5000 is not enough, of course, you need to have more than 5000. ODPS has a system that can manage multiple clusters, while allowing users to feel that they face only one cluster. There are a number of strategies involved that determine which cluster your compute runs on, which cluster the data is on, whether it is stored on more than one cluster, how to balance the replication of multiple clusters, and so on. There are more things in this pipe, and our external hope is more transparent.

InfoQ: What is your personal main direction?

Xu Changliang: I am concerned about the calculation. Such as SQL data warehouse, from the implementation of the plan to the implementation of the engine, storage, I will see. This piece is my direct responsibility. There are other programming framework is also my side will see. The two students reported directly to me. In addition, the overall ODPS architecture how to do, the above control cluster how to do, I will also participate.

InfoQ: What do you think most worth sharing with us about these things you've done?

Xu Changliang: There may be a bit more difficult, that is, how to do open. Today we look at the Hadoop community, because it's open source and everyone knows everything about it, so you can create many new things based on its architecture. New things are all traceable, not suddenly pop up, many new things on Hadoop are based on the theory of stand-alone era, such as the theory of the database, these things are based on some, may today Some people have applied them to the distributed environment has become something new.

ODPS is an open data processing service, and opening up is not necessarily open source. Currently, there are no open plans for the codes of the flying aids and the ODPS. Even if you open them out now, you can not use them because you have to rely on many supporting facilities. Therefore, opening up without opening up the source requires a good balance.

Open, means allowing users the freedom and convenience of using our computing power, fully tap the value of the data. For ODPS, to be open to the user's imagination to fully stimulate, depends on how well we can make the programming framework beautiful. The programming framework is important. SQL, and algorithmic libraries These are probably more BI-oriented people who can do things that are relatively ready-made; the open data-processing services are more programming-oriented and are geared more toward developers who build from our open engine Exposed through the interface, so that they can use, but not to expose the following modes of operation, both to allow users a high degree of customization, without violating our safety principles and our balanced choice of distributed and stand-alone .

MapReduce is a great way, because someone gave us this way, we think this method is much easier than the multi-threaded lock relationship, as if writing a stand-alone program, but the steps are different. So, we provide MapReduce to copy what is already there. But in the future a lot of things may not be handled in two steps, and we want to use DAG - a directed acyclic graph that supports such concepts as the current YARN or MapReduce 2.0, as supported by the Tez framework from Hortonworks As long as you do not loop, you can describe the dependency as a directed acyclic graph, and we can all factor it out so that the user can customize it at each stage. This thing we will come up for users. Of course, for developers, DAG is a bit more complicated than MapReduce, but it has more processing power and freedom.

We still think of some things that can help users to do the packaging, such as writing a wordcount: It may be very simple for users to write a MapReduce, but if written in SQL as long as a select and groupby is completed, a word that covers the wordcount thing. So, we can not give users some re-packaging semantics? We provide a groupby operator, the user can use. Although SQL is also called a programming language, after all, unlike the logic of our general language, you can write for loops, if, and so on, and SQL feels you can only express what you want to do , Behind the details of difficult to control, so developers will feel limited. Providing SQL-like basic operators - the idea of ​​groupby, filter, has a similar implementation in Spark, and we might do something similar. We will consider whether there are some things that can sink deeper, or some things can be higher in order to do some design or trade-offs.

Of course, this idea may have a lot, I just made a few points, such as MapReduce, DAG, combined with SQL operators to provide high-level features, so that users write the same procedure. I think writing a SQL may not be writing procedures, writing procedures or variable assignment, relations and more. I do not know if there is anything else in the future, but in these few places we're going to have a great deal of energy and hope to be secure throughout the system and to provide key features that can be iterated inside, broadcast, etc. MapReduce does not provide Things that let them all go through the programming framework and outsiders have better control over the capabilities of distributed systems. If I can do it, I think I can do a good job of opening up.

InfoQ: So in a way, so many subprojects out of Hadoop, but also because of the limitations of MapReduce?

Xu Changliang: In some ways. You see Hadoop 2.0, or YARN scheduler, a big reason is that Hadoop 1.0 job tracker only supports MapReduce and map only these two simple scheduling model. On YARN you can do things like MPI or iteration, Spark can run on YARN, and things are relatively easy in all aspects. For ODPS, because of flight-based and flying schedules - Fu Xi - from the first day to support YARN today can support the model. From this point you can see the development of the sky, a lot of ideas from the beginning or better than Hadoop.

InfoQ: If you want to provide these rich patterns, but also can be copied directly off-the-shelf sub projects?

Xu Changliang: This is a practice. Such as SQL, because of the standard definition, we can easily copy, as long as you write this SQL, my parser will be able to resolve it as you want, you can not think of other tricks. The existing theories and systems in this area are relatively mature. But Spark you come up with, although I think this thing is also very good, but after all, not as stereotyped as database theory, or self-contained system, it has some shortcomings.

We are usedism, it is a good place we took it. Spark is based on Scala and is still unfamiliar to many of our classmates. If we ported it to Java or Python, these two languages ​​have a bigger community and may be easier to do. In fact, Spark this thing in today's ODPS can run up, but we run up Spark may be the implementation of the system is completely different. This is also a future direction of the open programming framework, and in the future, for example, you can bring up Pig, are possible. Spark has dozens or twenty operators, and it's almost up to us now.

Today we do flying or ODPS Ye Hao, we do these independent research and development does not mean that we are behind closed doors, we will see the outside good things, we will integrate some of our own scenes or micro-innovation Innovation.

InfoQ: For ODPS, what is the current business requirement?

Xu Changliang: There are some business needs are clear, such as the business department may do some data analysis, said I think faster, or to deal with larger data, previously supported TB level, and now may be PB level. Some of the requirements are very clear, which we would like to find ways to solve, and these in the distributed system, the data volume itself is a linear expansion must solve the problem, or distributed system is meaningless. And faster processing speed, we are also doing some of their own exploration, such as I mentioned that we made it in the iteration is easy, some data does not fall, in real-time processing, today, our internal SQL run speed Very fast, faster than Hive. In the future, if we are interested, we can publish some benchmark data.

In other areas, such as programming interfaces, these users are developers, their taste will be different, so that's why we want to wrap the bottom layer, let it out, so that developers can customize their own. So everyone will be happy. Of course, today may be only some people happy, after all, put together Java, C, PHP, Python classmates will certainly differ in opinion, we still hope that the underlying operator, we define some things out, so Custom ability later. If everyone's needs are one by one, I think it is difficult to achieve.

InfoQ: What do you think is the most challenging thing to accomplish?

Xu Changliang: I think those academic, theoretical things are actually solved, but also see other people have done a good product, there is no particular problem in this regard. Along the way, I think the majority of engineering problems. For example, events in distributed systems that used to be a small probability become normal, and because of constant interaction, amplification, it becomes quite hard to solve these small probability events because these problems are often beyond your protection. What do you want? Positioning, solving, is a very challenging thing.

On the other hand is the early, regardless of the space or ODPS staffing, the pressure on the number and progress of the work are large, there are some engineering, project management issues. Of course, this is not a technical challenge. The challenges are there, but they will certainly be solved.

The most common small probability event is the device is broken. Hard drive we heard a lot of broken, another card will be broken. Although theoretically Pangu team will deal with the problem of broken hard disk, but the early whether it is scheduling or storage are sitting together, so we work together, not to mention we have a real scene here, a large amount of data, you can find many problems .

We encountered a network card before the problem: a machine probably a one-thousandth chance card is broken, it is bad and not all bad, about five ten thousandths chance of a data error, a bit will Flip - such as 1 into 0. In short, nearly one in 100 million chance of making a mistake. However, because of the large amount of interactive data, give hit.

How to find out this problem? Just mentioned some characteristics of ODPS, but in fact, it is a little important but I did not say it is correct. Our demand for correctness is very high, because our first official commercial client is a small Dijon service, that is, a small loan from Ali. Their business related to money, is directly related to whether you can lend the money out, so we are responsible for their bad debt. At this level, we have high requirements on accuracy, so we will do full batch verification before each release. This process we need to compare the version of the data to ensure that they are right. This process, because we have data to compare, so I found such a problem. This user may not be able to find out, he may run some time to find a data can not explain, but run again and ok, this matter may have passed, because the chance of one billionth will almost certainly not happen to him again Head, it may be for a person.

After finding the problem to communicate with the flying classmates, Feitian class students may think it is not your upper level logic is wrong, resulting in this randomness, we have to find ways to prove that our upper logic is right. Later, we specifically made an end-to-end data verification checksum. Before, we might have done a checksum of the stored data like HDFS, which was not done in the past because of some additional overhead in network transmission, but because of this it has to be done. So we have to make a very rigorous regression on each of our own version releases, and we can not let it go without any mistakes. This is also one of our major features.

InfoQ: Finally, talk about your expectations of the Tianchi algorithm contest?

Xu Changliang: ODPS started commercial on January 24 this year, began to invite some users come in. We hope that at the beginning of this contest, that is, the end of April, the entire ODPS officially commercial. At that time, I think there will be a lot of feedback from outside users. With the Tianchi Competition, we also take a look at our contestants' feedback on ODPS.

First of all, after all, we are doing platform background, the user experience may not do well. We are heavily invested in the platform, but for interactive use, the API may not be well-defined. If users have feedback in this area, it is a great help to us. After commercialization, we have to invest more in all aspects of the external world. So I hope through the competition to make the appropriate improvements.

In addition, we provide to the user this time there are more things. January we have only SQL, April we will open the Xlib machine algorithm platform to help users model, we think that is still very powerful. Last year, we made a big competition inside, similar to this, the top prize-winning are basically used this super weapon, which in today's similar products which are basically not. We also hope that by this competition signboard started, of course, look at the user's feedback, it is not only great power, but also to allow users to the entire modeling process more smoothly.

In addition, we will put some users can customize programming things out. Of course, we also do not want to open too much at once, pre-MapReduce framework and uDF combined with SQL, allowing users to customize some of the functions. We also want to see this user experience. This one will not be commercial in April, but will be open to do a test, it may be dominated by contest users.

Finally, we are also exploring what to do with this "data sharing platform." Today Lynx to share the data out for everyone modeling, if we can achieve a very good recommendation, we Alibaba will also benefit greatly. Because there are thousands of teams, we all have different ideas and maybe new things. In my opinion we have to do data sharing, is to allow everyone to see the value of the data. This depends on everyone's imagination.

Guest introduction

Xu Changliang (@ Chang Liang surname Xu), Peking University double bachelor (majoring in chemistry, into the IT industry purely interest), Princeton University PhD (computational chemistry), served in the New York Times Network Department search team leader, development, Maintain self-developed search engine, the earliest Amazon ec2, s3 and Hadoop users. He joined Alisun in 2009 and was responsible for the Alibaba Cloud Distributed Platform - Flying - Basic Infrastructure Maintenance. Now he is mainly responsible for the architecture and development of the ODPS platform. The products mainly meet various scenarios such as data warehouse, distributed programming framework and data exchange.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.