python+ Big Data Computing platform, PYODPS architecture Building

Source: Internet
Author: User
Data analysis and machine learning

Big data is basically built on the ecosystem of Hadoop systems, in fact a Java environment. Many people like to use Python and r for data analysis, but this often corresponds to problems with small data or local data processing. How do you combine the two to make it more valuable? Hadoop has an existing ecosystem and an existing Python environment as shown in.

MaxCompute

Maxcompute is a big data platform for off-line computing, providing TB/PB data processing, multi-tenancy, out-of-the-box, and isolation mechanisms to ensure security. The main analytical tool on Maxcompute is that Sql,sql is very simple, easy to get started, and is descriptive. Tunnel provides a data upload and download channel that does not need to be dispatched by the SQL engine.

Pandas

Pandas is a numpy based data analysis tool, the most important structure is Dataframe, provides a series of drawing API, behind is the operation of Matplotlib, very easy and Python third-party library interaction.

Pyodps Architecture

Pyodps is using Python for big data analysis, as shown in the architecture. The underlying is the underlying API, which can be used to manipulate tables, functions, or resources on Maxcompute. Again above is the dataframe frame, dataframe consists of two parts, part is the front end, defines a set of expressions of the operation, the user writes the code will be converted into expression tree, which is the same as the normal language. Users can customize functions, or they can visualize and interact with third-party libraries. The bottom of the backend is optimizer, which is the function of optimizing the expression tree. Both ODPs and pandas are submitted to the engine through compiler and analyzer.

Background

Why do you want to do the Dataframe framework?

For any big data analysis tool, there are three dimensions of the problem: expression, API, grammar, programming language is simple, intuitive? data, storage, metadata can be compressed, effective? Is the performance of the engine sufficient? So you face pandas and SQL two choices.

As shown, the pandas expression is very good, but its data can only be placed in memory, the engine is a stand-alone, limited by the performance of the machine. The expression of SQL is limited, but it can be used for a large amount of data, the data is small when the advantages of the engine, the data is large when the engine becomes very advantageous. The goal of ODPs is to synthesize the merits of both.

Pyodps DataFrame

Pyodps Dataframe is written in the Python language and can be used in Python variables, conditional judgments, and loops. You can use pandas-like syntax to define your own set of front ends, with better expressive power. The backend can be based on the data source to determine the specific execution of the engine, is the visitor design pattern, extensible. The entire execution is deferred, unless the user invokes a method that executes immediately, otherwise it is not executed directly.

As you can see, the syntax is very similar to pandas.

expressions and abstract syntax trees

As can be seen, the user from a primitive collection to perform groupby operation, and then the operation of the column selection, the bottom is the source of the collection. Take two fields species, these two fields are done by operation, Pental_length is the aggregation of the operation to take aggregate values. The species field is taken out directly, and the shortest field is an additional operation.

Optimizer (Operation merge)

The backend will first use optimizer to optimize the expression tree, first do groupby, and then do the column selection above, through the operation of merging can be removed petal_length do aggregation operations, plus one, eventually formed GroupBy collection.

Optimizer (column pruning)

When a user joins two data frames and then takes two columns from the data frame, the process is very low if you commit to a big data environment, because not every column is used. So we need to prune the columns under the joined. For example, data frame1 we only use one of the fields, we just need to intercept the field to make a projection to form a new Collection,data frame2 is similar. In this way, the two parts of the calibration operation can greatly reduce the output of the data.

Optimizer (predicate push-down)

If the two data frame is joined and then filtered separately, the filter should be pushed down to the bottom to perform, thus reducing the amount of input to the joined.

Visualization of

Visualize () is provided to make it easy for users to visualize. As you can see in the example on the right, the ODSP SQL backend is compile into a SQL execution.

Back end

As you can see, the compute backend is very flexible. Users can even joined a pandas data frame and maxcompute the previous table.

Analyzer

The role of analyzer is to convert some operations to specific back ends. Like what:

Some operations such as Value_counts,pandas itself support, so for the pandas back end, no processing, for the ODPs SQL backend, there is no direct operation to execute, so when analyzer executes, it will be rewritten to GroupBy + sort operation ;

There are also operators that, when compile to ODPs SQL, have no built-in functions to complete and are rewritten as custom functions.

ODPS SQL Backend

ODPS SQL Backend How do I perform SQL compilation? The compiler can traverse the expression tree from top to bottom to find a join or union. For a sub-procedure, recursive compile. When the engine is executed, the expression tree is rewritten using analyzer, compile the top-down sub-process, compile the bottom-up into an SQL clause, and finally getting the full SQL statement, committing the SQL and returning the task.

Pandas back end

The expression tree is accessed first, and then the pandas action is mapped to each expression tree node, and a DAG is formed after the entire expression tree has been traversed. The engine executes in the DAG topology order, applying it continuously to the pandas operation, resulting in a result. For big data environments, the pandas backend is used for local debug, and when the amount of data is very small, we can calculate it using pandas.

Difficulty + Pit

Back-end compilation error prone to loss of context, multiple optimize and analyze, resulting in difficult to find out which visit node was previously caused. Solution: To ensure that each module ⽴, testing complete;

Bytecode compatibility issues, Maxcompute only support the execution of Python2.7 custom functions;

The order in which SQL is executed.

ML machine learning

Machine learning is the input and output of a data frame. For example, there is an iris data frame that first uses the Name field to make a categorical field, calling the split method to divide it into 60% training data and 40% of the test data. Then initialize a randomforests, which has a decision tree, call train method training training data, call Predict method to form a predictive data, call Segments[0] can see the visual results.

Future plans

Distributed numpy,dataframe based on the backend of distributed numpy;

Memory computing to enhance the interactive experience;

TensorFlow.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.