Presto: Distributed SQL query engine that can handle petabytes of data

Source: Internet
Author: User

In the fall of 2012, Facebook launched the Presto, Presto, whichaims to perform quasi-real-time analysis on hundreds of petabytes of data. After abandoning some external projects, Facebook is ready to develop its own distributed query engine. Presto's syntax is based on ANSI SQL, and most distributed query engines require users to learn a new syntax, some of which are similar to SQL, but none are as familiar as real SQL and have detailed documentation. Facebook hopes this decision will make it easier and faster to train new users. Relying on ANSI SQL also allows Presto to take advantage of existing third-party tools.

Internally, the Presto is based on pipelining. When the request is parsed and the task is assigned to the appropriate node, the client pulls the data from the output phase and the output phase pulls the data from the lower stage. Presto's execution pattern is a fundamental difference from hive/mapreduce. Hive translates the query statements into different stages of the MapReduce task, and then executes one after the other. Each task reads the input from the disk and writes the intermediate results back to disk. By contrast, Presto is not using MapReduce, he uses the query and execution engines that are used by everyone, and they have well-designed operators that support SQL syntax. More than optimized scheduling, the whole process is in memory, but also in the different stages of network interaction through the pipeline operation. This avoids unnecessary IO operations, and the resulting high latency. This pipelined execution model can run at different stages at the same time, and when the data is available, stream data is from one stage to another. For many types of queries, this significantly reduces end-to-end latency.

Presto is a pluggable back end written in Java. For many data sources, such as Hive, HBase, or scribe, a data connector is required. This connector provides metadata for Presto, information about which nodes hold data, and provides a way to stream data.

In most of Facebook's query scenarios, Presto is more than hive/mapreduce 10 times times more than the time-consuming and CPU-intensive. Facebook still plans to further improve performance. A plan is to design a new data format to reduce the amount of data that is required to convert data from one phase to another. Facebook also plans to remove some of the current design limitations: The main limitation is the size of the table at the time of the join operation and the cardinality of the unique primary key and group time. At present, the system lacks the ability to export Data association to the table, the current query results are returned to the client.  

At present, the United States Regiment has a large-scale use, see: http://tech.meituan.com/presto.html

Currently Presto has been incorporated into apache2.0, with its git address: https://github.com/prestodb/presto

Official Document: Https://prestodb.io/docs/current/overview/use-cases.html

Presto: Distributed SQL query engine that can handle petabytes of data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.