SEQUOIADB x Spark's new mainstream architecture leads enterprise-class applications

Source: Internet
Author: User
Tags apache flink databricks

In June, the spark Summit 2017, which brings together today's big data world elite, has been the hottest big data technology framework in the world, showcasing the latest technological results, ecosystems and future development plans.

As the industry's leading distributed database vendor and one of the 14 global distributors of Spark, the company was invited to share the "distributed database +spark architecture and Applications" at this conference. Giant Cedar Database co-founder, CTO and General architect will also share the information of the General Assembly and the development and application of this architecture.

Spark fully evolved to expand the ecosystem to help AI

With the release of the Spark 2.2 release, spark performance has improved a lot. In spark streaming, the latest version has reached more than 5 times times more than 60 million records per second in the Common stream processing architecture (such as Apache Flink and Kafka streaming) under the same conditions. In testing, Spark's end-to-end response time to critical loads has reached sub-millisecond levels, truly real-time.

Streaming Performance Comparison chart released by Spark

In addition to performance improvements, Spark's structured streaming system is also largely product-based. Under the guarantee of performance and stability, structured streaming supports more big data architectures, from graph processing to deep learning to provide real-time streaming support at the highest performance level.

In addition, Spark has been fully supported in the field of high-heat AI. The Spark 2.2 version joins the full deep learning Pipeline as a data source for in-depth learning, providing comprehensive data support.

Wang Tao believes that "Data is the new oil! "A very accurate description of the positioning problem between big data and AI. It can be said that artificial intelligence is a new engine, big data is the energy needed by the engine. The data is the foundation of deep learning technology, only the two are complete, artificial intelligence can really "self-learning and self-evolution." As one of the most popular high-performance analytics processing frameworks and streaming frameworks in big data, spark is also necessary to fully support AI and deep learning. Using the latest deep learning pipeline suite from spark, users can call the in-depth learning library in an existing spark machine learning workflow, migrate to a molded model, and use Spark's distributed computing engine to process complex data through AI. Databricks chief technical expert Matei Zaharia also said that the official release of this set of pieces is an important step in the popularization and popularization of AI development, can help more users to better access to AI and deep learning technology, can greatly enhance the importance of spark technology in the future technology field.

There is no doubt that spark's product progress is accelerating, and it is expanding its technology ecosystem.

Distributed database +spark architecture leads the mainstream, sequoiadb X spark to improve the big data ecosystem

In recent years, the architecture of "distributed Database +spark" has evolved into one of the mainstream architectures with the application of Spark. Distributed database provides massive data storage management capabilities and high concurrency real-time data query interaction, and Spark's batch processing to achieve perfect complementarity, is the spark application architecture indispensable important support.

Giant Cedar is one of the earliest practitioners of the "distributed Database +spark" architecture, and SEQUOIADB's real-time, high-performance, elastic scalability has become a solid data base for this architecture. Since 2015, the deep Integration architecture of SEQUOIADB distributed database and Spark has matured, and many large enterprises such as banks have applied this architecture in production systems such as data processing and interactive access.

In order to realize the deep integration of SEQUOIADB distributed database with Spark, the technical aspect is to deeply interface the distributed database with the spark architecture through the giant Fir's own connector.

· The docking method supports both file block and Datanode two ways, and can support the query condition under pressure, and improve the query efficiency by matching the index of the giant FIR database itself.

·  SEQUOIADB for Spark Connector can also intelligently determine the query data and the location of the Spark compute worker when generating the query's access plan, by default prioritizing local data, thereby reducing the overhead of data transmission over the network.

·  The connector can achieve the file block level concurrency, make full use of the distributed multi-node effectively improve cluster overall I/O throughput capability.

Distributed database +spark Technical architecture diagram

Spark supports data sources from data sources, such as text files and HDFs files, by default, and also supports the shipment of third-party products as data for the calculation task of the Spark computing framework. In addition to supporting the distributed storage of massive data, the distributed database can provide users with multi-index function, and enable users to access high-performance real-time data in high concurrency scenarios.

Distributed database +spark The main usage scenarios are: in large amounts of data, through conditional retrieval of records and in large amounts of data, for some specific range of records, such as for the past one months of records for statistical analysis. This type of query and analysis with explicit query conditions is ideal for spark+ distributed databases. The distributed database +spark architecture will be able to achieve full-featured coverage from high-concurrency real-time interactive querying to high-performance data computing to real-time data stream processing.

In practice, a joint-stock bank uses giant FIR database to construct the near-line data platform, through the Sequoiadb+spark architecture, SEQUOIADB guarantees the storage and real-time online of the whole-volume near-line data, and provides real-time query access to the whole data. And Spark provides the function of conditional retrieval and statistical analysis.

On the one hand, users of the full amount of historical data to be fully online, so that bank customers can access through the cabinet applications, mobile phones, online banking and other multi-channel to their own account of all transactions, on the other hand, the bank's internal clerk provides free report analysis, support public security historical data query and many other services.

In addition, a bank through the sequoiadb+spark of the underlying data platform, for its "real-time position" to solve the original report system can only do "t+1" limit, for the system to provide high-performance real-time data analysis, query, display. Among them, the high performance of Spark improves the efficiency of analysis statistics, while sequoiadb data real-time access guarantees the real "real-time" of data.

For the next direction of collaboration with Spark, the spark ecosystem will be the strongest technical force in the future of big data, as the spark ecosystem continues to enrich and its technology components support the different technologies, Wang said.

As one of the spark global distributors, the giant Cedar will further strengthen its cooperation with spark/databricks, increase the Sequoiadb+spark solution and strive for deeper docking with the spark framework to enable high-concurrency real-time interactive query access from data. To the full-featured coverage of high-performance data computing to data real-time streaming, enabling enterprise users to achieve the highest performance, most comprehensive big data platform.

SEQUOIADB giant Cedar Database 2.6 latest version download

SEQUOIADB Giant Cedar Database Technology Blog

SEQUOIADB Giant Cedar Database Community

SEQUOIADB x Spark's new mainstream architecture leads enterprise-class applications

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.