Use Akka to optimize quasi-real-time systems of Spark + ElasticSearch

Source: Internet
Author: User

Use Akka to optimize quasi-real-time systems of Spark + ElasticSearch

In this scenario, the system receives a large number of events every second. Each event contains many parameters, in addition to quasi-real-time data, you must periodically determine whether the combination of each event and event parameter value has exceeded the threshold value set by the system. In this scenario, what kind of solutions should users adopt? Recently, Android Camilo, software architect from Premium Minds, published an article on how they use Akka to solve this thorny problem.

In this article, André Camilo first introduces their application scenarios:

Our system can receive hundreds of events at most every second. Some events have 8 parameters, and some events have a combination of over 240,000 parameter values (* assume that there is a PhoneCall (phoneNumber, countryCode, geoZone) event. The event has three parameters, of which phoneNumber has 4,000 values, countryCode has 5 values, and geoZone has 10 values, the possible combination of parameter values is about (4000 + 1) (5 + 1) (10 + 1) = K *), we do not only need to judge in real time whether the combination of these events and parameter values exceeds the threshold value set by the system, but also keep the data of the last 30 minutes, in order to determine whether the frequency of their appearance exceeds the threshold value during this period of time.

The simplest way to deal with this problem may be to store all the data and calculate the frequency of each combination every second, but in fact this is impossible, as a result, there will be more than 240,000 queries per second, and the system will not be able to afford it. Android Camilo's first solution is to use Spark and ElasticSearch:

We have created a Spark Streaming Data Stream pipeline that first reads messages from the JMS queue and converts them to PhoneCall events, then, an event is separated into multiple events based on the parameter values of the event, and then the countByWindow function is used to calculate the frequency of each event combination, finally, check whether the average frequency of each combination exceeds the threshold. When countByWindow is used for computing, a 30-minute window is set every second. At the same time, the function output value is divided by 1800 seconds to obtain the average frequency of each window. The final result is stored in the ElasticSearch cluster.

The process of this solution is as follows:

Although this solution is feasible, it does not solve the problem of Android Camilo, not because Spark does not work, but because Spark Streaming can process a large amount of real-time data, however, a large number of windows cannot be processed. In the android Camilo experiment, if the number of combinations is less than 1000, this solution can work well, but if it exceeds this number, it will cause memory overflow.

Android Camilo provides the second solution by using Akka:

  • Creates a combined Actor for the combination of each parameter value.
  • Create an Actor responsible for receiving all events. The Actor separates an event into multiple events based on the event parameter values, the Separated Events are sent to the combined Actor created in step 1 based on the relationship between the parameter combinations.
  • Each combined Actor stores the number of events in the last 30 minutes (in seconds) through the ring buffer. Every second, the buffer will scroll to a position, and the Actor will calculate the event frequency, check whether the frequency has exceeded the threshold value set by the system and send the result to ElasticSearch Actor.
  • ElasticSearch Actor is just an ActorPublisher, responsible for sending data to ElasticSearch stream-driven

The process of the second solution is as follows:

The structure of the circular buffer is as follows:

You may ask if creating an Actor for each combination will lead to too many actors? Android Camilo tells us that this is not a problem for Akka, a lightweight event-driven framework. With this solution, Android Camilo easily solves the separation of 800 events on an i7 4 GB notebook. More importantly, Akka supports horizontal scaling. If the system has more parameter values or requires more throughput, you only need to add more machines.

Finally, Android Camilo concluded that Spark has excellent features and its solution is simpler and more intuitive, but not suitable for this scenario. Akka is very suitable for processing CPU-sensitive issues, and the Actor model is more suitable for processing high concurrency issues.

Full record of installation and deployment of ElasticSearch on Linux

Elasticsearch installation and usage tutorial

ElasticSearch configuration file Translation

ElasticSearch cluster creation instance

Build a standalone and server environment for distributed search ElasticSearch

Working Mechanism of ElasticSearch

Akka details: click here
Akka: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.