This document has been translated from building Analytics Engine Using Akka, Kafka & ElasticSearch, and has been licensed by the original author Satendra Kumar and the website.
In this article, I'll share with you my experience in building large, distributed, fault-tolerant, extensible analysis engines with Scala, Akka, Play, Kafka, and Elasticsearch.
My analysis engine is mainly used for text analysis. Input structured, unstructured, and semi-structured data, and we use the analysis engine to do a lot of processing of the data. As shown in the first generation architecture, the analysis engine can be accessed with rest clients or Web clients (built-in engines).
Briefly describe the techniques used:
- The play framework makes rest servers and web apps. Play is a lightweight, stateless, and web-friendly MVC framework.
- Akka cluster as the processing engine. Akka is a toolset for simplifying the writing of highly concurrent, distributed, and resilient message-driven applications on the JVM.
- The clusterclient is used to communicate with the Akka cluster. It runs on the rest server and sends the task to the Akka cluster. Using Clusterclient is a very wrong decision because it does not maintain a long connection to the Akka cluster, so it often reports a connection error, and also restarts the JVM where the client resides when the connection is re-established.
- Elasticsearch is used as the query engine and data store, including raw data and analysis results.
- The Kibana is used as a visual platform. Kibana is an elastic analysis and visualization platform.
- The Akka actor is used as the data Import Export service for Elasticsearch. It's a very good performance and the service has never been faulty.
- S3 is used as a centralized file store.
- The Elastic load balance is used as a load balancer between nodes.
- MySQL is used for meta data storage.
We started with the Akka 2.2.x version, but also encountered a number of serious problems, mainly manifested as:
- Disconnect between clusterclient and Akka clusters: When the load is high CPU usage, clusterclient often inexplicably disconnects from the Akka cluster. Because it was a third-party library, we had to reboot the JVM to keep it working, and sometimes we had to get up in the middle of the night to deal with the problem.
- Resource utilization: We find that the CPU usage on rest servers is only 2-5%, which is a waste of resources, and Amazon EC2 servers are not cheap.
- Latency issues: The rest server is running on a different server. This creates a delay problem because, for each client request, it deserializes the request and then serializes it before it is sent to the Akka cluster. The same is true of the response message from the Akka cluster, which is then deserialized before being sent to the requester. This serialization and deserialization process often results in a time-out problem. And, we're just using play as the rest backend instead of the full web framework, and I admit that this is our design problem.
In order to solve these problems we designed the second generation architecture, the main changes are:
- Remove Akka clusterclient.
- Replace the play schema with spray because it is not the right decision to use play as a rest service. Spray is a lightweight HTTP server.
- To reduce end-to-end latency, we run the rest service on the same JVM as the Akka cluster node, rather than on a separate node.
The new architecture is this:
Great, this system works very well. Life has become very beautiful and the team has received a lot of praise.
Three months later, there was a new need to increase datasift as a data source, providing streaming data and historical data. This requirement is satisfying, as long as you add a new service, pull the data from the DataSift and send it to the analytics cluster.
Adding new services is simple, but it leads to new problems:
- The above architecture is essentially a push model, and the cluster will not be processed when there is a lot of streaming or historical data being pushed over.
- We decided to expand the cluster from 4 nodes to 8 nodes. This is possible at peak conditions, but most of the nodes are in a very idle state under normal circumstances. We're using Amazon EC2 4x. The large node, which is very expensive, raises the cost of infrastructure.
- We decided to use Amazon's auto-scaling service. It does automatically expand when the load on the cluster is increased, but it does not shrink when the load drops. Amazon Auto-scaling service does not handle our business well enough.
- Another problem is that the internal node communication of the Akka cluster is often problematic when CPU usage exceeds 90%, probably because we don't have enough experience to match akka clusters, or maybe Akka clusters are not as mature as they are now.
- If a node crashes, the entire process stops.
When we are trying to find a solution to this problem, we also need to add another data source!
After a lot of brainstorming, we learned about the existing architecture and made a simple, scalable, and fault-tolerant third-generation architecture:
In this new architecture, we removed the Akka cluster and rewritten the analysis engine. It is entirely based on the Akka actor, and the rest service is running on the same JVM. The rest service simply receives requests from the client, does authentication and authentication, and then creates a pending message to be sent to the Kafka queue. Each node of the analysis engine pulls data from the Kafka queue and then pulls down a batch. So that it will never be too busy to come.
Benefit from Kafka's internal mechanism, regardless of which node is dead, Kafka will automatically send the message to another normal node, so no message will be lost.
With this architecture, we don't have to continue to lease the previous Amazon EC2 4X large server, as long as Amazon EC2 2X large can support any load and save a lot of money. (There should be applause here.) :) )
This is entirely a schema based on the pull pattern. All requests and surges are handled through the Kafka cluster. It will never be busy, because all operations are based on the pull mode. The entire system is deployed on 26 EC2 nodes, and it has been almost two years since the production system failed.
We also use Kafka to store a variety of service logs to analyze performance, security, and user behavior. Kafka producers will send logs to the Kafka server. Because we already have Elasticsearch import and export service, we can still use them to push the Elasticsearch log. We can also easily visualize user behavior with Kibana.
Conclusion
- The Akka actors is ideal for creating highly concurrent, distributed, resilient applications.
- The spray is ideal for lightweight HTTP servers. Now it has changed its name to Akka-http.
- The play framework is ideal for building highly concurrent, extensible Web applications that are akka at the bottom.
- Elasticsearch is a very good search engine, it is lucene at the bottom, can provide full-text search function. Although we also use it as a data store, data persistence is not its strength (as compared to Cassandra).
- The Kafka is ideal for stream processing and log aggregation. Its architecture has been designed to support scalable, distributed, fault-tolerant functions.
Please be patient and wait for me to improve the fourth edition of the architecture and then update this article ... Happy programming, continuous innovation!
Http://www.infoq.com/cn/articles/use-akka-kafka--build-analysis-engine?utm_campaign=rightbar_v2&utm_source =infoq&utm_medium=articles_link&utm_content=link_text
Build analysis engines using Akka, Kafka, and Elasticsearch-good