Eleme big data computing engine: Grace is implemented by Spark Streaming

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Grace architecture

The data used in the above examples was collected by Grace. Grace is hungry. The application developed by the big data team is mainly used to monitor and analyze the online MR/Spark task running data, monitor the running queue and task details and summarize the data.

Grace is implemented by Spark Streaming. By collecting the jhist file of the completed MR task stored in Kafka or the event log path of the Spark task, the task running history data is obtained from the corresponding location of the HDFS, and the detailed data of the MR/Spark task is obtained after parsing. Based on these data, a certain aggregation analysis is performed to obtain summary information of the task level, the Job level, and the Stage level. Finally, through the customized Dr-Elephant system, the task detailed data is analyzed by heuristic algorithm, which gives the user some intuitive optimization tips.

For Dr-Elephant, we also made custom changes, such as packaging it as a component of the Grace system. The mode of deploying services from a single machine has changed to a distributed real-time resolution mode. Switch its data source to the task detail data that Grace resolves to. Increase the ActionId of each task to track link information, optimize Spark task resolution logic, add new heuristic algorithms and new monitoring indicators.

Conclusion

As the big data ecosystem becomes more and more perfect, more and more users with different backgrounds will join the ecosystem. How to reduce the user's entry threshold and facilitate users to use big data resources quickly and conveniently is also a problem to be considered.

Most of the tasks running in big data clusters are business-related, but as the cluster size becomes larger and larger, the task size becomes larger and larger, and the data generated by the cluster itself cannot be ignored. This part of the data is really reflecting the details of the use of the cluster. We need to consider how to collect and use this part of the data to measure and observe our clusters and tasks from a data perspective.

It is not enough to focus on the overall deployment, performance, and stability of the cluster. How to improve the user experience, fully exploit the data of the cluster itself, and use data to promote the construction of big data clusters is the theme of this sharing.

Q & A

Q: Can you briefly introduce the scheduling system? Managing tens of thousands of tasks is not easy.

A: The scheduling system is quite complicated to say. Just mention a few key points, one is the dependence between tasks, one is blood relationship, one is task and instance, as well as cluster back pressure, distributed scheduling, and the underlying environment.

The kinship relationship should be necessary, because when your cluster is large, users can't add dependencies completely when configuring tasks.

Through the blood system, the task is parsed. When the user configures the new task, the pre-dependency is automatically recommended to ensure that the task runs in an orderly manner.

Q: How do I get the daily read and write scale of the cluster? Hadoop has an interface?

A: The scale of cluster read and write is collected by Grace introduced earlier. Because we will analyze the HDFS data read and write amount of each mr task or spark task. It also includes spike to disk data, shuffle write, shuffle read data, and GBHour information for each task.

In fact, you can see the data through YARN or Spark's WEB UI page. All you need to do is parse and collect the data in real time. This is also mentioned in this sharing introduction, and the operation and maintenance of the cluster from the perspective of data.

In addition to business data, the data generated by the cluster itself is also valuable.

Q: Is this the data from the big data itself to refine the operation and maintenance cluster?

A: Yes. If you are also engaged in the direction of data architecture, you can recall your daily work. We are just turning the human flesh analysis into automation and then adding some real-time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More