How to build a JAVA Thread pool management and distributed HADOOP scheduling framework tutorial

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In normal development, threads are indispensable. For example, servlets in tomcat are threads. How can we provide multi-user access without threads? However, many developers who have just started to contact the thread have suffered a lot from it. How to implement a simple thread development model framework that allows everyone to quickly transfer from single-thread development to multi-thread development is indeed a relatively difficult project.

What is thread? First, let's take a look at what a process is. A process is a program executed in the system. This program can use resources such as memory, processor, and file system. For example, QQ software, eclipse, tomcat, and so on are an exe program, and running is a process. Why multithreading? If each process processes one thing separately and cannot process multiple tasks at the same time, for example, we can only chat with one person when opening qq, and we cannot compile the code when developing code using eclipse, when we request the tomcat service, we can only serve one user request. I think we are still in the primitive society. The purpose of multithreading is to allow a process to process multiple tasks or requests simultaneously. For example, the QQ software we use now can chat with multiple people at the same time, we can also compile the code when developing code using eclipse, tomcat can serve multiple user requests at the same time.

How can I change a single-process program into a multi-threaded program with so many advantages of threads? Different languages have different implementations. Here we talk about two methods for implementing multithreading in java: extending the java. lang. Thread class and implementing the java. lang. Runnable interface.

Let's take a look at an example. Suppose there are 100 pieces of data to be distributed and computed. Depending on the processing speed of the order thread:

Package thread;

Import java. util. Vector;

Public class OneMain {
Public static void main (String [] args) throws InterruptedException {
Vector <Integer> list = new Vector <Integer> (100 );

For (int I = 0; I <100; I ++ ){
List. add (I );
            }

Long start = System. currentTimeMillis ();
While (list. size ()> 0 ){
Int val = list. remove (0 );
Thread. sleep (100); // simulate processing
System. out. println (val );
            }
Long end = System. currentTimeMillis ();

System. out. println ("consumption" + (end-start) + "ms ");

      }

// Consume 10063 MS
}

Let's take a look at the processing speed of multiple threads and use 10 threads to process them separately:

Package thread;

Import java. util. Vector;
Import java. util. concurrent. CountDownLatch;

Public class MultiThread extends Thread {
Static Vector <Integer> list = new Vector <Integer> (100 );
Static CountDownLatch count = new CountDownLatch (10 );

Public void run (){

While (list. size ()> 0 ){
Try {
Int val = list. remove (0 );
System. out. println (val );
Thread. sleep (100); // simulate processing
} Catch (Exception e ){
// The array may be out of bounds. This is just to illustrate the problem and ignore the error.
               }

          }

Count. countDown (); // delete successfully Minus One

     }

Public static void main (String [] args) throws InterruptedException {

For (int I = 0; I <100; I ++ ){
List. add (I );
          }

Long start = System. currentTimeMillis ();

For (int I = 0; I <10; I ++ ){
New MultiThread (). start ();
          }



Count. await ();
Long end = System. currentTimeMillis ();
System. out. println ("consumption" + (end-start) + "ms ");

     }

// Consume 1001 MS
}

We have seen the advantages of the thread! It takes 10 S for a single thread and 1 S for 10 threads. Fully utilizes system resources for parallel computing. It may be a misunderstanding whether the increasing number of threads is more efficient. The more threads, the higher the processing performance. This is an error. The paradigm should be appropriate. It is not good if it passes through. We need to popularize some knowledge about computer hardware. Our cpu is a timer, which is required for thread execution. However, there is only one resource, and everyone will compete for it. Generally, the following algorithms are used to schedule cpu competition:

1. Queue mode: first, service first. No matter what task comes, you must queue up first and then.

2. Time slice rotation, which is also the oldest cpu scheduling algorithm. Set a time slice. The cpu usage time of each task cannot exceed this time. If this time is exceeded, the task will be paused and saved, and placed at the end of the queue to continue waiting for execution.

3. Priority: set a priority for a task. If the task has a priority, the task is executed first. If the task has no priority, the task is waiting for execution.

These three algorithms have advantages and disadvantages. The actual operating system combines multiple algorithms to ensure that priority can be processed first, but it cannot always process priority tasks. In terms of hardware, there are also multi-core cpu, multi-thread cpu and other solutions to improve efficiency. At present, it can be seen that the increase in threads will increase the load of cpu Scheduling. The cpu needs to schedule a large number of threads, including creating and destroying threads, checking whether the threads need to switch out the cpu, and whether they need to be allocated to the cpu. These are all system resources that need to be consumed. Therefore, we need a mechanism to uniformly manage this pile of thread resources. The thread pool concept solves the cost of frequent thread creation and destruction. A thread pool is a process where a thread of a certain size is pre-created to wait for the user's tasks to be processed at any time. You do not have to wait until the user needs to create a thread pool. Especially in java development, to minimize the consumption of the garbage collection mechanism, we need to reduce the frequent creation and destruction of objects.

Previously we all implemented our own thread pool, but with the launch of jdk, jdk comes with the java. util. concurrent development framework, which solves the repetitive work of most of our thread pool frameworks. Executors can be used to create a thread pool and list the following rough information.

NewCachedThreadPool

NewFixedThreadPool creates a fixed number of threads

NewScheduledThreadPool creates a thread with time scheduling

With the thread pool, you need to consider the following issues:

1. How to manage threads, such as creating a task thread.

2. How to stop and start a thread.

3. Whether the thread can enable accurate start time in addition to the interval of scheduled mode. For example, start at 1 o'clock.

4. How to monitor the thread? If the thread is dead during execution, how can we know the exception termination.

With this in mind, we need to manage threads in a centralized manner, which cannot be achieved by using java. util. concurrent. You need to do the following:

1. Separate threads from services, and separate the service configurations into a single table.

2. Construct a concurrent-based thread scheduling framework, including the thread state management, thread stop interface, thread survival heartbeat mechanism, and thread exception logging module.

3. Build a flexible timer component and add the quartz timing component to implement a precise timing system.

4. Build a thread pool task scheduling system based on the business configuration information. You can perform operations such as configuration management, adding thread tasks, monitoring, timing, and management.

Component diagram:

Can a thread scheduling framework be built to meet the needs of a large amount of computing? The answer is No. Because the resources of a machine are limited, as mentioned above, the cpu is a time cycle, and more tasks will be queued up. Even if the cpu is increased, the cpu that a machine can carry is limited. Therefore, the whole thread pool framework must be made into a distributed task scheduling framework to cope with horizontal scaling. For example, the resource on a machine has reached the bottleneck, now you can add a machine to deploy the scheduling framework and business to increase the computing capability. Okay. How do I build it? As shown in the following figure:

Based on jeeframework, We encapsulate spring, ibatis, databases, and other operations, and can call business methods to complete business processing. Main components:

1. Tasks are centrally stored on database servers.

2. The control center is responsible for managing the node status and task distribution in the cluster.

3. The thread pool scheduling cluster is responsible for the execution of tasks distributed by the control center.

4. web servers perform visual task assignment, management, and monitoring.

Generally, this architecture can meet common distributed processing requirements. However, with the increase of developers and business models, the single-thread programming model will become more complex. For example, if you want to split million pieces of data into a single thread for execution, it takes a lot of time to query the database. Some people say, can I just scatter million data to different machines for computation and then merge it? Because this is a special case model, it is no problem to develop the corresponding program for this need, but how can we deal with other massive demands in the future? For example, you can obtain the highest user schedule that is forwarded by the most posts posted by all users in the past three years. It is too troublesome to compile another program! Distributed cloud computing architecture solves these problems, reducing development complexity and improving performance. Do you think of a framework that has been very popular recently? hadoop is just that. Hadoop solves this problem. Is this what we need to break down, compute, and merge large computing tasks? However, anyone who has played this course knows that it is a separate process. No! He is a bunch of processes. How can he be combined with our scheduling framework? Figure-based speech:

The preceding distributed scheduling framework components remain unchanged. The following components and functions are added:

1. Transform the distributed scheduling framework to convert its own thread tasks into mapreduce tasks and submit them to the hadoop cluster.

2. hadoop clusters can call spring and ibatis of business interfaces to process business logic access to databases.

3. Data required by hadoop can be queried through hive.

4. hadoop can access hdfs/hbase read/write operations.

5. Business data should be promptly added to the hive warehouse.

6. hive processes offline data, hbase processes frequently updated data, hdfs is the underlying structure of hive and hbase, and can also store regular files.

In this way, the entire transformation is basically completed. However, it must be noted that architecture design must reduce the complexity of development programs. Although the hadoop model is introduced here, the framework developers are still hidden. The business processing class can run either in standalone mode or on hadoop, and can call spring and ibatis. It reduces the learning cost of development, and gradually learns a new skill in practice.

Screenshot:

Original article, reprinted please note: Reprinted from LANCEYAN. COM

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to build a JAVA Thread pool management and distributed HADOOP scheduling framework tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to build a JAVA Thread pool management and distributed HADOOP scheduling framework tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support