Manage Java thread pool and build distributed Hadoop scheduling framework

Source: Internet
Author: User
Keywords Java hadoop
Tags access business class code computing configuration consumption control

Usually the development of the thread is a thing, such as Tomcat is a servlet in the threads, there is no thread how do we provide multi-user access? But many developers who have just started to touch threads have suffered a lot. How to do a set of simple threading Development Mode framework for everyone from the single thread development into multithreaded development, this is really a relatively difficult project.

What exactly is a thread? First, look at what the process is, the process is a program executed in the system, the program can use the memory, processor, file system and other related resources. For example, QQ software, Eclipse, Tomcat, etc. is an EXE program, run started up is a process. Why do I need multiple threads? If each process is handled alone one thing cannot be handled by multiple tasks at the same time, such as we open QQ can only chat with a person, we use Eclipse Development code can not compile code, we request the Tomcat service can only serve a user request, Then I think we're still in primitive society. The purpose of multithreading is to allow a process to handle multiple things or requests at the same time. For example, now we use QQ software can chat with many people at the same time, we use Eclipse to develop code can also compile code, tomcat can serve multiple user requests.

Thread so many benefits, how to turn a single process program into multithreaded programs? Different languages have different implementations, here is the Java language implementation of multithreading in two ways: Extend Java.lang.Thread class, implement Java.lang.Runnable interface.

Let's take a look at an example, assuming 100 data needs to be distributed and calculated. Look at the processing speed of a single thread:

Package thread;

Import Java.util.Vector;

public class OneMain {

public static void Main (string] args) throws Interruptedexception {

Vector list = new vector (100);

for (int i = 0; i < i++) {

List.add (i);

}

Long start = System.currenttimemillis ();

while (List.size () > 0) {

int val = list.remove (0);

Thread. Sleep (100);//Analog processing

System. Out.println (Val);

}

Long end = System.currenttimemillis ();

System. Out.println ("Consumption" + (End-start) + "MS");

}

Consumption 10063 ms

}

Take a look at the processing speed of multithreading, with 10 threads handled separately:

Package thread;

Import Java.util.Vector;

Import Java.util.concurrent.CountDownLatch;

public class Multithread extends Thread {

Static vector list = new vector (100);

Static Countdownlatch count = new Countdownlatch (10);

public void Run () {

while (List.size () > 0) {

try {

int val = list.remove (0);

System.out.println (Val);

Thread.Sleep (100);//Analog processing

catch (Exception e) {

Possible array out of bounds, this place is just to illustrate the problem, ignore the error

}

}

Count.countdown (); Delete succeeded minus one

}

public static void Main (string] args) throws Interruptedexception {

for (int i = 0; i < i++) {

List.add (i);

}

Long start = System.currenttimemillis ();

for (int i = 0; i < i++) {

New Multithread (). Start ();

}

Count.await ();

Long end = System.currenttimemillis ();

System.out.println ("Consumption" + (End-start) + "MS");

}

Consumption 1001 ms

}

You see the benefits of threading! A single-threaded need to 10s,10 a thread requires only 1S. The system resources are utilized to realize parallel computing. Perhaps there is a misconception that increasing the number of threads is more efficient. The more threads you handle the higher the performance this is wrong, the paradigm should be appropriate, after the bad. Need to popularize some knowledge of computer hardware. Our CPU is an operator, and thread execution requires this operator to run. But there is only one resource, and everyone will scramble for it. The following algorithms are generally used to scramble for CPU scheduling:

Queue way, first come first service. No matter what the task is, we must queue chronological according to the queue.

Time slice rotation, this is also the oldest CPU scheduling algorithm. Set a time slice, the CPU time for each task can not exceed this time. If this time is exceeded, the task is paused and placed at the end of the queue to continue waiting for execution.

Priority mode: Prioritize tasks, prioritize, and wait for execution without priority.

These three algorithms have advantages and disadvantages, the actual operating system is combined with a variety of algorithms to ensure that priority can be processed first, but also can not always deal with priority tasks. Hardware in order to improve efficiency also has multi-core CPU, multithreaded CPU and other solutions. At present, the increase in the number of threads will result in increased load of CPU scheduling, the CPU needs to dispatch a lot of threads, including creating a thread, destroying the thread, whether the thread needs to swap out the CPU, whether the need to allocate to the CPU. These are the resources needed to consume the system, and therefore, we need a mechanism to unify the management of this heap of thread resources. The idea of thread pooling solves the cost of frequently creating and destroying threads. The thread pool refers to a predetermined size thread waiting to be processed by the user at any time, without waiting for the user to create it. Especially in Java development, minimizing the consumption of garbage collection mechanisms reduces the frequent creation and destruction of objects.

Before we were all our own thread pool, but with the introduction of jdk1.5, JDK brought the java.util.concurrent concurrent development framework, to solve most of our thread pool framework duplication of work. You can use executors to create a thread pool that lists the following, and then describes it later.

Newcachedthreadpool establishes a thread pool with a caching function

Newfixedthreadpool establish a fixed number of threads

Newscheduledthreadpool establish a thread with time schedule

There are several issues to consider when you have a thread pool:

How the thread manages, such as creating a new task thread.

How the thread stops, starts.

The thread can achieve accurate time startup in addition to the interval timing of the scheduled mode. Starting at 1 o'clock, for example.

How the thread monitors, if the thread dies during execution, the exception terminates how do we know.

Considering these points, we need to manage the threads centrally, and we can't do it with Java.util.concurrent. The following points need to be done:

Separate the threads from the business and the configuration of the business into a single table.

Constructs a concurrent-based thread scheduling framework that includes the control of the state of the thread, the interface to stop the thread, the thread surviving heartbeat mechanism, and the thread exception logging module.

Build Flexible timer components, add quartz timer components to achieve accurate timing system.

Combined with business configuration information to build thread pool task scheduling system. This can be done through configuration management, adding thread tasks, monitoring, timing, management, and so on.

The component diagram is:

  

Is it possible to build a good thread scheduling framework to cope with the need for a lot of computing? The answer is No. Because the resources of a machine is limited, it also mentioned that the CPU is a time cycle, the task of a lot of queues, even if the increase in CPU, a machine can host the CPU is limited. Therefore, the entire thread pool framework needs to be made into a distributed task scheduling framework to cope with horizontal scaling, such as the resource on a machine reached a bottleneck, and immediately add a machine deployment scheduling framework and business can increase the computational power. OK, how to build it?

  

Based on jeeframework we encapsulate Spring, Ibatis, database operations, and can invoke business methods to complete business processing. The main components are:

The task set is stored in the database server

The control center is responsible for managing the node status in the cluster and distributing the tasks

Thread pool scheduling cluster is responsible for the execution of tasks distributed by the control center

The Web server is assigned, managed, and monitored by visual operations tasks.

Typically this architecture can handle the common distributed processing requirements, but one drawback is that the single-threaded programming model becomes complex as developers grow and business models grow. For example, the need for 1000w data segmentation, if this is put into a thread to execute, not calculate time consumption is to query the database will take a lot of time. Some people say, then I will split the 1000w data into different machines to calculate, and then merge not just do it? Because this is a special case of the model, specifically for this need to develop the corresponding program no problem, but then there are other huge demand how to do? For example, to the 3 years of all users posted posts in the post most of the fans forward the highest user hours to take out. and have to make a set of program implementation, too troublesome! Distributed cloud computing architecture to solve these problems, reduce the complexity of development and to be high-performance, people will not think of a recently very hot frame, Hadoop, yes this is the thing. Hadoop solves this problem by decomposing, calculating, and merging large computational tasks, isn't that what we want? But everyone who has played this knows that he is a separate process. No, it's not! He is a bunch of processes, how to combine with our scheduling framework? To see the pictorial words:

  

The basic previous distributed scheduling framework component is unchanged, adding the following components and functions:

By transforming the distributed scheduling framework, you can turn the thread task into a MapReduce task and submit it to the Hadoop cluster.

The Hadoop cluster can invoke the business interface's spring, Ibatis processing business logic to access the database.

The data that Hadoop needs can be queried by hive.

Hadoop can access hdfs/hbase read-write operations.

Business data should be added to hive warehouse in time.

Hive processing of hbase data, the processing of frequently updated data, hdfs the underlying structure of hive and hbase can also be stored in regular files.

In this way, the whole transformation is basically completed. However, it is important to note that architectural design must reduce the complexity of the development process. Although the Hadoop model is introduced here, the developer is still hidden in the framework. The business process class can run either in stand-alone mode or on Hadoop, and can invoke spring, Ibatis. Reduce the development of learning costs, in the actual combat slowly learned to learn a new skill.

Screen screenshot:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.