There is no ready process in the current runqueue, start load balancing to transfer processes from other CPUs, then pick (see "Scheduler-related load Balancing");
If you still do not have a ready process, set the IDLE process for this CPU as a candidate.
After you have selected Next, if you find that next is the first time you have been dispatched to (activated>0) after waking from task_interruptible hibernation, the scheduler will be ready according to next
The length of time that waits on the queue adjusts the priority of the process (and stores the new location in the ready queue, as detailed in the "process average wait Time sleep_avg").
In addition to the updates for Sleep_avg and Prio, Next's timestamp is also updated to the current time, which is used to calculate the run-time length of the next switch.
4) External environment
The external environment here refers to the impact of the scheduler on the environment other than the process involved in scheduling and the ready queue, mainly including switching count processing and updating the CPU state (QSCTR)
。
9. Scheduler's support for kernel preemption operations
In a 2.4 system, any process that runs in a nuclear mentality, only when it calls schedule () actively discards control, the scheduler has the opportunity to choose another process to run, so we say
The Linux 2.4 kernel is not preempted. Lack of this support, the core can not guarantee real-time task timely response, so it can not meet real-time system (even soft real-time)
Requirements.
2.6 Kernel to achieve preemptive operation, no lock protection of any code segment may be interrupted, its implementation, for scheduling technology, the main is to increase the timing of the scheduler operation. I
We know that in the 2.4 kernel, the scheduler has two ways to boot: active and passive, in which the passive start scheduler can only be in the control of the return of the user state from the kernel mentality, so
Only the kernel can not preempt the characteristics. 2.6, the start of the scheduler can still be divided into active and passive two kinds, the difference is the passive start scheduler conditions relaxed a lot. It's
The modification is mainly in entry. S in:
......
Ret_from_exception: #从异常中返回的入口
Preempt_stop #解释为 CLI, off interrupt, which is not allowed to be preempted during return from exception
RET_FROM_INTR: #从中断返回的入口
Get_thread_info (%EBP) #取task_struct的thread_info信息
MOVL eflags (%ESP),%eax
Movb CS (%ESP),%al
Testl $ (Vm_mask | 3),%EAX
JZ Resume_kernel # "Return to User state" and "return in nuclear mentality" junction
ENTRY (Resume_userspace)
Cli
MOVL ti_flags (%EBP),%ecx
Andl $_tif_work_mask,%ECX #
(_tif_notify_resume | _tif_sigpending
# | _tif_need_resched)
Jne work_pending
JMP Restore_all
ENTRY (Resume_kernel)
Cmpl $0,ti_pre_count (%EBP)
JNZ Restore_all
#如果preempt_count非0, preemption is not allowed
Need_resched:
MOVL ti_flags (%EBP),%ecx
Testb $_tif_need_resched,%CL
JZ Restore_all
#如果没有置NEED_RESCHED位, you do not need to schedule
Testl $IF _mask,eflags (%ESP)
JZ restore_all #如果关中断了, scheduling is not allowed
MOVL $PREEMPT _active,ti_pre_count (%EBP)
#preempt_count set to Preempt_active,
Notifies the scheduler that the current dispatch is in the midst of a robbery.
#占调度中
STi
Call Schedule
MOVL $0,ti_pre_count (%EBP) #preemmpt_count清0
Cli
JMP need_resched
......
Work_pending: #这也是从系统调用中返回时的resched入口
Testb $_tif_need_resched,%CL
JZ Work_notifysig
#不需要调度, it must have been because of the signal needed to be processed before entering work_pending.
Work_resched:
Call Schedule
Cli
MOVL ti_flags (%EBP),%ecx
Andl $_tif_work_mask,%ECX
JZ Restore_all #没有work要做了, there is no need to resched
Testb $_tif_need_resched,%CL
JNZ work_resched #或者是需要调度, or is there a signal to deal with
Work_notifysig:
......
Now, whether the return user state or return to the nuclear mentality, it is possible to check the state of need_resched; return to the nuclear mentality, as long as the Preempt_count is 0, that the current process is currently
Preemption is allowed, the schedule () is invoked according to the Need_resched state selection. In the core, because at least the clock interrupts are constantly occurring, so as long as there is a process setting the current into
Need_resched flag, the current process can be preempted immediately, regardless of whether it is willing to give up the CPU, even the core process.
Timing of the scheduler's work
In addition to the active call scheduler for the core application, the core also starts the scheduler work in the following three different times when the application is not fully sensed:
Returns from an interrupt or system call;
The process again allows preemption (preempt_enable () to invoke Preempt_schedule ());
Actively enter hibernation (e.g. wait_event_interruptible () interface)
10. Scheduler-related Load balancing
In the 2.4 kernel, when process p is switched off, if there is a CPU idle, or if the CPU is running at a lower priority than itself, then P will be dispatched to that CPU
Operation, the core is in this way to achieve load balance.
Simplicity is the biggest advantage of this load-balancing approach, but its drawbacks are obvious: process migrations are more frequent, and interactive processes (or high-priority processes) may also not
Break "Jump". Even in an SMP environment, process migration has a cost, and the experience of 2.4 systems shows that this load balancing method does more harm than good, and solves this "SMP affinity"
The problem is one of the goals of the 2.6 system design.
2.6 Scheduling system uses a relatively centralized load balancing scheme, divided into "push" and "pull" two types of operations:
1) "Pull"
When one CPU load is too light and the other CPU load is heavy, the system pulls the process from the overloaded CPU, and this "pull" load balancing operation is implemented in the Load_balance () function
In
Load_balance () has two ways of calling, for the current CPU is not idle and idle two states, we call it "busy balance" and "Idle Balance":
A) busy balance
The clock interrupt (Rebalance_tick () function) starts once every time (Busy_rebalance_tick), regardless of whether the current CPU is busy or idle () load_balance ()
Balance load, this balance is called "Busy balance".
Linux 2.6 tends to do as much as possible without load balancing, so there are a lot of limitations when it comes to deciding whether to "pull":
Load balance when the system's busiest CPU load exceeds 25% of the current CPU load;
The current CPU load takes the current real load and the load balance when the last load balancing the larger value, smoothing the load concave value;
The load of each CPU takes the current real load and the lower value of the load balance during the last load balancing, smoothing the load peak;
After the source, destination two ready queue is locked, confirm that the source ready queue load does not decrease, or cancel the load balancing action;
The following three types of processes in the source ready queue participate in load calculation, but do not do the actual migration:
Running processes
No processes are allowed to migrate to this CPU (according to the Cpu_allowed property)
The time at which the process was scheduled to occur on the CPU (Runqueue::timestamp_last_tick, value in clock interrupts) and the time the process was switched off
(Task_struct::timestamp) The difference is less than a threshold (cache_decay_ticks nanosecond value)--the process is also active, the information in the cache is not cool enough.
Historical information for the payload to avoid competition, the scheduler saves the load on the system-wide CPU load balancing (the number of ready processes) in the CPU-ready queue
In the corresponding element of the prev_cpu_load array, this historical information is referenced when calculating the current load.
After you find the busiest CPU (the source CPU), determine the number of processes that need to be migrated as the source CPU load is half the difference from this CPU load (which is smoothed by the above historical information), and then
Migrate from expired queues to active queues, from low-priority processes to high-priority processes. But actually the process of performing the migration is often less than the process of planning migration
Because the three categories of "do not actually migrate" processes do not participate in the migration.
b) Idle Balance
Load balancing in idle state has two invocation opportunities:
In the scheduler, the ready queue for this CPU is empty;
In a clock interrupt, the ready queue for this CPU is empty, and the current absolute time (jiffies value) is a multiple of the Idle_rebalance_tick (that is, every
Idle_rebalance_tick executed once).
At this point load_balance () action is relatively simple: looking for the current real load of the largest CPU (runqueue::nr_running largest), which will be "most suitable" (see below) a ready
The process migrates to the current CPU.
The criteria for the candidate process for "idle balancing" are similar to the "Busy balance", but because the idle balance is only "pull" a process over, the action is much smaller and the execution frequency is relatively high
(Idle_rebalance_tick is 200 times times the Busy_rebalance_tick), so there is no consideration of the load history and load difference, the candidate migration process also does not consider the Cache
Active degree.
Calculate problems in the busiest CPU algorithm
In fact, it is possible to become a balanced source of the CPU load should be at least larger than the current CPU load, so the Find_busiest_queue () function in the Max_load initial value if it is
Nr_running, and at the same time ensure that the load is at least 1, then the calculation will be slightly less.
c) Pull_task ()
The concrete action of the pull process is implemented in this function. After the process migrates from the source ready queue to the destination ready queue, Pull_task () updates the timestamp attribute of the process so that it can continue
Continue to describe the time that the process has been switched off for this CPU. If the process being pulled has a higher priority than the process being run on this CPU, the current process's
Need_resched bit waits for dispatch.
2) "Push"
A) Migration_thread ()
and "pull" corresponds to, 2.6 load balancing system also has a "push" process, the execution of "push" is a core process called migration_thread (). The process starts on the system
is automatically loaded (one per CPU) and set itself as a sched_fifo real-time process, and then check to see if there is a request in runqueue::migration_queue for processing, such as
If not, sleep in the task_interruptible until it is awakened and checked again.
Migration_queue is added only in set_cpu_allowed (), when a process (such as a CPU is shut down via APM) calls set_cpu_allowed () to change the current available CPU.
So that a process does not continue to run on the current CPU, it constructs a migration request data structure migration_req_t, which is implanted into the CPU-ready queue of the process
Migration_queue, and then wakes the migration daemon of the Ready queue (recorded in the Runqueue::migration_thread attribute) and migrates the process to the appropriate CPU
(see "New data Structure Runqueue").
In the current implementation, the destination CPU selection and load independent, but "Any_online_cpu (req->task->cpus_allowed)", which is the first in the CPU number sequence
Allowed the CPU. So, unlike Load_balance (), which is closely related to the scheduler, load Balancing Strategy, Migration_thread () should be said to be just a CPU binding and
CPU power Management and other functions of an interface.
b) Move_task_away ()
The actual migrated action is implemented in the Move_task_away () function, and after the process enters the destination ready queue, its timestamp is updated to the destination CPU ready queue
Timestamp_last_tick, this process is just beginning (on the destination CPU) waiting. Because the "push" operation is written locally (as opposed to Pull_task ()),
When you start a remote CPU scheduling, you need to synchronize with remote operations, and you may want to notify the destination CPU via IPI (Inter-processor Interrupt), all of which are implemented in
Resched_task () function.
Two-Runqueue lock sync
A ready queue on two CPUs is involved when migrating a process, typically requiring the locking of two ready queues prior to operation, and in order to avoid deadlocks, the kernel provides a guaranteed lock sequence
Double_rq_lock ()/double_rq_unlock () function. This set of functions does not operate an IRQ, so switching the interrupt action requires the user to do it himself.
This set of functions is used in Move_task_away (), while Pull_task () uses double_lock_balance (), but the principle and double_rq_lock
()/double_rq_unlock () the same.
11. Scheduling under NUMA structure
In the Linux Scheduler's view, the main difference between NUMA and SMP is that the CPUs under NUMA are organized into nodes. Different architectures that each node contains
The number of CPUs is different, for example, under the 2.6 i386 platform, 16 Cpu,summit structures can be configured on each node of the NUMAQ structure to configure 32 CPUs. NUMA Structure Formal Body
The Linux kernel now starts with 2.6, and before that, Linux leverages existing "discontinuous memory" (discontiguous Memory,config_discontigmem) architectures
To support NUMA. In addition to the special processing of memory allocations, the previous kernel is equivalent to SMP in the scheduling system. 2.6 of the scheduler in addition to the single CPU load, but also consider
The load situation of each node under NUMA.
The NUMA structure has two special processes in the new kernel, one that balances the NUMA nodes while doing load balancing, and the other is when the system executes the new program (DO_EXECVE ()) from the load
Select the execution CPU in the lightest node:
1) Balance_node ()
The balance between nodes starts as part of the Rebalance_tick () function () before Load_balance () (at which point the working set of the Load_balance () is the CPU within the node, and
That is, NUMA is not a simple balance of the whole system of CPU load, but the first balance between the nodes load, and then balance the load in the node, also divided into "Busy balance" and "idle balance" two steps, the implementation
The interval is Idle_node_rebalance_tick (5 times times the current implementation is Idle_rebalance_tick) and Busy_node_rebalance_tick (implemented as
Twice times the Busy_node_rebalance_tick).
Balance_node () first calls Find_busiest_node () to locate the busiest node in the system and then Load_balance () on the node and the CPU set composed of this CPU.
The algorithm for finding the busiest node involves several data structures:
Node_nr_running[max_numnodes], the number of ready processes on each node is recorded as the index of the node number, which is the real-time load on that node. This array is a
A global data structure that needs to be accessed through atomic series functions.
Runqueue::p rev_node_load[max_numnodes], the load at the time of the last load balancing operation on each node of the system as recorded in the ready queue data structure, which follows
The following formula is amended:
Current load = Last load/2 + 10* current live load/node CPU number
This method can be used to smooth the peak load, or to take into account the inconsistent CPU number of nodes.
Node_threshold, the weight of the payload, defined as 125, the load of the busiest node selected must exceed 125/100 of the current node load, i.e. load difference over
25%.
2) sched_balance_exec ()
When the EXECVE () system call loads another program into operation, the core will look for the lightest load of one CPU (SCHED_BEST_CPU ()) in one of the lightest nodes in the system.
, and then call Sched_migrate_task () to migrate the process to the selected CPU. This operation is implemented by DO_EXECVE () call Sched_balance_exec ().
The selection criteria for SCHED_BEST_CPU () are as follows:
If the current number of CPU-ready processes does not exceed 2, the migration is not done;
When calculating the node load, the algorithm (10* the current real-time load/node CPU number) is used, regardless of the load history;
When calculating the load on the CPU in a node, the actual number of the ready process is used as the load indicator, regardless of the history of the load.
As with "Busy balance" and "idle balance" using different load evaluation criteria, SCHED_BALANCE_EXEC () uses the same (simpler) evaluation criteria as balance_node ().
Sched_migrate_task () borrows the Migration_thread service process to complete the migration, and the cpu_allowed of the process is temporarily set to run only on the destination CPU in the actual operation.
Wake Migration_thread to migrate the process to the destination CPU before restoring the Cpu_allowed property.
12. Real-time performance of dispatchers
1) 2.6 For real-time application enhancement
The 2.6 kernel scheduling system has two new features that are critical to real-time applications: Kernel preemption and O (1) scheduling, which ensure that real-time processes are responsive within an estimated time. This "limit
Time response "features in line with the soft real-time (soft realtime) requirements, from the" immediate response "hard time (hard realtime) there is a certain distance. Also, the 2.6 dispatch system is still not
Provides a denial of access to resources other than the CPU, so its real-time nature has not been radically improved.
2 priority of real time processes
2.4 System, the priority of the real-time process is represented by the rt_priority attribute, which is different from the non-real time process. 2.6 Introduces the dynamic priority property outside of the static priority and uses it to
Represents the priority of both real-time and non-real-time processes.
From the above analysis, we can see that the static priority of the process is the basis of the initial time slice of the process, and the dynamic priority determines the actual scheduling priority of the process. Whether it's a real-time process or
Non-real-time processes, static priorities are set and changed by Set_user_nice (), and the default values are max_prio-20, which means that the time slice of the real-time process and the non real
The process is within a quantum.
There are two places where you can distinguish between real-time and non-real-time processes: Scheduling policy policy (SCHED_RR or SCHED_FIFO) and dynamic priority Prio (less than Max_user_rt_prio), real
Use the latter as a test standard. The dynamic priority of a real-time process is set (equivalent to rt_priority) in Setscheduler () and does not change as the process runs, so
Real-time processes are always sorted in strict accordance with the priority set, which differs from the meaning of non-real-time process dynamic precedence. It can be argued that the static priority of a real-time process is only used to calculate time
, whereas dynamic precedence is equivalent to static precedence.
3) Real-time scheduling
2.4 Sched_rr and Sched_fifo Two real-time scheduling strategies are unchanged in 2.6, and both types of real-time processes remain in the active ready queue only because the 2.6 kernel is a preemptive
, real-time processes, especially at the core level, can respond more quickly to changes in the environment, such as the emergence of higher-priority processes.
13. PostScript: Looking at Linux development from dispatcher
In recent years, Linux for desktop systems, low-end servers, high-end servers and embedded systems have shown increasing interest and competitiveness, for a still in the "market-style" open
Development mode of the operating system, can do this is a miracle.
However, from the implementation of the scheduling system I feel that the strengths of Linux is still on the desktop system, it still retains the early development of the "egoism" characteristics, that is, the development of free software developers move
The force, to a large extent, comes from changing the existing system to its own "bad" status quo. Despite various motivations and incentives, Linux has demonstrated a strong competitive edge with commercial operating systems such as Windows
, but from the developer's point of view, this desire and free software development characteristics are contradictory.
Ingo Monar said in an interview that he designed the O (1) scheduling algorithm, basically from personal creativity, no reference to the market and research areas of the existing scheduling algorithms. From the Dispatch
The design of the device can be seen, 2.6 scheduling system to consider a lot of details, but there is no clear main line, and can not (or do not intend to) in the theory of O (1) Model performance points
Analysis From the 2.6 development process, we can also see that the various scheduling related weights in different versions have been fine-tuning, it can be considered that 2.6 scheduling system performance optimization is mainly measured
Come on.
This is the typical Linux development model-passionate, lack of planning.
For the Linux market, the most urgent, most active need is embedded system, but at least from the scheduling system, 2.6 does not have a lot of effort in this area, perhaps the developer himself
There is no much feeling and interest in this. To be sure, although Linux is hot on the market, its development is still not market-driven. This may affect Linux's competitiveness, but perhaps
So it can keep the Linux development alive.
Just this year (2004) March 1, the famous Open Source network security Project Frees/wan announced the end of development, mainly because the developer's intentions and the user's needs do not match. For network
Security systems, users are more concerned about the integrity of the system, powerful, rather than its predictable nature, therefore, Frees/wan new version of the main launch of the opportunistic
Encryption (OE) did not attract a sufficient number of user tests. In view of this, investors have stopped funding the Frees/wan project, which provides a powerful network security for open source systems.
The supported systems may be diverted to the ground again.
So far, the Linux development has not been heard to rely on some kind of business fund coverage, so relatively, Linux development is more free and arbitrary, the promotion of Linux people and development
Linux is largely independent, and Linux's beneficiaries are not tightly integrated with Linux developers. This may be a blessing rather than a curse for Linux.
Resources
[1][linus torvalds,2004]
linux kernel source v2.6.2,www.kernel.org
[2][ PUBB@163.NET,2004]&NBSP
linux 2.4 Dispatch system Analysis  ,IBM DEVELOPERWORKS&NBSP
[3][ingo molnar, 2002]
Goals, design and implementation of the new ultra-scalable o (1) scheduler, linux documentation,sched-design.txt
[4][Anand K santhanam (), 2003]
toward linux 2.6 ,ibm developerworks
[5][Robert  LOVE,2003]&NBSP
linux kernel development,sams
[6][ ,2003]
2.5.62  SMP notes, www.linux-forum.net kernel technology version
[7][ vinayak hegde,2003]
the linux kernel,linux gazette 2003, April, 89th
[8][ rick fujiyama,2003]
Analyzing the linux scheduler ' s tunables,kerneltrap.org