Migrate the Linux kernel from 2.4 to 2.6 On iSeries and pSeries

Source: Internet
Author: User
Tags prefetch
Migrate the Linux kernel from 2.4 to 2.6 On iSeries and pSeries -- Linux general technology-Linux programming and kernel information. The following is a detailed description. Level: elementary

Matt Davis, Linux consultant, IBM
Chakarat skawratanond, technical consultant, IBM
Nikolay Yevik, Linux consultant, IBM

In this article, we will focus on the differences between Linux kernel 2.4 and 2.6 in POWER.
Differences between Linux kernel 2.4 and 2.6 in POWER

Module Subsystem, Unified Device Model, and PnP support

The module subsystem has undergone major changes.

Improved Stability

To completely prevent the kernel from loading or exporting the kernel module in use, or at least to reduce the possibility of using the module while loading or detaching the module (which sometimes causes system crash ), the process of loading and exporting kernel modules has been improved.

Unified Device Model

Creating a unified device model is one of the most important changes to the 2.6 kernel. It promotes the standardization of module interfaces and aims to better control and manage devices, for example:

More accurate identification of system devices.
Power Management and device power status.
Improved system bus structure management.

Plug-and-play (PnP) Support

The changes mentioned in the 1.1.1 and 1.1.2 sections make Linux running the 2.6 kernel a real plug-and-play OS. For example, PnP support for ISA PnP extensions, legacy MCA and EISA buses, and hot swapping devices.

Kernel infrastructure changes

To distinguish regular object files with the. o extension, the kernel module now uses the. ko extension.
A new sysfs file system is created, which is described when the kernel discovers the Device Tree.
Memory support, NUMA support

Supports a larger number of RAM

2.6 The kernel supports a larger number of RAM resources, up to 64 GB in paging mode.

NUMA

Support for Non-Uniform Memory Access-NUMA systems with inconsistent kernel Access is a new feature in kernel 2.6.

Thread model, NPTL

Compared with LinuxThreads of v2.4, NPTL (Native POSIX Threading Library) is newly introduced in version 2.6 ). NPTL provides enterprise-level thread support for Linux and provides far better performance than LinuxThreads. It is based on a ratio of between users and kernel threads.

In October 2003, NPTL support was incorporated into the gnu c library glibc, and Red Hat was the first to implement NPTL using a customized v2.4 kernel in Red Hat Linux 9 and Red Hat Enterprise Linux.

Performance Improvement

New scheduler algorithm

2.6 The Linux kernel introduces a new O (1) algorithm. It runs very well under high loads. The new scheduler distributes time slices based on each CPU. This eliminates global synchronization and reallocation cycles, thus improving performance.
Kernel Preemption)
The new 2.6 kernel is preemptible. This will significantly improve the performance of interactive and multimedia applications.

I/O performance improvement

The I/O sub-system of Linux has also undergone major changes. By modifying the I/O scheduler, it is ensured that no process will reside in the queue for a long time waiting for input/output operations, in this way, the I/O operation response is faster.
Fast User-Space Mutexes)
"Futexes" (Fast User space mutex) can serialize threads to avoid race conditions and improve the response speed. "Futexes" is partially implemented in the kernel space to allow improvements based on competition to set the priority of waiting tasks.

Scalability Improvement

More Processors
Linux kernel 2.6 supports a maximum of 64 CPUs.
Support for larger memory
Thanks to PAE (Physical Address extension, Physical Address Extensions), the memory supported in paging mode is increased to 64 GB on a 32-bit system.
Users and groups
The number of unique users and groups increased from 65,000 to more than 4 billion, that is, from 16-bit to 32-bit.
PID quantity
The maximum number of PIDs increases from 32,000 to 1 billion.
Number of opened file descriptors
The number of opened file descriptors does not increase, but you do not need to set this parameter in advance. It will be adjusted by itself.

Supports more devices

Before Linux kernel 2.6, there are restrictions on large systems in the kernel, such as 256 devices per chain. The v2.6 kernel completely breaks these restrictions, not only supports more types of devices, but also supports more devices of the same type. In Linux 2.6, 4095 supports 1 million main device types, each of which can have more than sub-devices.
File System size
The maximum size of an addressable File System in Linux kernel 2.6 is 16 TB.

File System

Traditional Linux file systems such as ext2, ext3, and ReiserFS have been significantly improved. The most noteworthy improvement is the introduction of extended attributes (or file metadata. The most important thing is the implementation of posix acl, which is an extension of common UNIX permissions and supports more refined user access control.

In addition to the improved support for the traditional Linux File System, the new kernel fully supports the relatively new XFS File System in Linux.

The Linux 2.6 kernel also introduces improved support for the NTFS file system. Now, the NTFS file system can be installed in read/write mode.

New features for POWER5 Linux releases

The Linux release version running on a POWER5-based system is suse linux Enterprise Server 9 (SLES 9) and Red Hat Enterprise Linux Advanced Server 3 (rhel as 3 Update 3) with the Third-edition service package installed ). These two releases will be available in 2004. SLES 9 based on Linux kernel 2.6. Rhel as 3 Update 3 based on Linux kernel 2.4. Both SLES 9 and rhel as 3 Update 3 can also run on POWER4 hardware. The following table highlights the support for the POWER5 feature for the two Release versions.

Feature SLES 9 RHELAS 3 Update 3
Dynamic LPAR
-- Processor support is not supported
-- Unsupported memory
-- I/O support is not supported
-- Supports a maximum of 254 partitions
0.1 support for granular sub-processor partitions
-- Supports upper-and no-upper-limit partitions
-- Concurrent multi-thread support
Storage Options
Not supported by the virtual SCSI Server
Support for virtual SCSI clients
I5 itself
Install p5 support for AIX 5.3
I5 itself
Install p5 of AIX 5.3
Communication options
Support for Virtual LAN
Large Page support is not supported
Support for PCI Hot swapping is not supported
Support for SUE machine check and Processing

The following describes the features in detail.

Dynamic logical partition (Dynamic LPAR)

Logical partitions allow multiple operating systems to coexist on the same hardware platform. System resources can be divided so that each partition does not affect each other. You can use the hardware management console (HMC) to manage the LPAR in the system. With dynamic LPAR, You can dynamically add and delete resources without restarting the partition. When you need to add these resources, the administrator can reconfigure the system to identify these additional resources. The maximum number of logical partitions supported depends on the number of processors in the server model. The system limit is 254. Whether or not dynamic LPAR is used is ultimately determined by the Linux issuer and the purpose of the 2.6 kernel. SLES 9 supports dynamic movement of processors and I/O. RHELAS 3 Update 3 does not support dynamic LPAR.

Sub-processor Partition

The minimum unit that can be configured for any partition using a shared processor is 0.10 processing units. A group of physical processors that can be shared by multiple logical partitions is called a shared processing pool. The shared processor feature allows you to allocate a portion of the processor to a logical partition.

Figure 1 can be considered as an example of using a shared processor pool environment. This figure describes a hypothetical 4-path machine that runs i5/OS or AIX. It also has three other logical partitions. Assume that the second partition is the Transaction Server that processes financial transactions. In addition, assume that the transaction processing application will interact with AIX or i5/OS, store and obtain its information in the database. The partition marked as "Report" is the sister application (sister-application) of the Transaction Server, which generates financial reports. In order to achieve load balancing, the company separates transactions and report partitions, because the Transaction Server is sensitive to time and response, and the report generation can be done without busy hours. The last partition is the development and testing partition of the company. This partition is the development space of the company's engineers. Note: How is the processor allocated between four partitions based on the workload.


Figure 1. Example of a shared processor




(400) {this. resized = true; this. width = 400; this. alt = 'click here to open new window';} "onmouseover =" if (this. resized) this. style. cursor = 'hand'; "onclick =" window. open ('HTTP: // www.ibm.com/?works/eserver/graphics/es-ovrw_fig2.gif'); ">

The sharing mode of partitions in the shared processing pool can be either limited (capped) or uncapped ). A partition with an upper limit indicates that the logical partition cannot exceed the processing capacity assigned to it. All unused resources can only be used by partitions in the shared processing pool. When defining the attributes of a partition, you can specify whether the partition has an upper limit or no upper limit. When defining a partition, you can also set the smallest and largest processor value, which can be expressed by numbers or scores. This is very suitable for the example described above. Figure 2 shows the evolution of the previous example, but now the minimum and maximum values are specified.


Figure 2. dynamically mobilize processor capabilities based on workload




(400) {this. resized = true; this. width = 400; this. alt = 'click here to open new window';} "onmouseover =" if (this. resized) this. style. cursor = 'hand'; "onclick =" window. open ('HTTP: // www.ibm.com/?works/eserver/graphics/es-ovrw_fig3.gif'); ">

In this example, you can dynamically move the processor as needed. The benefits are obvious. The load of transactions and report partitions clearly states that dynamic processor allocation is very necessary. The Transaction Server is assigned a half processor (minimum ). It can also use the remaining half of the second virtual processor and all of the third virtual processor as needed. If you want to generate a report during non-busy hours when the system has more free time, you can use up to two virtual processors for report partitions and their applications, but not less than 3/4 processors. The same is true for test partitions. Suppose engineers need to compile their applications. If compilation is performed when idle processing capability is available, a maximum of three virtual processors can be used to test the partition so that compilation can be completed more quickly.

The premise for these processors to share instances is that the system is idle, so that other partitions can use resources. However, there must be some time when multiple partitions are requesting more processing capabilities. For example, due to a surge in workload, both report partitions and transaction partitions request more processor resources. As the timely response of the Transaction Server is the most critical for your business, you are more willing to give the Transaction Server more virtual processing capabilities than the Report Server. In this case, it is important to set weights for processing capabilities.

The unlimited' weight is a number ranging from 0 to 255. You can set it for each unlimited' partition in the shared processing pool. By setting the unlimited weight (255 is the highest weight), all available idle power is distributed to the competing logical partition based on the value that has been determined by the unlimited weight. The default maximum weight is 128.


Figure 3. Weight determines the distribution of idle Processors




(400) {this. resized = true; this. width = 400; this. alt = 'click here to open new window';} "onmouseover =" if (this. resized) this. style. cursor = 'hand'; "onclick =" window. open ('HTTP: // www.ibm.com/?works/eserver/graphics/es-ovrw_fig4.gif'); ">

When the load on both the transaction and report server is surging, you can set the right to determine how to allocate the processor. In Figure 3, set the weight of the Transaction Server to 2 and that of the Report Server to 1. Therefore, during the peak period, whenever three processing units are available, the system administrator allocates two processor units to the Transaction Server and one processor unit to the Report Server.

Synchronous Multithreading

One of the features of the POWER5 architecture is synchronous multithreading technology. The POWER4 microprocessor collects a set of commands (up to five) for each clock cycle and completes a set of commands within each clock cycle. The POWER5 microprocessor doubles the throughput, collects two sets of commands per clock cycle (each group has a maximum of five), and completes two sets of commands within each clock cycle. Both SLES 9 and rhel as 3 Update 3 support this technology.

Storage Options

For storage and I/O, Linux can use a variety of real devices and Virtual Devices. This flexibility makes it possible to efficiently set Linux partitions. For disks, Linux logical partitions support three storage options.

Internal memory dedicated to Partitioning Using SCSI adapters and drives.
External memory dedicated to Partitioning Using the SAN adapter.
Use the virtual memory of the virtual SCSI adapter and the memory in different partitions.
Virtual Disk

The Virtual Disk enables multiple partitions in the POWER5-based system to share storage. There is a partition, that is, an I/O Server partition, which holds the physical adapter and memory (which may be internal or external ). The virtual adapter allows other partitions, that is, the I/O client partition, to use the memory of the I/O Server partition. I/O servers can be AIX and i5/OS. Both SLES 9 and rhel as 3 Update 3 support this.


Figure 4. AIX or i5/OS can provide virtual disks for Linux partitions




(400) {this. resized = true; this. width = 400; this. alt = 'click here to open new window';} "onmouseover =" if (this. resized) this. style. cursor = 'hand'; "onclick =" window. open ('HTTP: // www.ibm.com/?works/eserver/graphics/es-ovrw_fig5.gif'); ">

Figure 4 graphically describes how a host partition provides a virtual disk to a Linux partition. The advantage of using a virtual disk is not only to save the cost of the disk drive. Adding disks and controllers to a smaller machine may be a problem and you may need to purchase an extension unit. In addition, the host system can manage, back up, and quickly copy Virtual Disks.

CDROM, tape and DVD-ROM

You can also share the SCSI drive of AIX 5.3 or i5/OS with the Linux partition. This is very similar to the Virtual Disk function. If AIX or i5/OS has a CDROM, tape drive, or DVDRAM drive, Linux can use these devices, just as they have been physically connected to Linux partitions, as long as the host partition is not using this device. The benefits of a virtual SCSI device are almost the same as that of a virtual disk; there is less hardware cost and no need to configure devices for each partition.

Communication options

Linux on a POWER5 system can establish a TCP/IP connection through a directly connected network interface or a virtual ethernet interface. The virtual ethernet provides roughly the same functionality as the 1 Gigabit Ethernet adapter. Partitions In POWER5-based servers can communicate with each other through virtual Ethernet communication ports using TCP/IP.

You can define up to 4,094 independent virtual Ethernet LAN (VLAN ). Each partition can have up to 65,534 virtual Ethernet adapters connected to a vswitch. Each adapter can be connected to 21 VLANs. No special hardware or software is required to establish and set up virtual ethernet. After you enable a specific virtual ethernet for a partition, a network device named ethXX is created in that partition. Then, you can create a TCP/IP configuration for communication with other partitions.

Let's consider the example in the sub-processor partition section on page 1 again. In the case description, the transaction server uses the database in the AIX or i5/OS partition to store and obtain information. This is a very typical application of virtual ethernet, because the communication speed is very fast and no additional hardware is required. Figure 5 shows an example of adding two virtual Ethernet LANs.


Figure 5. Virtual Ethernet LAN is a fast and efficient way for partitions to communicate with each other




(400) {this. resized = true; this. width = 400; this. alt = 'click here to open new window';} "onmouseover =" if (this. resized) this. style. cursor = 'hand'; "onclick =" window. open ('HTTP: // www.ibm.com/?works/eserver/graphics/es-ovrw_fig6.gif'); ">

However, in most cases, partitions that need to be connected to the virtual ethernet can also communicate with the physical network. However, this requires at least one partition with both a physical Ethernet adapter and a virtual Ethernet adapter connected to another partition. Partitions with these two adapters can route the transmission between physical Ethernet and virtual ethernet.




(400) {this. resized = true; this. width = 400; this. alt = 'click here to open new window';} "onmouseover =" if (this. resized) this. style. cursor = 'hand'; "onclick =" window. open ('HTTP: // www.ibm.com/?works/eserver/graphics/es-ovrw_fig7.gif'); ">

A common way to connect a partition to a physical network is to run a firewall on a partition. In the firewall partition, you can have a network interface that is directly connected to the physical network, as shown in figure 6. Other partitions can then communicate with the physical network through the Virtual LAN and firewall transmission.

By the time the text is written, both SLES 9 and rhel as 3 Update 3 can use virtual LAN.

Large Page Support

In the 2.6 kernel, two types of virtual page sizes are supported: the traditional 4 KB page size and 16 MB page size. Large pages are mainly used to improve the performance of applications with frequent memory access. With the support of large pages, you can use the text and data supported by large pages (16 MB) at runtime without modifying the application code. The performance improvement is due to the decrease in the failure of the translation look aside buffer (TLB. This is because TLB can map a larger virtual memory range. Because you no longer need to re-prefetch at the 4 KB boundary, large pages also improve the memory prefetch performance. Large pages are supported in SLES 9, but not in rhel as 3 Update 3.

PCI hot swapping (hot plug)

With this capability, you can insert a new PCI Hot swapping adapter into an available PCI slot when the operating system is running. This can be another adapter of the same type as the one already installed, or a different type of PCI adapter. The operating system and applications can use new resources without restarting. You do not need to shut down the system. You can also replace the problematic PCI Hot swapping adapter with another adapter of the same type. When you change the adapter, the current device driver supports this adapter because it is of the same type. Device Configuration and device configuration information under this adapter are retained for the replaced device. PCI Hot swapping is supported in SLES 9, but not in rhel as 3 Update 3.

SUE machine check and Processing

This capability allows the system to mark Special Uncorrectable Errors (SUE) and kill all the processes that reference this resource. The system can continue to run without restarting to recover from Errors. Both SLES 9 and rhel as 3 Update 3 support this feature.


Development tool chain changes

With various innovations in the Linux 2.6 kernel, libraries and user space development tools also need to change accordingly. This section describes the changes in the GNU tool chain, including glibc, bintuils, as, ld, and gcc. Although the content in this section is far from complete enough, it should be a good reference. By reading some other source code modification logs that can be obtained free of charge, you can add this knowledge.

Glibc

To support new and improved features in the 2.6 kernel, and to support new and extended features of the POWER Architecture, glibc, the gnu c library, has been updated to Version 2.3. First, the core of the change is the Native POSIX Threads for the Linux model. The use of internationalization, network interface addressing, and regular expressions has also changed.

Internationalization
By allowing iconv to use the location of the system, internationalization has been improved. In addition, the locale. h thread-safe interface is also implemented. I will not comment on them separately here. However, there are related details in the source code documentation that can be obtained for free.
Network Interface
The Network Interface addressing is improved through a BSD compatible implementation.
Regular Expression
After rewriting to be compatible with POSIX, the regular expression runs much faster now.
Fexecve
Fexecve is used to exec file descriptor and can be enabled in Linux now.
Malloc
To be faster and more compatible, malloc is already based on Doug Lea's Malloc 2.7.0.c.
Thread-locale storage)
Local thread storage is implemented to collect and store empty objects more quickly in the thread, which is now handled by the compiler. For more information, visit http://lele.redhat.com/drepper/tls.htm and refer to the white paper written by Ulrich Drepper.
GNU binutils

GNU binutils includes ld, as, and other small utilities, such as objcopy and readelf. In this release, these small utilities were not significantly changed for the 2.6 kernel or POWER architecture, and they were slightly changed. For example, the utility readelf can now be used to display information about files in an archive file. Complete change records are available on the binutils Web site http://sources.redhat.com/binutils.

AS and LD

AS and LD have changed in some ways for the POWER architecture. However, because the POWER architecture is used with multiple operating systems, these changes are not all specific to Linux. Linux changes for POWER are indeed affected, including support for POWER opcode and increased VMX extensions (available in PPC970-based Linux ). In addition, the optimization profile can be used for the POWER4 and PPC970 chips ). The default Optimization Configuration File is-maltivec, which supports VMX extension in PPC970. You can use-mpower4 for POWER4 optimization.

GCC

GCC has undergone significant changes to support NPTL, but other changes should not be ignored. Added many improvements for POWER scheduling, optimization, and compatibility.

DFA Scheduler
The DFA scheduler used for commands is supported in gcc 3.3.3. Gain insight into the project through http://www.gnu.org/software/gcc/news/dfa/html.
Directives
Now, you can use direves Ves in the C macro.
Includes
If the library is already in path,-I library inclusion will be ignored. This avoids unexpected sequence issues in the library.
New support for POWER4 Processors
New support for specific optimization of the POWER4 processor, namely-mpower4.
Improvements for VMX Extension
Some functions have been improved for the VMX extension in the PPC970 chip.
More compatible with ISO C99
To learn about the current status, see tables on the http://www.gnu.org/software/gcc/gcc-3.3/c99status.html.

Author Profile



Matt Davis is a Linux technical consultant from the IBM eServer Solutions Enablement team. As a veteran of the Linux on POWER project, he studied and tested the emerging Linux on POWER Technology and wrote some reports to summarize his findings. This includes the Journaling File Systems in Linux on POWER, parallel Grid Computing Using Linux on POWER, and open source code solutions for Linux commercial software on POWER, and Linux solution catalog. When he was a student at the Texas State University in Austin (where he obtained two degrees), he came to IBM as an intern. You can contact him via mattdavis@us.ibm.com.




Chakarat skawratanond is a technical consultant at the IBM eServer Solutions Enablement organization, where he helped independent software developers use their applications for AIX 5L and Linux on the IBM pSeries platform. You can contact him via chakarat@us.ibm.com.




Nikolay Yevik is a Linux technical consultant from the IBM eServer Solutions Enablement team. He has more than 5 years of C/C ++ and Java software development experience, as well as AIX and Linux system management experience. He has a master's degree in petroleum engineering and computer science. You can contact him via yevik@us.ibm.com.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.