Linux運行監控與調優介紹參數(翻譯主要內容)(經典)

最後更新：2018-12-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

轉載：http://hi.baidu.com/springwu/blog/item/267ec345cd9d4628879473ce.html

1、CPU

2、記憶體

3、I/O

4、網路

1、CPU

應該理解CPU啟動並執行主要參數：環境切換，運行隊列、CPU利用率與平均負載

(1)環境切換：

1)CPU指令從一個進程(線程)到另一個進程，稱為環境切換。

2)當一個進程發生切換時，在記憶體中儲存CPU目前狀態。

3)Kernel也能擷取在記憶體中儲存在上一個進程的狀態並載入CPU

4)環境切換對多任務處理的CPU來說是非常重要的

5）然而，很高的切換數量會引起效能問題

(2)運行隊列(run Queue)

1)運行隊列是指在CPU隊列中當前活動的進程隊列的總數

2)當CPU準備執行一個進程，就從運行隊列中基於進程的優先順序取出一個

3)要留意那些睡眠的進程或不在運行隊列的I/O等待狀態

4)注意：很高的進程隊列數會引起效能問題

(3)CPU利用率

1)這是之當前有多少CPU被佔用

2)這是相當簡單，可以直接用TOP命令查看CPU利用率

3)CPU利用率為100％意味這超過系統負載

4)因此較高的CPU利用率會引起效能問題

(4)平均負載

1)這是指在特定的一段時間的CPU負載

2)在linux,平均負載顯示最近1分鐘、5分鐘或15分鐘。這對查看整個系統負載是否上升或下降是非常有用的，

3)例如：一個平均負載“0.75 1.70 2.10”表示負載在將來下降。0.75是在最近一分鐘的平均負載，1.70是最近5分鐘的平均負載，2.10是最近15分鐘平均負載。

4)注意，平均負載是通過在隊列中的總數進程和不間斷任務狀態的進程總數來計算的。

2、網路

(1)很好的理解TCP/IP概念非常有利於理解任何網路情況的分析，在將來的文檔中做更多討論。

(2)對於網路介面，應該監控總得包的數量（包括髮送包和接受包、丟棄包等）。

3、IO

(1)I/O wait等待是CPU等待I/O操作。如果在系統中一直看見I/O等待很高，說明磁碟子系統(disk subsystem)存在問題。

(2)還應該監控每秒讀寫能力。這是衡量資料區塊的讀寫，這些涉及到 bi(block in )和bo(block out)

(3)TPS(Transactions per second) 表示rtps(read transactions per second)和wtps(write transactions per second)的總數每秒處理數,

Virtual4、記憶體

(1)如你所知，RAM是實體記憶體。如果在系統上有4GB RAM，就可以有4GB實體記憶體。(If you have 4GB RAM installed on your system, you have 4GB of physical memory. 感覺引意是：如果在機器上有1＊4＝GB的記憶體條，則在機器上顯示出4GB<如果是32位的就算有8B也只能顯示4G>。)

(2)虛擬記憶體=在磁碟可用交換分區+實體記憶體大小。虛擬記憶體包含了使用者空間和核心空間

(3)用32位或64位的系統在一個線程能利用多大記憶體上有很大的不同

(4)不能被使用的記憶體將會通過核心作為檔案系統的Cache

(5)當需要更多的記憶體時Linux系統會用swap.例如：需要比實體記憶體多的記憶體。他會將記憶體中最小的頁面從實體記憶體交換到磁碟上

(6)太多的交換會引起效能問題，因為磁碟比實體記憶體慢很多，並且從RAM切換到disk還要花費時間

（根據自己的理解翻譯了部分內容，下面的懶得翻譯了）

All of the above 4 subsystems are interrelated. Just because you see a high reads/second, or writes/second, or I/O wait doesn’t mean the issue is there with the I/O sub-system. It also depends on what the application is doing. In most cases, the performance issue might be caused by the application that is running on the Linux system.

Remember the 80/20 rule — 80% of the performance improvement comes from tuning the application, and the rest 20% comes from tuning the infrastructure components.

這裡有些linux監控工具 top, free, ps, iostat, vmstat, mpstat, sar, tcpump, netstat, iozone

We’ll be discussing more about these tools and how to use them in the upcoming articles in this series.

解決問題方法：

Step 1 – Understand (and reproduce) the problem: Half of the problem is solved when you clearly understand what the problem is. Before trying to solve the performance issue, first work on clearly defining the problem. The more time you spend on understanding and defining the problem will give you enough details to look for the answers in the right place. If possible, try to reproduce the problem, or at least simulate a situation that you think closely resembles the problem. This will later help you to validate the solution you come up to fix the performance issue.

Step 2 – Monitor and collect data: After defining the problem clearly, monitor the system and try to collect as much data as possible on various subsystems. Based on this data, come up list of potential issues.

Step 3 – Eliminate and narrow down issues: After having a list of potential issues, dive into each one of them and eliminate any non issues. Narrow it down further to see whether it is an application issue, or an infrastructure issue. Drill down further and narrow it down to a specific component. For example, if it is an infrastructure issue, narrow it down and identify the subsystem that is causing the issue. If it is an I/O subsystem issue, narrow it down to a specific partition, or raid group, or LUN, or disk. Basically, keep drilling down until you put your finger on the root cause of the issue.

Step 4 – One change at a time: Once you’ve narrowed down to a small list of potential issues, don’t try to make multiple changes at one time. If you make multiple changes, you wouldn’t know which one fixed the original issue. Multiple changes at one time might also cause new issues, which you’ll be chasing after instead of fixing the original issue. So, make one change at a time, and see if it fixes the original problem.

In the upcoming articles of the performance series, we’ll discuss more about how to monitor and address performance issues on CPU, Memory, I/O and Network subsystem using various Linux performance monitoring tools.

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More