對siwft有些瞭解的人都知道,Ring是swift中非常核心的組件,它決定著資料如何在叢集中分布。Swift根據設定的partition_power決定叢集中的分區數量(2的partition_power次方),並根據一致性雜湊演算法將分區分配到不同的node上,並將資料分布到對應的分區上。
因此,構建Ring就成為swift初始化必須經曆的過程。簡單說來:
- ring-builder根據device weight計算出每個裝置上應該被分配的分區的數量。(2的partition_power次方得到分區總數,再根據weight和裝置數進行分配)
- ring-builder將每個分區的副本分配到對應的device上。
- 根據一個old ring建立新new ring的過程:
- 重新計算每個device上的分區數量;
- 收集需要被重新分配的分區:
- 1)將被移除的device上的所有分區添加到gathered list;
- 2)將由於添加新device而產生的需要被分配出去的分區添加到gathered list;
- 3)將所有device上經過重新分配後多出來的分區添加到gathered list。
- 使用上述“新的Ring建立的過程”的方法分配gathered list中的分區到devices中。
那麼swift-ring-builder命令又是如何執行的呢?本文簡單旨在介紹swift-ring-builder命令,通過源碼可以發現,swift-ring-builder命令的功能基本上都是通過RingBuilder執行個體的相關方法實現的,因此更加原理和細節的東西,將會在後續閱讀RingBuilder的源碼後再進行總結。So,莫噴我掛羊頭賣狗肉啦 ^_~
1. swift-ring-builder 做了什嗎?
Rings是通過swift-ring-builder這個工具手動建立的,swift-ring-builder將分區與裝置關聯,並將該資料寫入一個最佳化過的Python資料結構,壓縮、序列化後寫入磁碟,以供rings建立的資料可以被匯入到伺服器中。更新rings的機制非常簡單,伺服器通過檢查建立rings的檔案的最後更新日期來判斷它和自己記憶體中的版本哪一個更新,從而決定是否需要重新載入rings建立資料。本段中所說的“Python資料結構”是一個如下所示的字典輸出結構:
def to_dict(self): """ Returns a dict that can be used later with copy_from to restore a RingBuilder. swift-ring-builder uses this to pickle.dump the dict to a file and later load that dict into copy_from. """ return {'part_power': self.part_power, 'replicas': self.replicas, 'min_part_hours': self.min_part_hours, 'parts': self.parts, 'devs': self.devs, 'devs_changed': self.devs_changed, 'version': self.version, '_replica2part2dev': self._replica2part2dev, '_last_part_moves_epoch': self._last_part_moves_epoch, '_last_part_moves': self._last_part_moves, '_last_part_gather_start': self._last_part_gather_start, '_remove_devs': self._remove_devs}
swift-ring-builder命令的基本結構為:
swift-ring-builder <builder_file> <action> [params]
swift-ring-builder根據<action>執行相應的動作,產生builder file儲存在<builder_file>指定的檔案中,產生指導建立ring的檔案xxx.ring.gz。在此之前,它會將原來的<builder_file>和xxx.ring.gz備份到backups檔案夾中。
圖1 swift-ring-builder建立的builder file和ring.gz
圖2 swift-ring-builder備份的builder file和ring.gz
對<builder_file>的儲存時非常重要的,因此你需要儲存ring建立檔案的多個副本。因為一旦ring建立檔案完全丟失,就意味著我們需要重頭完全重新建立一個ring,這樣幾乎所有的分區都會被分配到新的不同的裝置上,因此資料副本也都會被移動到新的位置,造成大量資料移轉,導致系統在一段時間內不可用。
2. swift-ring-builder 命令 swift-ring-builder中包含多種命令:
add create list_parts rebalance remove search set_infoset_min_part_hoursset_weightset_replicasvalidatewrite_ring
接下來我們對這些命令進行羅列,並作出相關解釋。英文的文檔內容可以通過直接運行“swift-ring-builder”命令獲得。
swift-ring-builder <builder_file> Shows information about the ring and the devices within.
顯示ring以及ring中裝置的資訊,swift-1.8.0中對device新增了一個region屬性swift-ring-builder <builder_file> add z<zone>-<ip>:<port>/<device_name>_<meta> <weight> [z<zone>-<ip>:<port>/<device_name>_<meta> <weight>] ... Adds devices to the ring with the given information. No partitions will be assigned to the new device until after running 'rebalance'. This is so you can make multiple device changes and rebalance them all just once.
使用給出的資訊添加新的裝置到ring上。add操作不會分配partitions到新的裝置上,只有運行了'rebalance'命令後才會進行分區的分配。
因此,這種機制可以允許你一次添加多個裝置,並只執行一次rebalance實現對這些裝置的分區分配。
swift-ring-builder <builder_file> create <part_power> <replicas> <min_part_hours> Creates <builder_file> with 2^<part_power> partitions and <replicas>. <min_part_hours> is number of hours to restrict moving a partition more than once.
使用2的<part_power>次方個分區和<replicas>副本數建立<builder_file>.<min_part_hour>是一個分區被連續移動兩次之間的最小時間間隔swift-ring-builder <builder_file> list_parts <search-value> [<search-value>] .. Returns a 2 column list of all the partitions that are assigned to any of the devices matching the search values given. The first column is the assigned partition number and the second column is the number of device matches for that partition. The list is ordered from most number of matches to least. If there are a lot of devices to match against, this command could take a while to run.
返回一個兩列的列表,包含與搜尋值相匹配的所有裝置的所有分區。
第一列是關聯的分區編號
第二列是與分區匹配的裝置編號
列表按匹配的編號大小從大到小排序,如果有很多裝置與搜尋符合,則這個命令需要多運行一會兒
swift-ring-builder <builder_file> rebalance Attempts to rebalance the ring by reassigning partitions that haven't been recently reassigned.
rebalance命令嘗試重新平衡環,通過重新分配分區最近沒有被重新分配的分區。
swift-ring-builder <builder_file> remove <search-value> [search-value ...] Removes the device(s) from the ring. This should normally just be used for a device that has failed. For a device you wish to decommission, it's best to set its weight to 0, wait for it to drain all its data, then use this remove command. This will not take effect until after running 'rebalance'. This is so you can make multiple device changes and rebalance them all just once.
remove命令將裝置從ring中移除。一般情況下,這個命令應該僅用在那些失敗的裝置上。
如果你想將一個裝置退役掉,那麼最好的方式是將它的weight設定為0,待它將其上所有的資料都移走之後,再使用這個命令移除裝置。
remove操作不會重新分配partitions,只有運行了'rebalance'命令後才會進行分區的分配。因此,這種機制可以允許你一次添加刪除個裝置,並只執行一次rebalance實現對這些裝置的分區分配。
swift-ring-builder <builder_file> search <search-value> Shows information about matching devices.
顯示匹配的裝置的資訊swift-ring-builder <builder_file> set_info <search-value> <ip>:<port>/<device_name>_<meta> [<search-value> <ip>:<port>/<device_name>_<meta>] ... For each search-value, resets the matched device's information. This information isn't used to assign partitions, so you can use 'write_ring' afterward to rewrite the current ring with the newer device information. Any of the parts are optional in the final <ip>:<port>/<device_name>_<meta> parameter; just give what you want to change. For instance set_info d74 _"snet: 5.6.7.8" would just update the meta data for device id 74.
set_info命令會重新設定每一個與<search-value>相匹配的裝置資訊。這個資訊不會用來重新分配分區,因此你可以使用'write_ring'來直接重寫當前的ring。
<ip>:<port>/<device_name>_<meta>參數的任意一個部分都是可選的,你只需要給出你需要更改的部分。
比如,set_info d74 _"snet: 5.6.7.8"就僅僅會把id為74的裝置的中繼資料更新為"snet: 5.6.7.8"
swift-ring-builder <builder_file> set_min_part_hours <hours> Changes the <min_part_hours> to the given <hours>. This should be set to however long a full replication/update cycle takes. We're working on a way to determine this more easily than scanning logs.
set_min_part_hours命令將<min_part_hours>設定為參數給定的<hours>.
這個時間應該被設定的至少滿足一個完整的replication/update周期。我們正在努力找到一個方法可以比看日誌更簡單的決定這個時間
swift-ring-builder <builder_file> set_weight <search-value> <weight> [<search-value> <weight] ... Resets the devices' weights. No partitions will be reassigned to or from the device until after running 'rebalance'. This is so you can make multiple device changes and rebalance them all just once.
重新設定裝置的weight。set_weight操作後,裝置上的partition不會重新分配,只有運行了'rebalance'命令後才會進行分區的分配。
因此,這種機制可以允許你一次添加多個裝置,並只執行一次rebalance實現對這些裝置的分區分配。
swift-ring-builder <builder_file> set_replicas <replicas>
Changes the replica count to the given <replicas>. <replicas> may
be a floating-point value, in which case some partitions will have
floor(<replicas>) replicas and some will have ceiling(<replicas>)
in the correct proportions.A rebalance is needed to make the change take effect.
set_replicas命令用於使用參數中的<replicas>來設定副本數。
<replicas>可以是一個浮點數,因此在一些情境中一些分區的副本數可能是floor(<replicas>),也可能是(<replicas>),這取決於正確的比例。
需要執行一個rebalance命令來使副本設定生效。該命令是swift-1.8.0新增的。
swift-ring-builder <builder_file> validate Just runs the validation routines on the ring. 僅運行builder的validate方法,使ring生效
swift-ring-builder <builder_file> write_ring Just rewrites the distributable ring file. This is done automatically after a successful rebalance, so really this is only useful after one or more 'set_info' calls when no rebalance is needed but you want to send out the new device information.
write_ring命令僅是用來重寫分部環境下的ring檔案。這個命令會在成功執行一個rebalance操作後唄自動執行。
因此,它僅在你執行了一次或多次'set_info'命令,不想rebalance卻想保留新資訊時使用。
3. 參數格式
在進行search裝置的時候,<search_value>的格式如下:
d<device_id>z<zone>-<ip>:<port>/<device_name>_<meta>
這個格式中的任意一個部分都是可選的,例如:
z1 Matches devices in zone 1z1-1.2.3.4 Matches devices in zone 1 with the ip 1.2.3.41.2.3.4 Matches devices in any zone with the ip 1.2.3.4z1:5678 Matches devices in zone 1 using port 5678:5678 Matches devices that use port 5678/sdb1 Matches devices with the device name sdb1_shiny Matches devices with shiny in the meta data_"snet: 5.6.7.8" Matches devices with snet: 5.6.7.8 in the meta data[::1] Matches devices in any zone with the ip ::1z1-[::1]:5678 Matches devices in zone 1 with ip ::1 and port 5678
下面是一個指定最精確的例子:
d74z1-1.2.3.4:5678/sdb1_"snet: 5.6.7.8"
4. 返回碼含義
0 = operation successful1 = operation completed with warnings2 = error