First, collocated group status
1. Creating
When you create a storage pool, it creates a specified number of collocated groups. CEPH displays creating when creating one or more collocated groups, and when created, the OSD in the acting set of its collocated group will be interconnected; Once the interconnect is complete, the Collocated group state should become active+clean, meaning that the Ceph client can write data to the colocation group.
2. Peering
When Ceph establishes interconnection for a collocated group, it causes the objects and metadata states to be agreed between the OSD that stores the collocated group copy. Ceph completes the interconnect, which means that the OSD that stores the collocated group is aligned with its current state. However, the completion of the interconnection process does not indicate that each copy has the latest version of the data.
3. Active
After Ceph completes the interconnect process, a collocated group becomes active. The active state usually means that the data in the primary collocated group and the replica can be read and written.
4. Clean
When a collocated group is in the clean state, the primary OSD and the replica OSD are successfully interconnected, and there is no deviated collocated group. Ceph has copied the specified number of objects in the collocated group.
5. Degraded
When the client writes data to the primary OSD, the master OSD is responsible for writing the copy to the remaining copy OSD. After the main OSD writes the object to the copy OSD, the main OSD will remain in the degraded state until the confirmation of the successful completion is forfeited.
The collocated group state can be a active+degraded state because an OSD can be in the active state even if no objects are available. If an OSD is hung, Ceph will mark the associated collocated group as degraded, and after that OSD is reborn, they must reconnect. However, if the collocated group is still in the active state, the client can write new objects to it even if it is in the degraded state.
If an OSD is hung and the degraded state persists, Ceph will mark the down OSD as outside the cluster (out) and remap the data on the down OSD to the other OSD. The time interval from Mark down to out is controlled by the Mon OSD down-out interval, which is 300 seconds by default.
The collocated group is also demoted (degraded) because the collocated group cannot find one or more objects that should exist in the collocated group, and you cannot read or write objects that are not found, but you can still access other objects that are in the degraded collocated group.
6. Recovering
Ceph is designed to be fault-tolerant and resilient to a certain scale of hardware and software problems. When an OSD hangs (down), its content version lags behind other replicas within the collocated group; when it regenerates (UP), the co-location group contents must be updated to reflect the current state; During this time, the OSD is in the recovering state.
Recovery is not always these trivial things, because a hardware failure can implicate multiple OSD. For example, a network switch in a cabinet fails, which causes multiple hosts to lag behind the current state of the cluster, and each OSD must be restored after the problem is resolved.
Ceph provides a number of options to balance resource contention, such as new service requests, recovering data objects, and recovering collocated groups into the current state. The OSD Recovery delay start option allows an OSD to restart, reestablish interconnection, and even process some replay requests before starting the recovery process, and the OSD Recovery threads option limits the number of threads in the recovery process by default to 1 threads; OSD Recovery th Read timeout sets the line blocks until those, because multiple OSD may alternately fail, restart, and reestablish the interconnect; the OSD recovery max active option restricts the maximum number of requests that an OSD can accept at the same time, in case it is too stressed to serve properly; OSD Recovery Max CH The UNK option limits the size of the recovered data block to prevent network congestion.
7. Back Filling
When a new OSD is added to the cluster, CRUSH assigns the collocated group within the existing cluster to it. Forcing the new OSD to accept the redistributed collocated group immediately will overload it, and using the collocated group backfill allows the process to start in the background. When the backfill is finished, the new OSD is ready for external service.
8. remapped
When the acting set changes for a collocated group, the data is migrated from the old collection to the new one. The main OSD takes some time to service, so it allows the old master OSD to continue serving until the colocation group has migrated. When the data is migrated, the primary OSD is mapped to the new acting set.
9. Stale
Although Ceph jumps to ensure that the host and daemon are running, ceph-osd may still enter the stuck state, and they do not report their status on time (such as network transients). By default, the OSD daemon reports its collocated group, outbound traffic, boot, and failure statistics once every half-second (0.5)
State, which is higher than the heartbeat threshold. If the acting set where the main OSD of a collocated group is located does not report to the monitor, or if the other monitor has reported that the main OSD is down, the monitor will mark this collocated group as stale. When you start a cluster, you often see the stale state until the interconnection is complete. After the cluster runs for a while, if you can see that the collocated group is in the stale state, the primary OSD for those collocated groups is hung (down), or the statistics are not being reported to the monitor.
Second, find fault collocated group
In general, when the colocation group is stuck, Ceph's self-healing capabilities are often powerless, and the jammed state is subdivided into:
1. Unclean
Dirty: Some objects in the collocated group do not have the desired number of copies, they should be in the recovery.
2. Inactive
Inactive: collocated groups cannot handle read and write because they are waiting for an OSD that holds the latest data to enter the up state again.
3. Stale
drooped: collocated groups are in an unknown state because the OSD that stores them has not been reported to the monitor for a while (by the Mon osdreport timeout configuration).
To find the Stuck collocated group, execute:
?
1 |
ceph pg dump_stuck [unclean|inactive|stale] |
Third, positioning objects
Method has been written in ceph techniques and is not discussed here.
Ceph Placement Group Status summary