Ceph Source code Analysis: Scrub Fault detection

Source: Internet
Author: User
Tags crc32 checksum

Reprint please indicate the origin of the http://www.cnblogs.com/chenxianpao/p/5878159.html trot

This article only combed the general process, the details of the part has not been too understanding, there is time to see, and then add, there are errors please correct me, thank you.

One of the main features of Ceph is strong consistency, which mainly refers to end-to-end consistency. As we all know, the traditional end-to-end solution is based on the data block checksum, because of the possibility of error occurring on each layer from the application layer to the kernel file system, the common block layer, the SCSI layer and the last HBA and disk controller on the conventional storage path. In Ceph, the addition of Ceph's own client and network, storage logic, data migration, will inevitably lead to higher error probability.

Because Ceph is used as a path to an application layer, it uses a POSIX interface for storage and supports Parity Read/write, which can lead to more serious performance problems when encapsulating fixed data blocks and adding checksum data, so Ceph simply introduces the Scrub mechanism (Read Verify) To ensure the correctness of the data.

Simply put, Ceph's OSD will periodically start the Scrub thread to scan parts of the object, compare it to other replicas to see if it is consistent, and if there are inconsistencies, Ceph throws this exception to the user to resolve.

first, Scrub core process

/*

* Chunky Scrub scrubs objects one chunk at a time with writes blocked for that

* Chunk.

* The object store is partitioned to chunks which end on hash boundaries. For

* Each chunk, the following logic is performed:

* (1) block writes on the chunk//block write

* (2) Request maps from replicas//to get maps from a replica

* (3) Wait for pushes to is applied (after recovery)//wait for the pushes to take effect

* (4) Wait for writes-flush on the chunk//waits for a write-in chunk

* (5) Wait for maps from replicas//waiting to get maps

* (6) Compare/repair all scrub maps//comparison Repair

* The last update for the primary determines from the subset by walking the log. If

* It sees a log entry pertaining to a file in the chunk, it tells the replicas

* To wait until this update is applied before building a scrub map. Both the

* Primary and replicas would wait for any active pushes to be applied.

* In contrast to Classic_scrub, Chunky_scrub are entirely handled by SCRUB_WQ.

*/

Second, Scrub function flow: Starting from the Pg::chunky_scrub Scrub state machine

The New_chunk state _request_scrub_map function sends a new MOSDREPSCRUB message to the REPLICAS,REPLICAS processing process:

Primary the processing flow after receiving the MOSDSUBOP message:

Writes the received replicas Scrubmap to the scrubber.received_maps;

For all replicas returns SCRUBMAP, call the Pg::requeue_scrub () function to re-enter the scrub operation

Third, Scrub State Machine

INACTIVE

Start scrub, update the status of the OSD, set scrubber status information, such as Start, State, Epoch_start, and so on, to set the state to New_chunk

New_chunk

Initialize Scrubber.primary_scrubmap, received_maps; Gets the specified number of objects based on osd_scrub_chunk_min and Osd_scrub_chunk_max, how do I get them? ), send a new Mosdrepscrub () message to all actingbackfill copies, get scrub_map information, set the status to Wait_pushes

wait_pushes

Wait for push (push what?) ) is completed, and if Active_pushes is 0, the state is set to Wait_last_update

wait_last_update

Wait for update (update what?) ) is completed, if last_update_applied is greater than scrubber.subset_last_update set the state to Build_map

Build_map

Establish a scrub_map, read the object's size and property information, if it is deep mode, according to the EC and replicate two kinds of pool to calculate the different CRC32 checksum, repicate type to calculate the CRC value Omap_header, The CRC value of the EC type is the hash value of object; Scrubmap includes object size, attr and Omap attr, historical version information; set state to Wait_replicas

Wait_replicas

Wait to receive the information returned by replicas, set the status to Compare_maps

Compare_maps

Create the authoritative authmap based on Scrubber.primary_scrubmap, build the master set containing all objects, traverse the master set, select an OSD that does not have an exception, compare the file size, Property Information and Digest information (Digest and Omap_digest), record error objects and lost objects, follow-up recovery recovery, and set status to Wait_digest_updates.

wait_digest_updates

The Replicatedpg::_scrub () function completes (what is done), and if obj is not checked, continue back to the new_chunk process and repeat the above actions;

FINISH

Scrub End

Note:

1. The OSD will trigger the Scrub process in PG granularity, the frequency of the trigger can be specified by the option, and the Scrub start of a PG is initiated by the OSD on which the Master role is located

2. A PG In a normal environment, there will be thousands of to hundreds of thousands of objects, because the Scrub process needs to extract the checksum of the object and then compare it with the checksum of the other copy, during which the data of the object being validated cannot be modified. Therefore, each time a PG's Scrub process initiates a small subset of object checksums, Ceph takes a portion of the hash of each object name as the extraction factor, and each startup object check finds an object that conforms to the current hash value and then compares it. That's why Ceph calls it a chunky Scrub.

3. After locating the set of objects to be validated, the initiator needs to make a request to lock the set of objects for the other replica. Because the master and replicate nodes of each object are actually written to the underlying storage engine, there is a certain difference in time. At this point, the initiator of the set of objects to be verified comes with a version that is sent to the other replicas until the replica nodes synchronize to the same version as the Master node.

4. After determining that the set of objects to be checked is in the same version of the different nodes, the initiator asks all nodes to start calculating the validation information for this set of objects and feed them back to the initiator.

5. This verification information includes meta information for each object, such as size, all keys and historical version information for the extended attribute, and so on, known as Scrubmap in Ceph.

6. The initiator compares multiple scrubmap and discovers inconsistent objects, the inconsistent objects are collected and finally sent to monitor, and finally the user can learn the results of the Scrub by using monitor

Users can find inconsistent objects by "Ceph PG Repair [pg_id]" is the way to start the repair process, and the current fix will simply replicate the master node's object to the replica node, so the user is currently required to manually confirm that the object of the master node is the "correct copy." In addition, Ceph allows deep Scrub mode to fully compare object information to expect to discover Ceph itself or file system problems, which often results in a large IO burden, making it difficult to achieve the desired results in a real production environment.

Four, Scrub Test

Lab 1: Verifying the scrubbing and repair mechanisms

Scrubbing is a mechanism for CEPH to maintain data integrity, similar to fsck in a file system, and it will find inconsistencies in the data that exist. Scrubbing can affect cluster performance. It is divided into two categories:

· A class is the default daily, called light scrubbing, whose period is determined by the configuration item OSD Scrub min interval (default 24 hours) and OSD scrub max interval (default 7 days). It discovers minor inconsistencies in data by examining the size and properties of the object.

· The other is the default weekly, called Deep scrubbing, whose period is determined by the configuration item OSD Deep scrub interval (the default one week). It discovers data depth inconsistency by reading data and doing checksum inspection data.

The following is the default OSD scrub configuration entry:

[email protected]:~# ceph--admin-daemon/var/run/ceph/ceph-osd.5.asok Config Show | grep Scrub
"Osd_scrub_thread_timeout": "60",
"Osd_scrub_thread_suicide_timeout": "60",
"Osd_scrub_finalize_thread_timeout": "600",
"Osd_scrub_finalize_thread_suicide_timeout": "6000",
"Osd_scrub_invalid_stats": "true",
"Osd_max_scrubs": "1",
"Osd_scrub_load_threshold": "0.5",
"Osd_scrub_min_interval": "86400",
"Osd_scrub_max_interval": "604800",
"Osd_scrub_chunk_min": "5",
"Osd_scrub_chunk_max": "25",
"Osd_scrub_sleep": "0",
"Osd_deep_scrub_interval": "604800",
"Osd_deep_scrub_stride": "524288",

Experimental process:

0: Find the object's PG acting set

Osdmap e334 pool ' pool1 ' (9) object ' Evernote_5.8.6.7519.exe ', pg 9.6094a41e (9.1E), Up ([5,3,0], p5) acting ([5 , 3,0], p5)

1: Delete object's files

Depending on the PG Id,osd ID and object name, locate the file path on the Osd.5 to/var/lib/ceph/osd/ceph-5/current/9.1e_head/evernote\u5.8.6.7519.exe__ Head_6094a41e__9, remove it

2: Set Light scrub cycle

To wait for a day, set both Osd_scrub_min_interval and Osd_scrub_max_interval to 4 minutes:

[email protected]:/var/run/ceph# ceph--admin-daemon ceph-osd.5.asok config set osd_scrub_max_interval 240
{"Success": "Osd_scrub_max_interval = ' 240 '"}
[Email protected]:/var/run/ceph# ceph--admin-daemon ceph-osd.5.asok config get osd_scrub_max_interval
{"Osd_scrub_max_interval": "240"}
[Email protected]:/var/run/ceph# ceph--admin-daemon Ceph-osd.5.asok config set osd_scrub_min_interval 240
{"Success": "Osd_scrub_min_interval = ' 240 '"}
[Email protected]:/var/run/ceph# ceph--admin-daemon ceph-osd.5.asok config get osd_scrub_min_interval
{"Osd_scrub_min_interval": "240"}

3: Try light scrub to find the problem

Can see light scrub according to plan, and found the problem of PG 9.1e, that is, the file is missing:

2016-06-06 18:15:49.798236 Osd.5 [INF] 9.1d Scrub OK

2016-06-06 18:15:50.799835 Osd.5 [ERR] 9.1e Shard 5 missing 6094A41E/EVERNOTE_5.8.6.7519.EXE/HEAD//9
2016-06-06 18:15:50.799863 Osd.5 [ERR] 9.1e scrub 1 missing, 0 inconsistent objects
2016-06-06 18:15:50.799866 Osd.5 [ERR] 9.1e scrub 1 errors
2016-06-06 18:15:52.804444 Osd.5 [INF] 9.20 Scrub OK

Pgmap Rendering Inconsistent Status:

2016-06-06 18:15:58.439927 mon.0 [INF] pgmap v5752:64 pgs:63 active+clean, 1 active+clean+inconsistent; 97071 KB data, 2268 MB used, 18167 mb/20435 MB avail

At this point the cluster state is the ERROR state:

Health Health_err 1 pgs inconsistent; 1 scrub errors;

In addition to scheduled cleanup, administrators can also start the cleanup process by command:

[Email protected]:~# ceph PG Scrub 9.1E
Instructing PG 9.1e on Osd.5 to scrub

As can be seen from the output, scrubbing is initiated by the main OSD of PG.

4: Try deep Scrub, the result is the same.

Manually run the command Ceph PG Deep-scrub 9.1E, which will start the deep cleanup with the same results.

5: Try PG Repair, success

Run Ceph PG Repair 9.1E, start PG repair process, result repair succeeded (1 errors, 1 fixed), deleted files back, cluster back to OK state.

Conclusion :

· PG scrubbing can find problems with file loss

· PG scrubbing has a negative impact on cluster performance and can appropriately reduce its priority and the resources it needs.

Note : PG repair currently has a lot of problems, according to this article, it will copy the data on the primary OSD to the other OSD, which may cause the correct data to be overwritten by the wrong data, so use need to be cautious. The following experiment will verify the problem.

Lab 2: Verify that scrubbing can find inconsistencies in content and the behavior of PG repair

0: Create an object whose content is a text file containing the string 1111 whose PG is distributed on [2,0,3].

1: Modify the contents of the file on the Osd.2 to 1122

2: Start light scrub,9.2e scrub OK, unable to find the problem. Isn't that weird?? Light scrub should check the file properties and size, the properties should include the modification time, should be able to check out AH.

3. Start deep scrub,9.2e deep-scrub OK to find the problem. Isn't that weird?? Deep scrub should check the object data Ah, the data has changed, should be able to check out ah ...

4. Start PG Repair, the content is unchanged. It does not seem to do repair to PG that is not in the inconsistent state.

5. Continue to modify the files on the Osd.2 to add content to their size changes.

6. Start light scrub, finally found Shard 2:soid B2E6CB6E/TEXT/HEAD//9 size 931! = Known size 5, 1 inconsistent objects, cluster status also changed to health _err.

7. Start deep Scrub, it also finally found the problem.

8. Run the Rados get command to get the original file correctly. This means that even if the cluster is in a health_err state, the IO of the PG in the Active+clean+inconsistent state will work.

9. Start the file on the PG Repari,osd.2 to be overwritten and back to the original content. This means that PG repair is not copied from the main OSD to another OSD.

Conclusion :

· The two types of scrubbing are simply based on different methods to discover the different kinds of problems in data consistency, fix things it doesn't matter, not all of them can be found; after validation, size changes are discovered, but cannot be found when the size is changed.

· PG Repair is also not like Re: [Ceph-users] Scrub Error/how does ceph PG repair work? It is clear that the data on the primary OSD is copied to the other OSD simply, which can cause the correct data to be corrupted. The article was written in 2015, and my CEPH version is 0.8.11, perhaps with inconsistencies.

Five, Scrub problem

As described in the process, the current Scrub has the following issues:

1. After discovering inconsistent objects, there is a lack of policy to automatically correct errors, such as if most replicas agree, then a few replica objects will be assimilated

2. Scrub mechanism does not timely solve the storage system end-to-end correct problem, it is likely that the upper application has already read the error data

With regard to the first problem, Ceph already has Blueprint to enhance Scrub's ability to repair, and when a user starts Repair, the majority of replica-consistent policies are initiated to replace the current master replica synchronization policy.

For the second problem, the traditional end-to-end solution will be more "end-to-end verification" with fixed blocks of additional checksum data, but Ceph is not the actual management and allocation of storage device space, it relies on the file system to implement storage space management, if the use of object verification will severely lossy performance. Therefore, the verification from the file system to the device depends on the file system, and Ceph includes the client and server-side object correctness check can only rely more on the Read Verify mechanism, in the case of data migration need to synchronize the comparison of different copy object information to ensure correctness. The current asynchronous mode allows the possibility of the error data to be returned during the period.

Reference Documentation:

The test in this article is referenced in shimin (Sammy Liu) http://www.cnblogs.com/sammyliu/p/5568989.html, followed by an empty code test

The conceptual description is referenced in https://www.ustack.com/blog/ceph-internal-scrub/

Ceph Source code Analysis: Scrub Fault detection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.