Ceph Source code parsing: PG Peering

Source: Internet
Author: User
Tags set set

The device exception in the cluster (the addition and deletion of the exception OSD) can result in inconsistent data between the various copies of the PG, and data recovery is required so that all replicas are in a consistent state.

First, the OSD Fault and treatment methods:

1. The type of malfunction of the OSD:

Fault A: A normal OSD because the device is abnormal, resulting in the OSD does not work properly, so that the OSD over the set time will be out of the cluster.

Failure B: A normal OSD because the device is abnormal, the OSD will not work properly, but in the set time, it can work normally, then add the Gathering group.

2. OSD Fault Handling:

Failure A:OSD All the PG, these PG will redistribute the copy to the other OSD. The number of objects contained in a PG is unlimited, and all objects in the PG are copied, which can result in very large data duplication.

Fault B:osd again back to the PG, it is necessary to determine if the OSD can be incremental recovery is incremental recovery, otherwise the full amount of recovery. (Incremental recovery: an object that changes in the PG when an abnormal period of the OSD is resumed.) Full recovery: Refers to the recovery of all objects within the PG, the method with the processing of fault a).

Operations that require a full amount of recovery are called backfill operations. An operation that requires an incremental restore is called a recovery operation.

Second, the concept of analysis:

1.osdmap: A collection of all OSD in the cluster, including IP & state for each OSD (up or down)

2.acting Set & up set: Each PG has these two sets, and the acting set holds the set of OSD for all copies of the PG, such as acting[0,1,2], which means that the copy of the PG is stored in osd.0, Osd.1, Osd.2, and the first is osd.0, indicating that the osd.0 is a primary copy of PG. In general, the up set is the same as the acting set. The difference between the need to understand pg_temp first.

3.epoch:osdmap version number, monotonically incrementing, osdmap per change plus 1

4.current_interval & Past interval: An epoch sequence in which the acting set of this PG has not changed, current is the present sequence, and past refers to the past interval.

Last_epoch_started: Last peering after the Osdmap version number epoch.

Last_epoch_clean: Last recovery or backfill after the Osdmap version number epoch.

(Note: After the peering is finished, the data recovery operation is just beginning, so last_epoch_started and Last_epoch_clean may differ).

For example:

The current epoch value of the Ceph system is pg1.0, and the acting set and up set are all [0,1,2]

    • Osd.3 failure resulted in OSD map change, epoch changed to 21

    • Osd.5 failure resulted in OSD map change, epoch changed to 22

    • Osd.6 failure resulted in OSD map change, epoch changed to 23

None of the three epoch changes will change pg1.0 's acting set and up set

    • Osd.2 failure resulted in OSD map change, epoch changed to 24

This causes the acting set and up set of the pg1.0 to change to [0,1,8], last_epoch_started is 24 if the peering process completes successfully at this time

    • osd.12 failure resulted in OSD map change, epoch changed to 25

At this time if pg1.0 completed recovery, in clean state, Last_epoch_clean is 25

    • OSD13 failure resulted in OSD map change, epoch changed to 26

The epoch Sequence 21,22,23,23 is the past of pg1.0 interval

The epoch Sequence 24,25,26 is pg1.0 's current interval

5.authoritative History: Complete PG Log operation sequence

6.last epoch start: Last peering completed epoch

7.up_thru: A past interval, the first time to complete the epoch of peering

8.pg_temp: Suppose that when the number of copies of a PG is insufficient, then the copy situation is acting/up = [1,2]/[1,2]. This adds a Osd.3 as a copy of the PG. After crush calculation found that this Osd.3 should be the current PG primary, but, this Osd.3 there is no PG data, so can not bear primary, so need to apply for a pg_temp, this pg_ Temp also uses Osd.1 as the primary, when the collection of PG is ACTING,PG_TEMP set to up. Of course PG and PG_TEMP are not the same, so then the collection of PG becomes [3,1,2]/[1,2,3]. When all the data on the Osd.3 has been restored, it becomes [3,1,2]/[3,1,2].

9.pg_log:pg_log is an important structure for recovering data, and each PG has its own log. Each object operation for the PG is recorded in the PG.

    • __S32 op; Type of operation

    • hobject_t soid; object to manipulate

    • eversion_t version, Prior_version, reverting_to; Version of the action

Third, the specific process of peering

Algorithm Flowchart:

Peering: Three copies of each other (here is the number of replicas set, usually set to 3) the metadata for the PG is consistent. The official explanations are as follows:

The process of bringing all of the OSDS that store a Placement Group (PG) to agreement about the state of all of the obj ECTS (and their metadata) in that PG. Note this agreeing on the state does not mean that they all has the latest contents .

Primary PG and Raplica PG: Each copy of the three PG, one master, the other two supplemented, wherein the main is called the primary PG, the other two are called replica pg.

1. Influence of peering process

After the failed OSD is re-launched, primary PG and replica PG will enter different processing processes. Primary PG will enter the peering state first, the PG in this state suspends processing IO requests, in the production environment as part of the cluster IO does not respond, and even some cloud hosts because of waiting for IO causes the application to be unable to process properly. The following is the main operation of the peering process combined with the source code analysis.

2. Peering Process Analysis

PG is a state machine implemented by Boost::statechart, peering undergoes the following main processes:

1, GetInfo:

1.1. Select an epoch interval to calculate the corresponding acting set, acting primary, up set, and up primary for each epoch within the interval, and use the same result as a interval;

Pg->generate_past_intervals ();

Call the Generate_past_intervals () function to generate the past_interval sequence. First determine the Start_epoch to find interval (the epoch History.last_epoch_clean the last time the data was restored) and End_epoch (history.same_interval_since The beginning epoch of the last interval). After determining the Start_epoch and End_epoch, loop all osdmap between the two versions to determine the interval interval for the PG member changes.

1.2, Judge each interval, add the OSD of the up State to the prior set, and add the current acting set and up set to the prior set;

Pg->build_prior (Prior_set);

The prior set collection is generated according to Past_interval. Determines the prior set set, looping through each interval,interval.last >= Info.history.last_epoch_ in the past_interval if it is a member in the current acting and up collection Started,! Interval.acting.empty (), INTERVAL.MAYBE_WENT_RW, is in the acting collection in the interval, and is still up in the cluster.

1.3. Send the query info request to the OSD in each up state in Prior_set and wait for the receive reply to save the received request to peer_info;

context< recoverymachine > (). send_query (
Peer, pg_query_t (Pg_query_t::info,
It->shard, Pg->pg_whoami.shard,
Pg->info.history,
Pg->get_osdmap ()->get_epoch ()));

Starts getting info for all the OSD in the collection, based on the Priorset collection. This will send the Req (Pg::recoverystate::getinfo::get_infos ()) Requesting info to all the OSD. Wait for a reply after sending the request.

1.4, after receiving the last answer, the state machine post event to the Gotinfo state; If there is a receive request in the OSD down, the status of this PG will continue to wait until the corresponding OSD resumes;

Boost::statechart::result pg::recoverystate::getinfo::react (const MNOTIFYREC &INFOEVT)

The reply handler function. The main call is pg->proc_replica_info for processing: 1. Put info in the PeerInfo array. 2. Merge history records. Here you will be waiting for all copies to reply to info. Enter the next state getlog.

2, GetLog:

2.1, Traverse Peer_info, find best info, use it as authoritative log; save acting set/peer_info in the complete state and all PG in the UP set Acting_ Backfill;

Pg->choose_acting (Auth_log_shard,
&context< peering > (). history_les_bound)

Select Acting collection and auth_osd through Pg->choose_acting (Auth_log_shard).

Two major measures were taken in choose_acting:

    • Find_best_info to find an optimal OSD. When finding the optimal OSD in Find_best_info, there are three priority criteria: the maximum last_update, the smallest log_tail, and the current primary.

      map<pg_shard_t, pg_info_t>::const_iterator auth_log_shard =
               Find_best_info (All_info, history_les_bound);

    • Calc_replicated_acting, select the OSD collection that participates in peering, recovering. A member of the

      • Up collection. All members are added to the acting_backfilling, if they are members of the incomplete state or if the log is not connected to the member (Cur.last_update<auth.log_tail) to the backfill, Otherwise, it is added to the want member. The members of the

      • Acting collection are not added to the backfill, so it is only necessary to determine if the state is complete and the log is able to converge, add to want and acting_backfilling. The OSD members in the

      • Other prior are handled in the same acting.

      • After this step, the members of the acting_backfilling (to recover data from the log, or to help recover the data), members of the backfill (only through the other OSD on the PG data for full copy recovery), A member of the Want (also in Acting_backfill, but different from the member of backfill).

    • calc_ec_acting. Ceph has two kinds of pool, one is the replica type pool, and the other is the Erasure code type pool (similar to raid). Specific implementation of the follow-up, today is too late to see the code to fill.

2.2, if the computed authoritative log corresponding PG is itself, the direct post event to Gotlog; otherwise, send query log request to the OSD on which it resides;

Context<recoverymachine> (). Send_query (
Auth_log_shard,
pg_query_t (
Pg_query_t::log,
Auth_log_shard.shard, Pg->pg_whoami.shard,
Request_log_from, Pg->info.history,
Pg->get_osdmap ()->get_epoch ()));

2.3, receive the request OSD answer, and will get the log merge to local, State machine post event to getmissing; If the answer is not received, the status will continue to wait;

Boost::statechart::result pg::recoverystate::getlog::react (const Gotlog &)

{
Dout (Ten) << "leaving GetLog" << Dendl;
PG *PG = context< recoverymachine > (). pg;
if (msg)
{
Dout << "Processing master log" << Dendl;
Pg->proc_master_log (*context<recoverymachine> (). Get_cur_transaction (),
Msg->info, Msg->log, msg->missing,
Auth_log_shard);//log processing function
}
Pg->start_flush (
context< recoverymachine > (). get_cur_transaction (),
context< recoverymachine > (). get_on_applied_context_list (),
context< recoverymachine > (). get_on_safe_context_list ());
return transit< getmissing > ();//Jump to Getmissing
}

void PG::p roc_master_log (
Objectstore::transaction &t, pg_info_t &oinfo,
pg_log_t &olog, pg_missing_t &omissing, pg_shard_t from)
{
Dout << "Proc_master_log for OSD." << from << ":"
<< olog << "<< omissing << Dendl;
ASSERT (!is_peered () && is_primary ());

   /merge log into we own log to build Master log.  no need to
   //Make a NY adjustments to their missing map; We is taking their
   //log to be authoritative (i.e., their entries is by definitely
 &nb sp; //Non-divergent).
    merge_log (t, Oinfo, Olog, from);//The function merges the log to form a complete log with authoritative order. This includes patching before and after the log, and most importantly, during the patching process, the case of missing.add_next_event (NE) that requires the recovery of object in the local copy is counted. This is the beginning of the statistical missing structure.
    Peer_info[from] = oinfo;//saves Oinfo from Best_log to the local peer-info array.
    dout << "peer OSD." << from << ' Now ' << oinfo << ' "<< omissing << Dendl;
    Might_have_unfound.insert (from);

See doc/dev/osd_internals/last_epoch_started
if (oinfo.last_epoch_started > Info.last_epoch_started)
{
info.last_epoch_started = oinfo.last_epoch_started;
Dirty_info = true;
}
if (Info.history.merge (oinfo.history))//merge the history information.
Dirty_info = true;
ASSERT (Cct->_conf->osd_find_best_info_ignore_history_les | |
info.last_epoch_started >= info.history.last_epoch_started);

Peer_missing[from].swap (omissing);//The missing structure is counted into the local peer_missing structure.
}

    • Auth_log: One is the merger of Auth_log, the largest and most authoritative log, the recovery data should be based here.

    • Missing: The other is the object collection that the local replica needs to recover during the merge log.

    • Omissing:auth_osd the object collection that needs to be recovered.

3, Getmissing:

3.1. Traverse acting_backfill to send query log request to the OSD where the PG with primary PG Log is located, and put the remaining non-intersecting PG into peer_missing, generating missing set for subsequent recovery ;

context< recoverymachine > (). send_query (
                 *i,
                 pg_query_t (
                     Pg_query_t::log,
                     I->shard, Pg->pg_ Whoami.shard,
                     since, pg->info.history,
                     Pg->get_osdmap ()->get_epoch ());

3.2, will receive every answer merge to local, if there is OSD down, the status of this PG will continue to wait; after receiving all the answers, the current PG state machine into the activate State, the peering process is finished;

Boost::statechart::result pg::recoverystate::getmissing::react (const MLOGREC &LOGEVT)

{
PG *PG = context< recoverymachine > (). pg;

Peer_missing_requested.erase (Logevt.from);
Pg->proc_replica_log (*context<recoverymachine> (). Get_cur_transaction (),
Logevt.msg->info, Logevt.msg->log, logevt.msg->missing, logevt.from);//Receive log messages sent back by other OSD and process them. Trim the Peer_log in Proc_replica_log and discard the log that is not fully available. Organize the received oinfo into PeerInfo, omissing to peer_missing. Comes directly to the active state.

if (Peer_missing_requested.empty ())
{
if (Pg->need_up_thru)
{
Dout << "Still need up_thru update before going active" << Dendl;
Post_event (Needupthru ());
}
Else
{
Dout << "Got last missing, don ' t need missing"
<< "Posting Activate" << Dendl;
post_event (Activate (Pg->get_osdmap ()->get_epoch ());
}
}
return Discard_event ();
}

3. Summary

From the above analysis, the whole peering process is divided into three stages, GetInfo-a GetLog-getmissing, first to prior set, acting set, up set each OSD request PG Infos, Select the PG corresponding to authoritative log, then request authoritative log to the OSD where authoritative log is located, and finally get the missing set required for the recovery process;

Peering time is not controllable, mainly depends on whether the requested OSD can respond in a timely manner; If an OSD is down at this stage, it is likely to cause some PG to remain in the peering state, that is, all IO distributed to this PG will block.

Not to be continued. Today's time is later, two days to continue to expand the details of the content.

Resources:

A little http://my.oschina.net/u/2460844/blog/596895

Wang Songpo https://www.ustack.com/blog/ceph%EF%BC%8Dpg-peering/

Shimin (Sammy Liu) http://www.cnblogs.com/sammyliu/p/4836014.html

Incom http://blog.csdn.net/changtao381/article/details/49125817

Thanks to the above authors for their selfless sharing.

Ceph Source code parsing: PG Peering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.