Yarn Source Analysis (iv)-----Journalnode

Source: Internet
Author: User

Preface

Recently, when troubleshooting the company's Hadoop cluster performance problem, found that the whole Hadoop cluster processing speed is very slow, usually only need to run a few 10 minutes of the task time suddenly up to a few hours, initially suspected that the network, and then proved to be a part of the reason, but after a few days, the problem reappeared , this time is more difficult to locate the problem, later analysis of the HDFS request log and ganglia monitoring indicators, found that the Namenode squeeze request number continues to be relatively large, indicating that the namenode processing speed anomalies, Then the analysis is because the writing Journalnode editlog slow problem caused by, and later found that it is journalnode problem caused, and later because the Editlog directory Journalnode did not create, Cause a node to write Edillog has been thrown filenotfoundexception, so here to remind you must pay attention to some small roles, such as Journalnode. During troubleshooting, the code for the Journalnode related parts of yarn is also studied , here is the learning experience, there may be some local analysis errors, please understand.


Journalnode

Perhaps some classmates have not heard of Journalnode, only heard the datanode,namenode of Hadoop, because this concept is in MR2 is the new addition of yarn, Journalnode's role is to store Editlog, In the MR1 editlog is and fsimage stored together and then Secondnamenode do a regular merger, yarn on this does not need secondnamanode. Below is the current yarn architecture diagram, Focus on the role of Journalnode.


The green area above between Active Namenode and standby Namenode is Journalnode, of course the number is not necessarily 1, the equivalent of NFS shared file system. Active Namenode Write Editlog data, The standby then reads the data from the inside to synchronize.


QJM

The following from the angle of yarn source analysis of journalnode mechanism, in the configuration of the number of JOURNALNODE nodes defined can be multiple, so there will be a similar manager such a role exists, and this manager is QJM, Full Quorumjournalmanager. Here is the variable definition for QJM:

/** * A Journalmanager that writes to A set of the remote journalnodes, * requiring A Quorum of nodes to ack each write. * Journalmanager can write a lot of log data to multiple remote JOURNALNODE nodes */@InterfaceAudience. Privatepublic class Quorumjournalmanager  Implements Journalmanager {static final log log = Logfactory.getlog (Quorumjournalmanager.class);  Timeouts for which the QJM would wait for each of the following actions.  private final int Startsegmenttimeoutms;  private final int Preparerecoverytimeoutms;  private final int Acceptrecoverytimeoutms;  private final int Finalizesegmenttimeoutms;  private final int Selectinputstreamstimeoutms;  private final int Getjournalstatetimeoutms;  private final int Newepochtimeoutms;  private final int Writetxnstimeoutms; Since These don ' t occur during normal operation, we can//use rather lengthy timeouts, and don ' t need to make them/  /Configurable.  private static final int format_timeout_ms = 60000;         private static final int Hasdata_timeout_ms  = 60000;  private static final int can_roll_back_timeout_ms = 60000;  private static final int finalize_timeout_ms = 60000;  private static final int pre_upgrade_timeout_ms = 60000;  private static final int roll_back_timeout_ms = 60000;  private static final int upgrade_timeout_ms = 60000;  private static final int get_journal_ctime_timeout_ms = 60000;    private static final int discard_segments_timeout_ms = 60000;  Private final Configuration conf;  Private final URI Uri;  Private final Namespaceinfo Nsinfo;    Private Boolean isactivewriter;  The remote node exists in the Asyncloggerset collection in private final asyncloggerset loggers;  private int outputbuffercapacity = 512 * 1024; Private final urlconnectionfactory ConnectionFactory;
The above defines a lot of operation time-outs, which is also the way to go RPC. All proxies for the Journalnode client are included in the Asyncloggerset object, where the object contains a list of Asynclogger objects. Each logger object controls a separate Journalnode, and the following is QJM to dynamically create logger objects from the configuration

Static list<asynclogger> Createloggers (Configuration conf,      URI Uri, Namespaceinfo Nsinfo, Asynclogger.factory Factory)          throws IOException {    list<asynclogger> ret = lists.newarraylist ();    list<inetsocketaddress> Addrs = getloggeraddresses (URI);    String Jid = Parsejournalid (URI);    for (inetsocketaddress Addr:addrs) {      Ret.add (factory.createlogger (conf, Nsinfo, Jid, addr));    }    return ret;  }
Then set into the Asyncloggerset collection class:

Quorumjournalmanager (Configuration conf,      URI Uri, Namespaceinfo nsinfo,      asynclogger.factory loggerfactory) Throws IOException {    preconditions.checkargument (conf! = NULL, "must be configured");    this.conf = conf;    This.uri = URI;    This.nsinfo = Nsinfo;    This.loggers = new Asyncloggerset (Createloggers (loggerfactory));    ...
The definition of the Asyncloggerset collection class is simple, which is the wrapper class of the Logger object.

/** * Wrapper around a set of loggers, taking care of fanning off * calls to the underlying loggers and constructing Corre sponding * {@link Quorumcall} instances. */class Asyncloggerset {  static final log log = Logfactory.getlog (Asyncloggerset.class);  Private final list<asynclogger> loggers;    Private static final Long Invalid_epoch =-1;  Private long Myepoch = Invalid_epoch;    Public Asyncloggerset (list<asynclogger> loggers) {    this.loggers = immutablelist.copyof (loggers);  }
Back in the Logger object class, the Asynclogger object is an abstract class that actually works with the following pipeline class

/** * Channel to a remote Journalnode using Hadoop IPC.  * All of the calls is run on a separate thread, and return * {@link listenablefuture} instances to wait for their result. * This allows calls to be bound together using the {@link Quorumcall} * class.  */@InterfaceAudience. Privatepublic class Ipcloggerchannel implements Asynclogger {private final Configuration conf;  Journalnode correspondence Address protected final inetsocketaddress addr;  Private Qjournalprotocol proxy; /** * Executes tasks submitted to it serially, on a single thread, in FIFO order * (generally used for write tasks tha   T should not being reordered).  * Single threaded serial operation thread pool */private final listeningexecutorservice singlethreadexecutor; /** * Executes tasks submitted to it in parallel with each other and with those * submitted to singlethreadexecutor (g   Enerally used for read tasks, can * is safely reordered and interleaved with writes).  * Parallel operation thread Pool */private final listeningexecutorservice parallelexecutor; Private Long ipcserial = 0;  Private Long epoch =-1;    Private long Committedtxid = Hdfsconstants.invalid_txid;  Private final String Journalid;  Private final Namespaceinfo Nsinfo;  Private URL Httpserverurl; Journalnode Thread Metric statistics Operation private final ipcloggerchannelmetrics metrics;
Just like the name of this class, the function is the connection class of the server and the client executing the class, note that this class is not directly executing the class. In this pipeline class, many useful monitoring information variables are defined, and the journal Monitoring indicator on ganglia is taken from here

/** * The number of bytes of edits data still in the queue.    * Backlog of Editlog record count */private int queuededitssizebytes = 0;   /** * The highest txid that have been successfully logged on the remote JN.  * Number of things with highest bit ID */private long highestackedtxid = 0;   /** * Nanotime of the last time we successfully journaled some edits * to the remote node.  */private long Lastacknanos = 0; /** * Nanotime of the last time this committedtxid was update.   Used * To calculate the lags in terms of time, rather than just a number * of Txns.    */private long Lastcommitnanos = 0;   /** * The maximum number of bytes that can is pending in the queue. * This keeps the writer from hitting Oome if one of the loggers * starts responding really slowly.   Eventually, the queue * overflows and it starts to treat the logger as has errored.  */private final int queuesizelimitbytes; /** * If This logger misses some edits, or restarts in the middle of * a segment, the writer won' t be able to write any more edits until * the beginning of the next segment.   Upon detecting this situation, * The writer sets the "This flag" to "true" to avoid sending useless RPCs. * Non-synchronous status indicator, determine if Journalnode is dropped/private Boolean outofsync = false;
Because the pipeline class method inherits the same protocol as the true client method, the method definition is the same, and several common methods are listed below:

Start record write operation

@Override public  listenablefuture<void> startlogsegment (Final long txid,      final int layoutversion) {    return Singlethreadexecutor.submit (New callable<void> () {      @Override public      Void call () throws IOException {        getproxy (). Startlogsegment (Createreqinfo (), Txid, layoutversion);        Synchronized (ipcloggerchannel.this) {          if (outofsync) {            Outofsync = false;            QuorumJournalManager.LOG.info (                "Restarting Previously-stopped writes to" +                Ipcloggerchannel.this + "in Segment starting at Txid "+                Txid);          }        }        return null;}}    );  }
After writing, perform a record confirmation finalize operation

@Override public  listenablefuture<void> finalizelogsegment (      final long starttxid, final long Endtxid) { C2/>return Singlethreadexecutor.submit (New callable<void> () {      @Override public      Void call () throws IOException {        throwifoutofsync ();                GetProxy (). Finalizelogsegment (Createreqinfo (), Starttxid, ENDTXID);        return null;}}    );  }
The Singlethreadexecutor single threaded thread pool generally performs write-related operations, while the parallel thread pool is read, and all of these operations take the form of asynchronous execution, ensuring efficiency. After the server executes the action function, it immediately gets a call list and waits for the reply value

@Override public  void Finalizelogsegment (Long firsttxid, long Lasttxid)      throws IOException {    quorumcall<asynclogger,void> q = loggers.finalizelogsegment (        firsttxid, LASTTXID);    Loggers.waitforwritequorum (q, Finalizesegmenttimeoutms,        String.Format ("Finalizelogsegment (%s-%s)", Firsttxid, Lasttxid));  }


Journalnode and JournalThe class that operates on each journalnode is journalnode to the client that corresponds to the server.

/** * The Journalnode is a daemon which allows namenodes using * the Quorumjournalmanager to log and retrieve edits stored * remotely. It is a thin wrapper around a local edit log * Directory with the addition of facilities to participate * in the quorum PR Otocol. */@InterfaceAudience. Privatepublic class Journalnode implements Tool, configurable, Journalnodemxbean {public  Static final Log log = Logfactory.getlog (Journalnode.class);  Private Configuration conf;  Private Journalnoderpcserver rpcserver;  Private Journalnodehttpserver httpserver;  Private final map<string, journal> Journalsbyid = Maps.newhashmap ();  Private ObjectName journalnodeinfobeanname;  Private String Httpserveruri;  Private File Localdir;  static {    hdfsconfiguration.init ();  }    /**   * When stopped, the daemon would exit with this code.    */  private int resultcode = 0;
The log record operation method corresponding to the server is defined.

... public void discardsegments (String journalid, long Starttxid)      throws IOException {    getorcreatejournal ( journalid). discardsegments (STARTTXID);  }  public void Dopreupgrade (String journalid) throws IOException {    getorcreatejournal (journalid). Dopreupgrade ();  } Public  void Doupgrade (String journalid, StorageInfo sinfo) throws IOException {    getorcreatejournal (journalid) . Doupgrade (Sinfo);  }  public void Dofinalize (String journalid) throws IOException {    getorcreatejournal (journalid). Dofinalize ();  } ...
And the method of indirect invocation of these methods is journal this method, and invariably passed the method Journald,journalid refers to the location of the Journalnode node identity? At first I thought so, and it turned out to be wrong.

file[] Journaldirs = localdir.listfiles (new FileFilter () {      @Override public      Boolean accept (file file) {        return File.isdirectory ();      }    });    for (File journaldir:journaldirs) {      String Jid = Journaldir.getname ();      if (!status.containskey (Jid)) {        map<string, string> jMap = new hashmap<string, string> ();        Jmap.put ("Formatted", "true");        Status.put (Jid, JMap);      }    }
The answer is actually the target directory, from the Hadoop-yarn-project test code can also know

/**   * Set up the given Configuration object to the set of Journalnodes * in this    cluster.   *  /Public URI Getquorumjournaluri (String jid) {    list<string> Addrs = Lists.newarraylist ();    for (Jninfo info:nodes) {      addrs.add ("127.0.0.1:" + info.ipcAddr.getPort ());    }    String addrsval = Joiner.on (";"). Join (Addrs);    Log.debug ("Setting logger addresses to:" + addrsval);    try {      return new URI ("qjournal://" + Addrsval + "/" + Jid);    } catch (URISyntaxException e) {      throw new Asse Rtionerror (e);    }  }
The Journaluri format is the following, Qjournal://host/jid

<property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://had1:8485;had2 :8485;had3:8485/mycluster</value></property>
The map map object with journal saved in Journalnode allows different nodes to write different Editlog directories. The journal object is the final action performer, and has the Editlogoutputstream class that directly operates the Editlog output file. Here's one of the methods

/** * Start A new segment at the given Txid.   The previous segment * must has already been finalized.    */Public synchronized void Startlogsegment (Requestinfo reqinfo, long txid, int layoutversion) throws IOException {    Assert FJM! = null;    Checkformatted ();        Checkrequest (Reqinfo); if (cursegment! = null) {Log.warn ("Client is requesting a new LOG segment" + Txid + "Though we are Alrea      Dy writing "+ Cursegment +". "+" aborting the current segment on order to begin the new one. ");      The writer May has lost a connection to us and are now//re-connecting after the connection came back.      We should abort our own old segment.    Abortcursegment ();    }//Paranoid sanity check:we should never overwrite a finalized log file.    Additionally, if it ' s in-progress, it should has at most 1 transaction.    This can happen if the writer crashes exactly at the start of a segment. Editlogfile existing = Fjm.getlogfIle (TXID); if (existing! = NULL) {if (!existing.isinprogress ()) {throw new IllegalStateException ("already has a Finali      Zed segment "+ existing +" beginning at "+ Txid); }...
Specific code to write logic, readers can self-check, this article only from the overall comb the whole journalnode of the writing process, the following is the preparation of a simple architecture diagram, to help you understand.


All Code Analysis please click on the link Https://github.com/linyiqun/hadoop-yarn, follow up will continue to update yarn other aspects of code analysis.

Reference Source Code

apach-hadoop-2.7.1 (Hadoop-hdfs-project)


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Yarn Source Analysis (iv)-----Journalnode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.