Recently in the high-availability (three-control) test of the OpenStack control node, Nova Service-list saw that all Nova services were down when one of the control nodes was switched off. There are a number of such error messages in the Nova-compute log:
2016-11-08 03:46:23.887 127895 info oslo.messaging._drivers.impl_rabbit [-] a recoverable connection/channel error occurred, trying to reconnect: [ Errno 32] broken pipe2016-11-08 03:46:27.275 127895 info oslo.messaging._ drivers.impl_rabbit [-] a recoverable connection/channel error occurred, trying to reconnect: [errno 32] broken pipe2016-11-08 03:46:27.276 127895 info oslo.messaging._drivers.impl_rabbit [-] a recoverable connection/ channel error occurred, trying to reconnect: [errno 32] broken pipe2016-11-08 03:46:27.276 127895 info oslo.messaging._drivers.impl_rabbit [-] a recoverable connection/channel error occurred, trying to reconnect: [Errno 32] broken pipe2016-11-08 03:46:27.277 127895 info oslo.messaging._drivers.impl_rabbit [ -] a recoverable connection/channel error occurred, trying to reconnect: [errno 32] broken pipe2016-11-08 03:46:27.277 127895 info oslo.messaging._drivers.impl_rabbit [-] a recoverable connection/channel error occurred, trying to reconnect: [errno 32] broken pipe2016-11-08 03:46:27.278 127895 info oslo.messaging._drivers.impl_rabbit [-] a recoverable  CONNECTION/CHANNEL ERROR OCCURRED, TRYING TO RECONNECT: [ERRNO 32] broken pipe2016-11-08 03:46:27.278 127895 info oslo.messaging._drivers.impl_ Rabbit [-] a recoverable connection/channel error occurred, trying to reconnect: [errno 32] broken pipe
The exception thrown above is located in the oslo_messaging/_drivers/impl_rabbit.py:
def _heartbeat_thread_job (self): "" "Thread that maintains inactive connections "" " while not self._heartbeat_exit_event.is_set (): with self._connection_lock.for_heartbeat (): recoverable_errors = ( self.connection.recoverable_channel_errors + Self.connection.recoverable_connection_errors) try: try: self._heartbeat_check () # note (sileht): we need to drain event to receive # heartbeat from the broker but don ' t hold the # connection too much times. in amqpdriver a connection # is used exclusivly for read or for write, so we have # to do this for connection used for write drain_events # already do that for other connection try: self.connection.drain_events (timeout=0.001) except socket.timeout: pass except recoverable_errors as exc: log.info (_LI ("A recoverable connection/channel error " "occurred, trying to reconnect: %s"), exc Self.ensure_connection () except Exception: log.warning (_LW (" unexpected error during heartbeart " "Thread processing, retrying ...") &nBsp; log.debug (' Exception ', exc_ Info=true) self._heartbeat_exit_ Event.wait ( Timeout=self._heartbeat_wait_timeout) self._heartbeat_exit_ Event.clear ()
Originally heartbeat check is to detect whether the connection between the component service and RABBITMQ server is alive, the Heartbeat_check task in oslo_messaging runs in the background when the service is started, when a control node is closed , a RABBITMQ server node is actually closed. But this will always be in the loop, throwing the exception that recoverable_errors catches, and only if Self._heartbeat_exit_event.is_set () exits the while loop. Supposedly it should add a timeout, so that it will not be in the loop, after several minutes to recover.
This article is from the "The-way-to-cloud" blog, make sure to keep this source http://iceyao.blog.51cto.com/9426658/1870593
The Heartbeat_check in oslo_messaging