Investigate issues of real time interrupted

Last Update:2018-09-30 Source: Internet

Author: User

Tags connection reset jboss

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Issues:

Customer report that real time would interrupted frequently as below:

Root Cause:

Some storm workers execute full GC with too many time and cause Nimbus reset the worker, so the data is missing.

Steps to invesigates this issues

1. Check the storm UI with workers and found it all workers is normal, but some task failed

Check logs from storm workers and found there is some exceptions as below:

2018-09-26 07:35:55.081 flowkafkareadspout getDataThread-8 [INFO] Partition:8,offset:676160000,key:2018-09-26 07:35:53_15067,valuelength:17552018-09-26 07:36:00.490 O.a.s.m.n.stormserverhandler Netty-server-localhost-6700-worker-1[ERROR] Server errors in handling the Requestjava.io.IOException:Connection reset by peer at Sun.nio.ch.FileD Ispatcherimpl.read0 (Native Method)~[?:1.7. 0_80] at Sun.nio.ch.SocketDispatcher.read (Socketdispatcher.java:39) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.readIntoNativeBuffer (Ioutil.java:223) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.read (Ioutil.java:192) ~[?:1.7. 0_80] at Sun.nio.ch.SocketChannelImpl.read (Socketchannelimpl.java:384) ~[?:1.7. 0_80] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.read (Nioworker.java:(+) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process ( Abstractnioworker.java:108) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run ( Abstractnioselector.java:318) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run (Abstractnioworker.java: ) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.run (Nioworker.java:178) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run (Threadrenamingrunnable.java: 108) [storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.jboss.netty.util.internal.deadlockproofworker$1.run (deadlockproofworker.java:42) [storm-core-1.1.1.jar:1.1.1] at Java.util.concurrent.ThreadPoolExecutor.runWorker (Threadpoolexecutor.java:1145) [?: 1.7. 0_80] at Java.util.concurrent.threadpoolexecutor$worker.run (Threadpoolexecutor.java:615) [?: 1.7. 0_80] at Java.lang.Thread.run (Thread.java:745) [?: 1.7. 0_80]2018-09-26 07:36:00.490 O.a.s.m.n.stormclienthandler client-worker-1 [INFO] Connection to ip-10-9-248-74.us-west-2.compute.internal/10.9.248.74:6700failed:java.io.IOException:Connection Reset by peer at Sun.nio.ch.FileDispatcherImpl.read0 (Native Method) ~[?:1.7. 0_80] at Sun.nio.ch.SocketDispatcher.read (Socketdispatcher.java:39) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.readIntoNativeBuffer (Ioutil.java:223) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.read (Ioutil.java:192) ~[?:1.7. 0_80] at Sun.nio.ch.SocketChannelImpl.read (Socketchannelimpl.java:384) ~[?:1.7.0_80]

It seems that some of workers can ' t connect and also can find others exceptions of "Connection reset by peer" with Zookeep ER cluster and Kafaka cluster, check the port usage and found that:

TCP6       0      0 10.9.248.61:38050       10.9.248.70:9092        time_waittcp6       0      0 10.9.248.61:38066       10.9.248.70:9092        time_waittcp6       0      0 10.9.248.61:39160       10.9.248.97:2181        time_wait

We restarted all storm workers and Kafaka cluster and zookeeper cluster, the issue is not fix also.

We check the storm workers again and found on other workers there also has some issues as below:

2018-09-26 06:34:34.834 STDERR Thread-2 [INFO] 740.606: [Full GC [psyounggen:1298054k->370618k (6126592K)] [Paroldgen : 5032811k->5122788k (6748672K)] 6330866k->5493406k (12875264K) [pspermgen:55526k->55525k (524288K)], 6.4880090 secs] [times:user=100.76 sys=0.00, real=6.49 secs]2018-09-26 06:34:34.834 o.a.s.s.o.a.z.clientcnxn thread-23-$mastercoord-bg1-executor[2 2]-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Client Session timed out, with not heard from server in 9008ms for SessionID 0xb65a69ab5380782, closing socket connection and ATT empting reconnect2018-09-26 06:34:34.840 o.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[ 824 824]-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, with not heard from Server in 10147ms for SessionID 0xa5beb7fcf46ff88, closing socket connection and attempting reconnect2018-09-26 06:34:34.8 O.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, with not heard from server in 9398ms for SessionID 0xa5beb7fcf46ff8 2, closing socket connection and attempting reconnect2018-09-26 06:34:34.935 O.a.s.s.o.a.c.f.s.connectionstatemanager thread-23-$mastercoord-bg1-executor[2 2]-eventthread [INFO] State change:suspended2018-09-26 06:34:34.941 O.a.c.f.s.connectionstatemanager thread-279-spout-datakafkaspout1537942916091-executor[824 824]-EventThread [INFO ] State change:suspended2018-09-26 06:34:34.942 O.a.s.s.o.a.c.f.s.connectionstatemanager main-EventThread [INFO] State change:suspended2018-09-26 06:34:34.943 O.a.s.u.stormboundedexponentialbackoffretry Executor-heartbeat-timer [WARN] Would SLEEP for 1001ms (not MAX) 2018-09-26 06:34:34.943 o.a.s.u.stormboundedexponentialbackoffretry Refresh-active-timer [WARN] would SLEEP for 1001ms (not MAX) 2018-09-26 06:34:34.943 O.a.s.u.stormboundedexponentialbackoffretry Refresh-connections-timer [WARN] would SLEEP for 1001ms (not MAX) 2018-09-26 06: 34:34.943 o.a.s.c.zookeeper-state-factory main-eventthread [WARN] Received event:d Isconnected::none:with Disconnected Writer zookeeper.2018-09-26 06:34:35.182 o.a.s.s.o.a.z.clientcnxn thread-23-$mastercoord-bg1-executor[ 2 2]-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.183 o.a.s.s.o.a.z.clientcnxn thread-23-$ma Stercoord-bg1-executor[2 2]-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Socket connection Established to ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, initiating session2018-09-26 06:34:35.183 o.a.s.s.o.a.z.clientcnxn thread-23-$mastercoord-bg1-executor[2 2]-sendthread ( ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, SessionID = 0xb65a69ab53807Negotiated timeout = 100002018-09-26 06:34:35.183 O.a.s.s.o.a.c.f.s.connectionstatemanager Thread-23-$ Mastercoord-bg1-executor[2 2]-eventthread [INFO] State change:reconnected2018-09-26 06:34:35.787 O.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket Connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.787 o.a.s.s.o.a.z.clientcnxn Main-sendthre AD (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:35.789 O.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect To ZooKeeper Service, session 0xa5beb7fcf46ff82 have expired, closing socket connection2018-09-26 06:34:35.789 O.A.S.S.O.A . C.f.s.connectionstatemanager Main-eventthread [INFO] State change:lost2018-09-26 06:34:35.789 o.a.s.c.zookeeper-state-factory main-eventthread [WARN] Received event: Expired::none:with disconnected Writer zookeeper.2018-09-26 06:34:35.789 o.a.s.s.o.a.c.connectionstate Main-eventthread [WARN] Session expired event received2018-09-26 06:34:35.789 O.a.s.s.o.a.z.zookeeper main-eventthread [INFO] Initiating client connection, connectstring=ec2-52-27-163-101.us-west-2.compute.amazonaws.com:2181, ec2-52-27-236-22.us-west-2.compute.amazonaws.com:2181,ec2-52-24-149-36.us-west-2.compute.amazonaws.com:2181/ storm111 sessiontimeout=90000 wa[email protected]383fbe822018-09-26 06:34:35.802 o.a.s.s.o.a.z.clientcnxn Main-eventthread [INFO] eventthread shut down2018-09-26 06:34:35.802 o.a.s.s.o.a.z.clientcnxn main-SendThread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.802 o.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket Connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:35.804 o.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, SessionID = 0xb65a69ab538078c, negotiated timeout = 100002018-09-26 06:34:35.805 O.a.s.s.o.a.c.f.s.connectionstatemanager Main-eventthread [INFO] state change:reconnected2018-09-26 06:34:35.935 O.A.Z.CLIENTCNXN thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.935 o.a.z.clientcnxn Thread-279-spout-data Kafkaspout1537942916091-executor[824 824]-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket Connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:35.937 o.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-SendThread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect to ZooKeeper service, session 0XA5BEB7FCF46FF88 has expired, closing socket connection2018-09-26 06:34:35.937 O.a.c.f.s.connectionstatemanager thread-279-spout-datakafkaspout1537942916091-executor[824 824]-eventthread [INFO] State change:lost2018-09-26 06:34:35.937 o.a.c.connectionstate thread-279-spout-datakafkaspout1537942916091-executor[824 824]-EventThread [ WARN] Session expired event received2018-09-26 06:34:35.938 O.a.z.zookeeper thread-279-spout-datakafkaspout1537942916091-executor[824 824]-eventthread [INFO] initiating client connection, Connectstring=zookeeper-prod-1.compass-calix.com:2181,zookeeper-prod-2.compass-calix.com:2181,zookeeper-prod-3.compass-calix.com:2181 sessionTimeout=60000 [ Email protected]2018-09-26 06:34:36.177 O.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:36.177 o.a.z.clientcnxn Thread-279-spout-data kafkaspout1537942916091-executor[824 824]-eventthread [INFO] Eventthread shut down2018-09-26 06:34:36.177 O.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:36.179 O.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, SessionID = 0xb65a69ab538078d, negotiated timeout = 100002018-09-26 06:34:36.180 O.a.c.f.s.connectionstatemanager thread-279-spout-datakafkaspout1537942916091-executor[824 824]-eventthread [INFO] state change: reconnected2018-09-26 06:34:36.217 C.c.s.r.z.znodetreelistener curator-treecache-0 [INFO] Listen:add path:/realtime/ subscriptions/location/1127582/1033, Timestamp is:

From logs, we'll found that, sometimes the worker would execute full GC exceed more than 30s, and the worker ' s "topolog Y.message.timeout.secs=30 ", so when full GC executed + than 30s, the other workers can ' t get the response from this W Orker and Nimbers would disconnect this worker,

Investigate issues of real time interrupted

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More