Customer report that real time would interrupted frequently as below:
Some storm workers execute full GC with too many time and cause Nimbus reset the worker, so the data is missing.
- Steps to invesigates this issues
1. Check the storm UI with workers and found it all workers is normal, but some task failed
Check logs from storm workers and found there is some exceptions as below:
2018-09-26 07:35:55.081 flowkafkareadspout getDataThread-8 [INFO] Partition:8,offset:676160000,key:2018-09-26 07:35:53_15067,valuelength:17552018-09-26 07:36:00.490 O.a.s.m.n.stormserverhandler Netty-server-localhost-6700-worker-1[ERROR] Server errors in handling the Requestjava.io.IOException:Connection reset by peer at Sun.nio.ch.FileD Ispatcherimpl.read0 (Native Method)~[?:1.7. 0_80] at Sun.nio.ch.SocketDispatcher.read (Socketdispatcher.java:39) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.readIntoNativeBuffer (Ioutil.java:223) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.read (Ioutil.java:192) ~[?:1.7. 0_80] at Sun.nio.ch.SocketChannelImpl.read (Socketchannelimpl.java:384) ~[?:1.7. 0_80] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.read (Nioworker.java:(+) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process ( Abstractnioworker.java:108) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run ( Abstractnioselector.java:318) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run (Abstractnioworker.java: ) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.run (Nioworker.java:178) [storm-core-1.1.1.jar:1.1.1] at Org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run (Threadrenamingrunnable.java: 108) [storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.jboss.netty.util.internal.deadlockproofworker$1.run (deadlockproofworker.java:42) [storm-core-1.1.1.jar:1.1.1] at Java.util.concurrent.ThreadPoolExecutor.runWorker (Threadpoolexecutor.java:1145) [?: 1.7. 0_80] at Java.util.concurrent.threadpoolexecutor$worker.run (Threadpoolexecutor.java:615) [?: 1.7. 0_80] at Java.lang.Thread.run (Thread.java:745) [?: 1.7. 0_80]2018-09-26 07:36:00.490 O.a.s.m.n.stormclienthandler client-worker-1 [INFO] Connection to ip-10-9-248-74.us-west-2.compute.internal/10.9.248.74:6700failed:java.io.IOException:Connection Reset by peer at Sun.nio.ch.FileDispatcherImpl.read0 (Native Method) ~[?:1.7. 0_80] at Sun.nio.ch.SocketDispatcher.read (Socketdispatcher.java:39) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.readIntoNativeBuffer (Ioutil.java:223) ~[?:1.7. 0_80] at Sun.nio.ch.IOUtil.read (Ioutil.java:192) ~[?:1.7. 0_80] at Sun.nio.ch.SocketChannelImpl.read (Socketchannelimpl.java:384) ~[?:1.7.0_80]
It seems that some of workers can ' t connect and also can find others exceptions of "Connection reset by peer" with Zookeep ER cluster and Kafaka cluster, check the port usage and found that:
TCP6 0 0 10.9.248.61:38050 10.9.248.70:9092 time_waittcp6 0 0 10.9.248.61:38066 10.9.248.70:9092 time_waittcp6 0 0 10.9.248.61:39160 10.9.248.97:2181 time_wait
We restarted all storm workers and Kafaka cluster and zookeeper cluster, the issue is not fix also.
We check the storm workers again and found on other workers there also has some issues as below:
2018-09-26 06:34:34.834 STDERR Thread-2 [INFO] 740.606: [Full GC [psyounggen:1298054k->370618k (6126592K)] [Paroldgen : 5032811k->5122788k (6748672K)] 6330866k->5493406k (12875264K) [pspermgen:55526k->55525k (524288K)], 6.4880090 secs] [times:user=100.76 sys=0.00, real=6.49 secs]2018-09-26 06:34:34.834 o.a.s.s.o.a.z.clientcnxn thread-23-$mastercoord-bg1-executor[2 2]-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Client Session timed out, with not heard from server in 9008ms for SessionID 0xb65a69ab5380782, closing socket connection and ATT empting reconnect2018-09-26 06:34:34.840 o.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[ 824 824]-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, with not heard from Server in 10147ms for SessionID 0xa5beb7fcf46ff88, closing socket connection and attempting reconnect2018-09-26 06:34:34.8 O.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, with not heard from server in 9398ms for SessionID 0xa5beb7fcf46ff8 2, closing socket connection and attempting reconnect2018-09-26 06:34:34.935 O.a.s.s.o.a.c.f.s.connectionstatemanager thread-23-$mastercoord-bg1-executor[2 2]-eventthread [INFO] State change:suspended2018-09-26 06:34:34.941 O.a.c.f.s.connectionstatemanager thread-279-spout-datakafkaspout1537942916091-executor[824 824]-EventThread [INFO ] State change:suspended2018-09-26 06:34:34.942 O.a.s.s.o.a.c.f.s.connectionstatemanager main-EventThread [INFO] State change:suspended2018-09-26 06:34:34.943 O.a.s.u.stormboundedexponentialbackoffretry Executor-heartbeat-timer [WARN] Would SLEEP for 1001ms (not MAX) 2018-09-26 06:34:34.943 o.a.s.u.stormboundedexponentialbackoffretry Refresh-active-timer [WARN] would SLEEP for 1001ms (not MAX) 2018-09-26 06:34:34.943 O.a.s.u.stormboundedexponentialbackoffretry Refresh-connections-timer [WARN] would SLEEP for 1001ms (not MAX) 2018-09-26 06: 34:34.943 o.a.s.c.zookeeper-state-factory main-eventthread [WARN] Received event:d Isconnected::none:with Disconnected Writer zookeeper.2018-09-26 06:34:35.182 o.a.s.s.o.a.z.clientcnxn thread-23-$mastercoord-bg1-executor[ 2 2]-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.183 o.a.s.s.o.a.z.clientcnxn thread-23-$ma Stercoord-bg1-executor[2 2]-sendthread (ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Socket connection Established to ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, initiating session2018-09-26 06:34:35.183 o.a.s.s.o.a.z.clientcnxn thread-23-$mastercoord-bg1-executor[2 2]-sendthread ( ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, SessionID = 0xb65a69ab53807Negotiated timeout = 100002018-09-26 06:34:35.183 O.a.s.s.o.a.c.f.s.connectionstatemanager Thread-23-$ Mastercoord-bg1-executor[2 2]-eventthread [INFO] State change:reconnected2018-09-26 06:34:35.787 O.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket Connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.787 o.a.s.s.o.a.z.clientcnxn Main-sendthre AD (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:35.789 O.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect To ZooKeeper Service, session 0xa5beb7fcf46ff82 have expired, closing socket connection2018-09-26 06:34:35.789 O.A.S.S.O.A . C.f.s.connectionstatemanager Main-eventthread [INFO] State change:lost2018-09-26 06:34:35.789 o.a.s.c.zookeeper-state-factory main-eventthread [WARN] Received event: Expired::none:with disconnected Writer zookeeper.2018-09-26 06:34:35.789 o.a.s.s.o.a.c.connectionstate Main-eventthread [WARN] Session expired event received2018-09-26 06:34:35.789 O.a.s.s.o.a.z.zookeeper main-eventthread [INFO] Initiating client connection, connectstring=ec2-52-27-163-101.us-west-2.compute.amazonaws.com:2181, ec2-52-27-236-22.us-west-2.compute.amazonaws.com:2181,ec2-52-24-149-36.us-west-2.compute.amazonaws.com:2181/ storm111 sessiontimeout=90000 wa[email protected]383fbe822018-09-26 06:34:35.802 o.a.s.s.o.a.z.clientcnxn Main-eventthread [INFO] eventthread shut down2018-09-26 06:34:35.802 o.a.s.s.o.a.z.clientcnxn main-SendThread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.802 o.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket Connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:35.804 o.a.s.s.o.a.z.clientcnxn Main-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, SessionID = 0xb65a69ab538078c, negotiated timeout = 100002018-09-26 06:34:35.805 O.a.s.s.o.a.c.f.s.connectionstatemanager Main-eventthread [INFO] state change:reconnected2018-09-26 06:34:35.935 O.A.Z.CLIENTCNXN thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:35.935 o.a.z.clientcnxn Thread-279-spout-data Kafkaspout1537942916091-executor[824 824]-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket Connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:35.937 o.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-SendThread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect to ZooKeeper service, session 0XA5BEB7FCF46FF88 has expired, closing socket connection2018-09-26 06:34:35.937 O.a.c.f.s.connectionstatemanager thread-279-spout-datakafkaspout1537942916091-executor[824 824]-eventthread [INFO] State change:lost2018-09-26 06:34:35.937 o.a.c.connectionstate thread-279-spout-datakafkaspout1537942916091-executor[824 824]-EventThread [ WARN] Session expired event received2018-09-26 06:34:35.938 O.a.z.zookeeper thread-279-spout-datakafkaspout1537942916091-executor[824 824]-eventthread [INFO] initiating client connection, Connectstring=zookeeper-prod-1.compass-calix.com:2181,zookeeper-prod-2.compass-calix.com:2181,zookeeper-prod-3.compass-calix.com:2181 sessionTimeout=60000 [ Email protected]2018-09-26 06:34:36.177 O.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Won't attempt to authenticate using SASL (unknown error) 2018-09-26 06:34:36.177 o.a.z.clientcnxn Thread-279-spout-data kafkaspout1537942916091-executor[824 824]-eventthread [INFO] Eventthread shut down2018-09-26 06:34:36.177 O.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread ( ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session2018-09-26 06:34:36.179 O.a.z.clientcnxn thread-279-spout-datakafkaspout1537942916091-executor[824 824]-sendthread (ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, SessionID = 0xb65a69ab538078d, negotiated timeout = 100002018-09-26 06:34:36.180 O.a.c.f.s.connectionstatemanager thread-279-spout-datakafkaspout1537942916091-executor[824 824]-eventthread [INFO] state change: reconnected2018-09-26 06:34:36.217 C.c.s.r.z.znodetreelistener curator-treecache-0 [INFO] Listen:add path:/realtime/ subscriptions/location/1127582/1033, Timestamp is:
From logs, we'll found that, sometimes the worker would execute full GC exceed more than 30s, and the worker ' s "topolog Y.message.timeout.secs=30 ", so when full GC executed + than 30s, the other workers can ' t get the response from this W Orker and Nimbers would disconnect this worker,
Investigate issues of real time interrupted