has been using a five-machine MongoDB cluster (192.168.40.80 ~ 84), 5 Shard, divided by 3 shards. has been running normally, the recent period of discovery service is very unstable, show db old hint Shard 4 error, and sometimes the machine will be too high load and downtime.
Today, I accidentally look at the MongoDB logs and found that several machines related to Shard 4 are reporting the same error:
[Rshealthpoll] couldn ' t connect to 192.168.40.83:29022:couldn ' t connect to server 192.168.40.83:29022
And on the 40.83 view SHARD4 log found also error:
[Rshealthpoll] Replset info 192.168.40.80:29022 thinks that we is down
This is strange, the network is not a problem why the service of each shard is normal but outside the connection is not on it. A few other machines can ping through, Telnet 29022 but not successful. Later found that the original is 83 iptables with a system maintenance restart the machine and opened, so the default is blocked on its MongoDB service (in fact, several other shards have problems), the result may be on the SHARD4 data is accessed too many times, and this shard only two machines on duty, So once another problem occurs, it will invalidate the entire data set. A machine crash (not 40.83) should also be associated with this.
[Root@mongodb04 ~]# iptables-l
Chain INPUT (policy ACCEPT)
target prot opt source destination
Accept All – anywhere anywhere state related,established
accept ICMP- - Anywhere anywhere
accept all - anywhere anywhere
accept TCP - Anywhere Anywhere state NEW TCP dpt:ssh
REJECT all - anywhere anywhere Reject-with icmp-host-prohibited
Chain FORWARD (Policy ACCEPT)
target prot opt source destination
REJECT all -- Anywhere anywhere reject-with icmp-host-prohibited
Chain OUTPUT (policy ACCEPT)
Target Prot opt source destination
The workaround is to turn off the iptables and set the boot not to start.
Service iptables stop
chkconfig--level 2345 iptables off
PS. The above nonsense, summed up is that MongoDB Shard services may be blocked by the iptables firewall, if the network and the possibility of the service is excluded, the most likely reason is to view the firewall settings.