Recently, the company purchased two Windows Server R2 servers to provide Web services, a machine installed ihs+dm+was8.5 cluster, B machine installed ORACLE11GR2 for data storage, both machines can be connected to the Internet.
Service deployment overnight deployment, testing without any problems, the morning users call feedback does not normally visit the site, remote login found that the IHS+DM service is normal, but the cluster does not start, view Task Manager found no nodeagent and the server in the cluster process, Start the cluster manually after starting Nodeagent, two servers start normally, and then serve normally. At the time, it was suspected that the server was not restarted, and it was a problem with the program, but there was not a continuous follow-up on the matter at hand, but the same thing happened again the next day, when I woke up in the morning and found the service inaccessible. Can't ignore it this time ... After you have collected the relevant logs, start troubleshooting by manually starting nodeagent and clustering and providing services.
1. Check the WebSphere server logs
Check the cluster of server SystemOut.log log, found at 3:15 suddenly burst the following log:
[ --3- A 3: the: -:482CST] 0000004e Peer I odcf8534i: Removed neighbor ip=192.168.1.8udp=11011tcp=11012ID=A0AFD7F939EF4C971FE6825780126B1741B2F9FF version=0; cellname=win-ru03cb21qgacell01;bridgedcells=[];structuredgateway=false;p roperties={inodc=1, epoch=1458522523691, member_startup_time=1458522519269, Membername=win-ru03cb21qgacell01\win-ru03cb21qganode01\appsrv02, member_version=4}, the neighbor set is now2nodes0ip=192.168.1.8udp=11008tcp=11007Id=f271d5e15b5f3696eb6b30d9ef41532f9c5a81e8 version=0; cellname=win-ru03cb21qgacell01;bridgedcells=[];structuredgateway=true;p roperties={inodc=1, epoch=1458522483936, member_startup_time=1458522480920, Membername=win-ru03cb21qgacell01\win-ru03cb21qganode01\nodeagent, member_version=4}1ip=192.168.1.8udp=11005tcp=11006ID=63A7EFDDBD567D67083EFB4FC6A7727DD79C4C32 version=0; cellname=win-ru03cb21qgacell01;bridgedcells=[];structuredgateway=true;p roperties={inodc=1, member_version=4, epoch=1458503412906, odc_publisher_only=false, member_startup_time=1458503408859, membername=win-ru03cb21qgacell01\win-Ru03cb21qgacellmanager01\dmgr}.
The remaining few lines of irrelevant information are out of silence.
2. Check the WebSphere DM log
Check DM SystemOut.log log found DM at night around 3:15 output service stop and start the log, but stop and start unexplained.
3. Check the WebSphere FFDC log
The log files in Dmgr's FFDC directory were sorted by date, and two log files were found on March 22;
Dmgr_exception.log.1458587814531.txt
Dmgr_25be7f2a_16.03.22_03.16.54.5782445606813376690951.txt
The following output is found:
[3-3: From:578 CST] FFDC Exception:java.io.IOException SourceId:com.ibm.ws.management.discovery.DiscoveryService.sendQuery Probeid: 189 Reporter:[email protected]java.io.ioexception:admd0004e: Unable to open TCP socket: WIN-ru03cb21qga:7272. Check to see if the remote process has opened the port.
"Unable to open TCP sockets" is not a network problem, then what is the network problem? is the network not allowed to restart the service? is the operating system itself doing what? Then look at the operating system log according to the time point.
4. Check the logs in Windows Event Viewer
Click "Start--" management tool-"Event Viewer", under the Windows log node click on the "System", in the right side of the list of events according to the time of the event 3.15 to filter, finally found the problem;
The original cloud service provider's operating system is set at three o'clock in the morning system updates, system updates automatically after the system restart.
IHS+DM is started as a service by default under the Windows platform, can be started with the operating system, and Nodeagent is not a service and cannot be started with the operating system, which causes the service not to start properly.
A case of WebSphere service failure caused by Automatic update and restart of Windows Server