Fuel 6.1 Automatic Push 3 controlled high Availability CentOS 6.5 Juno Environment Troubleshooting (i)

Source: Internet
Author: User
Tags haproxy

To view the fuel log:

# Less/var/log/docker-logs/remote/node-1.domain.tld/puppet-apply.log

2015-12-25t17:26:22.134118+00:00 debug:waiting seconds for service ' vip__public ' to start

Wait for "Vip__public" to start for more than 600 seconds. "Vip__public" is the VIP resource of the cluster:

# CRM Configure Show

Primitive Vip__public OCF:FUEL:NS_IPADDR2 \
Op stop timeout=30 interval=0 \
OP Monitor timeout=30 interval=3 \
Op start timeout=30 interval=0 \
params ns_iptables_start_rules=false ip=168.1.22.80 ns_iptables_stop_rules=false iptables_stop_rules=false bridge= Br-ex ns=haproxy nic=br-ex iptables_start_rules=false other_networks=false gateway=168.1.22.1 base_veth=br-ex-hapr gateway_metric=10 ns_veth=hapr-p iflabel=ka iptables_comment=false cidr_netmask=24 \
Meta migration-threshold=3 failure-timeout=60 Resource-stickiness=1 target-role=started

For some reason, the "vip__public" resource could not be started:

# CRM Resource Status Vip__public

Resource Vip__public is not running

# ping 168.1.22.80

PING 168.1.22.80 (168.1.22.80) bytes of data.
From 168.1.22.82 icmp_seq=2 Destination Host Unreachable
From 168.1.22.82 icmp_seq=3 Destination Host Unreachable

Attempting to manually start a resource still fails:

# CRM Resource Start Vip__public

# CRM Resource Status Vip__public

Resource Vip__public is not running

Because there is no error log in the/var/log/corosync.log that the resource cannot start, you need to find another way to troubleshoot.

Because the resource type of Vip__public is OCF:FUEL:NS_IPADDR2, the resource agent script is called when the VIP is created:

/usr/lib/ocf/resource.d/fuel/ns_ipaddr2

Create the OCF test profile/usr/share/resource-agents/ocft/configs/ns_ipaddr2_start, which reads as follows:

--------------------------------------------------------------------------

# NS_IPADDR2

CONFIG
Agent NS_IPADDR2
Agentroot/usr/lib/ocf/resource.d/fuel
Hangtimeout 20

Case-block set_ocf_reskey_env
ENV Ocf_reskey_iptables_start_rules=false
ENV Ocf_reskey_iptables_stop_rules=false
ENV ocf_reskey_gateway_metric=10
ENV ocf_reskey_ns_veth=hapr-p
ENV OCF_RESKEY_BASE_VETH=BR-EX-HAPR
ENV ocf_reskey_gateway=168.1.22.1
ENV Ocf_reskey_bridge=br-ex
ENV ocf_reskey_ip=168.1.22.80
ENV Ocf_reskey_iflabel=ka
ENV Ocf_reskey_other_networks=false
ENV ocf_reskey_other_cidr_netmask=24
ENV Ocf_reskey_ns=haproxy
ENV Ocf_reskey_ns_iptables_start_rules=false
ENV Ocf_reskey_iptables_comment=false
ENV Ocf_reskey_nic=br-ex
ENV Ocf_reskey_ns_iptables_stop_rules=false

Case ":::: Begin Test Start:::"
Include set_ocf_reskey_env
Agentrun Start ocf_success

-----------------------------------------------------------------

To compile the test script:

# OCFT Make Ns_ipaddr2_start

Execute the test script:

# OCFT Test Ns_ipaddr2_start

The test script was executed without an error and the VIP was created successfully:

# ping 168.1.22.80

PING 168.1.22.80 (168.1.22.80) bytes of data.
Bytes from 168.1.22.80:icmp_seq=1 ttl=64 time=0.046 ms
Bytes from 168.1.22.80:icmp_seq=2 ttl=64 time=0.059 ms

Suspect that Vip__public has not called this script for some reason. So edit the resource agent script:

# cp-a/usr/lib/ocf/resource.d/fuel/ns_ipaddr2/usr/lib/ocf/resource.d/fuel/ns_ipaddr2.bak

# VI/USR/LIB/OCF/RESOURCE.D/FUEL/NS_IPADDR2

Add in 669 rows:

668 # # Main
669 Echo $__ocf_action>>/root/ns_ipaddr2.log
670 rc= $OCF _success
671 Case $__ocf_action in

With this modification, once a call to the resource agent script occurs, the called method is output.

Try to start the resource manually and see if the resource agent script is called:

# CRM Resource Start Vip__public

# Less/root/ns_ipaddr2.log

The contents of the discovery log are empty, indicating that the resource agent script was not called.

For further verification, we try to manipulate other normal resources:

# CRM Resource Restart Vip__management

# Less/root/ns_ipaddr2.log

The discovery log contains content that indicates that the normal resource has a call to the resource agent script.

There is no log in Corosync.log, NS_IPADDR2 resource proxy script is not called, most likely vip__public resource definition has a logic error.

Related data, such as resource definitions in a cluster, are stored in the/var/lib/pacemaker/cib/cib.xml of each node as an XML file, where constraints can be defined for each defined resource. The role of constraints is to set certain conditions, if the conditions are met or not met, resources can be started or not bootable.

The most likely constraint definition error. Since there was no understanding of cib.xml at the time, no logic errors could be seen from it.

In this case, you need to know when the fuel is pushed, when there is a problem, and when there is a problem, puppet do what step. Because it is the VIP resource problem, the problem is the ping 168.1.22.80 can not pass, And 168.1.22.80 is not installed at the beginning of the pass, using this feature, if you can know 168.1.22.80 from through to the specific time, you can know this time puppet do what action, resulting in the occurrence of problems.

In the 168.1.22.0/24 network segment of the Linux server, write a ping script with a time, the contents are as follows:

----------------------------------------------------

#!/bin/sh
Ping 168.1.22.80| While read Pong; Do echo "$ (date +%y-%m-%d\%h:%m:%s) $pong"; Done>>/root/ping.log

----------------------------------------------------

Push again at fuel and execute the script in the background on the server:

# nohup/root/ping.sh &

According to the log, it is found that the time of vip__public failure is approximately 23:12:00

2015-12-25 23:11:58 bytes from 168.1.22.80:icmp_seq=2844 ttl=64 time=6.70 ms
2015-12-25 23:11:59 bytes from 168.1.22.80:icmp_seq=2845 ttl=64 time=1.15 ms
2015-12-25 23:12:00 bytes from 168.1.22.80:icmp_seq=2846 ttl=64 time=0.892 ms
2015-12-25 23:12:30 from 168.1.22.205 icmp_seq=2873 Destination Host Unreachable
2015-12-25 23:12:30 from 168.1.22.205 icmp_seq=2874 Destination Host Unreachable
2015-12-25 23:12:30 from 168.1.22.205 icmp_seq=2875 Destination Host Unreachable

See what happened 23:12:00 in the log of puppet:

2015-12-25t23:12:00.766102+00:00 debug: (Cs_rsc_location[loc_ping_vip__public] (PROVIDER=CRM)) {}
2015-12-25T23:12:00.768631+00:00 Notice: (/stage[main]/main/cluster::virtual_ip_ping[vip__public]/cs_rsc_ Location[loc_ping_vip__public]/ensure) created
2015-12-25t23:12:00.775187+00:00 debug:executing '/usr/sbin/pcs Property show Dc-version '
2015-12-25t23:12:03.157766+00:00 debug: (Cs_rsc_location[loc_ping_vip__public] (PROVIDER=CRM)) Evaluating {: cib= > "Ping_vip__public",: Node_score=>nil,:p rimitive=> "Vip__public",:name=> "Loc_ping_vip__public",: Rules=>[{:expressions=>[{:value=> "or",:operation=> "Pingd",:attribute=> "not_defined"}, {: value= > "0",:operation=> "LTE",:attribute=> "Pingd"}],:boolean=> "",:score=> "-inf"}],: Node_name=>nil,: Ensure=>:p Resent}
2015-12-25t23:12:03.157951+00:00 debug: (Cs_rsc_location[loc_ping_vip__public] (PROVIDER=CRM)) Creating location With command
2015-12-25t23:12:03.157951+00:00 debug: (Cs_rsc_location[loc_ping_vip__public] (PROVIDER=CRM)) Location Loc_ping_ Vip__public vip__public rule-inf:not_defined pingd or Pingd LTE 0
2015-12-25t23:12:03.157951+00:00 debug: (Cs_rsc_location[loc_ping_vip__public] (PROVIDER=CRM)) trying to delete old Shadow if exists
2015-12-25t23:12:03.157951+00:00 debug:executing '/usr/sbin/crm_shadow-b-f-d location_loc_ping_vip__public '
2015-12-25t23:12:03.206974+00:00 debug: (Cs_rsc_location[loc_ping_vip__public] (PROVIDER=CRM)) Delete failed but Proceeding anyway
2015-12-25t23:12:03.207226+00:00 debug:executing '/usr/sbin/crm_shadow-b-C location_loc_ping_vip__public '
2015-12-25t23:12:04.355222+00:00 debug: (Cs_rsc_location[loc_ping_vip__public] (PROVIDER=CRM)) No difference- Nothing to apply
2015-12-25t23:12:04.355979+00:00 Debug: (/stage[main]/main/cluster::virtual_ip_ping[vip__public]/cs_rsc_location [Loc_ping_vip__public]) The container cluster::virtual_ip_ping[vip__public] 'll propagate my refresh event
2015-12-25t23:12:04.356660+00:00 Info: (/stage[main]/main/cluster::virtual_ip_ping[vip__public]/cs_rsc_location[ Loc_ping_vip__public]) evaluated in 6.21 seconds
2015-12-25t23:12:04.357718+00:00 Info: (/STAGE[MAIN]/MAIN/CLUSTER::VIRTUAL_IP_PING[VIP__PUBLIC]/SERVICE[PING_VIP __public]) Starting to evaluate the resource
2015-12-25t23:12:04.358951+00:00 debug:waiting seconds for Pacemaker to become online

Because at that time to CRM is not very understanding, can not understand what the log exactly do, only probably understand the puppet to the CIB has a certain operation, it seems necessary to continue to follow the changes of the CIB.

Re-use fuel push, in controller1, add script to constantly export the contents of Cib.xml:

------------------------------------------------------------------------

#!/bin/sh

Str_time= "
Str_path= "
Str_md5_c= "
Str_md5= "

Mkdir-p/ROOT/CIB
While:
Do
str_time=$ (Date +%y%m%d%h%m%s)
Str_path=/root/cib/cib_$str_time.xml
Cibadmin--query> $STR _path
str_md5_c=$ (md5sum $STR _path|awk ' {print $} ')
If ["$STR _md5"! = "$STR _md5_c"]; Then
str_md5= $STR _md5_c
MV $STR _path/root/cib/cib_$str_time.${str_md5:0-4}
Else
Rm-f $STR _path
Fi
Sleep 1s
Done

--------------------------------------------------------------------------

Execute the script in controller1 background:

# nohup/root/cib.sh &

From the ping 168.1.22.80 script, the point at which the problem occurred is about 23:12:00, looking at the Controller1/ROOT/CIB directory:

Cib_20151225231150.1fbc
Cib_20151225231201.b4f1
Cib_20151225231203.09b2

CIB_20151225231150.1FBC:&LT;CIB crm_feature_set= "3.0.9" validate-with= "pacemaker-2.0" epoch= "119" num_updates= "3" Admin_epoch= "0"cib-last-written= "Fri Dec 23:11:48"have-quorum= "1" dc-uuid= "one" > Cib_20151225231201.b4f1:<cib crm_feature_set= "3.0.9" validate-with= " pacemaker-2.0 "epoch=" "num_updates=" 0 "admin_epoch=" 0 "cib-last-written= "Fri Dec 23:12:00"have-quorum= "1" dc-uuid= "one" > two files compared with Beybond compare, found that the second file only a lot: 399 <rsc_location id= "Loc_ping_vip__publi C "rsc=" Vip__public ">
<rule score= "-infinity" id= "Loc_ping_vip__public-rule" boolean-op= "or" >
401 <expression operation= "not_defined" id= "Loc_ping_vip__public-rule-expression" attribute= "Pingd"/>
402 <expression value= "0" operation= "LTE" Id= "loc_ping_vip__public-rule-expression-0" attribute= "Pingd"/>
403 </rule>
404 </rsc_location>

Loc_ping_vip__public is the content of the constrained part (constraints) in Cib.xml, which matches my conjecture.

Export the constraint section after the error occurred:

# cibadmin--query--scope=constraints>/root/constraints.xml

Edit Constraints.xml, delete the definition of loc_ping_vip__public, and save it to your own computer.

To re-push with fuel, in controller1, add script to wait for Cib.xml to appear in the content "Loc_ping_vip__public":

---------------------------------------------------------------------------

#!/bin/sh
R=1
While ["$r" = 1]; Do
Cibadmin--query--scope=constraints>/root/t.xml
grep loc_ping_vip__public/root/t.xml
R=$?
Sleep 1s
Done
Sleep 29s
Cibadmin--replace--scope=constraints--xml-file/root/constraints.xml

---------------------------------------------------------------------------

Push success.

Push success, is equivalent to manually modified puppet launched content, on the network did not find others have the same problem, and by modifying the Puppet method to solve this problem, this problem should not be a bug. Need to understand what the loc_ping_vip__public definition means.

Learn through the following connections:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_scores

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_rule_properties

Http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html

Http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_tell_pacemaker_how_to_interpret_the_ Connectivity_data.html

Understand what the loc_ping_vip__public definition means:

399 <rsc_location id= "Loc_ping_vip__public" rsc= "Vip__public" >
<rule score= "-infinity" id= "Loc_ping_vip__public-rule" boolean-op= "or" >
401 <expression operation= "not_defined" id= "loc_ping_vip__public-rule-expression" attribute= "Pingd"/>
402 <expression value= "0" operation= "LTE" Id= "loc_ping_vip__public-rule-expression-0" attribute= "Pingd"/>
403 </rule>
404 </rsc_location>

rsc= "Vip__public" means the object of constraint is vip__public resource;

Boolean-op= "or" means the Boolean value of each expression (<expression .../>) for the "or" operation;

Rule score= "-infinity" means the final result if it is "True" (or the result of the operation here), the constrained object cannot be started;

Expression 1:<expression not_defined .../> This expression checks whether a resource in the resource that has a property defined as "Pingd" is missing and the result is false because it is a resource with the property "Pingd":

<clone id= "Clone_ping_vip__public" >
<primitive class= "OCF" type= "ping" id= "Ping_vip__public" provider= "pacemaker" >
<operations>
<op timeout= "name=" Monitor "interval=" id= "ping_vip__public-monitor-20"/>
</operations>
<instance_attributes id= "Ping_vip__public-instance_attributes" >
<nvpair value= "3s" name= "timeout" id= "Ping_vip__public-instance_attributes-timeout"/>
<nvpair value= "30s" name= "dampen" id= "Ping_vip__public-instance_attributes-dampen"/>
<nvpair value= "" name= "multiplier" id= "Ping_vip__public-instance_attributes-multiplier"/>
<nvpair value= "168.1.22.1" name= "host_list" id= "Ping_vip__public-instance_attributes-host_list"/>
</instance_attributes>
</primitive>
</clone>

Expression 2:<expression value= "0" operation= "LTE" .../> This expression checks for a resource with a property of "Pingd". The resource with the property "Pingd" is used to check if an IP can ping, and Ping will return 0. Here LTE 0 means that the return value is not 0 o'clock, and the expression returns TRUE. From the Ping_vip__public resource definition, it ping is 168.1.22.1, is the gateway in my environment, I actually do not have a gateway in the environment, so Ping does not pass, so the final result of this expression is true, or after the operation is true, the constrained object can not start.

In summary, the constraint definition loc_ping_vip__public is used to test whether the gateway can pass, only the gateway to the node to start Vip__public, no wonder the removal of Loc_ping_vip__public after push is successful, It is no wonder that no one on the network encountered me this problem, because their network environment Gateway is able to pass!

The same problem was found on the network:

Http://irclog.perlgeek.de/fuel-dev/2015-01-28

PS: Again explain what is "property for Pingd", that is, the type is "ocf:pacemaker:ping" resources, please use the command "CRM Configure Show" view, the general cluster must have at least one of this type of resources.

With fuel to re-push several times to troubleshoot problems, and finally the problem is found in the Gateway Ping does not pass ...

Fuel 6.1 Automatic Push 3 controlled high Availability CentOS 6.5 Juno Environment Troubleshooting (i)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.