Monday, 23 February 2015

Problems with NX-OS configuration synchronization (switch-profiles)

Now that configuration synchronization (through a switch-profile in configure sync mode) has been set up, the relevant parts of the configuration across both upstream Nexus 5ks should be kept in step (in particular, the ports on the FEXs).

However, during testing, a number of issues were encountered, making this feature look very useful but need a lot of care and attention, as if they arise it is often necessary to do some fiddling about to recover things.  However, so far none of these processes would appear to result in a service outage.

Running configuration getting out step

Sometimes, the running configuration can get out of step with the local configuration and synchronised switch-profile configuration - everything looks correct when you look at the files, but you get an error when you try to commit changes.

To detect this situation, compare the relevant portions of show running-config ... and show running-config switch-profile and see if they reconcile.  If they don't, you can compare these with what the internal configuration of the switch says with two internal commands:

n5k-top# show run int e1/3
...
interface Ethernet1/3

n5k-top# show system internal csm info global-db cmd-tbl | sec Ethernet1/3
parent_seq_no= 0,  seq_no= 34,  clone_seq_no= 0, cmd= 'interface Ethernet1/3'
    parent_seq_no= 34,  seq_no= 35,  clone_seq_no= 0, cmd= 'description b2'

n5k-top# show system internal csm info switch-profile cfgd-db cmd-tbl | sec Ethernet1/3
parent_seq_no= 0,  seq_no= 146,  clone_seq_no= 0, cmd= 'interface Ethernet1/3'
    parent_seq_no= 146,  seq_no= 150,  clone_seq_no= 0, cmd= 'channel-group 102'
    parent_seq_no= 146,  seq_no= 147,  clone_seq_no= 0, cmd= 'description b2'
    parent_seq_no= 146,  seq_no= 149,  clone_seq_no= 0, cmd= 'fex associate 102'
    parent_seq_no= 146,  seq_no= 148,  clone_seq_no= 0, cmd= 'switchport mode fex-fabric'

The first shows what is in the running configuration of Ethernet1/3, the second shows the local configuration and the third the one from the switch-profile (here for a block pertaining to Ethernet1/3).  Clearly these don't tally up.

Somehow things have gone awry, but it's easy to fix by resyncing the database which, in this case, caused the local configuration to clear itself (but sometimes it can have other effects):

n5k-top# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-top(config-sync)# resync-database
Re-synchronization of switch-profile db takes a few minutes...
Re-synchronize switch-profile db completed successfully.
n5k-top(config-sync)# end

n5k-top# show system internal csm info global-db cmd-tbl | sec Ethernet1/3

Over the next few seconds, things will sync up and sort themselves out again and you'll have your interface configuration back:

n5k-top# show run int e1/3
...
interface Ethernet1/3
  description b2
  switchport mode fex-fabric
  fex associate 102
  channel-group 102

I found an explanation of this here: http://nexp.com.ua/technologies/nx-os/troubleshooting-nx-os-config-sync/ - that's a good place to start before trying the second (more drastic) method.

Stub configuration stanza removal

If it is impossible to remove stub configuration stanzas (the opening line of a block, starting a subcontext in the configuration file - e.g. interface ... or fex ...), from either configuration, if it has been entered by mistake.

For example, if you add a FEX to the switch profile:

n5k-bottom# conf sync
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# fex 103
n5k-bottom(config-sync-sp-fex)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful

Then you mistakenly go to configure the FEX in local (configure terminal) mode but realise your mistake, even enter no subcommands, you cannot remove the fex stanza because it would clash (= be mutually exclusive, in the parlance of configuration synchronization) with the declaration in the switch profile:

n5k-bottom# conf t
n5k-bottom(config)# fex 103
n5k-bottom(config-fex)# no fex 103
Error: Command is not mutually exclusive

You also cannot remove the declaration in the switch profile as that would clash with the one in the local configuration!

n5k-bottom(config)# conf sync
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# no fex 103
n5k-bottom(config-sync-sp)# commit
Failed: Verify Failed
n5k-bottom(config-sync-sp)# show switch-profile status
...
Local information:
----------------
Status: Verify Failure
Error(s):
Following commands failed mutual-exclusion checks:
no fex 103

You can see these two commands with this handy internal command:

n5k-bottom# show system internal csm info global-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 238,  clone_seq_no= 0, cmd= 'fex 103'
n5k-bottom# show system internal csm info switch-profile cfgd-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 237,  clone_seq_no= 0, cmd= 'fex 103'

The only solution to this problem that I'm aware of is to, step-by-step on both switches (except for the first command, which will do both at the same time):
  1. Use the no switch-profile ... profile-only all in configure sync mode to remove the profile.  This will break the synchronisation and move all the commands in the profile to the local configuration on both switches.
  2. Remove the unwanted parts in configure terminal mode.
  3. Create a new switch-profile.
  4. Use import running-config to read in the current configuration into the profile.
  5. commit the new profile, which will move the commands from the local configuration into the switch-profile.
  6. Re-establish the synchronisation peer.
This appears to be non-destructive but is a little worrying!

With an example - first break the synchronisation and copy the commands in the switch-profile to the local configuration:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# no switch-profile wcdc-b profile-only all

WARNING: Deleting switch-profile will not remove all commands configured under switch-profile. This will only remove the switch profile. Are you sure you want to delete the switch-profile from the system ?
Are you sure? (y/n)  [n] y
Verification successful...
Proceeding to delete switch-profile. This might take a while depending on amount of configuration under a switch-profile.
Please avoid other configuration changes during this time.
Delete Successful

At this point, the commands you don't want will still be there, but only in the local configuration, where they can be removed directly on both switches, without issue:

n5k-bottom# conf t
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config)# no fex 103

Now re-create the switch-profile and import the (now clean) running configuration, then commit it - I've added some show system ... commands to show the commands moving from the local configuration to the switch-profile when the commit is issued:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# import running-config
n5k-bottom(config-sync-sp-import)# show system internal csm info global-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 8,  clone_seq_no= 0, cmd= 'fex 103'
n5k-bottom(config-sync-sp-import)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful
n5k-bottom(config-sync)# show system internal csm info global-db cmd-tbl | sec "fex 103"
n5k-bottom(config-sync)# show system internal csm info switch-profile cfgd-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 8,  clone_seq_no= 0, cmd= 'fex 103'

Once this has been done on both switches, the synchronisation peering can be re-established and hopefully everything will be OK:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# sync-peers dest 192.168.200.2
n5k-bottom(config-sync-sp)# end

Wait a few seconds:

n5k-bottom# show switch-profile status
...
Local information:
----------------
Status: Commit Success
Error(s):

Peer information:
----------------
IP-address: 192.168.200.2
Sync-status: In sync
Status: Commit Success
Error(s):

All rather messy!  What NX-OS needs is some sort of "delete fex ..." command to remove the block from the configuration you're editing, but not try and remove the block from the running-config itself.

Config-revision getting out of step

I'm not sure how I got in this state, but I was testing FEX preprovisioning and what happened when the configured model mismatches the connected one, then fixing this.  This messed things up when I tried to change the model - the command verified OK but failed on the peer when I tried committing, claiming there was a mismatch.

After trying to undo this, I got in a weird situation where the peer had lost the configuration of some port channel interfaces in the switch profile.  On closer inspection, in show switch-profile the peer had a Config-revision listed one less than the switch I committed the change on.

I couldn't find a command to force the two switches to synchronise (including making a change and trying to commit it).  Breaking the peering (by removing the sync-peers destination ... entry and re-adding it on one switch) didn't fix it, with the switch grumbling the commands mismatched:

n5k-bottom# show switch-profile status

switch-profile  : wcdc-b
----------------------------------------------------------

Start-time: 221438 usecs after Mon Feb 12 02:59:18 2001
End-time: 885937 usecs after Mon Feb 12 02:59:19 2001

Profile-Revision: 11
Session-type: Import-Verify
Session-subtype: -
Peer-triggered: No
Profile-status: Verify Failed

Local information:
----------------
Status: Verify Success
Error(s):

Peer information:
----------------
IP-address: 192.168.200.2
Sync-status: Not yet merged
Merge Flags: pending_merge:1 rcv_merge:1 pending_validate:0
Status: Verify Failure
Error(s):
Validation Failed: Config validation failed as found changes on both sides. rcvd
_rev: 0, expected_rev: 0
interface Ethernet1/3
        switchport mode fex-fabric
interface Ethernet1/3
        fex associate 102
interface Ethernet1/3
        channel-group 102
interface Ethernet1/4
        switchport mode fex-fabric
interface Ethernet1/4
        fex associate 102
interface Ethernet1/4
        channel-group 102

Indeed, those commands listed were the ones missing on the peer but present on the one I was making the change on.

Fixing this was fairly straightforward but tedious - the missing commands just needed adding back into the switch profile and committing:

n5k-top# conf sync
n5k-top(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-top(config-sync-sp)# int e1/3
n5k-top(config-sync-sp-if)# switchport mode fex-fabric
n5k-top(config-sync-sp-if)# fex associate 102
n5k-top(config-sync-sp-if)# channel-group 102
n5k-top(config-sync-sp-if)# int e1/4
n5k-top(config-sync-sp-if)# switchport mode fex-fabric
n5k-top(config-sync-sp-if)# fex associate 102
n5k-top(config-sync-sp-if)# channel-group 102
n5k-top(config-sync-sp-if)# verify
Verification Successful
n5k-top(config-sync-sp)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful

A few seconds after this was done, the two switches automatically recovered and synced up:

n5k-top# show switch-profile status
...
Peer information:
----------------
IP-address: 192.168.200.1
Sync-status: In sync
Status: Commit Success
Error(s):

Invented/implicit and non-canonical commands

You can get in a muddle if you rely on NX-OS to insert implicit commands, or use the non-canonical forms of some commands.  What do I mean by this?  Well - consider the following configuration for a vPC link port channel interface:

rk-wcdc-b-b(config)# interface Port-Channel1
rk-wcdc-b-b(config-if)# switchport mode trunk
rk-wcdc-b-b(config-if)# switchport trunk allowed vlan remove 1
rk-wcdc-b-b(config-if)# vpc peer-link

Entering the last of these commands will display the following warning:

Please note that spanning tree port type is changed to "network" port type on vPC peer-link.
This will enable spanning tree Bridge Assurance on vPC peer-link provided the STP Bridge Assurance

(which is enabled by default) is not disabled.

When we look at the configuration, we can see two changes from what we entered (ignoring the speed command, which seems not to matter):

rk-wcdc-b-b# show run int po12
...
interface port-channel12
  description vpc-peer
  switchport mode trunk
  switchport trunk allowed vlan 2-4094
  spanning-tree port type network
  speed 40000
  vpc peer-link

The allowed vlan list just shows what's included (not what we removed), thus storing a different command from what we entered.  Also, as per the warning above, the spanning-tree port type network command has been automatically inserted.

This is all good stuff but, when you try to create a switch-profile and import the running-config and verify it, it fails on those two lines:

rk-wcdc-b-b# conf sync
rk-wcdc-b-b(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
rk-wcdc-b-b(config-sync-sp)# import running-config
rk-wcdc-b-b(config-sync-sp-import)# verify
Failed: Verify Failed
rk-wcdc-b-b(config-sync-sp-import)# show switch-profile status
...
Local information:
----------------
Status: Verify Failure
Error(s):
Following commands are not configured under config-terminal mode and hence cannot be imported:
interface port-channel12
        switchport trunk allowed vlan 2-4094
interface port-channel12
        spanning-tree port type network

This is because these commands didn't match what was exactly entered, but something that was either "invented" (inserted automatically, as per the last line) or different (as per the first line).

In this case, the issue can be fixed with the conf sync / resync-database command, but I've seen situations - usually after the switch-profile has been created and the pair of switches synced up - where things get messy and it's necessary to break the synchronisation to fix it.

It is best to try and avoid commands being implicitly inserted and enter them explicitly.

2 comments:

  1. Just two words:
    Awesome post!

    ReplyDelete
  2. ne awry, but it's easy to fix by resyncing the database which, in this case, caused the local configuration to clear itself (but sometimes it can have other effects):
    Could you explain or hint what effect
    (but sometimes it can have other effects): ?

    ReplyDelete