Monday, 23 February 2015

Problems with NX-OS configuration synchronization (switch-profiles)

Now that configuration synchronization (through a switch-profile in configure sync mode) has been set up, the relevant parts of the configuration across both upstream Nexus 5ks should be kept in step (in particular, the ports on the FEXs).

However, during testing, a number of issues were encountered, making this feature look very useful but need a lot of care and attention, as if they arise it is often necessary to do some fiddling about to recover things.  However, so far none of these processes would appear to result in a service outage.

Running configuration getting out step

Sometimes, the running configuration can get out of step with the local configuration and synchronised switch-profile configuration - everything looks correct when you look at the files, but you get an error when you try to commit changes.

To detect this situation, compare the relevant portions of show running-config ... and show running-config switch-profile and see if they reconcile.  If they don't, you can compare these with what the internal configuration of the switch says with two internal commands:

n5k-top# show run int e1/3
...
interface Ethernet1/3

n5k-top# show system internal csm info global-db cmd-tbl | sec Ethernet1/3
parent_seq_no= 0,  seq_no= 34,  clone_seq_no= 0, cmd= 'interface Ethernet1/3'
    parent_seq_no= 34,  seq_no= 35,  clone_seq_no= 0, cmd= 'description b2'

n5k-top# show system internal csm info switch-profile cfgd-db cmd-tbl | sec Ethernet1/3
parent_seq_no= 0,  seq_no= 146,  clone_seq_no= 0, cmd= 'interface Ethernet1/3'
    parent_seq_no= 146,  seq_no= 150,  clone_seq_no= 0, cmd= 'channel-group 102'
    parent_seq_no= 146,  seq_no= 147,  clone_seq_no= 0, cmd= 'description b2'
    parent_seq_no= 146,  seq_no= 149,  clone_seq_no= 0, cmd= 'fex associate 102'
    parent_seq_no= 146,  seq_no= 148,  clone_seq_no= 0, cmd= 'switchport mode fex-fabric'

The first shows what is in the running configuration of Ethernet1/3, the second shows the local configuration and the third the one from the switch-profile (here for a block pertaining to Ethernet1/3).  Clearly these don't tally up.

Somehow things have gone awry, but it's easy to fix by resyncing the database which, in this case, caused the local configuration to clear itself (but sometimes it can have other effects):

n5k-top# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-top(config-sync)# resync-database
Re-synchronization of switch-profile db takes a few minutes...
Re-synchronize switch-profile db completed successfully.
n5k-top(config-sync)# end

n5k-top# show system internal csm info global-db cmd-tbl | sec Ethernet1/3

Over the next few seconds, things will sync up and sort themselves out again and you'll have your interface configuration back:

n5k-top# show run int e1/3
...
interface Ethernet1/3
  description b2
  switchport mode fex-fabric
  fex associate 102
  channel-group 102

I found an explanation of this here: http://nexp.com.ua/technologies/nx-os/troubleshooting-nx-os-config-sync/ - that's a good place to start before trying the second (more drastic) method.

Stub configuration stanza removal

If it is impossible to remove stub configuration stanzas (the opening line of a block, starting a subcontext in the configuration file - e.g. interface ... or fex ...), from either configuration, if it has been entered by mistake.

For example, if you add a FEX to the switch profile:

n5k-bottom# conf sync
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# fex 103
n5k-bottom(config-sync-sp-fex)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful

Then you mistakenly go to configure the FEX in local (configure terminal) mode but realise your mistake, even enter no subcommands, you cannot remove the fex stanza because it would clash (= be mutually exclusive, in the parlance of configuration synchronization) with the declaration in the switch profile:

n5k-bottom# conf t
n5k-bottom(config)# fex 103
n5k-bottom(config-fex)# no fex 103
Error: Command is not mutually exclusive

You also cannot remove the declaration in the switch profile as that would clash with the one in the local configuration!

n5k-bottom(config)# conf sync
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# no fex 103
n5k-bottom(config-sync-sp)# commit
Failed: Verify Failed
n5k-bottom(config-sync-sp)# show switch-profile status
...
Local information:
----------------
Status: Verify Failure
Error(s):
Following commands failed mutual-exclusion checks:
no fex 103

You can see these two commands with this handy internal command:

n5k-bottom# show system internal csm info global-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 238,  clone_seq_no= 0, cmd= 'fex 103'
n5k-bottom# show system internal csm info switch-profile cfgd-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 237,  clone_seq_no= 0, cmd= 'fex 103'

The only solution to this problem that I'm aware of is to, step-by-step on both switches (except for the first command, which will do both at the same time):
  1. Use the no switch-profile ... profile-only all in configure sync mode to remove the profile.  This will break the synchronisation and move all the commands in the profile to the local configuration on both switches.
  2. Remove the unwanted parts in configure terminal mode.
  3. Create a new switch-profile.
  4. Use import running-config to read in the current configuration into the profile.
  5. commit the new profile, which will move the commands from the local configuration into the switch-profile.
  6. Re-establish the synchronisation peer.
This appears to be non-destructive but is a little worrying!

With an example - first break the synchronisation and copy the commands in the switch-profile to the local configuration:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# no switch-profile wcdc-b profile-only all

WARNING: Deleting switch-profile will not remove all commands configured under switch-profile. This will only remove the switch profile. Are you sure you want to delete the switch-profile from the system ?
Are you sure? (y/n)  [n] y
Verification successful...
Proceeding to delete switch-profile. This might take a while depending on amount of configuration under a switch-profile.
Please avoid other configuration changes during this time.
Delete Successful

At this point, the commands you don't want will still be there, but only in the local configuration, where they can be removed directly on both switches, without issue:

n5k-bottom# conf t
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config)# no fex 103

Now re-create the switch-profile and import the (now clean) running configuration, then commit it - I've added some show system ... commands to show the commands moving from the local configuration to the switch-profile when the commit is issued:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# import running-config
n5k-bottom(config-sync-sp-import)# show system internal csm info global-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 8,  clone_seq_no= 0, cmd= 'fex 103'
n5k-bottom(config-sync-sp-import)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful
n5k-bottom(config-sync)# show system internal csm info global-db cmd-tbl | sec "fex 103"
n5k-bottom(config-sync)# show system internal csm info switch-profile cfgd-db cmd-tbl | sec "fex 103"
parent_seq_no= 0,  seq_no= 8,  clone_seq_no= 0, cmd= 'fex 103'

Once this has been done on both switches, the synchronisation peering can be re-established and hopefully everything will be OK:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# sync-peers dest 192.168.200.2
n5k-bottom(config-sync-sp)# end

Wait a few seconds:

n5k-bottom# show switch-profile status
...
Local information:
----------------
Status: Commit Success
Error(s):

Peer information:
----------------
IP-address: 192.168.200.2
Sync-status: In sync
Status: Commit Success
Error(s):

All rather messy!  What NX-OS needs is some sort of "delete fex ..." command to remove the block from the configuration you're editing, but not try and remove the block from the running-config itself.

Config-revision getting out of step

I'm not sure how I got in this state, but I was testing FEX preprovisioning and what happened when the configured model mismatches the connected one, then fixing this.  This messed things up when I tried to change the model - the command verified OK but failed on the peer when I tried committing, claiming there was a mismatch.

After trying to undo this, I got in a weird situation where the peer had lost the configuration of some port channel interfaces in the switch profile.  On closer inspection, in show switch-profile the peer had a Config-revision listed one less than the switch I committed the change on.

I couldn't find a command to force the two switches to synchronise (including making a change and trying to commit it).  Breaking the peering (by removing the sync-peers destination ... entry and re-adding it on one switch) didn't fix it, with the switch grumbling the commands mismatched:

n5k-bottom# show switch-profile status

switch-profile  : wcdc-b
----------------------------------------------------------

Start-time: 221438 usecs after Mon Feb 12 02:59:18 2001
End-time: 885937 usecs after Mon Feb 12 02:59:19 2001

Profile-Revision: 11
Session-type: Import-Verify
Session-subtype: -
Peer-triggered: No
Profile-status: Verify Failed

Local information:
----------------
Status: Verify Success
Error(s):

Peer information:
----------------
IP-address: 192.168.200.2
Sync-status: Not yet merged
Merge Flags: pending_merge:1 rcv_merge:1 pending_validate:0
Status: Verify Failure
Error(s):
Validation Failed: Config validation failed as found changes on both sides. rcvd
_rev: 0, expected_rev: 0
interface Ethernet1/3
        switchport mode fex-fabric
interface Ethernet1/3
        fex associate 102
interface Ethernet1/3
        channel-group 102
interface Ethernet1/4
        switchport mode fex-fabric
interface Ethernet1/4
        fex associate 102
interface Ethernet1/4
        channel-group 102

Indeed, those commands listed were the ones missing on the peer but present on the one I was making the change on.

Fixing this was fairly straightforward but tedious - the missing commands just needed adding back into the switch profile and committing:

n5k-top# conf sync
n5k-top(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-top(config-sync-sp)# int e1/3
n5k-top(config-sync-sp-if)# switchport mode fex-fabric
n5k-top(config-sync-sp-if)# fex associate 102
n5k-top(config-sync-sp-if)# channel-group 102
n5k-top(config-sync-sp-if)# int e1/4
n5k-top(config-sync-sp-if)# switchport mode fex-fabric
n5k-top(config-sync-sp-if)# fex associate 102
n5k-top(config-sync-sp-if)# channel-group 102
n5k-top(config-sync-sp-if)# verify
Verification Successful
n5k-top(config-sync-sp)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful

A few seconds after this was done, the two switches automatically recovered and synced up:

n5k-top# show switch-profile status
...
Peer information:
----------------
IP-address: 192.168.200.1
Sync-status: In sync
Status: Commit Success
Error(s):

Invented/implicit and non-canonical commands

You can get in a muddle if you rely on NX-OS to insert implicit commands, or use the non-canonical forms of some commands.  What do I mean by this?  Well - consider the following configuration for a vPC link port channel interface:

rk-wcdc-b-b(config)# interface Port-Channel1
rk-wcdc-b-b(config-if)# switchport mode trunk
rk-wcdc-b-b(config-if)# switchport trunk allowed vlan remove 1
rk-wcdc-b-b(config-if)# vpc peer-link

Entering the last of these commands will display the following warning:

Please note that spanning tree port type is changed to "network" port type on vPC peer-link.
This will enable spanning tree Bridge Assurance on vPC peer-link provided the STP Bridge Assurance

(which is enabled by default) is not disabled.

When we look at the configuration, we can see two changes from what we entered (ignoring the speed command, which seems not to matter):

rk-wcdc-b-b# show run int po12
...
interface port-channel12
  description vpc-peer
  switchport mode trunk
  switchport trunk allowed vlan 2-4094
  spanning-tree port type network
  speed 40000
  vpc peer-link

The allowed vlan list just shows what's included (not what we removed), thus storing a different command from what we entered.  Also, as per the warning above, the spanning-tree port type network command has been automatically inserted.

This is all good stuff but, when you try to create a switch-profile and import the running-config and verify it, it fails on those two lines:

rk-wcdc-b-b# conf sync
rk-wcdc-b-b(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
rk-wcdc-b-b(config-sync-sp)# import running-config
rk-wcdc-b-b(config-sync-sp-import)# verify
Failed: Verify Failed
rk-wcdc-b-b(config-sync-sp-import)# show switch-profile status
...
Local information:
----------------
Status: Verify Failure
Error(s):
Following commands are not configured under config-terminal mode and hence cannot be imported:
interface port-channel12
        switchport trunk allowed vlan 2-4094
interface port-channel12
        spanning-tree port type network

This is because these commands didn't match what was exactly entered, but something that was either "invented" (inserted automatically, as per the last line) or different (as per the first line).

In this case, the issue can be fixed with the conf sync / resync-database command, but I've seen situations - usually after the switch-profile has been created and the pair of switches synced up - where things get messy and it's necessary to break the synchronisation to fix it.

It is best to try and avoid commands being implicitly inserted and enter them explicitly.

NX-OS configuration synchronisation (switch-profiles)

When using vPC with dual-attached FEXs, a large chunk of the configuration across the two parent Nexus 5k (or 7k) switches must be kept in step.  For example:
  • VLANs must be created and deleted on both switches.
  • Fabric interfaces must have the same configuration with regards channel group assignments, FEX associations, etc.
  • Edge (host) ports must have the same switchport configuration - mode and access/trunk/native VLANs.
This can be done manually but is a bit tedious and error-prone to maintain.  To help, NX-OS has a mechanism whereby parts of the configuration can be synchronised across the switches.

However, this facility is little confusing and can be problematic to set up - it's one of those things where the Cisco documentation makes sense after you understand the basics!

Key concepts

  • Each switch maintains a local configuration which contains things which are NOT synchronised across the switches.  This is configured in the normal way, using configure terminal and has all the usual commands in it and will obviously hold things which differ between the switches.
  • In addition, there is a separate configuration which is synchronised between switches and holds the common elements.  This is configured in a special mode, entered using configure sync in a block called a switch-profile.
  • The running configuration is the result of merging the two configurations.  When using show running-config (including its options to look at specific parts of the configuration), the merged result is shown.
  • Changes to the switch profile are made on one of the switches, verifyed and commited as a single transaction (rather like a database) on both switches automatically; if they fail, things should rollback to how they were before they started.
  • Commands can be imported from the running configuration into the switch profile: this will remove them from the running configuration.
  • Synchronisation can only work across the out-of-band management VRF (which, on a Nexus 5k, limits you to the copper Management0 port on the rear of the unit); it cannot be done in-band.

Setting up synchronisation

There are three main steps to this:
  1. Set up management interface communication and enable Cisco Fabric Services (CFS) over it
  2. Set up a switch profile and import the running configuration into it
  3. Establish synchronisation between the peers
You have to take care to do these in the correct order otherwise things can get in a muddle.  Cisco describe this rather briefly on this page.

Set up management interface communication

Once the profile The first thing to do is allow communication between switches using the management interface and enable Cisco Fabric Services (CFS) over IPv4.  CFS is used to communicate the synchronisation and only operates over the management VRF:

n5k-bottom# conf t
n5k-bottom(config)# int mgmt0
n5k-bottom(config-if)# ip addr 192.168.200.1/24
n5k-bottom(config-if)# exit
n5k-top(config)# cfs ipv4 distribute
n5k-top(config)# end

Set up the switch profile and import the running configuration

The switch profile is created in the special configure sync mode.

The profile must be given a name - this is used to identify the configuration which must be synchronised and must match across the synchronisation peers - I recommend using something to identify the area the switches and their FEXs will serve (e.g. the room and rack row).

Once this is done, the running configuration can be imported, verified and committed (the verification stage can be omitted, if required, but it's a good idea to check things first).  This will move the synchronisable elements of the configuration from the running configuration into the switch profile.

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1


n5k-bottom(config-sync-sp)# import running-config

n5k-bottom(config-sync-sp-import)# verify

Verification Successful
n5k-bottom(config-sync-sp-import)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful

All this must be done on BOTH switches independently.

Establish synchronisation between the peers

The synchronisation can now be enabled between the peers in the switch profile:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1

n5k-bottom(config-sync-sp)# sync-peers destination 192.168.200.2

Once this is entered on both switches, they should find each other, exchange information and sync up, comparing the imported running configuration with each other and ensuring they agree (or, if they differ, they can be merged without conflict).  This will typically take 10-20 seconds and can be checked with show switch-profile status; the Peer information / Status will change from Peer not reachable to Verify Success to Commit Success and then finally the Sync-status will say In Sync, when this is complete:

n5k-bottom# show switch-profile status

switch-profile  : wcdc-b
----------------------------------------------------------

Start-time:   6070 usecs after Sun Feb 11 23:42:16 2001
End-time: 571814 usecs after Sun Feb 11 23:42:17 2001

Profile-Revision: 1
Session-type: Initial-Exchange
Session-subtype: Init-Exchange-All
Peer-triggered: Yes
Profile-status: Sync Success

Local information:
----------------
Status: Commit Success
Error(s):

Peer information:
----------------
IP-address: 192.168.200.1
Sync-status: In sync
Status: Commit Success
Error(s):

Once this has completed, and the two switches are in sync, it's ready for use.

Using synchronisation

From this point onwards, any changes to made across both switches should be made in configure sync mode, in the switch-profile then comited to take effect, rather than in configure terminal mode.  For example, to set up a new FEX, you might do:

n5k-bottom# conf sync
Enter configuration commands, one per line.  End with CNTL/Z.
n5k-bottom(config-sync)# switch-profile wcdc-b
Switch-Profile started, Profile ID is 1
n5k-bottom(config-sync-sp)# int e1/3-4
n5k-bottom(config-sync-sp-if-range)# desc b2
n5k-bottom(config-sync-sp-if-range)# channel-group 102
n5k-bottom(config-sync-sp-if-range)# exit
n5k-bottom(config-sync-sp)# int po102
n5k-bottom(config-sync-sp-if)# desc b2
n5k-bottom(config-sync-sp-if)# switchport mode fex-fabric
n5k-bottom(config-sync-sp-if)# fex associate 102
n5k-bottom(config-sync-sp-if)# vpc 102
n5k-bottom(config-sync-sp-if)# verify
Verification Successful
n5k-bottom(config-sync-sp)# commit
Verification successful...
Proceeding to apply configuration. This might take a while depending on amount of configuration in buffer.
Please avoid other configuration changes during this time.
Commit Successful

If something is accidentally configured in configure terminal mode, sometimes it can just be removed (with some no ... commands), but it may take some work if mutual exclusion errors are encountered - I'll cover this on a future entry.