L2 network head scratcher, losing pings to Management VLAN

crustachio

@Kelly
Thanks. Traceroute functions as expected; when testing from the 5406 core it simply times out, no hops (which there shouldn't be since it's directly connected). It never attempts to route via another path.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

y VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is

Is the VLAN ID the same for the old and the new Management VLAN?

thecreaitvone91

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

The last little piece of the puzzle... Our old "core" switch (Cisco 3750) is still in use to an extent. It's largely used to route legacy VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is defined/directly connected on the 5406R core switch. The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core).

Are you sure you don't have asymmetric routing going on? You can make asymmetric work if there is a vaild reason for it however things like firewalls and routers have to be setup for this otherwise the packets will be dropped.

notverypunny

A lot of good things to consider here so far. Keep spanning tree in mind as soon as you're dealing with topology changes and intermittent issues. It can come up and bite you in the a$$ if you've got a static config somewhere or a new vlan that isn't part of the config.

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

y VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is

Is the VLAN ID the same for the old and the new Management VLAN?

There was no "old" management VLAN (I know right). Mgmt was done in the default VLAN (1) on the 3750. Hence the creation of a dedicated mgmt VLAN on the 5406 when we started migrating off the 3750.

Dashrender

So you made a new VLAN specifically for management, alright. and what are you pinging and from where on this new management VLAN?

i.e. are you pinging the switch connected to the fiber on the far side? are you pinging from the switch connected to that same fiber on your side? or your PC?

crustachio

I need to clarify something I said erroneously:

"The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core)."

That bolded sentence is actually untrue. I don't know why I was thinking the wireless link was still served off the 3750. We moved it to the 5406 awhile back and it has been working fine.

So to clarify, the "working" link (wireless bridge) actually terminates in an L2 access switch on the roof, which trunks back to the 5406 core. The "new" fiber link terminates directly on the 5406. The mgmt VLAN only lives on the 5406. There should be no way any traffic is trying to go out to the 3750. Traceroute confirms this -- when the fiber link is working (intermittently), traceroute shows a hop from my PC to the 5406, then to the remote switch. When the fiber link is down, traceroute hops to the 5406 then dies.

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

So you made a new VLAN specifically for management, alright. and what are you pinging and from where on this new management VLAN?

i.e. are you pinging the switch connected to the fiber on the far side? are you pinging from the switch connected to that same fiber on your side? or your PC?

Pinging fails FROM any host in the mgmt VLAN on the local side TO any host in the mgmt VLAN on the far side. That includes the remote switch, a UPS, and a WAP.

On the local side, I've tried pinging from my PC (which is not ACL restricted from talking to the mgmt VLAN or anything), the core switch itself, and other switches in the mgmt VLAN. And of course our NMS.

I need to go back onsite and console into the remote switch to see if pings work the other way.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

I need to go back onsite and console into the remote switch to see if pings work the other way.

If you have a PC at that remote site - since you said normal data VLANs are working, you could remote into one of them and then access a switch and see if it pinging on that side is working.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

...we recently installed a new direct buried fiber circuit to each building

this is fiber you own, it doesn't go through a carrier like AT&T/Cox/Comcast/etc?

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

I need to go back onsite and console into the remote switch to see if pings work the other way.

If you have a PC at that remote site - since you said normal data VLANs are working, you could remote into one of them and then access a switch and see if it pinging on that side is working.

Nice suggestion but the remote PC VLAN is not authorized to SSH into to the management VLAN of the switch.

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

...we recently installed a new direct buried fiber circuit to each building

this is fiber you own, it doesn't go through a carrier like AT&T/Cox/Comcast/etc?

We own it, it's a simple PTP SMF span.

crustachio

@notverypunny said in L2 network head scratcher, losing pings to Management VLAN:

A lot of good things to consider here so far. Keep spanning tree in mind as soon as you're dealing with topology changes and intermittent issues. It can come up and bite you in the a$$ if you've got a static config somewhere or a new vlan that isn't part of the config.

I think you are on to something. I had discarded STP from being in the mix at first because we're really not doing any complicated STP -- no PVST or anything. I checked right away to confirm the 5406 was the root, and the remote switch is an appropriately low priority, and everything looked normal. But digging into the STP topology change history logs on the switches does in fact show numerous topo change requests happening, and in the last 15 minutes I've correlated intermittent responsiveness on the remote switch to topo change requests coming from a completely different L2 access switch on the LAN.

That switch is generating "CIST starved for a BPDU Rx on port 1 (uplink port)" error and therefore self-promoting to root, forcing topo changes across the tree.

If I manually set the STP priority on that switch and let STP reconverge, things go back to normal for a short while and the fiber "problem" switch comes back. Until the "CIST starved for a BPDU Rx" error reoccurs on the other switch, then things go haywire again.

OK, now we're getting somewhere. Not sure why that port is no longer receiving BPDU packets... filtering is not enabled, there's no root-guard in place. I'll keep digging, but now I'm on the trail.

THANKS

crustachio

OK still not sure why that "other" access switch on the LAN is getting starved for BPDU packets, but as a band-aid I enabled "tcn-guard" on its upstream port, to prevent it's topology change notifications from flooding the network and goofing the remote fiber switch. So far, so good.

I wonder if this is some odd interop issue from the fact that our old 3750 is still on the LAN running its default flavor of PVST. Our Aruba is doing MSTP and has been interop'ing fine alongside the 3750 until now. The plot thickens!

If nothing else this will motivate me to finish pulling the plug on that old 3750. Got some work to do yet...

crustachio

Welp, got it figured out, and it had nothing to do with any of my theories

The "other" access switch that was generating all the BPDU starvation errors was also a remote switch at a completely different site (unrelated to this fiber replacement), connected via PTP Ubiquiti NanoBeam radio. The head-end radio, even though it was set for simple bridge mode, had STP toggled on for some [mistaken] reason. Of course Ubiquiti NanoBeams don't speak HPE MSTP, so it was borking the BPDUs to that remote switch. Since that switch was getting starved for BPDUs, it was self-promoting to root bridge. Of course on the upstream switch I had root-guard enabled to prevent the remote switch from actually becoming root, but the TCNs still propagated out and somehow kept crippling the original problem switch on the new fiber. I'm not sure why it was only causing problems on these remote switches on the new fiber, and no other switches/links, but hey.

Final solution: Disable STP on the Ubiquiti radio. BPDU starvation resolved immediately, remote fiber switches management VLAN connectivity restored also. Problem solved.

Thanks very much to all for being a sounding board and the great suggestions. Special thanks to @notverypunny for pointing me in the right direction with STP. Teaches me to step back and look at the patterns.

crustachio

Post Script:

Immediately following my last "solution" update, I drove over to the remote site to button things up. En route I noticed a work crew standing around a concrete bridge over a small canal, which our fiber conduit happens to runs alongside. The bridge had just collapsed (nobody injured thankfully). Conduit is torn apart pretty good but the fiber is still in tact. Not sure it will stay that way, I can't see how they'll get the bridge removed without disturbing or removing that conduit entirely. There's also a gas line that runs alongside which complicates things further.

There's never a good time for something like that, but this was just plain uncanny.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

Post Script:

Immediately following my last "solution" update, I drove over to the remote site to button things up. En route I noticed a work crew standing around a concrete bridge over a small canal, which our fiber conduit happens to runs alongside. The bridge had just collapsed (nobody injured thankfully). Conduit is torn apart pretty good but the fiber is still in tact. Not sure it will stay that way, I can't see how they'll get the bridge removed without disturbing or removing that conduit entirely. There's also a gas line that runs alongside which complicates things further.

There's never a good time for something like that, but this was just plain uncanny.

oh man - at least you still have the wifi beam connection option.