8000 routes unexpectedly still present in swadm when going from 6k to 2k /24 BGP routes · Issue #475 · oxidecomputer/maghemite · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

routes unexpectedly still present in swadm when going from 6k to 2k /24 BGP routes #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers a 8000 nd the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
elaine-oxide opened this issue Apr 19, 2025 · 1 comment

Comments

@elaine-oxide
Copy link
elaine-oxide commented Apr 19, 2025

On madrid running omicron commit a7ab9d82c601da5631528eec58c54c5975a1a8e7 with the asilomar pop configuration, I used a tool that I created that does the following:

  • on asilomar-edge, makes changes to frr:
    • uploads an frr configuration file with a temporary filename containing a pre-configured set of /24 BGP routes (could be 2k, 4k, or 6k routes, see below)
    • using mv, renames the current /etc/frr/frr.conf to have the current unix timestamp as a suffix
    • using mv, renames the aforementioned frr configuration file with the temporary filename to /etc/frr/frr.conf
    • reloads frr, so that it reads the new /etc/frr/frr.conf
  • on madrid switch0 and switch1, performs checks:
    • checks the output of mgadm bgp status selected 47 for the expected number of routes
    • checks the output of swadm route ls for the expected number of routes

Here are the {2k, 4k, 6k} frr configuration files that were uploaded:

$ ls -la asilomar-edge-frr.conf.??
-rw-r-----   1 elaine   staff      49847 Apr 18 02:37 asilomar-edge-frr.conf.2k
-rw-r-----   1 elaine   staff     100969 Apr 18 02:37 asilomar-edge-frr.conf.4k
-rw-r-----   1 elaine   staff     152091 Apr 18 02:37 asilomar-edge-frr.conf.6k

I ran my tool 7 times, which caused frr on asilomar-edge to go through the following changes, as can be seen from the timestamps:

root@asilomar-edge:/etc/frr# ls -lart frr.conf*
-rw-r----- 1 frr  frr   49816 Oct 17  2024 frr.conf.2k.orig
-rw-r----- 1 root root 100969 Apr  9 13:54 frr.conf.4k
-rw-r----- 1 root root 152091 Apr  9 14:28 frr.conf.6k
-rw-r----- 1 root root 152091 Apr  9 14:29 frr.conf.1744942520626 // Initial state: Original configuration with 6k routes before running my tool
-rw-r----- 1 root root  49816 Apr 17 19:29 frr.conf.1744943950017 // Run 1: Went from above 6k initial state to old 2k variation as seen above in frr.conf.2k.orig (routes not advertised by frr because it is missing `no bgp network import-check`)
-rw-r----- 1 root root  49847 Apr 17 19:39 frr.conf.1744944114834 // Run 2: Went from above old 2k variation to intended newer 2k variation with `no bgp network import-check`)
-rw-r----- 1 root root  49847 Apr 17 19:41 frr.conf.1744944228452 // Run 3: Reloaded newer 2k variation
-rw-r----- 1 root root  49847 Apr 17 19:43 frr.conf.1744944274771 // Run 4: Reloaded newer 2k variation
-rw-r----- 1 root root 100969 Apr 17 19:44 frr.conf.1744944435795 // Run 5: Went from above newer 2k variation to 4k routes
-rw-r----- 1 root root 152091 Apr 17 19:47 frr.conf.1744944573114 // Run 6: Went from above 4k routes to 6k routes
-rw-r----- 1 root root  49847 Apr 17 19:49 frr.conf // Run 7: Went from above 6k routes to newer 2k variation

On each Run 1-7, I confirmed with my automated checks (described above) that mgadm and swadm reported the expected number of routes.

  • For Run 1, since the configuration file was missing no bgp network import-check, the 2k routes were not advertised by frr, and mgadm and swadm reported just the single route 0.0.0.0/0. Note that 0.0.0.0/0 is not included in my count of {2k, 4k, 6k} in the discussion here.
  • For Runs 2-6, mgadm and swadm reported the expected number of routes after waiting sufficient length of time (see: mg-lower: consider bulk bestpath synchronization to forwarding plane #348).
  • For Run 7, upon going from 6k routes to 2k routes, switch1 swadm unexpectedly reported 11 routes from the 6k configuration that are not in the 2k configuration (see below output), even though switch1 mgadm reported the expected number of routes. Note that switch0 mgadm and swadm reported the expected number of routes. The results were the same the next day, so sufficient length of time had definitely passed.
root@oxz_switch1:~# swadm route ls
Subnet                   Port    Link  Gateway                   Vlan
0.0.0.0/0                qsfp15  0     198.51.103.5              200
...
2.7.208.0/24             qsfp15  0     198.51.103.5              
22.7.93.0/24             qsfp15  0     198.51.103.5              
22.7.94.0/24             qsfp15  0     198.51.103.5              
22.7.95.0/24             qsfp15  0     198.51.103.5              
22.7.96.0/24             qsfp15  0     198.51.103.5              
22.7.97.0/24             qsfp15  0     198.51.103.5              
22.7.98.0/24             qsfp15  0     198.51.103.5              
22.7.99.0/24             qsfp15  0     198.51.103.5              
22.7.100.0/24            qsfp15  0     198.51.103.5              
22.7.101.0/24            qsfp15  0     198.51.103.5              
22.7.102.0/24            qsfp15  0     198.51.103.5              
22.7.103.0/24            qsfp15  0     198.51.103.5 
...

Notes:

  • IP addresses with the pattern 2.N.N.N are in the {2k, 4k, 6k} configurations.
  • IP addresses with the pattern 12.N.N.N are in the {4k, 6k} configurations.
  • IP addresses with the pattern 22.N.N.N are in the {6k} configuration only.

So, in above output of swadm route ls, since the system is in the 2k configuration, it is unexpected to see the 11 IP addresses with the pattern 22.N.N.N.

I copied the switch1 mgd and dendrite logs encompassing the time range of interest (at least including the transition from 6k to 2k routes) to /staff/elaine/2025-04-17-asilomar-switch1-unexpected-leftover-routes

Here are the original timestamps of those logs in UTC (note that asilomar-edge is in PDT, and UTC is +7h from PDT), so keep this in mind when inspecting the UTC timestamps in these logs and comparing them to the PDT timestamps for the frr.conf versions.

BRM42220007 # ls -l /pool/ext/2c668386-3940-4d45-aea2-32bbd4b16e5a/crypt/debug/oxz_switch/oxide-mgd:default.log.1744949721
-rw-r--r--   1 root     root     21688820 Apr 18 04:20 /pool/ext/2c668386-3940-4d45-aea2-32bbd4b16e5a/crypt/debug/oxz_switch/oxide-mgd:default.log.1744949721
BRM42220007 # ls -l /pool/ext/2c668386-3940-4d45-aea2-32bbd4b16e5a/crypt/debug/oxz_switch/oxide-dendrite:default.log.1744945217
-rw-r--r--   1 root     root     2362237465 Apr 18 03:02 /pool/ext/2c668386-3940-4d45-aea2-32bbd4b16e5a/crypt/debug/oxz_switch/oxide-dendrite:default.log.1744945217
@rcgoodfellow
Copy link
Collaborator

Possible repro:

  1. Fling a bunch of routes at mgd
  2. Withdraw some of those routes whilst mgd is still syncing to the asic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0