8000 [Smartswitch][Pmon]Changes for Post startup and pre shutdown by gpunathilell · Pull Request #1980 · sonic-net/SONiC · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Smartswitch][Pmon]Changes for Post startup and pre shutdown #1980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gpunathilell
Copy link
Contributor

These changes are for handling platform independent pcie removal and addition along with sensord restarts for smartswitch platform. These changes are needed to detect PCIE removal and addition and sensor removal and additon on smartswitch at the time of admin state changes and reboots for DPUs

Repo PR Status
sonic-platform-common [Smartswitch] Add module specific pcie attach/detach functions for smartswitch platforms PR 557

@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

@gpunathilell
Copy link
Contributor Author

@vvolam @rameshraghupathy please review the HLD changes

@gpunathilell
Copy link
Contributor Author

Changes to be done:
Change path from dpu_ignore_conf to module_ignore_conf

@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB

#### Post-startup Procedures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename like "DPU post-startup handling"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these operations "handle_pci_rescan(), pci_reattach()"applicable to the power on sequence or only to gNOI based reboot requence? Give some context on the two platform supported models and and talk about the two sets of APIs and how they work. Then give some context on why suddenly sensors come into this block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to call out that the platforms implement these functions will only follow this and for platforms that don't implement this it is a NO-OP

* The switch PMON gets the admin up notification from the configDB
* The switch PMON invokes the platform API to power on the DPU
* DPU boots up and attaches itself to the midplane.
* If there is ignore configuration relevant to the DPU then we remove the file and restart sensord
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a line to give some context on "ignore config", which file and sensord ? And also some information on why PCIe rescan is performed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB

#### Post-startup Procedures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these operations "handle_pci_rescan(), pci_reattach()"applicable to the power on sequence or only to gNOI based reboot requence? Give some context on the two platform supported models and and talk about the two sets of APIs and how they work. Then give some context on why suddenly sensors come into this block.

### DPU startup sequence diagram
<p align="center"><img src="./images/dpu-startup-seq.svg"></p>
<p align="center"><img src="./images/dpu-startup-seq.png"></p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sequence diagram show the module.,py platform API vertical and if PCIe is not of significance please remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show what triggers the post startup sequence int he seq.diag? Also, can you check the order of deleting sensors vs pci-rescan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order during shutdown:
sensors removal(by adding ignore configuration), then pcie device is removed
order during startup:
pcie rescan is performed, then we re-add the sensors (by removing the ignore configuration)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell Please show that this is applicable only to platforms that need a full PCI rescan and should implement module.handle_pci_rescan() and self.handle_sensor_addition() # No action taken if it is not implemented. Rest all LGTM

#### Pre-shutdown Procedure

The switch has to prepare for the DPUs being powered off.
* The PCIe devices associated with the DPU are removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these two points below the line "When a DPU module's admin state is set to "down", the following pre-shutdown procedures are executed:"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB

#### Post-startup Procedures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to call out that the platforms implement these functions will only follow this and for platforms that don't implement this it is a NO-OP

"DPU0": {
"bus_info": "XXXX:XX:XX.X"
},
"DPU1": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional and what does it signify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, as this configuration is not used from Platform.json

def set_admin_state(self, up):
if up:
module.set_admin_state(up)
module.handle_pci_rescan()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should call out that these functions are NO-OPs for platforms don't need them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as comment in the set admin state call

* NPU-DPU (GNOI) soft reboot workflow is captured in [reboot-hld.md](https://github.com/sonic-net/SONiC/blob/26f3f4e282f3d2bd4a5c684608897850354f5c30/doc/smart-switch/reboot/reboot-hld.md)
* In the first option the "admin_status: down" configDB status change event will send a message to the switch PMON.
* The switch PMON will invoke the module class API "set_admin_state(self, up):" with the state being "down" and the platform in turn will call its API to gracefully shutdown the DPU.
* The PCIE device is added to `PCIE_DETACH_INFO` table and we remove the pcie device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to say this is applicable only to platforms that don't implement their own APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

Copy link
Contributor
@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell doc/smart-switch/pmon/images/dpu-startup-seq.png - Arrows in the picture for step 1 and 3 seems to be wrong. could you fix it?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

@liat-grozovik
Copy link
Collaborator

@rameshraghupathy are you ok with the changes following your feedback? anything missing? if not please approve.
@gpunathilell please be reminded to add the code PRs to the description

@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link
No pipelines are associated with this pull request.

@gpunathilell
< F438 path d="M8 9a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3ZM1.5 9a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3Zm13 0a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3Z"> Copy link
Contributor Author

@rameshraghupathy Could you please review

Copy link
Contributor
@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB

#### DPU post-startup handling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please callout in the beginning just like Line: #284 to #290 that this is applicable only to platforms that need a full PCI rescan and should implement module.handle_pci_rescan() and self.handle_sensor_addition() # No action taken if it is not implemented. Can you also say the same in the sequence diagram? Rest all LGTM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @rameshraghupathy, please mention that if it is not handled by any dedicated platform specific server and needs to be done via sonic APIs, then one should follow this otherwise it would be confusing as In our we are not needed these at all. In future, other platform might also have there own methods and it would be confusing/trickier to follow these standard including myself.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam, can you please comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy It is mentioned in line 96 If pci_reattach() is not implemented in the specific platform, then no operations are performed in this function and in line 100
if such file does not exist, the sensord restart for this module is skipped
do you want to have it present seperately here?

@vvolam vvolam merged commit 8ecb624 into sonic-net:master May 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0