-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[Smartswitch][Pmon]Changes for Post startup and pre shutdown #1980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Smartswitch][Pmon]Changes for Post startup and pre shutdown #1980
Conversation
/azp run |
No pipelines are associated with this pull request. |
Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
/azp run |
No pipelines are associated with this pull request. |
Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
/azp run |
No pipelines are associated with this pull request. |
@vvolam @rameshraghupathy please review the HLD changes |
Changes to be done: |
/azp run |
No pipelines are associated with this pull request. |
* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB | ||
|
||
#### Post-startup Procedures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename like "DPU post-startup handling"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these operations "handle_pci_rescan(), pci_reattach()"applicable to the power on sequence or only to gNOI based reboot requence? Give some context on the two platform supported models and and talk about the two sets of APIs and how they work. Then give some context on why suddenly sensors come into this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to call out that the platforms implement these functions will only follow this and for platforms that don't implement this it is a NO-OP
* The switch PMON gets the admin up notification from the configDB | ||
* The switch PMON invokes the platform API to power on the DPU | ||
* DPU boots up and attaches itself to the midplane. | ||
* If there is ignore configuration relevant to the DPU then we remove the file and restart sensord |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a line to give some context on "ignore config", which file and sensord ? And also some information on why PCIe rescan is performed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB | ||
|
||
#### Post-startup Procedures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these operations "handle_pci_rescan(), pci_reattach()"applicable to the power on sequence or only to gNOI based reboot requence? Give some context on the two platform supported models and and talk about the two sets of APIs and how they work. Then give some context on why suddenly sensors come into this block.
### DPU startup sequence diagram | ||
<p align="center"><img src="./images/dpu-startup-seq.svg"></p> | ||
<p align="center"><img src="./images/dpu-startup-seq.png"></p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the sequence diagram show the module.,py platform API vertical and if PCIe is not of significance please remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show what triggers the post startup sequence int he seq.diag? Also, can you check the order of deleting sensors vs pci-rescan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order during shutdown:
sensors removal(by adding ignore configuration), then pcie device is removed
order during startup:
pcie rescan is performed, then we re-add the sensors (by removing the ignore configuration)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell Please show that this is applicable only to platforms that need a full PCI rescan and should implement module.handle_pci_rescan() and self.handle_sensor_addition() # No action taken if it is not implemented. Rest all LGTM
#### Pre-shutdown Procedure | ||
|
||
The switch has to prepare for the DPUs being powered off. | ||
* The PCIe devices associated with the DPU are removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these two points below the line "When a DPU module's admin state is set to "down", the following pre-shutdown procedures are executed:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB | ||
|
||
#### Post-startup Procedures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to call out that the platforms implement these functions will only follow this and for platforms that don't implement this it is a NO-OP
"DPU0": { | ||
"bus_info": "XXXX:XX:XX.X" | ||
}, | ||
"DPU1": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional and what does it signify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, as this configuration is not used from Platform.json
def set_admin_state(self, up): | ||
if up: | ||
module.set_admin_state(up) | ||
module.handle_pci_rescan() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should call out that these functions are NO-OPs for platforms don't need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as comment in the set admin state call
* NPU-DPU (GNOI) soft reboot workflow is captured in [reboot-hld.md](https://github.com/sonic-net/SONiC/blob/26f3f4e282f3d2bd4a5c684608897850354f5c30/doc/smart-switch/reboot/reboot-hld.md) | ||
* In the first option the "admin_status: down" configDB status change event will send a message to the switch PMON. | ||
* The switch PMON will invoke the module class API "set_admin_state(self, up):" with the state being "down" and the platform in turn will call its API to gracefully shutdown the DPU. | ||
* The PCIE device is added to `PCIE_DETACH_INFO` table and we remove the pcie device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to say this is applicable only to platforms that don't implement their own APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/azp run |
No pipelines are associated with this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell doc/smart-switch/pmon/images/dpu-startup-seq.png - Arrows in the picture for step 1 and 3 seems to be wrong. could you fix it?
/azp run |
No pipelines are associated with this pull request. |
@rameshraghupathy are you ok with the changes following your feedback? anything missing? if not please approve. |
/azp run |
No pipelines are associated with this pull request. |
/azp run |
No pipelines are associated with this pull request. |
@rameshraghupathy Could you please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Once SONiC is up, the state progression is updated for every state transition on the DPU_STATE table in the chassisStateDB | ||
|
||
#### DPU post-startup handling | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please callout in the beginning just like Line: #284 to #290 that this is applicable only to platforms that need a full PCI rescan and should implement module.handle_pci_rescan() and self.handle_sensor_addition() # No action taken if it is not implemented. Can you also say the same in the sequence diagram? Rest all LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @rameshraghupathy, please mention that if it is not handled by any dedicated platform specific server and needs to be done via sonic APIs, then one should follow this otherwise it would be confusing as In our we are not needed these at all. In future, other platform might also have there own methods and it would be confusing/trickier to follow these standard including myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam, can you please comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy It is mentioned in line 96 If pci_reattach() is not implemented in the specific platform, then no operations are performed in this function
and in line 100
if such file does not exist, the sensord restart for this module is skipped
do you want to have it present seperately here?
These changes are for handling platform independent pcie removal and addition along with sensord restarts for smartswitch platform. These changes are needed to detect PCIE removal and addition and sensor removal and additon on smartswitch at the time of admin state changes and reboots for DPUs