User Details
- User Since
- Jan 5 2016, 9:54 PM (465 w, 1 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- LToscano (WMF) [ Global Accounts ]
Yesterday
I think I was wrong, this bit inside geoshape's index.js should work:
@Jclark-ctr I fixed the provisioning of ms-be1086, for some reasons if the BMC doesn't have IPv6 enabled the settings that errored out are read only (in fact provisioning was failing during the BMC network settings rollout). The workaround is to connect to the WebUI, Configuration -> Network and enable IPv6. Then you can re-run the cookbook and it works.
Tue, Dec 3
Better to open a new one for traceability!
Mon, Dec 2
Everything reimaged, we are good :)
Deployed to Tegola production, I think everything looks good. I didn't find metrics to add to the Grafana dashboard, but we'll see in the future. The setup is not safer in my opinion, and we can re-use it for kartotherian.
This task needs T380487 to be completed first, otherwise the analytics-privatedata-users settings will not be useful (since the user will not be able to log in to Superset).
@thcipriani Hi! Could you review this request? Lemme know if we can move forward :)
Setting it to declined for the moment, please re-open when a final decision has been made!
@klausman Hi! Could ML take care of this request?
Fri, Nov 29
Reimaged 208[2-5] too (2084 was left unconfigured for some reason, I have probably missed it, good that I rechecked :D).
Before the test I checked the envoy logs and I found that we had already a slow replica that was ejected and re-added back:
After a chat with Janis, this may be a good test:
Thu, Nov 28
ms-be2081 done reimaged!
Tried megactl (packaged by Moritz) on ms-be2082, this is the result:
Re-ran provision on all those, we are good, no changes registered. Now it is the turn of reimages, I'll kick off some.
To keep archives happy, we are going to use the envoy TCP proxy already implemented for Tegola with some tweaks. More info in T322647#10365816
TIL, already done thanks!
Tested staging with T344324#9826584 (used previous for another tegola work) and it seems working nicely.
Deployed in tegola staging, I got this from the envoy's logs:
Wed, Nov 27
elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep iacke member: uid=iackerman,ou=people,dc=wikimedia,dc=org
Hi! I am looping in @KFrancis since afaics we need to sign an NDA before proceeding.
@NCreasy the wmf group should be enough, you are free to play with DataHub, all perms should be set. If you find any issue please re-open this task!
elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep jly member: uid=jly,ou=people,dc=wikimedia,dc=org
Closing for the moment, please re-open if needed!
@VRiley-WMF @Jclark-ctr Hi! We are ready to start provisioning these nodes, but the procedure is a little bit more convoluted than the usual since we need to force UEFI and there are still some Supermicro bugs that upstream is working on.
@Jhancock.wm hi! We have done a lot of weird tests with these nodes, I think that we should re-run provision for all of them to check that nothing weird that was tested is still in place, and possibly reimage all of them too.
I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).
I tried to dowload and install perccli == 007.2616.0000.0000 on ms-be2081 but no luck, same issue.
The K8s SIG reviewed this proposal and for the moment it was decided not to proceed with anything that could harm the consistency of the Registry. We'll probably form a group of people interested in maintaining the registry, and after that a decision about how to proceed will be made.
Tue, Nov 26
@MatthewVernon it works for me, I think you are missing a start:
elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep sspalding member: uid=sspalding,ou=people,dc=wikimedia,dc=org
Confirmed it is legit after a chat on Slack :)
Merged! Puppet needs to run in various hosts to propagate the permissions but in ~1hour we should be good. Closing, please re-open if I missed something!
Merged! The new access permissions will be deployed during the next hour by puppet on all the stat nodes.
Reached out to Joanna to confirm the user to the group, but it LGTM.
Reached out on Slack to verify the ssh key.
I am removing the SRE tag on this, Data Platform SREs are the right target for the last requests :)
I followed up with @Kgraessle and the analytics-privatedata-users group seems to be the best one to access the Mariadb prod replicas (according to this.
@Jly Hi! You are currently using the same SSH key for both production (this request) and WMCS, so I'd ask you to create a new one and update the task :)
@thcipriani Hi! I'd need your review to grant access to the Deployment group, lemme know your thoughts :)
My understanding is that, by default, the envoy LB configuration will not do any active probing of the TCP proxy endpoints set. I am wondering if we should expand the mesh's tcp proxy config with health checks like described in:
Mon, Nov 25
@jijiki I am looking into the same problem in the task T378944, and I have a doubt - what happens if one of the maps node goes down for hw failure or maintenance? Is the envoy tcp load balancer going to remove it from rotation, or will it keep erroring out periodically until it is finally depooled? I was wondering if creating something like maps-db.discovery.wment:5432 with LVS could be a more long term solution (adding only the read replicas to the pool). What do you think?
For tegola we do the following:
Today I found out that Kartotherian seems to be contacting the local postgres read replica to fetch geoshapes/osmdb data:
Fri, Nov 22
The host is fully in service now and I had a chat with Matthew to put it in production, resolving!
Thu, Nov 21
+1 from my side!
Merged all the patches, and we finally have http://docker-registry.wikimedia.org/wikimedia/mediawiki-services-kartotherian:2024-11-21-145831-production \o/
@Jclark-ctr all configured, the host has been reimaged and all the disks are shows up.
Wed, Nov 20
Quick note about the reimage step - due to a bug in Supermicro's BMC firmware (at least, this is what we suspect) the first reimage run will likely end up into two consecutive debian installs that will end up in causing reimage to fail/stall. A subsequent reimage should be enough to fix it, getting the node in the final state.
@Jclark-ctr the host is provisioned, next step is the number 2 in T370453#10326159, lemme know if you want me to do it or not!
Mon, Nov 18
@Jclark-ctr I updated the firmware to the correct one, but I'd need the BMC label password in pvt when you are in the DC (it is needed for the factory reset that needs to happen post-firmware upgrade, sigh). Thanks for the patience!
Ok I found the issue, I asked Jenn to turn off IPv6 last week for the BMC network to test if that was the issue, but it was before upgrading the firmware. With the BMC reset the network settings are preserved, so the old test/setting caused the last hiccup in running provision.
My bad, I misremembered that we got the firmware for config J from Supermicro already (somehow I thought it was for the ganeti nodes, too many firmware floating around :D) and I uploaded it to thanos-be2005, followed by a factory reset. The issue is the same as happened on backup1012: T371416#10216617
Fri, Nov 15
I was able to upload the firmware via Web UI, but the issue seems still present (new version, 01.04.08. Need to investigate more what is the problem, and/or to ping supermicro to give us the same firmware that they deployed to the ms-be nodes.
@Jclark-ctr Hello! For this host, we have to follow a new workflow:
@Papaul @Jhancock.wm we'd need to upgrade the firmware on this node, I think that we could use directly this instead of the custom one. I tried to connect to the BMC web ui in various ways but I failed since the BMC network config is the one that fails while provisioning. I tried also to do it by hand via DEL/Setup at boot but for some reason I cannot modify any value (or, my client prevents me to do it remotely, not sure why).
While provisioning I see the following error for the BMC NIC config:
Thu, Nov 14
Tried to manually set the continuous flag on sretest2001, rebooted but I didn't see the boot options changing like ms-be2088. So at this point it may not be relevant, but I don't explain the above differences. Maybe we just need to reimage all of them another time and they will get the same conf?
@jhathaway something interesting that I found on Redfish related to BIOS boot options:
Wed, Nov 13
@jhathaway another episode of the saga, ms-be2088 :D
Tue, Nov 12
I think I found the issue, this is what I see from a new pod (available only for the time of the deployment, then helmfile/helm rolls it back):
All action items done, now the next step is to wait for the k8s service to be deployed on Wikikube :)
Mon, Nov 11
The new kartotherian.discovery.wmnet:6543 endpoint is available.
Sun, Nov 10
I ran optimize table archive (11M records, seemed safe enough) after stopping the slave, and it seems to have recovered. I wasn't confident enough to declare it "ready for prod" so we decided not to repool, leaving the decision to data persistence :)
Fri, Nov 8
I also tried to not configure any special JBOD config for ms-be2087 after provision, and kick off reimage to see if the double d-i issue appeared (to rule out special SAS controller features/settings) but no luck, still double d-i at first try.
So far I provisioned up to ms-be2087, and ms-be2088 was left untouched. The ADMIN/root password should already be set to the one on pwstore, so if you want to go ahead and test with 2088 please do it :)
Another test, leading to weird results. I tried to do the following:
This is the boot order right after provisioning:
Very interesting - I watched the sol1 console of ms-be2086 when doing provisioning, and right after the second round of reboot (for BIOS updates) I noticed an attempt to PXE boot over HTTP, failed and ended up in:
I tried with ms-be2085, doing the following:
@jhathaway thanks a ton for the tests, it was exactly what I had in mind to do today :)