[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Machine Learning

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (465 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Yesterday

elukey added a comment to T381257: Allow Kartotherian to use a local HTTP proxy.

I think I was wrong, this bit inside geoshape's index.js should work:

Wed, Dec 4, 12:22 PM · Content-Transform-Team, serviceops, WMDE-TechWish-Maintenance, Epic, Maps (Kartotherian)
elukey added a comment to T371389: Q1:rack/setup/install ms-be10{83-91}.

@Jclark-ctr I fixed the provisioning of ms-be1086, for some reasons if the BMC doesn't have IPv6 enabled the settings that errored out are read only (in fact provisioning was failing during the BMC network settings rollout). The workaround is to connect to the WebUI, Configuration -> Network and enable IPv6. Then you can re-run the cookbook and it works.

Wed, Dec 4, 10:25 AM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops

Tue, Dec 3

elukey created P71502 (An Untitled Masterwork).
Tue, Dec 3, 4:53 PM
elukey added a comment to T381123: Requesting access to releasers-mediawiki group for ABreault (WMF).

Better to open a new one for traceability!

Tue, Dec 3, 3:42 PM · MediaWiki-Engineering, SRE, SRE-Access-Requests
elukey added a comment to T380487: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE.

Hi all, please send me Suzanne Wood's email address and I will process the NDA. Thanks!

Tue, Dec 3, 8:32 AM · SRE, LDAP-Access-Requests

Mon, Dec 2

elukey created P71474 (An Untitled Masterwork).
Mon, Dec 2, 2:49 PM
elukey closed T371400: Q1:rack/setup/install ms-be208[1-8] as Resolved.

Everything reimaged, we are good :)

Mon, Dec 2, 2:47 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey created T381257: Allow Kartotherian to use a local HTTP proxy.
Mon, Dec 2, 11:04 AM · Content-Transform-Team, serviceops, WMDE-TechWish-Maintenance, Epic, Maps (Kartotherian)
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

Deployed to Tegola production, I think everything looks good. I didn't find metrics to add to the Grafana dashboard, but we'll see in the future. The setup is not safer in my opinion, and we can re-use it for kartotherian.

Mon, Dec 2, 10:25 AM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T380994: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE).

This task needs T380487 to be completed first, otherwise the analytics-privatedata-users settings will not be useful (since the user will not be able to log in to Superset).

Mon, Dec 2, 10:02 AM · SRE, SRE-Access-Requests
elukey moved T380994: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Mon, Dec 2, 10:02 AM · SRE, SRE-Access-Requests
elukey added a comment to T381123: Requesting access to releasers-mediawiki group for ABreault (WMF).

@thcipriani Hi! Could you review this request? Lemme know if we can move forward :)

Mon, Dec 2, 9:58 AM · MediaWiki-Engineering, SRE, SRE-Access-Requests
elukey closed T379159: Add permissions for Komla to run WMCS cookbooks as Declined.

Setting it to declined for the moment, please re-open when a final decision has been made!

Mon, Dec 2, 9:57 AM · SRE, SRE-Access-Requests, Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2)
elukey moved T380525: Requesting access to deployment & stats private data access for jly from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Mon, Dec 2, 9:54 AM · SRE, SRE-Access-Requests
elukey moved T381108: Access to deploy recommendation API ML service for Stephane from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Mon, Dec 2, 9:54 AM · SRE, Machine-Learning-Team, Recommendation-API, SRE-Access-Requests
elukey moved T381108: Access to deploy recommendation API ML service for Stephane from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Mon, Dec 2, 9:54 AM · SRE, Machine-Learning-Team, Recommendation-API, SRE-Access-Requests
elukey moved T381123: Requesting access to releasers-mediawiki group for ABreault (WMF) from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Mon, Dec 2, 9:54 AM · MediaWiki-Engineering, SRE, SRE-Access-Requests
elukey updated subscribers of T381108: Access to deploy recommendation API ML service for Stephane.

@klausman Hi! Could ML take care of this request?

Mon, Dec 2, 9:49 AM · SRE, Machine-Learning-Team, Recommendation-API, SRE-Access-Requests

Fri, Nov 29

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Reimaged 208[2-5] too (2084 was left unconfigured for some reason, I have probably missed it, good that I rechecked :D).

Fri, Nov 29, 11:39 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

Before the test I checked the envoy logs and I found that we had already a slow replica that was ejected and re-added back:

Fri, Nov 29, 8:59 AM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

After a chat with Janis, this may be a good test:

Fri, Nov 29, 8:43 AM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

I think we have made this migration slightly more complicated than it should be for the following two reasons:

Ingress
While kartotherian is a useful service, it is not a critical service, while its unavailability does not affect the stability of our websites. Thus, my suggestion is to simply use for kartotherian-k8s, and flip the switch on ATS. For gradual rollout we can go our usual path of stopping puppet on cp* hosts, and enable it in batches. To my knowledge, no other internal service talks to maps.

In the above scenario, our only concern would be if the estimated capacity the service has on k8s, is the one it needs in reality.

Fri, Nov 29, 8:38 AM · serviceops-radar, Maps (Kartotherian)

Thu, Nov 28

elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

Megactl is correct that the battery is missing, but obviously on nodes where we expect that, it shouldn't flag as an error...

Thu, Nov 28, 4:57 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

ms-be2081 done reimaged!

Thu, Nov 28, 4:55 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

Tried megactl (packaged by Moritz) on ms-be2082, this is the result:

Thu, Nov 28, 4:43 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Re-ran provision on all those, we are good, no changes registered. Now it is the turn of reimages, I'll kick off some.

Thu, Nov 28, 4:24 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a member for WMF-NDA: IAckerman-WMF.
Thu, Nov 28, 4:16 PM
elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

To keep archives happy, we are going to use the envoy TCP proxy already implemented for Tegola with some tweaks. More info in T322647#10365816

Thu, Nov 28, 3:33 PM · serviceops-radar, Maps (Kartotherian)
elukey closed T380523: Grant Access to wmf for JLy-WMF as Resolved.

TIL, already done thanks!

Thu, Nov 28, 3:33 PM · SRE, LDAP-Access-Requests
elukey closed T380523: Grant Access to wmf for JLy-WMF, a subtask of T380014: Onboard Jimmy Ly to the Security Team, as Resolved.
Thu, Nov 28, 3:32 PM · SecTeam-Processed, Security Team AppSec, Security-Team
elukey added a member for WMF-NDA: Jly.
Thu, Nov 28, 3:32 PM
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

Tested staging with T344324#9826584 (used previous for another tegola work) and it seems working nicely.

Thu, Nov 28, 3:31 PM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

Deployed in tegola staging, I got this from the envoy's logs:

Thu, Nov 28, 3:23 PM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

@awight, on Tegola (which is running on k8s), we already have envoy doing load-balancing there, details can be found at tegola-vector-tiles. It makes sense to use the same solution for kartotherian I reckon, do you agree?

Thu, Nov 28, 10:35 AM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)

Wed, Nov 27

elukey closed T380091: Access to Data Hub - IAckerman-WMF as Resolved.
elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep iacke
member: uid=iackerman,ou=people,dc=wikimedia,dc=org
Wed, Nov 27, 3:30 PM · SRE, LDAP-Access-Requests
elukey updated subscribers of T380487: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE.

Hi! I am looping in @KFrancis since afaics we need to sign an NDA before proceeding.

Wed, Nov 27, 3:13 PM · SRE, LDAP-Access-Requests
elukey closed T380097: Grant Access to NDA-users for ncreasy as Resolved.

@NCreasy the wmf group should be enough, you are free to play with DataHub, all perms should be set. If you find any issue please re-open this task!

Wed, Nov 27, 3:09 PM · SRE, LDAP-Access-Requests
elukey closed T380523: Grant Access to wmf for JLy-WMF, a subtask of T380014: Onboard Jimmy Ly to the Security Team, as Resolved.
Wed, Nov 27, 2:56 PM · SecTeam-Processed, Security Team AppSec, Security-Team
elukey closed T380523: Grant Access to wmf for JLy-WMF as Resolved.
elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep jly
member: uid=jly,ou=people,dc=wikimedia,dc=org
Wed, Nov 27, 2:56 PM · SRE, LDAP-Access-Requests
elukey closed T380820: Grant Access to wmf for Sspalding as Resolved.

Closing for the moment, please re-open if needed!

Wed, Nov 27, 2:52 PM · SRE, LDAP-Access-Requests
elukey updated the task description for T380525: Requesting access to deployment & stats private data access for jly.
Wed, Nov 27, 2:52 PM · SRE, SRE-Access-Requests
elukey added a comment to T380525: Requesting access to deployment & stats private data access for jly.

@elukey Got it, I have updated the key now, please see

Wed, Nov 27, 2:51 PM · SRE, SRE-Access-Requests
elukey updated subscribers of T371389: Q1:rack/setup/install ms-be10{83-91}.

@VRiley-WMF @Jclark-ctr Hi! We are ready to start provisioning these nodes, but the procedure is a little bit more convoluted than the usual since we need to force UEFI and there are still some Supermicro bugs that upstream is working on.

Wed, Nov 27, 9:30 AM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey reopened T371400: Q1:rack/setup/install ms-be208[1-8] as "Open".

@Jhancock.wm hi! We have done a lot of weird tests with these nodes, I think that we should re-run provision for all of them to check that nothing weird that was tested is still in place, and possibly reimage all of them too.

Wed, Nov 27, 9:22 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).

Wed, Nov 27, 9:08 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I tried to dowload and install perccli == 007.2616.0000.0000 on ms-be2081 but no luck, same issue.

Wed, Nov 27, 8:59 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey closed T375645: Clean up the Docker Registry catalog and Swift storage from old images as Declined.

The K8s SIG reviewed this proposal and for the moment it was decided not to proceed with anything that could harm the consistency of the Registry. We'll probably form a group of people interested in maintaining the registry, and after that a decision about how to proceed will be made.

Wed, Nov 27, 8:51 AM · User-Elukey, Infrastructure-Foundations, SRE, serviceops
elukey closed T375645: Clean up the Docker Registry catalog and Swift storage from old images, a subtask of T242604: Remove obsoleted docker images, as Declined.
Wed, Nov 27, 8:48 AM · Release-Engineering-Team (Radar), Upstream, User-brennen, SRE, Release Pipeline, serviceops

Tue, Nov 26

elukey added a comment to T380883: thanos-be1005 and thanos-be2005 serial console not available over ssh.

@MatthewVernon it works for me, I think you are missing a start:

Tue, Nov 26, 4:37 PM · SRE-swift-storage, DC-Ops, SRE, ops-eqiad, ops-codfw
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

From https://www.envoyproxy.io/docs/envoy/v1.23.12/api-v3/config/cluster/v3/cluster.proto:

Tue, Nov 26, 4:34 PM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T380820: Grant Access to wmf for Sspalding.
elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep sspalding
member: uid=sspalding,ou=people,dc=wikimedia,dc=org
Tue, Nov 26, 4:15 PM · SRE, LDAP-Access-Requests
elukey added a comment to T380820: Grant Access to wmf for Sspalding.

Confirmed it is legit after a chat on Slack :)

Tue, Nov 26, 4:07 PM · SRE, LDAP-Access-Requests
elukey closed T379678: Requesting access to deployment for dbrant as Resolved.

Merged! Puppet needs to run in various hosts to propagate the permissions but in ~1hour we should be good. Closing, please re-open if I missed something!

Tue, Nov 26, 3:58 PM · SRE, SRE-Access-Requests
elukey updated the task description for T379173: Requesting access to deployment shell access for Kgraessle.
Tue, Nov 26, 3:56 PM · SRE, SRE-Access-Requests
elukey closed T379173: Requesting access to deployment shell access for Kgraessle as Resolved.

Merged! The new access permissions will be deployed during the next hour by puppet on all the stat nodes.

Tue, Nov 26, 3:56 PM · SRE, SRE-Access-Requests
elukey updated the task description for T379678: Requesting access to deployment for dbrant.
Tue, Nov 26, 1:56 PM · SRE, SRE-Access-Requests
elukey added a comment to T379678: Requesting access to deployment for dbrant.

Reached out on Slack to verify the ssh key.

Tue, Nov 26, 1:56 PM · SRE, SRE-Access-Requests
elukey added a comment to T379159: Add permissions for Komla to run WMCS cookbooks.

Reached out to Joanna to confirm the user to the group, but it LGTM.

Tue, Nov 26, 1:52 PM · SRE, SRE-Access-Requests, Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2)
elukey added a comment to T379678: Requesting access to deployment for dbrant.

Reached out on Slack to verify the ssh key.

Tue, Nov 26, 1:50 PM · SRE, SRE-Access-Requests
elukey removed projects from T379303: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop: SRE, SRE-Access-Requests.

I am removing the SRE tag on this, Data Platform SREs are the right target for the last requests :)

Tue, Nov 26, 1:49 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
elukey added a comment to T379173: Requesting access to deployment shell access for Kgraessle.

I followed up with @Kgraessle and the analytics-privatedata-users group seems to be the best one to access the Mariadb prod replicas (according to this.

Tue, Nov 26, 1:28 PM · SRE, SRE-Access-Requests
elukey added a comment to T380525: Requesting access to deployment & stats private data access for jly.

@Jly Hi! You are currently using the same SSH key for both production (this request) and WMCS, so I'd ask you to create a new one and update the task :)

Tue, Nov 26, 1:13 PM · SRE, SRE-Access-Requests
elukey updated the task description for T379173: Requesting access to deployment shell access for Kgraessle.
Tue, Nov 26, 1:09 PM · SRE, SRE-Access-Requests
elukey updated subscribers of T380525: Requesting access to deployment & stats private data access for jly.

@thcipriani Hi! I'd need your review to grant access to the Deployment group, lemme know your thoughts :)

Tue, Nov 26, 1:08 PM · SRE, SRE-Access-Requests
elukey updated the task description for T380525: Requesting access to deployment & stats private data access for jly.
Tue, Nov 26, 1:06 PM · SRE, SRE-Access-Requests
elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

My understanding is that, by default, the envoy LB configuration will not do any active probing of the TCP proxy endpoints set. I am wondering if we should expand the mesh's tcp proxy config with health checks like described in:

Tue, Nov 26, 11:49 AM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)

Mon, Nov 25

elukey added a comment to T322647: Create a dedicated postgresql+postgis cluster for maps.

@jijiki I am looking into the same problem in the task T378944, and I have a doubt - what happens if one of the maps node goes down for hw failure or maintenance? Is the envoy tcp load balancer going to remove it from rotation, or will it keep erroring out periodically until it is finally depooled? I was wondering if creating something like maps-db.discovery.wment:5432 with LVS could be a more long term solution (adding only the read replicas to the pool). What do you think?

Mon, Nov 25, 2:48 PM · Platform Engineering, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

For tegola we do the following:

Mon, Nov 25, 12:06 PM · serviceops-radar, Maps (Kartotherian)
elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

Today I found out that Kartotherian seems to be contacting the local postgres read replica to fetch geoshapes/osmdb data:

Mon, Nov 25, 11:01 AM · serviceops-radar, Maps (Kartotherian)

Fri, Nov 22

elukey closed T370453: Q1:rack/setup/install thanos-be1005 as Resolved.

The host is fully in service now and I had a chat with Matthew to put it in production, resolving!

Fri, Nov 22, 9:48 AM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey updated the task description for T370453: Q1:rack/setup/install thanos-be1005.
Fri, Nov 22, 9:45 AM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops

Thu, Nov 21

elukey added a comment to T380373: Allow TLS authenticated client to write on new topics.

+1 from my side!

Thu, Nov 21, 5:16 PM · Data-Platform-SRE, Traffic
elukey added a comment to T327396: Migrate Kartotherian to node-mapnik v4.2.1 and unfork.

Merged all the patches, and we finally have http://docker-registry.wikimedia.org/wikimedia/mediawiki-services-kartotherian:2024-11-21-145831-production \o/

Thu, Nov 21, 4:04 PM · WMDE-TechWish-Sprint-2024-10-16, Patch-For-Review, Essential-Work, Content-Transform-Team-WIP, WMDE-GeoInfo-FocusArea, Maps (Kartotherian)
elukey updated subscribers of T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr all configured, the host has been reimaged and all the disks are shows up.

Thu, Nov 21, 11:06 AM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops

Wed, Nov 20

elukey added a comment to T370453: Q1:rack/setup/install thanos-be1005.

Quick note about the reimage step - due to a bug in Supermicro's BMC firmware (at least, this is what we suspect) the first reimage run will likely end up into two consecutive debian installs that will end up in causing reimage to fail/stall. A subsequent reimage should be enough to fix it, getting the node in the final state.

Wed, Nov 20, 5:22 PM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey added a comment to T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr the host is provisioned, next step is the number 2 in T370453#10326159, lemme know if you want me to do it or not!

Wed, Nov 20, 2:59 PM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops

Mon, Nov 18

elukey added a comment to T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr I updated the firmware to the correct one, but I'd need the BMC label password in pvt when you are in the DC (it is needed for the factory reset that needs to happen post-firmware upgrade, sigh). Thanks for the patience!

Mon, Nov 18, 12:05 PM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Ok I found the issue, I asked Jenn to turn off IPv6 last week for the BMC network to test if that was the issue, but it was before upgrading the firmware. With the BMC reset the network settings are preserved, so the old test/setting caused the last hiccup in running provision.

Mon, Nov 18, 12:04 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

My bad, I misremembered that we got the firmware for config J from Supermicro already (somehow I thought it was for the ganeti nodes, too many firmware floating around :D) and I uploaded it to thanos-be2005, followed by a factory reset. The issue is the same as happened on backup1012: T371416#10216617

Mon, Nov 18, 10:56 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Fri, Nov 15

elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

I was able to upload the firmware via Web UI, but the issue seems still present (new version, 01.04.08. Need to investigate more what is the problem, and/or to ping supermicro to give us the same firmware that they deployed to the ms-be nodes.

Fri, Nov 15, 6:18 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey updated subscribers of T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr Hello! For this host, we have to follow a new workflow:

Fri, Nov 15, 11:38 AM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey updated subscribers of T370452: Q1:rack/setup/install thanos-be2005.

@Papaul @Jhancock.wm we'd need to upgrade the firmware on this node, I think that we could use directly this instead of the custom one. I tried to connect to the BMC web ui in various ways but I failed since the BMC network config is the one that fails while provisioning. I tried also to do it by hand via DEL/Setup at boot but for some reason I cannot modify any value (or, my client prevents me to do it remotely, not sure why).

Fri, Nov 15, 11:20 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

While provisioning I see the following error for the BMC NIC config:

Fri, Nov 15, 10:50 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Thu, Nov 14

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Tried to manually set the continuous flag on sretest2001, rebooted but I didn't see the boot options changing like ms-be2088. So at this point it may not be relevant, but I don't explain the above differences. Maybe we just need to reimage all of them another time and they will get the same conf?

Thu, Nov 14, 8:55 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@jhathaway something interesting that I found on Redfish related to BIOS boot options:

Thu, Nov 14, 8:43 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Wed, Nov 13

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@jhathaway another episode of the saga, ms-be2088 :D

Wed, Nov 13, 8:30 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Tue, Nov 12

elukey added a comment to T379592: Unable to deploy new version of recommendation-api to production due to connectivity issues.

I think I found the issue, this is what I see from a new pod (available only for the time of the deployment, then helmfile/helm rolls it back):

Tue, Nov 12, 11:55 AM · Unplanned-Sprint-Work, LPL Essential (LPL Essential 2024 Nov-Dec), Recommendation-API
elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

All action items done, now the next step is to wait for the k8s service to be deployed on Wikikube :)

Tue, Nov 12, 10:17 AM · serviceops-radar, Maps (Kartotherian)

Mon, Nov 11

elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

The new kartotherian.discovery.wmnet:6543 endpoint is available.

Mon, Nov 11, 4:23 PM · serviceops-radar, Maps (Kartotherian)
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@elukey I was able to reproduce the issue, by wiping the files from the efi partition, before kicking off another re-image. I think the problem is actually in the debian-installer, rather than on the supermicro side, which is why we don't see this issue on sretest2001.codfw.wmnet. I think the debian-installer is failing to install grub properly and create the efi boot entry, which is part of the grub install process. I think the issue is related to setting grub-installer/bootdev which is done by autoinstall/scripts/partman_early_command.sh on the ms-be boxes. On ms-be2082 this evaluated to grub-installer/bootdev /dev/sdj /dev/sdk which seems correct, but perhaps /dev/sdk needs to be first? I also tried setting grub-installer/only_debian boolean false, which we set in the raid1-2dev-efi.cfg, but that didn't seem to have any effect, so I don't think we are still hitting, "#this workarounds LP #1012629 / Debian #666974", but I'm also not sure. I am off Monday, but happy to investigate more on Tuesday.

Mon, Nov 11, 2:47 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey edited P70998 rec-api-ng deployment log errors.
Mon, Nov 11, 11:23 AM
elukey created P70998 rec-api-ng deployment log errors.
Mon, Nov 11, 11:23 AM

Sun, Nov 10

elukey added a project to T379491: PROBLEM - MariaDB Replica SQL: s6 on db2217 is CRITICAL: CRITICAL: Data-Persistence-Automations.

I ran optimize table archive (11M records, seemed safe enough) after stopping the slave, and it seems to have recovered. I wasn't confident enough to declare it "ready for prod" so we decided not to repool, leaving the decision to data persistence :)

Sun, Nov 10, 12:39 PM · Data-Persistence-Automations, DBA

Fri, Nov 8

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

I also tried to not configure any special JBOD config for ms-be2087 after provision, and kick off reimage to see if the double d-i issue appeared (to rule out special SAS controller features/settings) but no luck, still double d-i at first try.

Fri, Nov 8, 12:05 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

So far I provisioned up to ms-be2087, and ms-be2088 was left untouched. The ADMIN/root password should already be set to the one on pwstore, so if you want to go ahead and test with 2088 please do it :)

Fri, Nov 8, 11:40 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Another test, leading to weird results. I tried to do the following:

Fri, Nov 8, 11:38 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

This is the boot order right after provisioning:

Fri, Nov 8, 10:04 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Very interesting - I watched the sol1 console of ms-be2086 when doing provisioning, and right after the second round of reboot (for BIOS updates) I noticed an attempt to PXE boot over HTTP, failed and ended up in:

Fri, Nov 8, 10:01 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

I tried with ms-be2085, doing the following:

Fri, Nov 8, 9:22 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@jhathaway thanks a ton for the tests, it was exactly what I had in mind to do today :)

Fri, Nov 8, 8:30 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops