User Details
- User Since
- Jan 7 2019, 1:06 PM (308 w, 3 d)
- Availability
- Available
- IRC Nick
- jbond
- LDAP User
- Jbond
- MediaWiki User
- Unknown
Jun 17 2024
hi all i wanted to say that the sso project is used so that users have an SSO testing infrastructure to use in cloud services. Originally this was also used to provide sso to production like services in cloud services, however this later functionality has been moved.
May 13 2024
Puppet 7 has some new ownership constraints which means that we can no longer investigate these repos as root, for example:
FYI this is an artifact of new version of git not puppet. you will need something like the following on cloud standalon puppet masters
Jan 5 2024
This is likely the issue something, somewhere likley is still using Puppet_internal_ca.pem, /var/lib/puppet/ssl/ca/ca.pem or $facts['pupet_config']['localcecert'] directly
Nov 29 2023
Nov 28 2023
Nov 27 2023
ill leave this to @SLyngshede-WMF as im guessing they have ben experimenting with migrating netbox to OIDC T308002: Move Netbox authentication to python-social-auth
Nov 24 2023
@hashar how do we get the updated commit-validator into the CI images used bu puppet. i.e. something similar to https://gerrit.wikimedia.org/r/c/integration/config/+/971546/2/dockerfiles/commit-message-validator/Dockerfile.template . i have a test commit which should pass once upgraded
Maybe it is related to the puppet 7 upgrade in some way?
Although the timing could suggest this is caused by puppet 7 i would suggest caution for the following reasons:
- ~50% of servers are no running puppet7
- As the migration is active we can't be sure theses servers where running puppet7 when the the timeouts happened
- most importantly: this script calls raid.rb directly as such puppet is not involved. the only library used is facter (and even then only very limited amount) which has not been upgraded as part of the puppet migration
Nov 22 2023
I have refreshed the patches for T347565: Switch rsyslog to use the new PKI infrastructure. which among other things updates central auth to use a pki.discovert.wmnet issued cert and updates blackbox::check::tcp to use the same certs. im running pcc now and will fix up any issues
Nov 21 2023
@fgiunchedi i have created a CR to use pki.discovery.wmnet to request a puppet agent certificate instead of using expos_puppet_certs. this should work around the issue
All systems hav now been migrated to ossl
@fgiunchedi Everything is using openssl now, do you still see the errors?
Looking at puppet board we are still having issues when we do a puppet merge. The following are times in utc where we had a puppet-merge occurring, each of theses times we have 8-10 puppet failures
I have rolled out a new wmf-certificates package which i believe has fixed this error. all swift services on thanos-fe1001 are now started. tentatively closing but please reopen if i missed something
Nov 20 2023
@MatthewVernon this is almost certainly something using the the puppet ca directly instead of using /etc/ssl/certs/wmf-ca-certificates.crt. I need to investigate a bit more why openssl is failing. specifically
this has been fixed upstream we should get the benefit when we upgrade to puppet7
@RobH closing this as we now have the upgrade-firmware cookbook but please reopen if needed
this has since been fixed
not enough information
Reading the task it seems like the last blocker was to "wait out buster" (T324623#8449852). however as we have now deployed this to buster (T324623#9334403) it seems like we can move ahead. Are the any concerns to making this change, it seems fairly simple
Timer as now been deployed
@fgiunchedi what is the probing software? we do have a bit of a work around for this which may work here as well. also if it is the issue you mention I'm not sure that switching to ossl will help. however i suspect T347565: Switch rsyslog to use the new PKI infrastructure would
Nov 17 2023
@bking in order for me to investigate further i need either broken host to investigate or a way to replicate the issue.
output
Agent |CA Server | ==================================================================================================== compile1001.puppet-dev.eqiad1.wikimedia.cloud |pm7.puppet-dev.eqiad1.wikimedia.cloud | db7.puppet-dev.eqiad1.wikimedia.cloud |pm7.puppet-dev.eqiad1.wikimedia.cloud | puppetboard1001.puppet-dev.eqiad1.wikimedia.cloud |pm7.puppet-dev.eqiad1.wikimedia.cloud | puppetdb1002.puppet-dev.eqiad1.wikimedia.cloud |pm7.puppet-dev.eqiad1.wikimedia.cloud | acme-chief1001.puppet-dev.eqiad1.wikimedia.cloud |puppetmaster.cloudinfra.wmflabs.org | project-pm.puppet-dev.eqiad1.wikimedia.cloud |puppetmaster.cloudinfra.wmflabs.org | pupptserver7.puppet-dev.eqiad1.wikimedia.cloud |puppetmaster.cloudinfra.wmflabs.org | agent7.puppet-dev.eqiad1.wikimedia.cloud |puppetserver1001.eqiad.wmnet |
@bking i took a look at cloudelastic1010 as i had thought this was in some broken state from the reimage cookbook. however from the puppet certs i can see its been around since Nov 9 07:30:40 2023 GMT and has had puppet disabled for the last 36 hours.
Nov 16 2023
i have logged into logging-logstash-02.logging.eqiad1.wikimedia.cloud and ran systemctl restart logstash.service hopefully that has fixed this
Add Comment
Nov 15 2023
looking at the ocsp file using the following command suggests that something with ocprefresh is not rworking correctly as the response is from kafka ca
The following command should be be able to be used to check
i have rolled out a change so that buster machines use openssl which seems to have fixed the issue. please reopen if you see other problems
i have tested using openssl and that works so ill prepare a patch to switch all buster to openssl
Well i have updated apt1001 to 8.2102.0-2~deb10u1 and i still see the problem so that would suggest its not an issue with rsyslog :/. perhaps a different option would be to pressure T347565, however i fear we may hit the same issue
Nov 14 2023
This is actually live now so ill assume taavi was right, and see if the error comes again
going to close this as i think its resolved but please reopen if not
@jhathaway I suspect you have already fixed theses with your dcl work are you able to confirm/update?
set priority to high as this is causing issues
volatile is now synced to all puppetserveres and agents using puppet7 can fetch data correctly
However, the IPv6 rule should not be there, right now it's incorrectly allowing v6 traffic to all addresses on port 3306.
It seems from some of the test cases that this may have been intentional. however i agree with you that the current bahviour seems undesirable. I created a CR lets see what @MoritzMuehlenhoff says
@jhathaway thanks for investigating by the sounds of it would could probably have a bit of a win if we:
- set environment_timeout = unlimited
- update puppet-merge to do a systemctl reload puppetserver after g10k
edit: or possibly this one https://github.com/rsyslog/rsyslog/issues/4035
ok i don't think its this as we still have SSL_set_verify_depth(pThis->ssl, 4); in the buster packages
Feels like this could be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=887637 (https://github.com/rsyslog/rsyslog/issues/2762) that iss is about the server not sending the intermediate but i wonder if the same issues means the client doesn't read the sent intermediate
from a very simple test this appears to only affect buster
Some additional information
Nov 13 2023
This is in place now use hiera key during migration
this is complete
Theses issues are all resolved