User Details
- User Since
- Jun 25 2020, 6:43 PM (232 w, 3 d)
- Availability
- Available
- IRC Nick
- chrisalbon
- LDAP User
- Calbon
- MediaWiki User
- CAlbon (WMF) [ Global Accounts ]
Oct 24 2024
I checked again this morning. Cursor and VSCode work fine. Reading the VS Code forum, I think there was an update in 0.41 that fixed the issue.
Approved
I tested this with vscode and cursor today. Both seem to work. I wonder if the install time was just really long
Oct 12 2024
Oct 11 2024
Oct 10 2024
Sep 24 2024
Update: Right now we don't have the resources to prioritize this. I'm moving it to the backlog.
Aug 27 2024
- Recommendation API is live and in production
- Recently been supporting structured content team for using the logo detection in Lift Wing production.
- Updated the readability model
- Pre-saved context for revert risk https://phabricator.wikimedia.org/T356102,, https://phabricator.wikimedia.org/T364705
- Slow revscoring, started logging queries on the pod side, so that is gone when the pod is killed.
- Answer "Is there a reason we are not logging the query into logstash?"
- machines are racked but not set up. Will set up one first to figure out disk layout and then the other one. Then will release to the research team
- GPU hosts are racked but not set up yet
- Software side slower
Aug 13 2024
Update
- Modernized recommendation API has been deployed to production
- API gateway setup underway
- Article quality LA: Ready on staging and want to bring it into production. Should we group models into common namespaces? Suggestion: create namespaces per area where the model is used: articles, revisions, images, etc.
Update:
- Waiting for ml-lab machines to be delivered to the eqiad data center.
Infra
- Setting up the puppet roles
- Can't commit puppet roles until the machines are there
- Reached out to vendor
Jul 31 2024
Jul 30 2024
Jun 18 2024
May 21 2024
People can now pip install and use models. Right now we only have a few models - the number of models should increase over time.
- Calico improvements makes the whole workflow more streamlived
- Improve our incident response procedure
- Investigate CPU spikes
- Still can't use GPU with ROCm. But we figured out what the bug is - if the control version is upgraded to Bookworm it will be fixed.
- Next step is to upgrade ml-staging to Bookworm then test.
- Working on upgrading HF with newer versions with ROCm 6.0. Tested them and they work and will be posting watch.
- Goal is to utilize GPU so we can deploy models from HuggingFace.
- Trying to fix up a Calico networking issue in Kubernetes
- After credentials, will send patched revert risk server to ml-staging
May 7 2024
- Narrowed down cause of symptoms of spike in CPU usage to feature extraction in revscoring isvc. Might be caused by some specific revids.
- Wait for vendor (Supermicro) to finalize order of 2x for ml-staging.
- Chris's guess is ml-staging installed at end of quarter
- Working on plumbing on staging, should be done within week
- Feeling good about it
Apr 30 2024
Logging queries and logging when things are slow is the short term goal. Knowing WHY a query takes a long time is a future question
We have a theory that the ROCm drivers on the debian package is not required.
Decision point: Do we upgrade ROCm drivers?
Update: No update
- Rebased code after prototype.
- Waiting for istio change for making a new service, which is imminent
- Need to add new visual service that is tcp
Apr 25 2024
Apr 23 2024
- GPU order for the first GPU 2x chassis is close to complete. There are some supply issues with the chassis, so the question is going to be if we want to use an upgraded chassis for the ml-staging server.
- Merged puppet machinery to allow network policies to be generated for assorted cluster. So we can automatically generated the network policy without the 60 lines of istio config.
- Will merge change to network policy to allow Istio to talk to Cassandra.
Apr 16 2024
Mar 26 2024
At risk because we don't have a GPU in the data centers yet.