-
Notifications
You must be signed in to change notification settings - Fork 745
OSError Too many open files: /tmp/tmphv67gzd0wandb-media #2825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @sarmientoj24! This is a known bug that we are currently working on fixing. We will let you know when we have a fix. |
I got def optuna_objective(trial):
wandb.init()
...
...
wandb.finish()
return loss Error Traceback
Enviroment
I fixed the problem temporarily set |
Thanks @Antsypc, we're aware of the issue and are actively working on a fix. It should be released in the next version of our client library due out in a week or so. |
This issue is stale because it has been open 60 days with no activity. |
I have the same problem at this moment. I'm running hpo on a SLURM cluster and do not have permission to change these mentioned values. Are there any other options to resolving this issue? |
@maxzw, thank-you for writing in. This bug is still being addressed. We will update the community here once a fix has been implemented. |
Any updates on this? |
Any updates? |
So, actually my problem was related to multiprocessing on a remote server (OS: Rocky Linux), which could be fixed by setting
|
Just to confirm I am still seeing this error, it seems to be related to starting too many jobs (my HPC uses slurm) too rapidly in succession and is generally avoided by allowing some time (say 60s) between jobs. |
Hopefully fixed with PR #4617 if it still an issue feel free to re-open. |
experiencing the same - it causes wandb to stop synching my run logs to the app |
I am still epxeriencing this :( This has to be open again |
@guyfreund Based on the traceback it looks like an issue that stems from artifacts. Few requests to help us solve this.
@abhinav-kashyap-asus same request regarding additional information to help us debug this issue. One of my colleague will ticket this issue and tag the relevant engineer. |
Happened to me too. Killed a long-running training run, huge inconvenience |
Is there any workaround? I am running jobs on slurm and at some point, wandb just stops logging because of this. I stopped the run via wandb, hoping it would sync after canceling, but it's just stuck in "stopping" forever. I really rely on this for my thesis, |
@jonasjuerss This workaround worked for me (so far):
|
@jxmorris12 Unfortunately, that didn't help. Thanks anyway for posting, maybe it works for others. |
For me, I figured out this issue arose when I was logging a large number of artifacts (in my case, even just a single If logging many artifacts is the main reason for this issue, then my guess is the wandb team made a naive assumption about how users would use their tools, they built software that opens and keeps open at least 1 file for every artifact logged, and this solution did not generalize to novel real-world use cases. Especially with the new NLP tools, I hope this issue gets more attention, as it's really useful to be able to get a bunch of tabular data at each step. |
I solved it by not logging the a artefacts. |
How to disable logging the artefacts? |
In your code, do not log artifacts if you are (e.g. don't do I hope some wandb people are looking at this. This seems pretty bad and renders wandb quite unusable for researchers using shared compute clusters who need to log many artifacts (especially for NLP or CV). |
@mukobi , @Jiuzhouh, @jxmorris12, @abhinav-kashyap-asus. Thank you for the continued follow up on this and our apologies that it got buried. We are going to revisit, but would appreciate your assistance in understanding the issue and reproducing it. Could you provide the following, either leave here in a comment or email your responses to mohammad.bakir@wandb.com with subject line
|
We'll get this fixed. If anyone has a script to reproduce that would help. |
I had this issue during a sweep. My hacky fix is to restart the sweep agent after every run, and bump |
@MBakirWB @shawnlewis Here's some simple code to repro this, run """Reproduce https://github.com/wandb/wandb/issues/2825"""
import os
import wandb
# Limit the file limit to 64 with ulimit -n 64
os.system("ulimit -n 64")
wandb.init()
for i in range(1000):
# Log a very simple artifact at each iteration
data = f"Step {i}"
wandb.log({"my_table": wandb.Table(columns=["text"], data=[[data]])})
print(f'Logged "{data}"') Around step 186 do I start to get the When I check the web UI, the logs and the view of the table seem to have stopped after Step 21, not even getting to the CLI-visible first error at 186. Run on Ubuntu via Windows Subsystem for Linux with the following
|
Also having this issue; following. |
I have also hit this issue - lost 12 long-running experiments, brutal! |
I'm also encountering this issue while using wandb.Table to log text predictions. We've already attempted various workarounds to log data less frequently, but the issue remains a massive annoyance for NLP tasks. Due to this bug, our models have crashed multiple times, many hours into training. I don't know if logging the table only at the end of a long run would fix the issue. However, we can't always guarantee that we'll reach the end of the loop. And with StreamTables (a core feature one would assume) waitig to be implemented over the past years we need to resort to all sorts of ugly workarounds. See: #2981 (comment) One workaround we tried (but failed) was just adding the data to the wandb Table and logging a copy of the accumulated table less frequently. |
Still experiencing this issue. Following. |
Also experiencing this issue. The code I have been using is an adaptation of the PPO multi-adapter from TRL with logging frequency of every step. Each step sends a table with the query, response and reward. After just under 1000 steps, I saw this error and nothing more was reported. My training still ran to completion and the model checkpoints were saved but I have no data on loss, rewards, etc. |
Also experiencing this issue when logging many tables. |
This still exists. Somehow. After 2 years. |
Also facing the same issue with |
The problem still exists |
Hi all, we released a new version of the SDK 0.16.3, that hopefully should mitigate this problem from some of you. release notes (PR #6891). We are slowly working a full solution, but hopefully it is a step in the right direction. Please give it a try and let us know if it helped. |
In my case, this error happens while using the sweeping function. I notice that it usually stops around the 24th sweep in a grid search. File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/git/cmd.py", line 1315, in _call_process
return self.execute(call, **exec_kwargs)
File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/git/cmd.py", line 985, in execute
proc = Popen(
File "/home/van-tien.pham/anaconda3/lib/python3.9/subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/van-tien.pham/anaconda3/lib/python3.9/subprocess.py", line 1720, in _execute_child
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
wandb: ERROR Abnormal program exit
Run 2pz72mmg errored: Exception('problem')
wandb: ERROR Run 2pz72mmg errored: Exception('problem')
wandb: Sweep Agent: Waiting for job.
^Cwandb: Ctrl + C detected. Stopping sweep. |
@pvtien96 do you think you will be able to provide a small repro example? to help us further debug this? |
Hi all, we released a new version of the SDK 0.18.3 with a new backend, which hopefully should mitigate this problem for you. Please give it a try and let us know if it helped. If not, could you please share a small repro - maybe a code snippet, etc. which can help us investigate this issue further? |
The issue still persists 0.18.5 |
The issue is still there in 0.19.5 |
I have been using yolov5's wandb and it is giving me this error
when doing Genetic Algorithm for hyperparameter evolution. Any idea on why wandb is doing this?
The text was updated successfully, but these errors were encountered: