8000 OSError Too many open files: /tmp/tmphv67gzd0wandb-media · Issue #2825 · wandb/wandb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

OSError Too many open files: /tmp/tmphv67gzd0wandb-media #2825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sarmientoj24 opened this issue Oct 23, 2021 · 42 comments · Fixed by #4617
Open

OSError Too many open files: /tmp/tmphv67gzd0wandb-media #2825

sarmientoj24 opened this issue Oct 23, 2021 · 42 comments · Fixed by #4617
Labels
a:sdk Area: sdk related issues c:artifacts Candidate for artifact branch c:sdk:media Component: Relating to media

Comments

@sarmientoj24
Copy link

I have been using yolov5's wandb and it is giving me this error

File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'

when doing Genetic Algorithm for hyperparameter evolution. Any idea on why wandb is doing this?

@sarmientoj24 sarmientoj24 added the a:sdk Area: sdk related issues label Oct 23, 2021
@ramit-wandb
Copy link
Contributor

Hi @sarmientoj24!

This is a known bug that we are currently working on fixing. We will let you know when we have a fix.

@Antsypc

I got OSError Too many open files too. I run wandb in optuna subprocess.

def optuna_objective(trial):
    wandb.init()
    ...
    ...
    wandb.finish()
    return loss

Error Traceback

Traceback (most recent call last):
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 870, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 600, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1738, in _on_start
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1707, in _console_start
OSError: [Errno 24] Too many open files: '/workspace/wandb/offline-run-20211026_081159-2i9isn2w/files/output.log'

Enviroment

macOS Catalina 10.15.7
Python 3.8.12
wandb 0.12.5

I fixed the problem temporarily set ulimit -n 2048 every time before python run.

@vanpelt
Copy link
Contributor
vanpelt commented Oct 27, 2021

Thanks @Antsypc, we're aware of the issue and are actively working on a fix. It should be released in the next version of our client library due out in a week or so.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity.

@github-actions github-actions bot added the stale label Dec 27, 2021
@sydholl sydholl closed this as completed May 9, 2022
@maxzw
Copy link
maxzw commented Jun 22, 2022

I got OSError Too many open files too. I run wandb in optuna subprocess.

def optuna_objective(trial):
    wandb.init()
    ...
    ...
    wandb.finish()
    return loss

Error Traceback

Traceback (most recent call last):
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 870, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 600, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1738, in _on_start
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1707, in _console_start
OSError: [Errno 24] Too many open files: '/workspace/wandb/offline-run-20211026_081159-2i9isn2w/files/output.log'

Enviroment

macOS Catalina 10.15.7
Python 3.8.12
wandb 0.12.5

I fixed the problem temporarily set ulimit -n 2048 every time before python run.

I have the same problem at this moment. I'm running hpo on a SLURM cluster and do not have permission to change these mentioned values. Are there any other options to resolving this issue?

@exalate-issue-sync exalate-issue-sync bot reopened this Jun 22, 2022
@MBakirWB
Copy link

@maxzw, thank-you for writing in. This bug is still being addressed. We will update the community here once a fix has been implemented.
Regards

@github-actions github-actions bot removed the stale label Jul 25, 2022
@AIchberger
Copy link

Any updates on this?

@da2r-20
Copy link
da2r-20 commented Oct 18, 2022

Any updates?
We are also experiencing it

@AIchberger
Copy link
AIchberger commented Oct 18, 2022

So, actually my problem was related to multiprocessing on a remote server (OS: Rocky Linux), which could be fixed by setting ulimit -n unlimited (or any high number) together with

torch.multiprocessing.set_sharing_strategy('file_system')
torch.multiprocessing.set_start_method('spawn')

@LucHayward
Copy link

Just to confirm I am still seeing this error, it seems to be related to starting too many jobs (my HPC uses slurm) too rapidly in succession and is generally avoided by allowing some time (say 60s) between jobs.

@kptkin
Copy link
Collaborator
kptkin commented Mar 3, 2023

Hopefully fixed with PR #4617 if it still an issue feel free to re-open.

@kptkin kptkin closed this as completed Mar 3, 2023
@guyfreund
Copy link
guyfreund commented Apr 15, 2023

experiencing the same - it causes wandb to stop synching my run logs to the app
image
@kptkin please reopn

@abhinav-kashyap-asus
Copy link

I am still epxeriencing this :( This has to be open again

@kptkin kptkin reopened this May 3, 2023
@kptkin kptkin added the c:artifacts Candidate for artifact branch label May 3, 2023
@kptkin
Copy link
Collaborator
kptkin commented May 3, 2023

@guyfreund Based on the traceback it looks like an issue that stems from artifacts. Few requests to help us solve this.

  • Could you provide the debug logs (you can find these logs in your run directory under run)?
  • If possible a repro script or explanation about your execution
  • Are you using the latest version and did upgrading helped your issue?

@abhinav-kashyap-asus same request regarding additional information to help us debug this issue.

One of my colleague will ticket this issue and tag the relevant engineer.

@jxmorris12
Copy link

Happened to me too. Killed a long-running training run, huge inconvenience

@jonasjuerss
Copy link

Is there any workaround? I am running jobs on slurm and at some point, wandb just stops logging because of this. I stopped the run via wandb, hoping it would sync after canceling, but it's just stuck in "stopping" forever. I really rely on this for my thesis,

@jxmorris12
Copy link

@jonasjuerss This workaround worked for me (so far):

import resource
resource.setrlimit(
    resource.RLIMIT_CORE, (resource.RLIM_INFINITY, resource.RLIM_INFINITY)
)

@jonasjuerss
Copy link

@jxmorris12 Unfortunately, that didn't help. Thanks anyway for posting, maybe it works for others.

@mukobi
Copy link
mukobi commented May 19, 2023

For me, I figured out this issue arose when I was logging a large number of artifacts (in my case, even just a single wandb.Table but at every time step for a few thousand steps) and using a cluster with a low ulimit -n (4096 in my case). I fixed it by changing my code to not log artifacts, meaning I lost significant amounts of the value of wandb in the first place, but at least I was able to run my experiments and have them finish without stalling.

If logging many artifacts is the main reason for this issue, then my guess is the wandb team made a naive assumption about how users would use their tools, they built software that opens and keeps open at least 1 file for every artifact logged, and this solution did not generalize to novel real-world use cases. Especially with the new NLP tools, I hope this issue gets more attention, as it's really useful to be able to get a bunch of tabular data at each step.

@abhinav-kashyap-asus
Copy link

I solved it by not logging the a artefacts.
This is sad. There should be some way to do this more efficiently.
I wasted a lot of time where my runs were crashing or the logging entirely stops.

@jxmorris12
Copy link

@Jiuzhouh
Copy link
Jiuzhouh commented Jun 5, 2023

I solved it by not logging the a artefacts.
This is sad. There should be some way to do this more efficiently.
I wasted a lot of time where my runs were crashing or the logging entirely stops.

How to disable logging the artefacts?

@mukobi
Copy link
mukobi commented Jun 15, 2023

I solved it by not logging the a artefacts.
This is sad. There should be some way to do this more efficiently.
I wasted a lot of time where my runs were crashing or the logging entirely stops.

How to disable logging the artefacts?

In your code, do not log artifacts if you are (e.g. don't do wandb.log({... wandb.Table(...)}) or other things that create artifacts.

I hope some wandb people are looking at this. This seems pretty bad and renders wandb quite unusable for researchers using shared compute clusters who need to log many artifacts (especially for NLP or CV).

@MBakirWB
Copy link
MBakirWB commented Jun 20, 2023

@mukobi , @Jiuzhouh, @jxmorris12, @abhinav-kashyap-asus. Thank you for the continued follow up on this and our apologies that it got buried. We are going to revisit, but would appreciate your assistance in understanding the issue and reproducing it. Could you provide the following, either leave here in a comment or email your responses to mohammad.bakir@wandb.com with subject line wandb-#2825

  1. For the run(s) that crash, the complete stacktrace, & debug.log and debug-internal.log files located in the working directory of your project under wandb/<run-folder>/<logs>
  2. A brief description of your experiment setup (single/multiprocessing, Multi-node, use of wandb integrations, ect..)? The type of environment you are running in, size of artifacts and number of files logged, and wandb version.
  3. Code example for a reproduction.

@shawnlewis
Copy link
Contributor

We'll get this fixed. If anyone has a script to reproduce that would help.

@iwishiwasaneagle
Copy link

I had this issue during a sweep. My hacky fix is to restart the sweep agent after every run, and bump ulimit -n 4096 from 1024. Hopefully, this will work... this is a super annoying bug.

@mukobi
Copy link
mukobi commented Jun 30, 2023

@MBakirWB @shawnlewis Here's some simple code to repro this, run ulimit -n 64 before to artificially set the file limit low for testing, though I expect this to also happen with 1024 or even 4096 (from experience).

"""Reproduce https://github.com/wandb/wandb/issues/2825"""

import os
import wandb

# Limit the file limit to 64 with ulimit -n 64
os.system("ulimit -n 64")

wandb.init()

for i in range(1000):
    # Log a very simple artifact at each iteration
    data = f"Step {i}"
    wandb.log({"my_table": wandb.Table(columns=["text"], data=[[data]])})
    print(f'Logged "{data}"')

Around step 186 do I start to get the Too many open files: error, then in the early 200s I start to see weirder tracebacks and errors. Finally, it hangs after Logged "Step 999" rather than finishing the run and exiting.

Screenshot 2023-06-29 175855

When I check the web UI, the logs and the view of the table seem to have stopped after Step 21, not even getting to the CLI-visible first error at 186.
image

Run on Ubuntu via Windows Subsystem for Linux with the following uname -a (though first encountered on a Linux SLURM cluster for my university).

Linux DESKTOP-7VO7NFL 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@ahalev
Copy link
Contributor
ahalev commented Sep 1, 2023

Also having this issue; following.

@UmaisZahid
Copy link
UmaisZahid commented Sep 10, 2023

I have also hit this issue - lost 12 long-running experiments, brutal!

@MaxWolf-01
Copy link
MaxWolf-01 F438 commented Oct 14, 2023

I'm also encountering this issue while using wandb.Table to log text predictions. We've already attempted various workarounds to log data less frequently, but the issue remains a massive annoyance for NLP tasks. Due to this bug, our models have crashed multiple times, many hours into training.

I don't know if logging the table only at the end of a long run would fix the issue. However, we can't always guarantee that we'll reach the end of the loop. And with StreamTables (a core feature one would assume) waitig to be implemented over the past years we need to resort to all sorts of ugly workarounds. See: #2981 (comment)

One workaround we tried (but failed) was just adding the data to the wandb Table and logging a copy of the accumulated table less frequently.

@mcao516
Copy link
mcao516 commented Nov 4, 2023

Still experiencing this issue. Following.

@wrmthorne
Copy link

Also experiencing this issue. The code I have been using is an adaptation of the PPO multi-adapter from TRL with logging frequency of every step. Each step sends a table with the query, response and reward. After just under 1000 steps, I saw this error and nothing more was reported. My training still ran to completion and the model checkpoints were saved but I have no data on loss, rewards, etc.

@spfrommer
Copy link

Also experiencing this issue when logging many tables.

@Kaiyotech
Copy link

This still exists. Somehow. After 2 years.

@Reza-esfandiarpoor
Copy link
Reza-esfandiarpoor commented Dec 17, 2023

Also facing the same issue with wandb==0.16.1!

@eminb61
Copy link
eminb61 commented Dec 26, 2023

The problem still exists

@kptkin kptkin added the c:sdk:media Component: Relating to media label Feb 8, 2024
@kptkin
Copy link
Collaborator
kptkin commented Feb 8, 2024

Hi all, we released a new version of the SDK 0.16.3, that hopefully should mitigate this problem from some of you. release notes (PR #6891). We are slowly working a full solution, but hopefully it is a step in the right direction. Please give it a try and let us know if it helped.

@pvti
Copy link
pvti commented Feb 15, 2024

In my case, this error happens while using the sweeping function. I notice that it usually stops around the 24th sweep in a grid search.

  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/git/cmd.py", line 1315, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/git/cmd.py", line 985, in execute
    proc = Popen(
  File "/home/van-tien.pham/anaconda3/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/van-tien.pham/anaconda3/lib/python3.9/subprocess.py", line 1720, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
wandb: ERROR Abnormal program exit
Run 2pz72mmg errored: Exception('problem')
wandb: ERROR Run 2pz72mmg errored: Exception('problem')
wandb: Sweep Agent: Waiting for job.
^Cwandb: Ctrl + C detected. Stopping sweep.

@kptkin
Copy link
Collaborator
kptkin commented Feb 17, 2024

@pvtien96 do you think you will be able to provide a small repro example? to help us further debug this?

@anmolmann
Copy link

Hi all, we released a new version of the SDK 0.18.3 with a new backend, which hopefully should mitigate this problem for you. Please give it a try and let us know if it helped. If not, could you please share a small repro - maybe a code snippet, etc. which can help us investigate this issue further?

@MaxWolf-01
Copy link

The issue still persists 0.18.5

@zeyuanyin
Copy link

The issue is still there in 0.19.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:sdk Area: sdk related issues c:artifacts Candidate for artifact branch c:sdk:media Component: Relating to media
Projects
None yet
Development

Successfully merging a pull request may close this issue.

0