OSError Too many open files: /tmp/tmphv67gzd0wandb-media #2825

sarmientoj24 · 2021-10-23T07:15:26Z

I have been using yolov5's wandb and it is giving me this error

File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'

when doing Genetic Algorithm for hyperparameter evolution. Any idea on why wandb is doing this?

The text was updated successfully, but these errors were encountered:

ramit-wandb · 2021-10-25T20:39:53Z

Hi @sarmientoj24!

This is a known bug that we are currently working on fixing. We will let you know when we have a fix.

Antsypc · 2021-10-26T03:00:47Z

I got OSError Too many open files too. I run wandb in optuna subprocess.

def optuna_objective(trial):
    wandb.init()
    ...
    ...
    wandb.finish()
    return loss

Error Traceback

Traceback (most recent call last):
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 870, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 600, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1738, in _on_start
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1707, in _console_start
OSError: [Errno 24] Too many open files: '/workspace/wandb/offline-run-20211026_081159-2i9isn2w/files/output.log'

Enviroment

macOS Catalina 10.15.7
Python 3.8.12
wandb 0.12.5

I fixed the problem temporarily set ulimit -n 2048 every time before python run.

vanpelt · 2021-10-27T02:40:42Z

Thanks @Antsypc, we're aware of the issue and are actively working on a fix. It should be released in the next version of our client library due out in a week or so.

github-actions · 2021-12-27T00:11:37Z

This issue is stale because it has been open 60 days with no activity.

maxzw · 2022-06-22T08:39:12Z

I got OSError Too many open files too. I run wandb in optuna subprocess.

def optuna_objective(trial):
    wandb.init()
    ...
    ...
    wandb.finish()
    return loss

Error Traceback

Traceback (most recent call last):
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 870, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 600, in init
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1738, in _on_start
  File "/anaconda3/envs/tf2.6/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1707, in _console_start
OSError: [Errno 24] Too many open files: '/workspace/wandb/offline-run-20211026_081159-2i9isn2w/files/output.log'

Enviroment

macOS Catalina 10.15.7
Python 3.8.12
wandb 0.12.5

I fixed the problem temporarily set ulimit -n 2048 every time before python run.

I have the same problem at this moment. I'm running hpo on a SLURM cluster and do not have permission to change these mentioned values. Are there any other options to resolving this issue?

MBakirWB · 2022-06-23T23:47:20Z

@maxzw, thank-you for writing in. This bug is still being addressed. We will update the community here once a fix has been implemented.
Regards

AIchberger · 2022-10-05T10:44:47Z

Any updates on this?

da2r-20 · 2022-10-18T06:05:52Z

Any updates?
We are also experiencing it

AIchberger · 2022-10-18T07:21:04Z

So, actually my problem was related to multiprocessing on a remote server (OS: Rocky Linux), which could be fixed by setting ulimit -n unlimited (or any high number) together with

torch.multiprocessing.set_sharing_strategy('file_system')
torch.multiprocessing.set_start_method('spawn')

LucHayward · 2022-11-24T08:12:12Z

Just to confirm I am still seeing this error, it seems to be related to starting too many jobs (my HPC uses slurm) too rapidly in succession and is generally avoided by allowing some time (say 60s) between jobs.

kptkin · 2023-03-03T07:50:17Z

Hopefully fixed with PR #4617 if it still an issue feel free to re-open.

guyfreund · 2023-04-15T13:35:58Z

experiencing the same - it causes wandb to stop synching my run logs to the app

@kptkin please reopn

abhinav-kashyap-asus · 2023-04-25T01:39:25Z

I am still epxeriencing this :( This has to be open again

kptkin · 2023-05-03T02:10:09Z

@guyfreund Based on the traceback it looks like an issue that stems from artifacts. Few requests to help us solve this.

Could you provide the debug logs (you can find these logs in your run directory under run)?
If possible a repro script or explanation about your execution
Are you using the latest version and did upgrading helped your issue?

@abhinav-kashyap-asus same request regarding additional information to help us debug this issue.

One of my colleague will ticket this issue and tag the relevant engineer.

jxmorris12 · 2023-05-09T02:35:53Z

Happened to me too. Killed a long-running training run, huge inconvenience

jonasjuerss · 2023-05-15T07:17:21Z

Is there any workaround? I am running jobs on slurm and at some point, wandb just stops logging because of this. I stopped the run via wandb, hoping it would sync after canceling, but it's just stuck in "stopping" forever. I really rely on this for my thesis,

jxmorris12 · 2023-05-15T21:26:10Z

@jonasjuerss This workaround worked for me (so far):

import resource
resource.setrlimit(
    resource.RLIMIT_CORE, (resource.RLIM_INFINITY, resource.RLIM_INFINITY)
)

jonasjuerss · 2023-05-18T12:38:18Z

@jxmorris12 Unfortunately, that didn't help. Thanks anyway for posting, maybe it works for others.

mukobi · 2023-05-19T09:08:21Z

For me, I figured out this issue arose when I was logging a large number of artifacts (in my case, even just a single wandb.Table but at every time step for a few thousand steps) and using a cluster with a low ulimit -n (4096 in my case). I fixed it by changing my code to not log artifacts, meaning I lost significant amounts of the value of wandb in the first place, but at least I was able to run my experiments and have them finish without stalling.

If logging many artifacts is the main reason for this issue, then my guess is the wandb team made a naive assumption about how users would use their tools, they built software that opens and keeps open at least 1 file for every artifact logged, and this solution did not generalize to novel real-world use cases. Especially with the new NLP tools, I hope this issue gets more attention, as it's really useful to be able to get a bunch of tabular data at each step.

abhinav-kashyap-asus · 2023-05-22T02:23:56Z

I solved it by not logging the a artefacts.
This is sad. There should be some way to do this more efficiently.
I wasted a lot of time where my runs were crashing or the logging entirely stops.

jxmorris12 · 2023-05-22T14:35:17Z

@vanpelt @staceysv @raubitsj @shawnlewis

Jiuzhouh · 2023-06-05T14:06:18Z

I solved it by not logging the a artefacts.
This is sad. There should be some way to do this more efficiently.
I wasted a lot of time where my runs were crashing or the logging entirely stops.

How to disable logging the artefacts?

mukobi · 2023-06-15T21:25:27Z

I solved it by not logging the a artefacts.
This is sad. There should be some way to do this more efficiently.
I wasted a lot of time where my runs were crashing or the logging entirely stops.

How to disable logging the artefacts?

In your code, do not log artifacts if you are (e.g. don't do wandb.log({... wandb.Table(...)}) or other things that create artifacts.

I hope some wandb people are looking at this. This seems pretty bad and renders wandb quite unusable for researchers using shared compute clusters who need to log many artifacts (especially for NLP or CV).

MBakirWB · 2023-06-20T17:56:26Z

@mukobi , @Jiuzhouh, @jxmorris12, @abhinav-kashyap-asus. Thank you for the continued follow up on this and our apologies that it got buried. We are going to revisit, but would appreciate your assistance in understanding the issue and reproducing it. Could you provide the following, either leave here in a comment or email your responses to mohammad.bakir@wandb.com with subject line wandb-#2825

For the run(s) that crash, the complete stacktrace, & debug.log and debug-internal.log files located in the working directory of your project under wandb/<run-folder>/<logs>
A brief description of your experiment setup (single/multiprocessing, Multi-node, use of wandb integrations, ect..)? The type of environment you are running in, size of artifacts and number of files logged, and wandb version.
Code example for a reproduction.

shawnlewis · 2023-06-22T15:55:01Z

We'll get this fixed. If anyone has a script to reproduce that would help.

iwishiwasaneagle · 2023-06-23T07:53:17Z

I had this issue during a sweep. My hacky fix is to restart the sweep agent after every run, and bump ulimit -n 4096 from 1024. Hopefully, this will work... this is a super annoying bug.

mukobi · 2023-06-30T01:04:04Z

@MBakirWB @shawnlewis Here's some simple code to repro this, run ulimit -n 64 before to artificially set the file limit low for testing, though I expect this to also happen with 1024 or even 4096 (from experience).

"""Reproduce https://github.com/wandb/wandb/issues/2825"""

import os
import wandb

# Limit the file limit to 64 with ulimit -n 64
os.system("ulimit -n 64")

wandb.init()

for i in range(1000):
    # Log a very simple artifact at each iteration
    data = f"Step {i}"
    wandb.log({"my_table": wandb.Table(columns=["text"], data=[[data]])})
    print(f'Logged "{data}"')

Around step 186 do I start to get the Too many open files: error, then in the early 200s I start to see weirder tracebacks and errors. Finally, it hangs after Logged "Step 999" rather than finishing the run and exiting.

When I check the web UI, the logs and the view of the table seem to have stopped after Step 21, not even getting to the CLI-visible first error at 186.

Run on Ubuntu via Windows Subsystem for Linux with the following uname -a (though first encountered on a Linux SLURM cluster for my university).

Linux DESKTOP-7VO7NFL 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

ahalev · 2023-09-01T05:58:20Z

Also having this issue; following.

UmaisZahid · 2023-09-10T17:51:33Z

I have also hit this issue - lost 12 long-running experiments, brutal!

MaxWolf-01 · 2023-10-14T11:17:40Z

I'm also encountering this issue while using wandb.Table to log text predictions. We've already attempted various workarounds to log data less frequently, but the issue remains a massive annoyance for NLP tasks. Due to this bug, our models have crashed multiple times, many hours into training.

I don't know if logging the table only at the end of a long run would fix the issue. However, we can't always guarantee that we'll reach the end of the loop. And with StreamTables (a core feature one would assume) waitig to be implemented over the past years we need to resort to all sorts of ugly workarounds. See: #2981 (comment)

One workaround we tried (but failed) was just adding the data to the wandb Table and logging a copy of the accumulated table less frequently.

mcao516 · 2023-11-04T02:31:39Z

Still experiencing this issue. Following.

wrmthorne · 2023-11-10T05:14:09Z

Also experiencing this issue. The code I have been using is an adaptation of the PPO multi-adapter from TRL with logging frequency of every step. Each step sends a table with the query, response and reward. After just under 1000 steps, I saw this error and nothing more was reported. My training still ran to completion and the model checkpoints were saved but I have no data on loss, rewards, etc.

spfrommer · 2023-11-26T16:58:32Z

Also experiencing this issue when logging many tables.

Kaiyotech · 2023-12-17T14:53:46Z

This still exists. Somehow. After 2 years.

Reza-esfandiarpoor · 2023-12-17T19:20:51Z

Also facing the same issue with wandb==0.16.1!

eminb61 · 2023-12-26T17:29:51Z

The problem still exists

kptkin · 2024-02-08T20:53:22Z

Hi all, we released a new version of the SDK 0.16.3, that hopefully should mitigate this problem from some of you. release notes (PR #6891). We are slowly working a full solution, but hopefully it is a step in the right direction. Please give it a try and let us know if it helped.

pvti · 2024-02-15T14:45:42Z

In my case, this error happens while using the sweeping function. I notice that it usually stops around the 24th sweep in a grid search.

  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/git/cmd.py", line 1315, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/git/cmd.py", line 985, in execute
    proc = Popen(
  File "/home/van-tien.pham/anaconda3/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/van-tien.pham/anaconda3/lib/python3.9/subprocess.py", line 1720, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
wandb: ERROR Abnormal program exit
Run 2pz72mmg errored: Exception('problem')
wandb: ERROR Run 2pz72mmg errored: Exception('problem')
wandb: Sweep Agent: Waiting for job.
^Cwandb: Ctrl + C detected. Stopping sweep.

kptkin · 2024-02-17T01:38:59Z

@pvtien96 do you think you will be able to provide a small repro example? to help us further debug this?

anmolmann · 2024-10-11T20:26:45Z

Hi all, we released a new version of the SDK 0.18.3 with a new backend, which hopefully should mitigate this problem for you. Please give it a try and let us know if it helped. If not, could you please share a small repro - maybe a code snippet, etc. which can help us investigate this issue further?

MaxWolf-01 · 2024-11-06T00:16:50Z

The issue still persists 0.18.5

zeyuanyin · 2025-02-03T16:55:29Z

The issue is still there in 0.19.5

sarmientoj24 added the a:sdk Area: sdk related issues label Oct 23, 2021

glenn-jocher mentioned this issue Nov 10, 2021

OSError Too many open files: /tmp/tmphv67gzd0wandb-media when doing --evolve ultralytics/yolov5#5306

Closed

github-actions bot added the stale label Dec 27, 2021

sydholl closed this as completed May 9, 2022

exalate-issue-sync bot reopened this Jun 22, 2022

github-actions bot removed the stale label Jul 25, 2022

kptkin mentioned this issue Mar 3, 2023

perf(sdk): improve file descriptor management #4617

Merged

2 tasks

kptkin closed this as completed Mar 3, 2023

kptkin reopened this May 3, 2023

kptkin added the c:artifacts Candidate for artifact branch label May 3, 2023

jxmorris12 mentioned this issue May 9, 2023

[CLI]: Opening runs leaks file pointers and semaphores #3974

Closed

kptkin added the c:sdk:media Component: Relating to media label Feb 8, 2024

AllenZ01 mentioned this issue Mar 8, 2024

[CLI]: [OSError: [Errno 24] Too many open files] when using sweep #7124

Closed

OSError Too many open files: /tmp/tmphv67gzd0wandb-media #2825

OSError Too many open files: /tmp/tmphv67gzd0wandb-media #2825

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!