This is an automatic job manager for running TPU jobs. It supports auto-resuming the preempted/grpc TPUs, and monitoring the jobs status.
Here is a quick guide of the common usage, and you can find more details in the full docs below.
Tldr usage in two sentences: Use tpu add-user
to add your username, then go to your working directory(where you have your scripts and code) and use tpu set-cur 1 username
to set the working directory. Use tpu run <tpu> username
(e.g. tpu run v2-32-p2 xibo
) to run the job, and use tpu monitor/check username
to see the status of all your jobs. (The tpu run
command will auto-resume the job when preempted/grpc for preempted TPUs, you don't have to set it.)
More usage in two sentences: Use tpu tldr
to see useful commands, and tpu clear username
to clear the finished/crashed jobs; use tpu -a alias_name full_name username
(e.g. tpu -a lr config.training.learning_rate
) to add a new alias, then you can pass the configs such as tpu run v2-32-6 xibo lr=0.01
. Use tfind
to search for TPUs in the spreadsheet, tpu describe <tpu>
to check the environment of the TPU, and tpu solve <tpu>
to solve the environment automatically.
REMEMBER TO UPDATE YOUR SCRIPTS!
1. Setup(IMPORTANT)
You should update your scripts to the newest version supporting command-line arguments. The newest scripts can be pulled from zhh's repo. The current finishing check is based on wandb final output, so please make sure your scripts are using wandb to log the final output.
Also, this script is not very robust to attack, so please try not to do OOD things, for example, setting username to be run
, false
, v2-32-2
or Chinese characters.
Use tpu add-user
and follow the instructions to add your username.
2. Setting Working Directory & Running/Monitoring Jobs (IMPORTANT)
The working directory is the place where you have your scripts (staging.sh
etc.) and code.
You can set multiple working directories and choose any of them when running code. The default working directory is 1
.
You can set the working directory, see you working directory, and run the job by:
tpu set-cur num username # Set the working directory <num> to the current directory, default directory is 1
tpu ls username # List all the working directories
tpu run tpu_name username [dir=1] # Run the job in working directory <dir>
The tpu_name
is of the format of the pre-defined tpu aliases , like v2-32-6
, v2-32-p1
, or v4-32-py2
. You can also pass full-name such as kmh-tpuvm-v2-32-1
.
We also suppport only passing the tpu type, like v2
, v3
, v23
(v2
or v3
), v3+
(v3
or v4
), or v3-32
, and -n
and -p
for normal/preemptible TPUs. If you pass only the tpu type, it will show all the TPUs of that type for you to choose interactively. Alternatively, you can pass -auto
to auto-select a free TPU of that type. If there's no free TPUs, it will show all the reserved TPUs for you to choose.
For all the aliases, use tpu -lta/-sta
(list/show TPU aliases) to see. You can also add aliases by tpu -ata alias FULL_TPU_NAME
(add TPU aliases). Please don't add aliases that may lead to contradictions to other things, for example username
or tag
or config
or s
.
Example:
trun/tpu run v2-32-6 xibo # Run the job in working directory 1 using tpu v2-32-6
trun/tpu run v2-32-p1 lyy 2/dir=2 # Run the job in working directory 2 using tpu v2-32-p1
trun/tpu run v2-32 v3-32 -p xibo -auto # Auto-select a free preemptible TPU of type v2-32 or v3-32
The tpu run
command opens a monitor window to track all your jobs. Alternatively, you can use:
tm/tpu monitor username
which updates the monitor window every 10 seconds. For one-time checks, use:
tck/tpu check username
The run
command will automatically resume the preempted TPU jobs, and you can see more in section 2B or 6.More on Resuming/Rerunning. Please notice that an common failure mode for resuming is that if a job A has be resumed and the new job A' gets an error before it stores a checkpoint, then the third job A'' resuming from this new job A' will restart from the beginning instead of the checkpoint from A. So please be careful with your jobs and make sure that the checkpoint is stored correctly.
2A. More Directory Operations (OPTIONAL)
```bash
tpu del-dir <num> username # Delete the working directory <num>
tpu swap-dir <num1> <num2> username # Swap the working directory <num1> and <num2>
```
2B. Advanced Running Settings (OPTIONAL)
The run
command will ask whether to reapply when the TPU is preempted.
You can add the flag -apply
to skip the prompt.
You can add the flag -q
to skip the monitor window.
You can add the tag by tag=your_tag
to add a tag to the job, which will be shown in the monitor window.
You can add tags to existing jobs by:tpu add-tag window_num tag_name username
You can change the default rules for resuming/rerunning by passing rule=<rule>
to the tpu run
command. (Default: auto-resume on GRPC errors and auto-reapply and resume when preempted for preemptible TPUs and do nothing for other TPUs (you can set rule=resume
to make it resume). See more in the More on Resuming/Rerunning section.)
2C. Advanced Monitor Configs (OPTIONAL)
The monitor will show four things: the windows number(w
), the directory(d
), the tpu(t
), and the job status(s
). You can choose which to show by adding commands. There's also an additional flag "verbose"(v
) available, meaning to show the messages(cut) from tmux windows even for the running jobs with known status.(Should be used with s
) For example, to only show the working directory and the job status and detailed output of xibo, use:
tpu monitor xibo -dsv
If you don't want tpu run
to open the monitor window, you can use tpu set-settings monitor_after_run False username
to disable it. Also, you can set the default monitoring whether to monitor tpu/directory. See the Customizing User Settings section for more details.
2D. Spreadsheet Support (OPTIONAL, RECOMMENDED)
The tpu run
command will automatically set the status in the spreadsheet to be running by you. If you want to set the notes, you can add a -ssn
flag(short for --set-spreadsheet-notes
) to set the notes interactively, or you can pass ssn="your notes"
to set the notes directly. (Notice: please don't include =
in the notes, which may introduce parsing errors, e.g. ssn="ssn = test"
) The notes set by ssn
will be shown as tag in the monitor window. If you don't want it, add -no-tag
flag to skip that.
You can also set the notes afterwards by tpu ssn/asn <tpu> <notes>
, for example tpu ssn v2-32-6 "This is a test"
. ssn
resets the notes to be This is a test
, while asn
appends the notes to the current notes.
You can use tpu find <all_tpu_types>
(or tfind
for short) to look at the status of the TPUs in the spreadsheet. The format of tpu_types is like v2
, v3
, v234
(or v*
) or v2-32
. You can also pass -n
for normal TPUs and -p
for preemptible TPUs. For example, to show the status of all non-preemptible v3-32 and v4 TPUs, you can do:
tpu find v3-32 v4 -n
. If no v?
is passed, it will show all the TPUs.
You can release the TPU by tpu release/rel <tpu_name>
, which set the status and the user to be free('闲的') in the spreadsheet. You can also use tpu release/rel <tpu_name> <username>
to make sure that the TPU is currently owned by you(recommended).
3. Killing Jobs/Windows & Cleaning up (USEFUL)
As you run more and more jobs, there will be a lot of tmux windows, which is messy.
You can use (recommend to do occasionally):
tpu clean username
to kill all the tmux windows whose jobs are finished/error/killed.
To kill a job, use:
tpu kill/kill-job/-k/-kj w=/-w=/window=/<windows_id> username
You can also just enter windows_id, in this case the command will find the integer in the
arguments to be the windows id. For example you can just use tpu kill 101 xibo
to kill the job with windows id 101, but passing w=
is safer for future use.
Jobs with children jobs that were rerun/resumed will be killed based on the status of their children. Use tpu clean username -re
to make all the rerun/resumed job be cleaned too.
IMPORTANT: If y
8000
ou have a job that has rerun setting, and it has been grpc, please remember to use clean
to clear it if you manually kill the window, otherwise it may be rerunned.
3A. Other killing commands (OPTIONAL)
To kill a specific tmux window (NOT RECOMMENDED):tpu -kw/kill-window window_number username
After killing windows, some jobs may become "zombies" (i.e., jobs without associated windows). You can use these helpers to clean zombies (Supported, but NOT RECOMMENDED):
tpu -czw username # Clear all zombie windows
tpu -czj username # Clear all zombie jobs
tpu clear-finished username # Clear all finished jobs
tpu clear-error username # Clear all error jobs
tpu clear-all username # RECOMMENDED: Clear all finished/error jobs
The clean
command integrates these actions, so using kill-job + clean
is strongly recommended instead of manually killing windows with tmux kill-window
or exit the window yourself. (If you like to kill the window yourself, we recommend doing tpu clean username
occasionally to clear the job data associated with these windows, or others may get the annoying warning of the TPU occupied by your dead jobs.)
4. Environment Operations (OPTIONAL)
We support common operations, such as:
tpu apply/reapply tpu_name # Apply/reapply the TPU; reapply deletes and recreates the TPU
Environment operations are also supported:
tpu test tpu_name # Test the TPU environment with commands interactively
tpu mount-disk tpu_name # Mount the disk and set up wandb for the TPU
tpu describe tpu_name # Describe the TPU environment
tpu check-status tpu_name # Check the TPU status (e.g., PREEMPTED, READY, CREATING, etc.)
An automatic environment solver is available to address TPU environment issues.
Currently, it handles mounting issues, but contributions are welcome to enhance it into a powerful one-line tool for solving complex TPU environment problems you have encountered. This way, ideally we only need to manully fix every possible issue once!
tpu solve tpu_name # Integrated automatic environment solver
5. Passing Configs in Command Line (OPTIONAL)
We support passing configs in the command line by config aliases or full config name. You can also set your own config alias by:
tpu -a/-alias your_alias FULL_NAME username # add/change an alias
tpu -sa username # list all the aliases
tpu del-config-alias your_alias username # delete the alias
For example, you can do:
tpu -a lr config.training.learning_rate xibo
Then:
tpu run v2-32-6 xibo lr=0.01
tpu run v2-32-6 xibo config.training.learning_rate=0.01 # This is also supported
Some default aliases
"lr": "config.training.learning_rate"
"bs": "config.training.batch_size"
"ep": "config.training.num_epochs"
"wd": "config.training.weight_decay"
"b1": "config.training.adam_b1"
"b2": "config.training.adam_b2"
"ckpt": "config.training.checkpoint_per_epoch"
6. More on Resuming/Rerunning (OPTIONAL)
You can manually resume/rerun a job by:tpu resume window=<windows_id> username # resume the job
tpu resume window=<windows_id> tpu=<tpu> username # resume the job in a new TPU
tpu rerun window=<windows_id> username # rerun the job
tpu rerun window=<windows_id> tpu=<tpu> username # rerun the job in a new TPU
The difference between resume
and rerun
is that resume
will load the job from the last checkpoint, while rerun
will start a new job from the beginning.
Our default rules for resuming/rerunning are as follows:
For preempted TPUs, we will reapply the TPU and resume the job when the job is preempted, and resume the job when the job encounters a GRPC error. For non-preempted TPUs, we will not perform any operations.
You can pass the rule=<rule>
to the tpu run
command to set the rules. The available rules are:
reapply
: Reapply when GRPC error occurs or when preempted.pass
(default for non-preempted TPUs): Do nothing.rerun
: Rerun when GRPC error occurs, reapply when preempted.pre
(default for preempted TPUs): Reapply when GRPC error occurs, resumeresume
(recommend for non-preempted TPUs, may change to default someday): Resume when GRPC error occurs, pass when preempted.
For example, if you want a job running in preempted TPUs to be rerunned instead of resumed when grpc, you can do:
tpu run v2-32-p2 xibo rule=rerun
If you want a job running in non-preempted TPUs to be resumed when grpc, you can do:
tpu run v2-32-2 xibo rule=resume
You can see all the rules using
tpu check-rules
If you want to know whether the job is a resumed job in the program(for example, use that to set a new wandb name/note), you can add --log-stage
flag in tpu run
, then it will pass an additional argument config.stage
to indicate the number of resumes of this job. (For example, if the job has been resumed twice, that is, there're 3 runs in total including the current one, the current one will recieve an extra config.stage=2
config).
We have a MONITOR to occasionally keep tract of all the job status and decide whether to resume/rerun. The default checking frequency for the jobs to be rerun is about 30 mins, that is, the jobs will wait at most 30 mins to be resumed. If you run a job that leads to a GRPC immediately, you can acknowledge the MONITOR to rerun that immediately by:
tpu ack
Then after no more than 3 mins you should expect the job to be resumed(if not, contact the admin).
7. Customizing User Settings (OPTIONAL)
We support customizing settings for users, and you can set/get them by:
tpu set-settings key value username # set the settings
tpu get-settings username # get the settings
tpu reset-settings username # reset all the settings
The current default settings and their meanings are:
{
"monitor_after_run": True, # Whether to monitor the job after running
"monitor_upd_time": 5, # The update time for the monitor window
"monitor_length": 800, # The output capturing length for the monitor window to determine the job status
"monitor_dir": True, # Whether to show the working directory in the monitor window
"monitor_tpu": True, # Whether to show the TPU name in the monitor window
"monitor_verbose": False, # Whether to show the output in the monitor window when the status is known
"show_length": 200, # The output capturing length for the monitor window to show the job output
"time_zone": "us", # The user timezone, only support 'us'(UTC-4)/'cn'(UTC+8) for now.
"extra_settings": {} # The extra settings for future development
}
Also, to avoid concurrency issues of tmux windows creation, we use a windows_offset
to offset the windows number for each user, and the number goes up by 1 for each new job. If you think the offset is too large, you can set it to a smaller number by:
tpu reset-window-num <num> <username> # reset the offset to <num>
Please be careful not to have conflicts with current jobs.
8. Documentation
tpu tldr
tpu -h command # details of the command
Some of the help of the commands are not updated, please refer to this README for the latest usage.
Code Structure
The user interface is implemented in tpu.py
, and the specific function implementation is in utils/
.
MONITOR.py
does the check and resume work, and will be run all day, it will check the jobs and do unit tests occansionally according to data["MONITOR_config"]
(You can see the full format of data.json
below, which is the key matadata we maintain to manage all the jobs).
We use MONITOR to referr to the global monitor process to separate it from the local monitor window for each user.
For utils/
:
desciptions.py
does all the documentation workoperate.py
does the tpu remote operationsjobs.py
does the job managementdirectories.py
deals with the user working dirslogger.py
does most of the logging with meta-datahelpers.py
does the helper functionserror_handler.py
does the error handling worksunit_tests.py
does the unit tests (sanity checks)sheet.py
does the spreadsheet operationsdevelop.py
does the developer tools, to safely modify the metadata and avoid conflicts with current jobs (see more in next paragraph)
Data Format
The key data is stored in data.json
, and the program reads and writes it using the API in data_io.py
, which implements locking (in lock.json
).
The structure of data.json
is as follows:
Full data.json structure
{
"users": {
"username": {
"id": 0,
"name": "username",
"tmux_name": "username",
"working_dir": {"1": "/path"},
"job_data": [],
"config_aliases": {"lr": "config.training.lr"},
"settings": {
"monitor_after_run": true,
"monitor_upd_time": 5,
"monitor_length": 800,
"monitor_verbose": false,
"monitor_dir": true,
"monitor_tpu": true,
"show_length": 300,
"time_zone": "us"
},
"windows_offset": 42,
"logs": []
}
},
"user_list": ["username"],
"id_list": [0],
"id_user_dict": {"0": "username"},
"user_id_dict": {"username": 0},
"tpu_aliases": {"v2-1": "kmh-tpuvm-v2-32-1"},
"all_tpus": {
"europe-west4-a": ["..."],
"us-central1-a": ["..."],
"us-central2-b": ["..."],
"preemptible": ["..."]
},
"monitor_config": {
"test_freq": 3600,
"checking_freq": 600
},
"wandb_api_key": "...",
"conda_env_name": "NNX",
"monitor_all_check_time": 20,
"MONITOR_logs": [],
"ack_MONITOR": false
}
Each job is described as:
Full job structure
{
"user": "username",
"windows_id": 1,
"job_dir_id": 1,
"job_dir": "/your/code/path",
"tpu": "kmh-tpuvm-v2-32-preemptible-1",
"job_tags": null,
"log_dir": "/your/log/path",
"staage_dir": "/your/staging/path",
"extra_configs": "--lr=0.01",
"status": "running",
"error": null,
"stage": 0,
"monitor": true,
"rules": {
"preempted": "reapply",
"grpc": "resume"
},
"extra_msgs": {},
"start_time": "20250420_011026",
"customized_settings": {}
}
- More testing/docs
- Support restarting TPU
- Customized monitor window
- Auto-choose the TPU to run a job
- More auto env solvers
- Logging for every user so that you can check the things happen since last time