Auto TPU job manager for Google Cloud Platform (experimental)

This is an automatic job manager for running TPU jobs. It supports auto-resuming the preempted/grpc TPUs, and monitoring the jobs status.

Quickstart in 2 mins

Here is a quick guide of the common usage, and you can find more details in the full docs below.

Tldr usage in two sentences: Use tpu add-user to add your username, then go to your working directory(where you have your scripts and code) and use tpu set-cur 1 username to set the working directory. Use tpu run <tpu> username(e.g. tpu run v2-32-p2 xibo) to run the job, and use tpu monitor/check username to see the status of all your jobs. (The tpu run command will auto-resume the job when preempted/grpc for preempted TPUs, you don't have to set it.)

More usage in two sentences: Use tpu tldr to see useful commands, and tpu clear username to clear the finished/crashed jobs; use tpu -a alias_name full_name username(e.g. tpu -a lr config.training.learning_rate) to add a new alias, then you can pass the configs such as tpu run v2-32-6 xibo lr=0.01. Use tfind to search for TPUs in the spreadsheet, tpu describe <tpu> to check the environment of the TPU, and tpu solve <tpu> to solve the environment automatically.

REMEMBER TO UPDATE YOUR SCRIPTS!

Full docs

1. Setup(IMPORTANT)

You should update your scripts to the newest version supporting command-line arguments. The newest scripts can be pulled from zhh's repo. The current finishing check is based on wandb final output, so please make sure your scripts are using wandb to log the final output.
Also, this script is not very robust to attack, so please try not to do OOD things, for example, setting username to be run, false, v2-32-2 or Chinese characters.

Use tpu add-user and follow the instructions to add your username.

2. Setting Working Directory & Running/Monitoring Jobs (IMPORTANT)

The working directory is the place where you have your scripts (staging.sh etc.) and code.

You can set multiple working directories and choose any of them when running code. The default working directory is 1.
You can set the working directory, see you working directory, and run the job by:

tpu set-cur num username # Set the working directory <num> to the current directory, default directory is 1
tpu ls username # List all the working directories
tpu run tpu_name username [dir=1] # Run the job in working directory <dir>

The tpu_name is of the format of the pre-defined tpu aliases , like v2-32-6, v2-32-p1, or v4-32-py2. You can also pass full-name such as kmh-tpuvm-v2-32-1.

We also suppport only passing the tpu type, like v2, v3, v23(v2 or v3), v3+(v3 or v4), or v3-32, and -n and -p for normal/preemptible TPUs. If you pass only the tpu type, it will show all the TPUs of that type for you to choose interactively. Alternatively, you can pass -auto to auto-select a free TPU of that type. If there's no free TPUs, it will show all the reserved TPUs for you to choose.

For all the aliases, use tpu -lta/-sta (list/show TPU aliases) to see. You can also add aliases by tpu -ata alias FULL_TPU_NAME(add TPU aliases). Please don't add aliases that may lead to contradictions to other things, for example username or tag or config or s.

Example:

trun/tpu run v2-32-6 xibo # Run the job in working directory 1 using tpu v2-32-6
trun/tpu run v2-32-p1 lyy 2/dir=2 # Run the job in working directory 2 using tpu v2-32-p1
trun/tpu run v2-32 v3-32 -p xibo -auto # Auto-select a free preemptible TPU of type v2-32 or v3-32

The tpu run command opens a monitor window to track all your jobs. Alternatively, you can use:

tm/tpu monitor username

which updates the monitor window every 10 seconds. For one-time checks, use:

tck/tpu check username

The run command will automatically resume the preempted TPU jobs, and you can see more in section 2B or 6.More on Resuming/Rerunning. Please notice that an common failure mode for resuming is that if a job A has be resumed and the new job A' gets an error before it stores a checkpoint, then the third job A'' resuming from this new job A' will restart from the beginning instead of the checkpoint from A. So please be careful with your jobs and make sure that the checkpoint is stored correctly.

2A. More Directory Operations (OPTIONAL)

```bash
tpu del-dir <num> username # Delete the working directory <num>
tpu swap-dir <num1> <num2> username # Swap the working directory <num1> and <num2>
```

2B. Advanced Running Settings (OPTIONAL)

The run command will ask whether to reapply when the TPU is preempted.

You can add the flag -apply to skip the prompt.

You can add the flag -q to skip the monitor window.

You can add the tag by tag=your_tag to add a tag to the job, which will be shown in the monitor window. You can add tags to existing jobs by:tpu add-tag window_num tag_name username

You can change the default rules for resuming/rerunning by passing rule=<rule> to the tpu run command. (Default: auto-resume on GRPC errors and auto-reapply and resume when preempted for preemptible TPUs and do nothing for other TPUs (you can set rule=resume to make it resume). See more in the More on Resuming/Rerunning section.)

2C. Advanced Monitor Configs (OPTIONAL)

The monitor will show four things: the windows number(w), the directory(d), the tpu(t), and the job status(s). You can choose which to show by adding commands. There's also an additional flag "verbose"(v) available, meaning to show the messages(cut) from tmux windows even for the running jobs with known status.(Should be used with s) For example, to only show the working directory and the job status and detailed output of xibo, use:

tpu monitor xibo -dsv

If you don't want tpu run to open the monitor window, you can use tpu set-settings monitor_after_run False username to disable it. Also, you can set the default monitoring whether to monitor tpu/directory. See the Customizing User Settings section for more details.

2D. Spreadsheet Support (OPTIONAL, RECOMMENDED)

The tpu run command will automatically set the status in the spreadsheet to be running by you. If you want to set the notes, you can add a -ssn flag(short for --set-spreadsheet-notes) to set the notes interactively, or you can pass ssn="your notes" to set the notes directly. (Notice: please don't include = in the notes, which may introduce parsing errors, e.g. ssn="ssn = test") The notes set by ssn will be shown as tag in the monitor window. If you don't want it, add -no-tag flag to skip that.

You can also set the notes afterwards by tpu ssn/asn <tpu> <notes>, for example tpu ssn v2-32-6 "This is a test". ssn resets the notes to be This is a test, while asn appends the notes to the current notes.

You can use tpu find <all_tpu_types> (or tfind for short) to look at the status of the TPUs in the spreadsheet. The format of tpu_types is like v2, v3, v234(or v*) or v2-32. You can also pass -n for normal TPUs and -p for preemptible TPUs. For example, to show the status of all non-preemptible v3-32 and v4 TPUs, you can do: tpu find v3-32 v4 -n. If no v? is passed, it will show all the TPUs.

You can release the TPU by tpu release/rel <tpu_name>, which set the status and the user to be free('闲的') in the spreadsheet. You can also use tpu release/rel <tpu_name> <username> to make sure that the TPU is currently owned by you(recommended).

3. Killing Jobs/Windows & Cleaning up (USEFUL)

As you run more and more jobs, there will be a lot of tmux windows, which is messy.

You can use (recommend to do occasionally):

tpu clean username

to kill all the tmux windows whose jobs are finished/error/killed.

To kill a job, use:

tpu kill/kill-job/-k/-kj w=/-w=/window=/<windows_id> username

You can also just enter windows_id, in this case the command will find the integer in the arguments to be the windows id. For example you can just use tpu kill 101 xibo to kill the job with windows id 101, but passing w= is safer for future use.

Jobs with children jobs that were rerun/resumed will be killed based on the status of their children. Use tpu clean username -re to make all the rerun/resumed job be cleaned too.

IMPORTANT: If y 8000 ou have a job that has rerun setting, and it has been grpc, please remember to use clean to clear it if you manually kill the window, otherwise it may be rerunned.

3A. Other killing commands (OPTIONAL)

To kill a specific tmux window (NOT RECOMMENDED):

tpu -kw/kill-window window_number username

After killing windows, some jobs may become "zombies" (i.e., jobs without associated windows). You can use these helpers to clean zombies (Supported, but NOT RECOMMENDED):

tpu -czw username # Clear all zombie windows
tpu -czj username # Clear all zombie jobs
tpu clear-finished username # Clear all finished jobs
tpu clear-error username # Clear all error jobs
tpu clear-all username # RECOMMENDED: Clear all finished/error jobs

The clean command integrates these actions, so using kill-job + clean is strongly recommended instead of manually killing windows with tmux kill-window or exit the window yourself. (If you like to kill the window yourself, we recommend doing tpu clean username occasionally to clear the job data associated with these windows, or others may get the annoying warning of the TPU occupied by your dead jobs.)

4. Environment Operations (OPTIONAL)

We support common operations, such as:

tpu apply/reapply tpu_name # Apply/reapply the TPU; reapply deletes and recreates the TPU

Environment operations are also supported:

tpu test tpu_name # Test the TPU environment with commands interactively
tpu mount-disk tpu_name # Mount the disk and set up wandb for the TPU
tpu describe tpu_name # Describe the TPU environment
tpu check-status tpu_name # Check the TPU status (e.g., PREEMPTED, READY, CREATING, etc.)

An automatic environment solver is available to address TPU environment issues.
Currently, it handles mounting issues, but contributions are welcome to enhance it into a powerful one-line tool for solving complex TPU environment problems you have encountered. This way, ideally we only need to manully fix every possible issue once!

tpu solve tpu_name # Integrated automatic environment solver

5. Passing Configs in Command Line (OPTIONAL)

We support passing configs in the command line by config aliases or full config name. You can also set your own config alias by:

tpu -a/-alias your_alias FULL_NAME username # add/change an alias
tpu -sa username # list all the aliases
tpu del-config-alias your_alias username # delete the alias

For example, you can do:

tpu -a lr config.training.learning_rate xibo

Then:

tpu run v2-32-6 xibo lr=0.01
tpu run v2-32-6 xibo config.training.learning_rate=0.01 # This is also supported

Some default aliases

"lr": "config.training.learning_rate"
"bs": "config.training.batch_size"
"ep": "config.training.num_epochs"
"wd": "config.training.weight_decay"
"b1": "config.training.adam_b1"
"b2": "config.training.adam_b2"
"ckpt": "config.training.checkpoint_per_epoch"

6. More on Resuming/Rerunning (OPTIONAL)

You can manually resume/rerun a job by:

tpu resume window=<windows_id> username # resume the job
tpu resume window=<windows_id> tpu=<tpu> username # resume the job in a new TPU
tpu rerun window=<windows_id> username # rerun the job
tpu rerun window=<windows_id> tpu=<tpu> username # rerun the job in a new TPU

The difference between resume and rerun is that resume will load the job from the last checkpoint, while rerun will start a new job from the beginning.

Our default rules for resuming/rerunning are as follows:
For preempted TPUs, we will reapply the TPU and resume the job when the job is preempted, and resume the job when the job encounters a GRPC error. For non-preempted TPUs, we will not perform any operations.

You can pass the rule=<rule> to the tpu run command to set the rules. The available rules are:

reapply: Reapply when GRPC error occurs or when preempted.
pass (default for non-preempted TPUs): Do nothing.
rerun: Rerun when GRPC error occurs, reapply when preempted.
pre (default for preempted TPUs): Reapply when GRPC error occurs, resume
resume(recommend for non-preempted TPUs, may change to default someday): Resume when GRPC error occurs, pass when preempted.

For example, if you want a job running in preempted TPUs to be rerunned instead of resumed when grpc, you can do:

tpu run v2-32-p2 xibo rule=rerun

If you want a job running in non-preempted TPUs to be resumed when grpc, you can do:

tpu run v2-32-2 xibo rule=resume

You can see all the rules using

tpu check-rules

If you want to know whether the job is a resumed job in the program(for example, use that to set a new wandb name/note), you can add --log-stage flag in tpu run, then it will pass an additional argument config.stage to indicate the number of resumes of this job. (For example, if the job has been resumed twice, that is, there're 3 runs in total including the current one, the current one will recieve an extra config.stage=2 config).

We have a MONITOR to occasionally keep tract of all the job status and decide whether to resume/rerun. The default checking frequency for the jobs to be rerun is about 30 mins, that is, the jobs will wait at most 30 mins to be resumed. If you run a job that leads to a GRPC immediately, you can acknowledge the MONITOR to rerun that immediately by:

tpu ack

Then after no more than 3 mins you should expect the job to be resumed(if not, contact the admin).

7. Customizing User Settings (OPTIONAL)

We support customizing settings for users, and you can set/get them by:

tpu set-settings key value username # set the settings
tpu get-settings username # get the settings
tpu reset-settings username # reset all the settings

The current default settings and their meanings are:

{
    "monitor_after_run": True, # Whether to monitor the job after running
    "monitor_upd_time": 5, # The update time for the monitor window
    "monitor_length": 800, # The output capturing length for the monitor window to determine the job status
    "monitor_dir": True, # Whether to show the working directory in the monitor window
    "monitor_tpu": True, # Whether to show the TPU name in the monitor window
    "monitor_verbose": False, # Whether to show the output in the monitor window when the status is known
    "show_length": 200, # The output capturing length for the monitor window to show the job output
    "time_zone": "us", # The user timezone, only support 'us'(UTC-4)/'cn'(UTC+8) for now.
    "extra_settings": {} # The extra settings for future development
}

Also, to avoid concurrency issues of tmux windows creation, we use a windows_offset to offset the windows number for each user, and the number goes up by 1 for each new job. If you think the offset is too large, you can set it to a smaller number by:

tpu reset-window-num <num> <username>  # reset the offset to <num>

Please be careful not to have conflicts with current jobs.

8. Documentation

tpu tldr
tpu -h command # details of the command

Some of the help of the commands are not updated, please refer to this README for the latest usage.

For Developers

Code Structure

The user interface is implemented in tpu.py, and the specific function implementation is in utils/.
MONITOR.py does the check and resume work, and will be run all day, it will check the jobs and do unit tests occansionally according to data["MONITOR_config"](You can see the full format of data.json below, which is the key matadata we maintain to manage all the jobs).

We use MONITOR to referr to the global monitor process to separate it from the local monitor window for each user.

For utils/:

desciptions.py does all the documentation work
operate.py does the tpu remote operations
jobs.py does the job management
directories.py deals with the user working dirs
logger.py does most of the logging with meta-data
helpers.py does the helper functions
error_handler.py does the error handling works
unit_tests.py does the unit tests (sanity checks)
sheet.py does the spreadsheet operations
develop.py does the developer tools, to safely modify the metadata and avoid conflicts with current jobs (see more in next paragraph)

Data Format

The key data is stored in data.json, and the program reads and writes it using the API in data_io.py, which implements locking (in lock.json).
The structure of data.json is as follows:

Full data.json structure

{
    "users": {
        "username": {
            "id": 0,
            "name": "username",
            "tmux_name": "username",
            "working_dir": {"1": "/path"},
            "job_data": [],
            "config_aliases": {"lr": "config.training.lr"},
            "settings": {
                "monitor_after_run": true,
                "monitor_upd_time": 5,
                "monitor_length": 800,
                "monitor_verbose": false,
                "monitor_dir": true,
                "monitor_tpu": true,
                "show_length": 300,
                "time_zone": "us"
            },
            "windows_offset": 42,
            "logs": []
        }
    },
    "user_list": ["username"],
    "id_list": [0],
    "id_user_dict": {"0": "username"},
    "user_id_dict": {"username": 0},
    "tpu_aliases": {"v2-1": "kmh-tpuvm-v2-32-1"},
    "all_tpus": {
        "europe-west4-a": ["..."],
        "us-central1-a": ["..."],
        "us-central2-b": ["..."],
        "preemptible": ["..."]
    },
    "monitor_config": {
        "test_freq": 3600,
        "checking_freq": 600
    },
    "wandb_api_key": "...",
    "conda_env_name": "NNX",
    "monitor_all_check_time": 20,
    "MONITOR_logs": [],
    "ack_MONITOR": false
}

Each job is described as:

Full job structure

{
    "user": "username",
    "windows_id": 1,
    "job_dir_id": 1,
    "job_dir": "/your/code/path",
    "tpu": "kmh-tpuvm-v2-32-preemptible-1",
    "job_tags": null,
    "log_dir": "/your/log/path",
    "staage_dir": "/your/staging/path",
    "extra_configs": "--lr=0.01",
    "status": "running",
    "error": null,
    "stage": 0,
    "monitor": true,
    "rules": {
        "preempted": "reapply",
        "grpc": "resume"
    },
    "extra_msgs": {},
    "start_time": "20250420_011026",
    "customized_settings": {}
}

Future Work

More testing/docs
Support restarting TPU
Customized monitor window
Auto-choose the TPU to run a job
More auto env solvers
Logging for every user so that you can check the things happen since last time

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
utils		utils
.gitignore		.gitignore
MONITOR.py		MONITOR.py
NOTES.md		NOTES.md
README.md		README.md
monitor.sh		monitor.sh
tpu.py		tpu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Auto TPU job manager for Google Cloud Platform (experimental)

Quickstart in 2 mins

Full docs

For Developers

Future Work

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

jzc-2007/tpu_manager

Folders and files

Latest commit

History

Repository files navigation

Auto TPU job manager for Google Cloud Platform (experimental)

Quickstart in 2 mins

Full docs

For Developers

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages