HAI Platform

HAI Time-Sharing Scheduling Training Platform, deployable via docker-compose or k8s, provides:

Time-sharing scheduling for training tasks
Training task management
Jupyter development container management
Studio UI
Haienv runtime environment management

External Dependencies

A centralized storage system (e.g., nfs, ceph, weka)
- Stores user code
- Stores code execution logs
- Stores required k8s configurations
Kubernetes cluster with all compute nodes joined
Recommended: Compute nodes with RDMA support and rdma-sriov device-plugin installed
- If unavailable, configure HAS_RDMA_HCA_RESOURCE: '0' in launcher.manager_envs

Quick Start

Build

Build all-in-one hai-platform image:

Note: To include haienv 202207 runtime (with CUDA, Torch) as training image, set export BUILD_TRAIN_IMAGE=1. For custom training images, see Appendix: Database Initialization for train_environment configuration.

# Replace IMAGE_REPO with your own repo
$ IMAGE_REPO=registry.cn-hangzhou.aliyuncs.com/hfai/hai-platform bash one/release.sh
  Build success:
    hai-platform image: registry.cn-hangzhou.aliyuncs.com/hfai/hai-platform:fa07f13
    hai-cli wheels:
      /home/hai-platform/build/hai-1.0.0+fa07f13-py3-none-any.whl
      /home/hai-platform/build/haienv-1.4.1+fa07f13-py3-none-any.whl
      /home/hai-platform/build/haiworkspace-1.0.0+fa07f13-py3-none-any.whl

Install hai-cli:

pip3 install /home/hai-platform/build/hai-1.0.0+fa07f13-py3-none-any.whl
pip3 install /home/hai-platform/build/haienv-1.4.1+fa07f13-py3-none-any.whl
pip3 install /home/hai-platform/build/haiworkspace-1.0.0+fa07f13-py3-none-any.whl

Prebuilt images and CLI:

# Base image
registry.cn-hangzhou.aliyuncs.com/hfai/hai-platform:latest
# With haienv 202207 runtime
registry.cn-hangzhou.aliyuncs.com/hfai/hai-platform:latest-202207

pip3 install hai --extra-index-url https://pypi.hfai.high-flyer.cn/simple --trusted-host pypi.hfai.high-flyer.cn -U

Deploy to Kubernetes

Get help:

$ hai-up -h
Usage:
  hai-up.sh config/run/up/dryrun/down/upgrade [option]
  Commands:
    config:   Print config script
    run/up:   Deploy platform
    dryrun:   Generate config template
    down:     Remove deployment
    upgrade:  Update hai-cli/hai-up

  Options:
    -h/--help:      Show help
    -p/--provider:  k8s/docker-compose (default: k8s)
    -c/--config:    Use environment config file

  Deployment Steps:
    1. Ensure:
       - Kubernetes cluster with LB and ingress
       - Shared filesystem mounted on all nodes
       - Docker/docker-compose installed (for docker-compose provider)
    2. Generate config: "hai-up config > config.sh"
    3. Deploy: "hai-up run -c config.sh"

Configure environment variables (shared FS path, node groups, training image, mounts, etc.) and deploy:

hai-up config > config.sh
# Edit config.sh

hai-up run -c config.sh

Use hai-cli to initialize and submit jobs:

HAI_SERVER_ADDR=$(kubectl -n hai-platform get svc hai-platform-svc -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Use token from USER_INFO
hai-cli init ${TOKEN} --url http://${HAI_SERVER_ADDR}

# Python files should be in workspace: ${SHARED_FS_ROOT}/hai-platform/workspace/{user.user_name}
hai-cli python ${SHARED_FS_ROOT}/hai-platform/workspace/$(whoami)/test.py -- -n 1

To stop: hai-up down

Appendix: Configuration Network Ports Default ports:

80: Web service

5432: PostgreSQL

6379: Redis

8080: Studio UI

Kubernetes Configuration Requires RBAC permissions for resource creation. Kubeconfig mounted at /root/.kube/config.

Node Groups Supported groups: TRAINING_GROUP, JUPYTER_GROUP. Configure via:

export MARS_PREFIX="hai-platform-one"
export TRAINING_GROUP="training"
export JUPYTER_GROUP="jupyter_cpu"
export TRAINING_NODES="cn-hangzhou.172.23.183.227"
export JUPYTER_NODES="cn-hangzhou.172.23.183.226"

Mount Points Default mounts:

  - '${HAI_PLATFORM_PATH}/kubeconfig:/root/.kube:ro'
  - '${HAI_PLATFORM_PATH}/log:/high-flyer/log'
  - '${DB_PATH}:/var/lib/postgresql/12/main'
  - '${HAI_PLATFORM_PATH}/redis:/var/lib/redis'
  - '${HAI_PLATFORM_PATH}/workspace/log:${HAI_PLATFORM_PATH}/workspace/log'
  - '${HAI_PLATFORM_PATH}/log/postgresql:/var/log/postgresql'
  - '${HAI_PLATFORM_PATH}/log/redis:/var/log/redis'
  - '${HAI_PLATFORM_PATH}/init.sql:/tmp/init.sql'
  - '${HAI_PLATFORM_PATH}/override.toml:/etc/hai_one_config/override.toml'
Database Configuration
Built-in Database
PostgreSQL and Redis configured via override.toml:

  [database.postgres]
  host = '${HAI_SERVER_ADDR}'
  port = 5432
  user = '${POSTGRES_USER}'
  password = '${POSTGRES_PASSWORD}'
  [database.redis]
  host = '${HAI_SERVER_ADDR}'
  port = 6379
  password = '${REDIS_PASSWORD}'

Database Initialization Sample init.sql:

-- Storage mounts
INSERT INTO "storage" (...) VALUES (...);

-- Quota settings
INSERT INTO "quota" (...) VALUES (...);

-- User setup
INSERT INTO "user" (...) VALUES (...);

-- Training environments
INSERT INTO "train_environment" (...) VALUES (...);

-- Node info
INSERT INTO "host" (...) VALUES (...);

Platform Configuration override.toml example:

[experiment.log.dist]
dir = '${HAI_PLATFORM_PATH}/workspace/log/{user_name}'

[database.postgres]
host = '${HAI_SERVER_ADDR}'

[scheduler]
default_group = '${TRAINING_GROUP}'

[launcher]
api_server = '${HAI_SERVER_ADDR}'
task_namespace = '${TASK_NAMESPACE}'
manager_image = '${BASE_IMAGE}'

[jupyter]
shared_node_group_prefix='${JUPYTER_GROUP}'

SSH Configuration To enable SSH in task containers:

Create initialization scripts for SSH key setup

Mount scripts to /usr/local/sbin/hf-scripts/post_system_init/ via storage table

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
api		api
base_model		base_model
client		client
cloud_storage		cloud_storage
conf		conf
db		db
db_schemas		db_schemas
deploy/dbs/files		deploy/dbs/files
docs		docs
experiment_manager		experiment_manager
exporter		exporter
fetion		fetion
k8s		k8s
k8s_watcher		k8s_watcher
logm		logm
marsv2		marsv2
monitor		monitor
one		one
plugins		plugins
roman_parliament		roman_parliament
scheduler		scheduler
server_model		server_model
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
k8s_watcher.py		k8s_watcher.py
launcher.py		launcher.py
requirements.txt		requirements.txt
scheduler.py		scheduler.py
uvicorn_server.py		uvicorn_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HAI Platform

External Dependencies

Quick Start

About

Uh oh!

Releases

Packages

Languages

License

dunkel000/quant-dragon

Folders and files

Latest commit

History

Repository files navigation

HAI Platform

External Dependencies

Quick Start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages