8000 Added support for AWS Batch. Added support for docker by yinweisu · Pull Request #1474 · dmlc/gluon-cv · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Added support for AWS Batch. Added support for docker #1474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 4, 2020

Conversation

yinweisu
Copy link
Collaborator
  • Updates for AWS Batch are under tools/batch
    • submit-job.py allows users to launch AWS batch job with a script
    • batch-test.py will deploy a simple job on all job definitions for testing
    • Docker setup for AWS Batch are under tools/batch/docker
      • Both Dockerfile.gpu and Dockerfile.cpu will use gluon_cv_job.sh as part of the container
      • docker_deploy.sh will simply build and push the updated dockerfile
    • Templates for AWS Batch initiated instance are under /tools/batch/template
      • Templates are needed because a known issue of aws batch initiated instance having only 10GB of storage
  • Updates for gluonCV docker are under tools/docker
    • Both shell scripts are used in the docker container for Jupyternotebook support

@mli
Copy link
Member
mli commented Oct 15, 2020

Job PR-1474-1 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1474/1/index.html
Code coverage of this PR: pr.svg vs. Master: master.svg

Copy link
Member
@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, see comments.

BTW, configure your text editor to always include new line at the end of the file

import boto3
from botocore.compat import total_seconds

# Fetch definitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this?

@@ -0,0 +1,23 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's bad to have parenthesis in the filename

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ECS will use linux1 as AMI for GPU related instance, and linux2 for CPU related instance. I feel it's necessary to mark it in the file name. If I do linux1-gpu, people might mistakenly think that there's a linux1-cpu too. Therefore, I put a parenthesis here. Is there any better naming convention I can use to achieve the same goal?

import time
from datetime import datetime

import boto3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advice in readme that boto3 is required

@mli
Copy link
Member
mli commented Oct 15, 2020

Job PR-1474-2 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1474/2/index.html
Code coverage of this PR: pr.svg vs. Master: master.svg

@mli
Copy link
Member
mli commented Oct 21, 2020

Job PR-1474-3 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1474/3/index.html
Code coverage of this PR: pr.svg vs. Master: master.svg

@zhreshold zhreshold merged commit 5c248f3 into dmlc:master Nov 4, 2020
yinweisu added a commit to yinweisu/gluon-cv that referenced this pull request Nov 18, 2020
* incorporate autodatasets (dmlc#1496)

* Add torch clarification (dmlc#1495)

* Add torch clarification

* fix

* Fix auto detectors (dmlc#1497)

* fix yolo predictor

* fix predict

* fix config (dmlc#1498)

* Added support for AWS Batch. Added support for docker (dmlc#1474)

* Added support for AWS Batch. Added support for docker

* Fixed style. Removed code in commet. Updated README to include boto3 usage

* Renamed template file. Removed gluon aws id

* fix readme

* fix

* fix imports (dmlc#1499)

* fix imports

* fix

* fix image classification

* fix

* fix width height

* fix

* fix batch size

* fix

* fix

* none to empty string (dmlc#1502)

* [WIP] Tinycoco (dmlc#1501)

* Add minicoco

* update jenkins for minicoco

* fix

* renamed mini to tiny

* fix

* fix

* fix, add VOCDetectionTiny

* fix

* fix env

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test

* test

* test

* clean up

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>

* Fix rcnn target generator (dmlc#1508)

* fix not used rcnn target generator

* fix lint

* fix

* fix

* add get flops (dmlc#1509)

* warmup scheduler for video torch (dmlc#1510)

1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not.
2. fix bug in gluoncv/torch/utils/lr_policy.py
3. change training configs
4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines

* update torchvideo model zoo (dmlc#1513)

* add ir-csn-152 into torchvideo model zoo (dmlc#1515)

* Revise danet.py (dmlc#1507)

The dropout layer should be placed before the classification layer.

* icnet missing background class (dmlc#1518)

* Add CSN model to torch video model zoo (dmlc#1517)

* add ircsn

* update model zoo

* fix lint

* Improve auto tasks (dmlc#1523)

* use in-memory pickle instead of disk file

* add feature extractor for image classification

* add tests

* fix

* fix lint

* more unittests

* fix

* fix

* Added github action and workflow for sanity check

* Removed container and actions.

* Added unit test

* Added build docs

* Fix

* Fix

* Fix

* Fix

* Test

* test

* Update unit test

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* subclass coco

* fix

* fix

* fix

* fix

* rebase conflict

* fix rebase

* fix

* fix

* add aws authentication

* add aws authentication

* test

* test

* test

* test

* test

* fix log

* test

* test

* test

* test

* test

* test

* fix

* rebase

* add tiny motorbike

* fix

* model zoo

* test

* fix docker

* parallel jobs

* parallel jobs

* fix

* add torch

* add torch

* fix

* fix

* fix

* full test

* full test

* test build docs

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test branch

* test branch

* fix

* test

* test

* add comment

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>
Co-authored-by: Yi Zhu <yizhu59@gmail.com>
Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com>
Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com>
Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu>
Co-authored-by: BebDong <BebDong@users.noreply.github.com>
Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>
yinweisu added a commit to yinweisu/gluon-cv that referenced this pull request Nov 24, 2020
* incorporate autodatasets (dmlc#1496)

* Add torch clarification (dmlc#1495)

* Add torch clarification

* fix

* Fix auto detectors (dmlc#1497)

* fix yolo predictor

* fix predict

* fix config (dmlc#1498)

* Added support for AWS Batch. Added support for docker (dmlc#1474)

* Added support for AWS Batch. Added support for docker

* Fixed style. Removed code in commet. Updated README to include boto3 usage

* Renamed template file. Removed gluon aws id

* fix readme

* fix

* fix imports (dmlc#1499)

* fix imports

* fix

* fix image classification

* fix

* fix width height

* fix

* fix batch size

* fix

* fix

* none to empty string (dmlc#1502)

* [WIP] Tinycoco (dmlc#1501)

* Add minicoco

* update jenkins for minicoco

* fix

* renamed mini to tiny

* fix

* fix

* fix, add VOCDetectionTiny

* fix

* fix env

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test

* test

* test

* clean up

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>

* Fix rcnn target generator (dmlc#1508)

* fix not used rcnn target generator

* fix lint

* fix

* fix

* add get flops (dmlc#1509)

* warmup scheduler for video torch (dmlc#1510)

1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not.
2. fix bug in gluoncv/torch/utils/lr_policy.py
3. change training configs
4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines

* update torchvideo model zoo (dmlc#1513)

* add ir-csn-152 into torchvideo model zoo (dmlc#1515)

* Revise danet.py (dmlc#1507)

The dropout layer should be placed before the classification layer.

* icnet missing background class (dmlc#1518)

* Add CSN model to torch video model zoo (dmlc#1517)

* add ircsn

* update model zoo

* fix lint

* Improve auto tasks (dmlc#1523)

* use in-memory pickle instead of disk file

* add feature extractor for image classification

* add tests

* fix

* fix lint

* more unittests

* fix

* fix

* Added github action and workflow for sanity check

* Removed container and actions.

* Added unit test

* Added build docs

* Fix

* Fix

* Fix

* Fix

* Test

* test

* Update unit test

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* subclass coco

* fix

* fix

* fix

* fix

* rebase conflict

* fix rebase

* fix

* fix

* add aws authentication

* add aws authentication

* test

* test

* test

* test

* test

* fix log

* test

* test

* test

* test

* test

* test

* fix

* rebase

* add tiny motorbike

* fix

* model zoo

* test

* fix docker

* parallel jobs

* parallel jobs

* fix

* add torch

* add torch

* fix

* fix

* fix

* full test

* full test

* test build docs

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test branch

* test branch

* fix

* test

* test

* add comment

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>
Co-authored-by: Yi Zhu <yizhu59@gmail.com>
Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com>
Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com>
Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu>
Co-authored-by: BebDong <BebDong@users.noreply.github.com>
Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>
zhreshold added a commit that referenced this pull request Dec 2, 2020
* [WIP] Github Actions (#1)

* incorporate autodatasets (#1496)

* Add torch clarification (#1495)

* Add torch clarification

* fix

* Fix auto detectors (#1497)

* fix yolo predictor

* fix predict

* fix config (#1498)

* Added support for AWS Batch. Added support for docker (#1474)

* Added support for AWS Batch. Added support for docker

* Fixed style. Removed code in commet. Updated README to include boto3 usage

* Renamed template file. Removed gluon aws id

* fix readme

* fix

* fix imports (#1499)

* fix imports

* fix

* fix image classification

* fix

* fix width height

* fix

* fix batch size

* fix

* fix

* none to empty string (#1502)

* [WIP] Tinycoco (#1501)

* Add minicoco

* update jenkins for minicoco

* fix

* renamed mini to tiny

* fix

* fix

* fix, add VOCDetectionTiny

* fix

* fix env

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test

* test

* test

* clean up

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>

* Fix rcnn target generator (#1508)

* fix not used rcnn target generator

* fix lint

* fix

* fix

* add get flops (#1509)

* warmup scheduler for video torch (#1510)

1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not.
2. fix bug in gluoncv/torch/utils/lr_policy.py
3. change training configs
4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines

* update torchvideo model zoo (#1513)

* add ir-csn-152 into torchvideo model zoo (#1515)

* Revise danet.py (#1507)

The dropout layer should be placed before the classification layer.

* icnet missing background class (#1518)

* Add CSN model to torch video model zoo (#1517)

* add ircsn

* update model zoo

* fix lint

* Improve auto tasks (#1523)

* use in-memory pickle instead of disk file

* add feature extractor for image classification

* add tests

* fix

* fix lint

* more unittests

* fix

* fix

* Added github action and workflow for sanity check

* Removed container and actions.

* Added unit test

* Added build docs

* Fix

* Fix

* Fix

* Fix

* Test

* test

* Update unit test

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* subclass coco

* fix

* fix

* fix

* fix

* rebase conflict

* fix rebase

* fix

* fix

* add aws authentication

* add aws authentication

* test

* test

* test

* test

* test

* fix log

* test

* test

* test

* test

* test

* test

* fix

* rebase

* add tiny motorbike

* fix

* model zoo

* test

* fix docker

* parallel jobs

* parallel jobs

* fix

* add torch

* add torch

* fix

* fix

* fix

* full test

* full test

* test build docs

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test branch

* test branch

* fix

* test

* test

* add comment

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>
Co-authored-by: Yi Zhu <yizhu59@gmail.com>
Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com>
Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com>
Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu>
Co-authored-by: BebDong <BebDong@users.noreply.github.com>
Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>

* [WIP] Test PR (#3)

* Added github action and workflow for sanity check

* Removed container and actions.

* Added unit test

* Added build docs

* Fix

* Fix

* Fix

* Fix

* Test

* test

* Update unit test

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* subclass coco

* fix

* fix

* fix

* fix

* rebase conflict

* fix rebase

* fix

* fix

* add aws authentication

* add aws authentication

* test

* test

* test

* test

* test

* fix log

* test

* test

* test

* test

* test

* test

* fix

* rebase

* add tiny motorbike

* fix

* model zoo

* test

* fix docker

* parallel jobs

* parallel jobs

* fix

* add torch

* add torch

* fix

* fix

* fix

* full test

* full test

* test build docs

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test branch

* test branch

* fix

* test

* test

* add comment

* test

* full test

* full test

* full test

* test (#5)

* test

* fix

* change to 12x

* test comments

* change to pr_target

* [WIP] Full Test (#6)

* full test

* test model zoo

* test model zoo

* full test

* full test

* add auto

* add gpu_test.sh

* test efs modelzoo

* test efs modelzoo

* test efs modelzoo

* test without auto

* test repo name

* test repo name

* test repo name

* test repo name

* test sharemem

* full test (#8)

* [WIP] Github Actions (#1)

* incorporate autodatasets (#1496)

* Add torch clarification (#1495)

* Add torch clarification

* fix

* Fix auto detectors (#1497)

* fix yolo predictor

* fix predict

* fix config (#1498)

* Added support for AWS Batch. Added support for docker (#1474)

* Added support for AWS Batch. Added support for docker

* Fixed style. Removed code in commet. Updated README to include boto3 usage

* Renamed template file. Removed gluon aws id

* fix readme

* fix

* fix imports (#1499)

* fix imports

* fix

* fix image classification

* fix

* fix width height

* fix

* fix batch size

* fix

* fix

* none to empty string (#1502)

* [WIP] Tinycoco (#1501)

* Add minicoco

* update jenkins for minicoco

* fix

* renamed mini to tiny

* fix

* fix

* fix, add VOCDetectionTiny

* fix

* fix env

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test

* test

* test

* clean up

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>

* Fix rcnn target generator (#1508)

* fix not used rcnn target generator

* fix lint

* fix

* fix

* add get flops (#1509)

* warmup scheduler for video torch (#1510)

1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not.
2. fix bug in gluoncv/torch/utils/lr_policy.py
3. change training configs
4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines

* update torchvideo model zoo (#1513)

* add ir-csn-152 into torchvideo model zoo (#1515)

* Revise danet.py (#1507)

The dropout layer should be placed before the classification layer.

* icnet missing background class (#1518)

* Add CSN model to torch video model zoo (#1517)

* add ircsn

* update model zoo

* fix lint

* Improve auto tasks (#1523)

* use in-memory pickle instead of disk file

* add feature extractor for image classification

* add tests

* fix

* fix lint

* more unittests

* fix

* fix

* Added github action and workflow for sanity check

* Removed container and actions.

* Added unit test

* Added build docs

* Fix

* Fix

* Fix

* Fix

* Test

* test

* Update unit test

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* subclass coco

* fix

* fix

* fix

* fix

* rebase conflict

* fix rebase

* fix

* fix

* add aws authentication

* add aws authentication

* test

* test

* test

* test

* test

* fix log

* test

* test

* test

* test

* test

* test

* fix

* rebase

* add tiny motorbike

* fix

* model zoo

* test

* fix docker

* parallel jobs

* parallel jobs

* fix

* add torch

* add torch

* fix

* fix

* fix

* full test

* full test

* test build docs

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test branch

* test branch

* fix

* test

* test

* add comment

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>
Co-authored-by: Yi Zhu <yizhu59@gmail.com>
Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com>
Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com>
Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu>
Co-authored-by: BebDong <BebDong@users.noreply.github.com>
Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>

* [WIP] Test PR (#3)

* Added github action and workflow for sanity check

* Removed container and actions.

* Added unit test

* Added build docs

* Fix

* Fix

* Fix

* Fix

* Test

* test

* Update unit test

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* subclass coco

* fix

* fix

* fix

* fix

* rebase conflict

* fix rebase

* fix

* fix

* add aws authentication

* add aws authentication

* test

* test

* test

* test

* test

* fix log

* test

* test

* test

* test

* test

* test

* fix

* rebase

* add tiny motorbike

* fix

* model zoo

* test

* fix docker

* parallel jobs

* parallel jobs

* fix

* add torch

* add torch

* fix

* fix

* fix

* full test

* full test

* test build docs

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* test branch

* test branch

* fix

* test

* test

* add comment

* test

* full test

* full test

* full test

* test (#5)

* test

* fix

* change to 12x

* test comments

* change to pr_target

* [WIP] Full Test (#6)

* full test

* test model zoo

* test model zoo

* full test

* full test

* add auto

* add gpu_test.sh

* test efs modelzoo

* test efs modelzoo

* test efs modelzoo

* test without auto

* test repo name

* test repo name

* test repo name

* test repo name

* test sharemem

* test pr only on yinweisu

* test pr only on yinweisu

* update repo name

* test pr only on yinweisu (#9)

* full test on pr only yinweisu (#10)

* ready to pr

* fix

* change doc env name

* add torch to env

* add yacs to env

* fix path

Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com>
Co-authored-by: Yi Zhu <yizhu59@gmail.com>
Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com>
Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com>
Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu>
Co-authored-by: BebDong <BebDong@users.noreply.github.com>
Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>
@yinweisu yinweisu deleted the batch branch January 12, 2021 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0