Added support for AWS Batch. Added support for docker #1474

yinweisu · 2020-10-14T23:27:11Z

Updates for AWS Batch are under tools/batch
- submit-job.py allows users to launch AWS batch job with a script
- batch-test.py will deploy a simple job on all job definitions for testing
- Docker setup for AWS Batch are under tools/batch/docker
  - Both Dockerfile.gpu and Dockerfile.cpu will use gluon_cv_job.sh as part of the container
  - docker_deploy.sh will simply build and push the updated dockerfile
- Templates for AWS Batch initiated instance are under /tools/batch/template
  - Templates are needed because a known issue of aws batch initiated instance having only 10GB of storage
Updates for gluonCV docker are under tools/docker
- Both shell scripts are used in the docker container for Jupyternotebook support

mli · 2020-10-15T00:31:04Z

Job PR-1474-1 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1474/1/index.html
Code coverage of this PR: vs. Master:

zhreshold

looks good, see comments.

BTW, configure your text editor to always include new line at the end of the file

zhreshold · 2020-10-15T21:28:33Z

tools/batch/submit-job.py

+import boto3
+from botocore.compat import total_seconds
+
+# Fetch definitions


do we still need this?

zhreshold · 2020-10-15T21:29:32Z

tools/batch/template/launch-template-data-linux1(gpu).json

@@ -0,0 +1,23 @@
+{


it's bad to have parenthesis in the filename

Because ECS will use linux1 as AMI for GPU related instance, and linux2 for CPU related instance. I feel it's necessary to mark it in the file name. If I do linux1-gpu, people might mistakenly think that there's a linux1-cpu too. Therefore, I put a parenthesis here. Is there any better naming convention I can use to achieve the same goal?

zhreshold · 2020-10-15T21:31:02Z

tools/batch/submit-job.py

+import time
+from datetime import datetime
+
+import boto3


advice in readme that boto3 is required

mli · 2020-10-15T23:00:23Z

Job PR-1474-2 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1474/2/index.html
Code coverage of this PR: vs. Master:

mli · 2020-10-21T23:23:49Z

Job PR-1474-3 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-1474/3/index.html
Code coverage of this PR: vs. Master:

…usage

* incorporate autodatasets (dmlc#1496) * Add torch clarification (dmlc#1495) * Add torch clarification * fix * Fix auto detectors (dmlc#1497) * fix yolo predictor * fix predict * fix config (dmlc#1498) * Added support for AWS Batch. Added support for docker (dmlc#1474) * Added support for AWS Batch. Added support for docker * Fixed style. Removed code in commet. Updated README to include boto3 usage * Renamed template file. Removed gluon aws id * fix readme * fix * fix imports (dmlc#1499) * fix imports * fix * fix image classification * fix * fix width height * fix * fix batch size * fix * fix * none to empty string (dmlc#1502) * [WIP] Tinycoco (dmlc#1501) * Add minicoco * update jenkins for minicoco * fix * renamed mini to tiny * fix * fix * fix, add VOCDetectionTiny * fix * fix env * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test * test * test * clean up Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> * Fix rcnn target generator (dmlc#1508) * fix not used rcnn target generator * fix lint * fix * fix * add get flops (dmlc#1509) * warmup scheduler for video torch (dmlc#1510) 1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not. 2. fix bug in gluoncv/torch/utils/lr_policy.py 3. change training configs 4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines * update torchvideo model zoo (dmlc#1513) * add ir-csn-152 into torchvideo model zoo (dmlc#1515) * Revise danet.py (dmlc#1507) The dropout layer should be placed before the classification layer. * icnet missing background class (dmlc#1518) * Add CSN model to torch video model zoo (dmlc#1517) * add ircsn * update model zoo * fix lint * Improve auto tasks (dmlc#1523) * use in-memory pickle instead of disk file * add feature extractor for image classification * add tests * fix * fix lint * more unittests * fix * fix * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>

* [WIP] Github Actions (#1) * incorporate autodatasets (#1496) * Add torch clarification (#1495) * Add torch clarification * fix * Fix auto detectors (#1497) * fix yolo predictor * fix predict * fix config (#1498) * Added support for AWS Batch. Added support for docker (#1474) * Added support for AWS Batch. Added support for docker * Fixed style. Removed code in commet. Updated README to include boto3 usage * Renamed template file. Removed gluon aws id * fix readme * fix * fix imports (#1499) * fix imports * fix * fix image classification * fix * fix width height * fix * fix batch size * fix * fix * none to empty string (#1502) * [WIP] Tinycoco (#1501) * Add minicoco * update jenkins for minicoco * fix * renamed mini to tiny * fix * fix * fix, add VOCDetectionTiny * fix * fix env * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test * test * test * clean up Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> * Fix rcnn target generator (#1508) * fix not used rcnn target generator * fix lint * fix * fix * add get flops (#1509) * warmup scheduler for video torch (#1510) 1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not. 2. fix bug in gluoncv/torch/utils/lr_policy.py 3. change training configs 4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines * update torchvideo model zoo (#1513) * add ir-csn-152 into torchvideo model zoo (#1515) * Revise danet.py (#1507) The dropout layer should be placed before the classification layer. * icnet missing background class (#1518) * Add CSN model to torch video model zoo (#1517) * add ircsn * update model zoo * fix lint * Improve auto tasks (#1523) * use in-memory pickle instead of disk file * add feature extractor for image classification * add tests * fix * fix lint * more unittests * fix * fix * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com> * [WIP] Test PR (#3) * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment * test * full test * full test * full test * test (#5) * test * fix * change to 12x * test comments * change to pr_target * [WIP] Full Test (#6) * full test * test model zoo * test model zoo * full test * full test * add auto * add gpu_test.sh * test efs modelzoo * test efs modelzoo * test efs modelzoo * test without auto * test repo name * test repo name * test repo name * test repo name * test sharemem * full test (#8) * [WIP] Github Actions (#1) * incorporate autodatasets (#1496) * Add torch clarification (#1495) * Add torch clarification * fix * Fix auto detectors (#1497) * fix yolo predictor * fix predict * fix config (#1498) * Added support for AWS Batch. Added support for docker (#1474) * Added support for AWS Batch. Added support for docker * Fixed style. Removed code in commet. Updated README to include boto3 usage * Renamed template file. Removed gluon aws id * fix readme * fix * fix imports (#1499) * fix imports * fix * fix image classification * fix * fix width height * fix * fix batch size * fix * fix * none to empty string (#1502) * [WIP] Tinycoco (#1501) * Add minicoco * update jenkins for minicoco * fix * renamed mini to tiny * fix * fix * fix, add VOCDetectionTiny * fix * fix env * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test * test * test * clean up Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> * Fix rcnn target generator (#1508) * fix not used rcnn target generator * fix lint * fix * fix * add get flops (#1509) * warmup scheduler for video torch (#1510) 1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not. 2. fix bug in gluoncv/torch/utils/lr_policy.py 3. change training configs 4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines * update torchvideo model zoo (#1513) * add ir-csn-152 into torchvideo model zoo (#1515) * Revise danet.py (#1507) The dropout layer should be placed before the classification layer. * icnet missing background class (#1518) * Add CSN model to torch video model zoo (#1517) * add ircsn * update model zoo * fix lint * Improve auto tasks (#1523) * use in-memory pickle instead of disk file * add feature extractor for image classification * add tests * fix * fix lint * more unittests * fix * fix * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com> * [WIP] Test PR (#3) * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment * test * full test * full test * full test * test (#5) * test * fix * change to 12x * test comments * change to pr_target * [WIP] Full Test (#6) * full test * test model zoo * test model zoo * full test * full test * add auto * add gpu_test.sh * test efs modelzoo * test efs modelzoo * test efs modelzoo * test without auto * test repo name * test repo name * test repo name * test repo name * test sharemem * test pr only on yinweisu * test pr only on yinweisu * update repo name * test pr only on yinweisu (#9) * full test on pr only yinweisu (#10) * ready to pr * fix * change doc env name * add torch to env * add yacs to env * fix path Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>

zhreshold reviewed Oct 15, 2020

View reviewed changes

yinweisu force-pushed the batch branch from d7c8f8d to ed92e18 Compare October 15, 2020 21:56

yinweisu added 4 commits November 4, 2020 11:10

Added support for AWS Batch. Added support for docker

8000

94ffcb6

Fixed style. Removed code in commet. Updated README to include boto3 …

bc58158

…usage

Renamed template file. Removed gluon aws id

d0004e8

fix readme

71bab01

yinweisu force-pushed the batch branch from ce6c203 to 71bab01 Compare November 4, 2020 19:10

fix

0afdd27

zhreshold approved these changes Nov 4, 2020

View reviewed changes

zhreshold merged commit 5c248f3 into dmlc:master Nov 4, 2020

yinweisu deleted the batch branch January 12, 2021 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added support for AWS Batch. Added support for docker #1474

Added support for AWS Batch. Added support for docker #1474

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Added support for AWS Batch. Added support for docker #1474

Added support for AWS Batch. Added support for docker #1474

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!