-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Added support for AWS Batch. Added support for docker #1474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Updates for AWS Batch are under tools/batch
- submit-job.py allows users to launch AWS batch job with a script
- batch-test.py will deploy a simple job on all job definitions for testing
- Docker setup for AWS Batch are under tools/batch/docker
- Both Dockerfile.gpu and Dockerfile.cpu will use gluon_cv_job.sh as part of the container
- docker_deploy.sh will simply build and push the updated dockerfile
- Templates for AWS Batch initiated instance are under /tools/batch/template
- Templates are needed because a known issue of aws batch initiated instance having only 10GB of storage
- Updates for gluonCV docker are under tools/docker
- Both shell scripts are used in the docker container for Jupyternotebook support
Job PR-1474-1 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, see comments.
BTW, configure your text editor to always include new line at the end of the file
tools/batch/submit-job.py
Outdated
import boto3 | ||
from botocore.compat import total_seconds | ||
|
||
# Fetch definitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need this?
@@ -0,0 +1,23 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's bad to have parenthesis in the filename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because ECS will use linux1 as AMI for GPU related instance, and linux2 for CPU related instance. I feel it's necessary to mark it in the file name. If I do linux1-gpu, people might mistakenly think that there's a linux1-cpu too. Therefore, I put a parenthesis here. Is there any better naming convention I can use to achieve the same goal?
import time | ||
from datetime import datetime | ||
|
||
import boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advice in readme that boto3 is required
Job PR-1474-2 is done. |
Job PR-1474-3 is done. |
* incorporate autodatasets (dmlc#1496) * Add torch clarification (dmlc#1495) * Add torch clarification * fix * Fix auto detectors (dmlc#1497) * fix yolo predictor * fix predict * fix config (dmlc#1498) * Added support for AWS Batch. Added support for docker (dmlc#1474) * Added support for AWS Batch. Added support for docker * Fixed style. Removed code in commet. Updated README to include boto3 usage * Renamed template file. Removed gluon aws id * fix readme * fix * fix imports (dmlc#1499) * fix imports * fix * fix image classification * fix * fix width height * fix * fix batch size * fix * fix * none to empty string (dmlc#1502) * [WIP] Tinycoco (dmlc#1501) * Add minicoco * update jenkins for minicoco * fix * renamed mini to tiny * fix * fix * fix, add VOCDetectionTiny * fix * fix env * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test * test * test * clean up Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> * Fix rcnn target generator (dmlc#1508) * fix not used rcnn target generator * fix lint * fix * fix * add get flops (dmlc#1509) * warmup scheduler for video torch (dmlc#1510) 1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not. 2. fix bug in gluoncv/torch/utils/lr_policy.py 3. change training configs 4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines * update torchvideo model zoo (dmlc#1513) * add ir-csn-152 into torchvideo model zoo (dmlc#1515) * Revise danet.py (dmlc#1507) The dropout layer should be placed before the classification layer. * icnet missing background class (dmlc#1518) * Add CSN model to torch video model zoo (dmlc#1517) * add ircsn * update model zoo * fix lint * Improve auto tasks (dmlc#1523) * use in-memory pickle instead of disk file * add feature extractor for image classification * add tests * fix * fix lint * more unittests * fix * fix * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>
* incorporate autodatasets (dmlc#1496) * Add torch clarification (dmlc#1495) * Add torch clarification * fix * Fix auto detectors (dmlc#1497) * fix yolo predictor * fix predict * fix config (dmlc#1498) * Added support for AWS Batch. Added support for docker (dmlc#1474) * Added support for AWS Batch. Added support for docker * Fixed style. Removed code in commet. Updated README to include boto3 usage * Renamed template file. Removed gluon aws id * fix readme * fix * fix imports (dmlc#1499) * fix imports * fix * fix image classification * fix * fix width height * fix * fix batch size * fix * fix * none to empty string (dmlc#1502) * [WIP] Tinycoco (dmlc#1501) * Add minicoco * update jenkins for minicoco * fix * renamed mini to tiny * fix * fix * fix, add VOCDetectionTiny * fix * fix env * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test * test * test * clean up Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> * Fix rcnn target generator (dmlc#1508) * fix not used rcnn target generator * fix lint * fix * fix * add get flops (dmlc#1509) * warmup scheduler for video torch (dmlc#1510) 1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not. 2. fix bug in gluoncv/torch/utils/lr_policy.py 3. change training configs 4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines * update torchvideo model zoo (dmlc#1513) * add ir-csn-152 into torchvideo model zoo (dmlc#1515) * Revise danet.py (dmlc#1507) The dropout layer should be placed before the classification layer. * icnet missing background class (dmlc#1518) * Add CSN model to torch video model zoo (dmlc#1517) * add ircsn * update model zoo * fix lint * Improve auto tasks (dmlc#1523) * use in-memory pickle instead of disk file * add feature extractor for image classification * add tests * fix * fix lint * more unittests * fix * fix * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>
* [WIP] Github Actions (#1) * incorporate autodatasets (#1496) * Add torch clarification (#1495) * Add torch clarification * fix * Fix auto detectors (#1497) * fix yolo predictor * fix predict * fix config (#1498) * Added support for AWS Batch. Added support for docker (#1474) * Added support for AWS Batch. Added support for docker * Fixed style. Removed code in commet. Updated README to include boto3 usage * Renamed template file. Removed gluon aws id * fix readme * fix * fix imports (#1499) * fix imports * fix * fix image classification * fix * fix width height * fix * fix batch size * fix * fix * none to empty string (#1502) * [WIP] Tinycoco (#1501) * Add minicoco * update jenkins for minicoco * fix * renamed mini to tiny * fix * fix * fix, add VOCDetectionTiny * fix * fix env * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test * test * test * clean up Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> * Fix rcnn target generator (#1508) * fix not used rcnn target generator * fix lint * fix * fix * add get flops (#1509) * warmup scheduler for video torch (#1510) 1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not. 2. fix bug in gluoncv/torch/utils/lr_policy.py 3. change training configs 4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines * update torchvideo model zoo (#1513) * add ir-csn-152 into torchvideo model zoo (#1515) * Revise danet.py (#1507) The dropout layer should be placed before the classification layer. * icnet missing background class (#1518) * Add CSN model to torch video model zoo (#1517) * add ircsn * update model zoo * fix lint * Improve auto tasks (#1523) * use in-memory pickle instead of disk file * add feature extractor for image classification * add tests * fix * fix lint * more unittests * fix * fix * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com> * [WIP] Test PR (#3) * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment * test * full test * full test * full test * test (#5) * test * fix * change to 12x * test comments * change to pr_target * [WIP] Full Test (#6) * full test * test model zoo * test model zoo * full test * full test * add auto * add gpu_test.sh * test efs modelzoo * test efs modelzoo * test efs modelzoo * test without auto * test repo name * test repo name * test repo name * test repo name * test sharemem * full test (#8) * [WIP] Github Actions (#1) * incorporate autodatasets (#1496) * Add torch clarification (#1495) * Add torch clarification * fix * Fix auto detectors (#1497) * fix yolo predictor * fix predict * fix config (#1498) * Added support for AWS Batch. Added support for docker (#1474) * Added support for AWS Batch. Added support for docker * Fixed style. Removed code in commet. Updated README to include boto3 usage * Renamed template file. Removed gluon aws id * fix readme * fix * fix imports (#1499) * fix imports * fix * fix image classification * fix * fix width height * fix * fix batch size * fix * fix * none to empty string (#1502) * [WIP] Tinycoco (#1501) * Add minicoco * update jenkins for minicoco * fix * renamed mini to tiny * fix * fix * fix, add VOCDetectionTiny * fix * fix env * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test * test * test * clean up Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> * Fix rcnn target generator (#1508) * fix not used rcnn target generator * fix lint * fix * fix * add get flops (#1509) * warmup scheduler for video torch (#1510) 1. refine warmup logic, now using cfg.CONFIG.TRAIN.USE_WARMUP to control open warmup or not. 2. fix bug in gluoncv/torch/utils/lr_policy.py 3. change training configs 4. change ddp_train_pytorch and ddp_train_shortonly_pytorch, This is tested on ec2 machines * update torchvideo model zoo (#1513) * add ir-csn-152 into torchvideo model zoo (#1515) * Revise danet.py (#1507) The dropout layer should be placed before the classification layer. * icnet missing background class (#1518) * Add CSN model to torch video model zoo (#1517) * add ircsn * update model zoo * fix lint * Improve auto tasks (#1523) * use in-memory pickle instead of disk file * add feature extractor for image classification * add tests * fix * fix lint * more unittests * fix * fix * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com> * [WIP] Test PR (#3) * Added github action and workflow for sanity check * Removed container and actions. * Added unit test * Added build docs * Fix * Fix * Fix * Fix * Test * test * Update unit test * fix * fix * fix * fix * fix * fix * fix * subclass coco * fix * fix * fix * fix * rebase conflict * fix rebase * fix * fix * add aws authentication * add aws authentication * test * test * test * test * test * fix log * test * test * test * test * test * test * fix * rebase * add tiny motorbike * fix * model zoo * test * fix docker * parallel jobs * parallel jobs * fix * add torch * add torch * fix * fix * fix * full test * full test * test build docs * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * test branch * test branch * fix * test * test * add comment * test * full test * full test * full test * test (#5) * test * fix * change to 12x * test comments * change to pr_target * [WIP] Full Test (#6) * full test * test model zoo * test model zoo * full test * full test * add auto * add gpu_test.sh * test efs modelzoo * test efs modelzoo * test efs modelzoo * test without auto * test repo name * test repo name * test repo name * test repo name * test sharemem * test pr only on yinweisu * test pr only on yinweisu * update repo name * test pr only on yinweisu (#9) * full test on pr only yinweisu (#10) * ready to pr * fix * change doc env name * add torch to env * add yacs to env * fix path Co-authored-by: Joshua Z. Zhang <cheungchih@gmail.com> Co-authored-by: Yi Zhu <yizhu59@gmail.com> Co-authored-by: Xinyu Li <lixinyu.arthur@outlook.com> Co-authored-by: Chunhui Liu <chunhuiliu960@gmail.com> Co-authored-by: YANYI ZHANG <yz593@scarletmail.rutgers.edu> Co-authored-by: BebDong <BebDong@users.noreply.github.com> Co-authored-by: Kuang Haofei <haofeikuang@gmail.com>