8000 Traning does not converge, because of dataset too small? · Issue #334 · open-mmlab/mmpose · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Traning does not converge, because of dataset too small? #334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yulong314 opened this issue Dec 4, 2020 · 5 comments
Closed

Traning does not converge, because of dataset too small? #334

yulong314 opened this issue Dec 4, 2020 · 5 comments
Assignees

Comments

@yulong314
Copy link
yulong314 commented Dec 4, 2020

as addressed in #333, with the same tiny dataset which only have 38 images and 10 annotations, the trainning does not give useful model(no keypoints detected when use 'top_down_img_demo_with_mmdet.py . when i watch log info on screen, I see that the "mse_loss" and "loss" almost does not change along epoches.

[INFO ] text:_log_info:122 - Epoch [1][1/1] lr: 5.000e-07, eta: 0:14:13, time: 3.428, data_time: 2.963, memory: 3805, mse_loss: 0.0019, acc_pose: 0.0469, loss: 0.0019
...
[INFO ] text:_log_info:122 - Epoch [180][1/1] lr: 1.793e-05, eta: 0:03:53, time: 3.306, data_time: 2.907, memory: 4560, mse_loss: 0.0004, acc_pose: 0.9302, loss: 0.0004

@yulong314
Copy link
Author

copy config here from #333 for convenience

log_level = 'INFO'
load_from = None
resume_from = None
dist_params = dict(backend='nccl')
workflow = [('train', 1)]
checkpoint_config = dict(interval=30)
evaluation = dict(interval=10, metric='mAP', key_indicator='AP')

optimizer = dict(
    type='Adam',
    lr=5e-4,
)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(
    policy='step',
    # warmup=None,
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[150, 200])
total_epochs = 250
log_config = dict(
    interval=1,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])

channel_cfg = dict(
    num_output_channels=32,
    dataset_joints=32,
    dataset_channel=[
        list(range(32)),
    ],
    inference_channel=list(range(32)))

# model settings
model = dict(
    type='TopDown',
    pretrained='https://download.openmmlab.com/mmpose/top_down/'
    'hrnet/hrnet_w48_coco_384x288_dark-741844ba_20200812.pth',
    backbone=dict(
        type='HRNet',
        in_channels=3,
        extra=dict(
            stage1=dict(
                num_modules=1,
                num_branches=1,
                block='BOTTLENECK',
                num_blocks=(4, ),
                num_channels=(64, )),
            stage2=dict(
                num_modules=1,
                num_branches=2,
                block='BASIC',
                num_blocks=(4, 4),
                num_channels=(48, 96)),
            stage3=dict(
                num_modules=4,
                num_branches=3,
                block='BASIC',
                num_blocks=(4, 4, 4),
                num_channels=(48, 96, 192)),
            stage4=dict(
                num_modules=3,
                num_branches=4,
                block='BASIC',
                num_blocks=(4, 4, 4, 4),
                num_channels=(48, 96, 192, 384))),
    ),
    keypoint_head=dict(
        type='TopDownSimpleHead',
        in_channels=48,
        out_channels=channel_cfg['num_output_channels'],
        num_deconv_layers=0,
        extra=dict(final_conv_kernel=1, ),
    ),
    train_cfg=dict(),
    test_cfg=dict(
        flip_test=True,
        post_process=True,
        shift_heatmap=True,
        unbiased_decoding=True,
        modulate_kernel=11),
    loss_pose=dict(type='JointsMSELoss', use_target_weight=True))

data_cfg = dict(
    image_size=[288, 384],
    heatmap_size=[72, 96],
    num_output_channels=channel_cfg['num_output_channels'],
    num_joints=channel_cfg['dataset_joints'],
    dataset_channel=channel_cfg['dataset_channel'],
    inference_channel=channel_cfg['inference_channel'],
    soft_nms=False,
    use_nms=False,
    nms_thr=1.0,
    oks_thr=0.9,
    vis_thr=0.2,
    bbox_thr=1.0,
    use_gt_bbox= True,
    image_thr=0.0,
    bbox_file= None,
)

train_pipeline = [
    dict(type='LoadImageFromFile'),
    # dict(type='TopDownRandomFlip', flip_prob=0.5),
    # dict(type='TopDownRandomFlipH', flip_prob=0.5),    
    # dict(
    #     type='TopDownHalfBodyTransform',
    #     num_joints_half_body=8,
    #     prob_half_body=0.3),
    dict(
        type='TopDownGetRandomScaleRotation', rot_factor=40, scale_factor=0.5),
    dict(type='TopDownAffine'),
    dict(type='ToTensor'),
    dict(
        type='NormalizeTensor',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]),
    dict(type='TopDownGenerateTarget', sigma=3, unbiased_encoding=True),
    dict(
        type='Collect',
        keys=['img', 'target', 'target_weight'],
        meta_keys=[
            'image_file', 'joints_3d', 'joints_3d_visible', 'center', 'scale',
            'rotation', 'bbox_score', 'flip_pairs'
        ]),
]

val_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='TopDownAffine'),
    dict(type='ToTensor'),
    dict(
        type='NormalizeTensor',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]),
    dict(
        type='Collect',
        keys=['img'],
        meta_keys=[
            'image_file', 'center', 'scale', 'rotation', 'bbox_score',
            'flip_pairs'
        ]),
]

test_pipeline = val_pipeline

data_root = '/mnt/data/wholebody'
data = dict(
    samples_per_gpu=8,
    workers_per_gpu=1,
    train=dict(
        type='TopDownCocoWeijingDataset',
        ann_file=f'{data_root}/12down_train.json',
        img_prefix=f'{data_root}/train_v3_ok_img/'
8000
,
        data_cfg=data_cfg,
        pipeline=train_pipeline),
    val=dict(
        type='TopDownCocoWeijingDataset',
        ann_file=f'{data_root}/12down_train.json',
        img_prefix=f'{data_root}/train_v3_ok_img/',
        data_cfg=data_cfg,
        pipeline=train_pipeline),
    test=dict(
        type='TopDownCocoWeijingDataset',
        ann_file=f'{data_root}/12down_train.json',
        img_prefix=f'{data_root}/train_v3_ok_img/',
        data_cfg=data_cfg,
        pipeline=train_pipeline),
)

# load_from = '/home/sy/working/otherCodes/mmpose/work-dirs/flip2500/latest.pth'

@innerlee
Copy link
Contributor
innerlee commented Dec 4, 2020

Short answer: yes.

You may set epochs to 10000 to actually see any sufficient updates to weights. It does not guarantee a good model.

@innerlee innerlee added the question Further information is requested label Dec 4, 2020
@yulong314
Copy link
Author
yulong314 commented Dec 4, 2020

Short answer: yes.

You may set epochs to 10000 to actually see any sufficient updates to weights. It does not guarantee a good model.

But after set log interval=1 , the model is able to inference keypoints after 250 epoches.
Before set log interval=1 , the model is NOT able to inference any keypoints, I have been trying repeatedly for a whole day.

@innerlee
Copy link
Contributor
innerlee commented Dec 4, 2020

Ideally logging should not affect the training process. @jin-s13 could you investigate this?

@innerlee innerlee removed the question Further information is requested label Dec 4, 2020
@innerlee
Copy link
Contributor
innerlee commented Dec 4, 2020

Please note that

warmup_iters=500,

, which means that if the number is small, the lr scheduling might still in the warm up stage.

@jin-s13 jin-s13 closed this as completed Dec 16, 2020
HAOCHENYE pushed a commit to HAOCHENYE/mmpose that referenced this issue Jun 27, 2023
* Improve registry infer_scope

* add warning info

* set scope as mmengine when failed to infer it

* refine message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0