DataParallel should support multiple inputs #649

apaszke · 2017-01-30T23:13:19Z

No description provided.

everwind · 2017-02-06T12:53:18Z

I implement the data_parallel with two inputs, but it does not work

def data_parallel2(module, input1, input2, device_ids, output_device=None):
"""Evaluates module(input) in parallel across the GPUs given in device_ids.

This is the functional version of the DataParallel module. 

Args:   
    module: the module to evaluate in parallel
    input: input to the module
    device_ids: GPU ids on which to replicate module
    output_device: GPU location of the output  Use -1 to indicate the CPU.
        (default: device_ids[0])
Returns:
    a Variable containing the result of module(input) located on
    output_device
"""
if not device_ids:
    return module(input1, input2) 

if output_device is None:
    output_device = device_ids[0]

replicas = replicate(module, device_ids)
input1s = scatter(input1, device_ids)
input2s = scatter(input2, device_ids)
replicas = replicas[:len(input1s)]
outputs = parallel_apply2(replicas, input1s, input2s)
return gather(outputs, output_device)

def parallel_apply2(modules, input1s, input2s):
assert len(modules) == len(input1s)
# Fast track
if len(modules) == 1:
return (modules[0](input1s[0], input2s[0]),)

lock = threading.Lock()
results = {}

def _worker(module, input1, input2, results, lock):
    var_input1 = input1
    var_input2 = input2
    while not isinstance(var_input1, Variable):
        var_input1 = var_input1[0]
    while not isinstance(var_input2, Variable):
        var_input2 = var_input2[0]
    try:    
        with torch.cuda.device_of(var_input1):
            output = module(input1, input2) 
        with lock:
            results[input1] = output
    except Exception as e:
        with lock:
            results[input1] = e

threads = [threading.Thread(target=_worker,
                            args=(module, input1, input2, results, lock))
            for module, input1, input2 in zip(modules, input1s, input2s)]

soumith · 2017-02-06T12:54:39Z

wdym by "it does not work"

everwind · 2017-02-06T12:56:24Z

the error info:
Traceback (most recent call last):
File "translate.py", line 136, in
data_parallel2(model,batch_enc_inputs, batch_dec_inputs, device_ids=[0, 1] )
File "/data1/plat/peakzeng/workspace/seq2seq-pytorch-example-master/data_parallel.py", line 115, in data_parallel2
outputs = parallel_apply2(replicas, input1s, input2s)
File "/data1/plat/peakzeng/workspace/seq2seq-pytorch-example-master/parallel_apply.py", line 86, in parallel_apply2
raise output
RuntimeError: equal number of batches expected at /data/plat/peakzeng/solfware/pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:441

soumith · 2017-02-06T12:58:20Z

looks like input1 and input2 don't have equal sizes maybe? either ways looks like an implementation bug on your side

everwind · 2017-02-06T13:06:01Z

yes, input1 and input2 don't have equal sizes, input1.size =( input_len1, batch_size, hidden_size) and the input2 .size() = (input_len2 , batch_size, hidden_size ) , if i make input1.size = (batch_size, hidden_size, input_len1) and the input2.size() = (batch_size, hidden_size, input_len2), can it works ?

apaszke · 2017-02-06T14:43:25Z

Yes, it expects the batch dimension to be first.

JianboTang · 2017-09-12T08:01:23Z

@apaszke the second dimension should be equal? I encounter this problem while the first dimension is equal

Traceback (most recent call last):
File "/home/tangjianbo/anaconda2/lib/python2.7/pdb.py", line 1314, in main
pdb._runscript(mainpyfile)
File "/home/tangjianbo/anaconda2/lib/python2.7/pdb.py", line 1233, in _runscript
self.run(statement)
File "/home/tangjianbo/anaconda2/lib/python2.7/bdb.py", line 400, in run
exec cmd in globals, locals
File "", line 1, in
File "main_SimpleAoA.py", line 1, in
import argparse
File "main_SimpleAoA.py", line 352, in main
train_loss_epoch, train_prob_epoch = train(epoch)
File "main_SimpleAoA.py", line 300, in train
loss, prob = AoA(q1_input, q2_input, lb_input)
File "/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 60, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 70, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
RuntimeError: invalid argument 7: equal number of batches expected at /pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:447
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program

/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py(67)parallel_apply()
-> raise output
(Pdb) up
/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py(70)parallel_apply()
-> return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
(Pdb)
/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py(60)forward()
-> outputs = self.parallel_apply(replicas, inputs, kwargs)
(Pdb)
/home/tangjianbo/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py(224)call()
-> result = self.forward(*input, **kwargs)
(Pdb)
/data/tangjianbo/question_normalization/AoA/char/main_SimpleAoA.py(300)train()
-> loss, prob = AoA(q1_input, q2_input, lb_input)
(Pdb) q1_input.size()
torch.Size([64, 20, 128])
(Pdb) q2_input.size()
torch.Size([64, 38, 128])
(Pdb) lb_input.size()
torch.Size([64, 1])

[WIP] Transformer tutorial

Remove relative_compute_at_axis, getComputeAtRelPos, TensorView::compute_at_view_, Expose ComputeAtMap so that it can be used in the C++ tests

* Add CMake Option "USE_OPT_NAVI3X" * fix bug

* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * create necessary files * save progress * Add Inter-Row thread transfer * save progress * save debugging progress * sanity check pass * fix a host tensor bug and clean up flash-attn code * format * cancel unnecessary change * cancel unnecessary change * cancel unnecessary change * temp save, add asm backend flag to amd_wmma * Mat-A LDS Bypass sanity pass * temp save * gemm sanity fix * Porting new blockwise gemm to flash attention * Example branch provide to compiler team * tempsave * Fix a bug * batched gemm ported * conv A-skip lds ported * Skip B-Lds real gemm * Skip B Lds Gemm + MulD * batched gemm, conv, skip b lds * format * Attn, skip b lds * Change GridwiseOp nam * fix a typo caused bug * Skip A_Lds sanity pass, Skip B_Lds scratch occured * Bug found, intra-row permute off caused * bug found * a fix * disable buffer load due to incorrect 3rd dword * update fmha config, no scratch generated * update 3rd dword * fmha config update * FMHA, add support to gfx1101/gfx1102 * Merge origin dev (pytorch#2) * [Navi3x] Fix Gridwise_multiple_d operation (pytorch#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (pytorch#655) * Separate bibtex requirement from rocm-docs-core (pytorch#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (pytorch#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (pytorch#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (pytorch#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (pytorch#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (pytorch#659) * add fp64 instances (pytorch#658) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Issue pytorch#666: Revert "simplify karg in device/grid of split-k op (pytorch#644)" (pytorch#665) This reverts commit bb5530a. * Groupnorm + swish external api (pytorch#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (pytorch#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * fixed quant example (pytorch#672) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (pytorch#663) * [gtest] suppress unsafe buffer warn (pytorch#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (pytorch#667) * Add more macros to turn on/off denorm fix (pytorch#678) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (pytorch#676) * Add (pytorch#677) * Allow using ROCm release candidate compilers. (pytorch#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * add vector load check * solve conflicts --------- Co-authored-by: Sam Wu <sjwu@ualberta.ca> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> * Disable SkipLDS & Align AIT api (pytorch#3) * fix layernorm, reduction Ops (pytorch#4) * [Navi3x] Fix Gridwise_multiple_d operation (pytorch#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (pytorch#655) * Separate bibtex requirement from rocm-docs-core (pytorch#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (pytorch#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (pytorch#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (pytorch#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (pytorch#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (pytorch#659) * add fp64 instances (pytorch#658) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Issue pytorch#666: Revert "simplify karg in device/grid of split-k op (pytorch#644)" (pytorch#665) This reverts commit bb5530a. * Groupnorm + swish external api (pytorch#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (pytorch#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * fixed quant example (pytorch#672) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (pytorch#663) * [gtest] suppress unsafe buffer warn (pytorch#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (pytorch#667) * Add more macros to turn on/off denorm fix (pytorch#678) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (pytorch#676) * Add (pytorch#677) * Allow using ROCm release candidate compilers. (pytorch#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * Disable SkipLDS & Align AIT api * Update dependabot config (pytorch#682) Co-authored-by: samjwu <samjwu@users.noreply.github.com> * update attn api * solve type_convert bug + enable --------- Co-authored-by: Sam Wu <sjwu@ualberta.ca> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: samjwu <samjwu@users.noreply.github.com> Co-authored-by: haocwang <Haocong.WANG@amd.com> * fix typo * Fix attention with causal mask * multiple fix, try ait compile * Add A/B not use LDS pipeline * Clang format, Add gfx1101, gfx1102 support of FMHA example * cancel change of format script * 1. Enable 2-stage global Prefetch ( M 5D1D ay cause VGPR spilling) 2. Enable FP16 accumulator blockwise_gemm * clang-format * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement) 2. change kernel timing mode to 50 warmup + 50 timed repeat * Update low level abstration of blockwise gemm wmma * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds * (4/5) grouped conv pass * (5/5) attention pass, todo: debug lds perf bug * AIT Attention API refactor (pytorch#8) * sanity pass * sanity pass 2 * confirm significant performance regression. * turn on all instances * turn off instance format * Fix bug & tunning & format * DML meta, self_attn+cross_attn * sanity pass * remove useless flag * update tile and problem size used in AIT attention * bug fix in grouped conv supporting check * deprecate inline asm wmma * Bug fix: double lds skip * clang-format * Fix errors in 1. example, fmha 2. gridwise pipeline 3. deviceop, fmha, change some containers from vector to array * part2 of previous commit * clang format * API fix of gridwisegemmpipeline * separate array base and vector base attention tensor transformation * fix gemm * clang format * add gemm fp16 instances * Temp save * fpAintB kernel compile pass * Sanity pass. * Temp save * debug code enabled * Fp16AInt8B_GEMM sanity * MQA implementation * GQA-4 example * tempsave * Compile pass * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm * format * Todo: fix gemm_bilinear_wmma instances compilation bug * Solve a bug when K1=16 * remove unnecessary changes * Remove tensor layout limitation to LDS usage in tesnor contraction * update self-attention and cross-attention * fix a typo of name * Add arch limiter for fp8 gemm * enable fp8 gemm_xdl for all gfx9 targets * temporarily disable gemm_xdl_fp16_fp8 on MI100/200 * fix the cmake logic for gemm_xdl_fp16_fp8 * re-enable the gemm_xdl_fp16_fp8 on MI100/200 --------- Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: Sam Wu <sjwu@ualberta.ca> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: samjwu <samjwu@users.noreply.github.com> Co-authored-by: haocwang <Haocong.WANG@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com>

apaszke added enhancement labels Jan 30, 2017

apaszke self-assigned this Feb 19, 2017

apaszke mentioned this issue Feb 20, 2017

Fixes and improvements #794

Closed

soumith closed this as completed Feb 21, 2017

mrshenli pushed a commit to mrshenli/pytorch that referenced this issue Apr 11, 2020

Merge pull request pytorch#649 from zhangguanheng66/transformer_tutorial

f9d5be7

[WIP] Transformer tutorial

jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this issue Apr 11, 2021

TensorView cleanup (pytorch#649)

5661076

Remove relative_compute_at_axis, getComputeAtRelPos, TensorView::compute_at_view_, Expose ComputeAtMap so that it can be used in the C++ tests

KyleCZH pushed a commit to KyleCZH/pytorch that referenced this issue Sep 20, 2021

manylinux/build_common.sh should also rewrite caffe2 deps (pytorch#649)

4990182

akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025

[Navi3x] Fix Gridwise_multiple_d operation (pytorch#649)

e5376be

* Add CMake Option "USE_OPT_NAVI3X" * fix bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataParallel should support multiple inputs #649

DataParallel should support multiple inputs #649

DataParallel should support multiple inputs #649

DataParallel should support multiple inputs #649

Comments