8000 Tags · Steboss/Fuser · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tags: Steboss/Fuser

Tags

v0.0.13

Toggle v0.0.13's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add device argument to FusionDefinition.execute (NVIDIA#178)

This is related to issue NVIDIA#175 

This option adds a strictly-checked hint of the device to run the Fusion
on. If inputs are passed which do not match the provided device, an
error will be emitted.

---------

Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>

v0.0.12

Toggle v0.0.12's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
[breaking change] Trunc div and true div (NVIDIA#360)

Fixes NVIDIA#328
- `ops.div` semantics changed from true division to trunc division
- `ops.truediv` added for true division
- version bumped

v0.0.11

Toggle v0.0.11's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Bump nvfuser version for `signbit` operation (NVIDIA#352)

Related NVIDIA#344

v0.0.10

Toggle v0.0.10's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
segmenter hint (NVIDIA#262)

added a new op to hint segmentation, this is to target debug usage where
we would want to explore scheduler heuristics

TODO:
- [x] Rename the op to better reflect the intention
- [x] Plumb it to python API.
- [x] Python test.

v0.0.9

Toggle v0.0.9's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Change the `contiguous` parameter of `define_tensor` in the python AP…

…I to `contiguity` (NVIDIA#191)

NM

v0.0.8

Toggle v0.0.8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add `FusionDefinition` methods to print the `fusion_ir`, `scheduled_f…

…usion_ir`, and `cuda_code` (NVIDIA#6)

Resolves issue [NVIDIA#2387](csarofeen/pytorch#2387)
for @mruberry .

The following APIs are added to the Python `FusionDefinition`. They all
return strings instead of printing to `stdout` so the python user can do
what they want with the string output.

```python
fd.fusion_ir()
fd.last_scheduled_fusion_ir(tensor_transforms=False)
fd.scheduled_fusion_ir_for(inputs, tensor_transforms=False)
fd.last_cuda_code(intrinsic_code=False)
fd.cuda_code_for(inputs, intrinsic_code=False)
```

Example test case:

```python
import torch
from nvfuser import FusionDefinition

inputs = [
    torch.randn(4, 4, device='cuda'),
]

with FusionDefinition() as fd:
    t0 = fd.from_pytorch(inputs[0])
    t1 = fd.ops.relu(t0)
    fd.add_output(t1)

out = fd.execute(inputs)
```

**Fusion IR**
Gives a string for the unscheduled Fusion IR for a given definition.
Example Usage:
```python
print(fd.fusion_ir())
```
Output:
```
%kernel {
T1_g[ iS2{i0}, iS3{i1} ]
   = relu(T0_g[ iS0{i0}, iS1{i1} ]);
}
```

**Scheduled Fusion IR**
Gives a string for the scheduled Fusion IR with or without transforms
for either the last scheduled fusion or for a specific set of inputs.
Note, segmented fusions also print the segmented groupings to help the
user understand what the scheduled Fusion IR represents.

Example Usages:
```python
print(fd.last_scheduled_fusion_ir(tensor_transforms=True))
print(fd.scheduled_fusion_ir_for(inputs, tensor_transforms=True))
```
Output:
```
%kernel {
T2_l[ iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, iS26{1}, ithreadIdx.x24{128} ] ca_pos( 2 )
   = T0_g[ iS34{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iS35{1}, iS33{1}, iS31{128} ];
T3_l[ iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, iS19{1}, ithreadIdx.x17{128} ] ca_pos( 4 ) produce_pos( 2 )
   = relu(T2_l[ iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, iS26{1}, ithreadIdx.x24{128} ] ca_pos( 2 ));
T1_g[ iblockIdx.x13{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS14{1}, iS12{1}, ithreadIdx.x10{128} ] ca_pos( 2 ) produce_pos( 4 )
   = T3_l[ iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, iS19{1}, ithreadIdx.x17{128} ] ca_pos( 4 ) produce_pos( 2 );

TransformPrinter : 
T0_g[ iS34{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iS35{1}, iS33{1}, iS31{128} ]
 root domain : (iS36{T0.size[0]},iS37{T0.size[1]})
 contiguity: t t
  Merge: iS36{T0.size[0]} and iS37{T0.size[1]} -> iS29{( T0.size[0] * T0.size[1] )}
  Split: iS29{( T0.size[0] * T0.size[1] )} by factor 128 -> iS30{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, iS31{128}, start offset: 0, stop offset: 0
  Split: iS30{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS32{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS33{1}, start offset: 0, stop offset: 0
  Split: iS32{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iS34{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iS35{1}, start offset: 0, stop offset: 0
T2_l[ iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, iS26{1}, ithreadIdx.x24{128} ] ca_pos( 2 )
 root domain : (iS38{T0.size[0]},iS39{T0.size[1]})
 contiguity: t t
  Merge: iS38{T0.size[0]} and iS39{T0.size[1]} -> iS22{( T0.size[0] * T0.size[1] )}
  Split: iS22{( T0.size[0] * T0.size[1] )} by factor 128 -> iS23{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, ithreadIdx.x24{128}, start offset: 0, stop offset: 0
  Split: iS23{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS25{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS26{1}, start offset: 0, stop offset: 0
  Split: iS25{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, start offset: 0, stop offset: 0
T3_l[ iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, iS19{1}, ithreadIdx.x17{128} ] ca_pos( 4 ) produce_pos( 2 )
 root domain : (iS40{T0.size[0]},iS41{T0.size[1]})
 contiguity: t t
  Merge: iS40{T0.size[0]} and iS41{T0.size[1]} -> iS15{( T0.size[0] * T0.size[1] )}
  Split: iS15{( T0.size[0] * T0.size[1] )} by factor 128 -> iS16{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, ithreadIdx.x17{128}, start offset: 0, stop offset: 0
  Split: iS16{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS18{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS19{1}, start offset: 0, stop offset: 0
  Split: iS18{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, start offset: 0, stop offset: 0
T1_g[ iblockIdx.x13{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS14{1}, iS12{1}, ithreadIdx.x10{128} ] ca_pos( 2 ) produce_pos( 4 )
 root domain : (iS42{T0.size[0]},iS43{T0.size[1]})
 contiguity: t t
  Merge: iS42{T0.size[0]} and iS43{T0.size[1]} -> iS8{( T0.size[0] * T0.size[1] )}
  Split: iS8{( T0.size[0] * T0.size[1] )} by factor 128 -> iS9{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, ithreadIdx.x10{128}, start offset: 0, stop offset: 0
  Split: iS9{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS11{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS12{1}, start offset: 0, stop offset: 0
  Split: iS11{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iblockIdx.x13{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS14{1}, start offset: 0, stop offset: 0
}
```

**Cuda Code**
Gives a string for the Cuda Code for a fusion with or without the full
intrinsic code required to execute the kernel for either the last
scheduled fusion or for a specific set of inputs.

Example Usages:
```python
print(fd.last_cuda_code(intrinsic_code=False))
print(fd.cuda_code_for(inputs, intrinsic_code=False))
```

Output:
```c++
__global__ void kernel1(Tensor<float, 2> T0, Tensor<float, 2> T1) {
  int i50;
  i50 = ((nvfuser_index_t)threadIdx.x) + (128 * ((nvfuser_index_t)blockIdx.x));
  if ((i50 < (T0.size[0] * T0.size[1]))) {
    float T2[1];
    T2[0] = 0;
    T2[0]
       = T0[i50];
    float T3[1];
    T3[0]
       = relu(T2[0]);
    T1[i50]
       = T3[0];
  }
}
```

v0.0.7

Toggle v0.0.7's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add nextafter op (NVIDIA#44)

The `nextafter(x, y)` operation provides the nearest distinct
representable floating point value to `x` between `x` and `y`. In CUDA
these are obtained with the builtins `nextafter` and `nextafterf`. Other
types such as bfloat16 are not directly supported, though PyTorch has
implemented that case based on some code in musl:
https://github.com/pytorch/pytorch/pull/61829/files#diff-ece04c31934b3504382e10ed3e9a69f03ffabd81ad1a2a890aab19b1642f53c0R120.

In PyTorch, the torch.nextafter function is defined for any pair of
arguments which would normally be promoted to either float32 or float64.
So arguments which are both ints, bools, complex, or half-precision
floats are not supported. This PR implements a binary op macro with a
TypePromotionConfig that enforces that rule.

This is a translation/update of
csarofeen/pytorch#2510.

---------

Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

v0.0.6

Toggle v0.0.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add pad to Python frontend (NVIDIA#10)

Fixes NVIDIA#8 

Based on NVIDIA#3

---------

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

v0.0.5

Toggle v0.0.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add python `slice` API. (NVIDIA#27)

This PR adds the `slice` operation as is documented by the [XLA Slice
Op](https://www.tensorflow.org/xla/operation_semantics#slice).

---------

Co-authored-by: Kevin Stephano <me@nvdl-a112-d004.nvidia.com>
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
0