Tags · Steboss/Fuser

v0.0.13

Add device argument to FusionDefinition.execute (NVIDIA#178)

This is related to issue NVIDIA#175 

This option adds a strictly-checked hint of the device to run the Fusion
on. If inputs are passed which do not match the provided device, an
error will be emitted.

---------

Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>

May 25, 2023
6915258
zip
tar.gz

v0.0.12

[breaking change] Trunc div and true div (NVIDIA#360)

Fixes NVIDIA#328
- `ops.div` semantics changed from true division to trunc division
- `ops.truediv` added for true division
- version bumped

May 18, 2023
6c10269
zip
tar.gz

v0.0.11

Bump nvfuser version for `signbit` operation (NVIDIA#352)

Related NVIDIA#344

May 17, 2023
e0de1dd
zip
tar.gz

v0.0.10

segmenter hint (NVIDIA#262)

added a new op to hint segmentation, this is to target debug usage where
we would want to explore scheduler heuristics

TODO:
- [x] Rename the op to better reflect the intention
- [x] Plumb it to python API.
- [x] Python test.

May 16, 2023
58a0fda
zip
tar.gz

v0.0.9

Change the `contiguous` parameter of `define_tensor` in the python AP…

…I to `contiguity` (NVIDIA#191)

NM

Apr 20, 2023
59a124e
zip
tar.gz

v0.0.8

Add `FusionDefinition` methods to print the `fusion_ir`, `scheduled_f…

…usion_ir`, and `cuda_code` (NVIDIA#6)

Resolves issue [NVIDIA#2387](csarofeen/pytorch#2387)
for @mruberry .

The following APIs are added to the Python `FusionDefinition`. They all
return strings instead of printing to `stdout` so the python user can do
what they want with the string output.

```python
fd.fusion_ir()
fd.last_scheduled_fusion_ir(tensor_transforms=False)
fd.scheduled_fusion_ir_for(inputs, tensor_transforms=False)
fd.last_cuda_code(intrinsic_code=False)
fd.cuda_code_for(inputs, intrinsic_code=False)
```

Example test case:

```python
import torch
from nvfuser import FusionDefinition

inputs = [
    torch.randn(4, 4, device='cuda'),
]

with FusionDefinition() as fd:
    t0 = fd.from_pytorch(inputs[0])
    t1 = fd.ops.relu(t0)
    fd.add_output(t1)

out = fd.execute(inputs)
```

**Fusion IR**
Gives a string for the unscheduled Fusion IR for a given definition.
Example Usage:
```python
print(fd.fusion_ir())
```
Output:
```
%kernel {
T1_g[ iS2{i0}, iS3{i1} ]
   = relu(T0_g[ iS0{i0}, iS1{i1} ]);
}
```

**Scheduled Fusion IR**
Gives a string for the scheduled Fusion IR with or without transforms
for either the last scheduled fusion or for a specific set of inputs.
Note, segmented fusions also print the segmented groupings to help the
user understand what the scheduled Fusion IR represents.

Example Usages:
```python
print(fd.last_scheduled_fusion_ir(tensor_transforms=True))
print(fd.scheduled_fusion_ir_for(inputs, tensor_transforms=True))
```
Output:
```
%kernel {
T2_l[ iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, iS26{1}, ithreadIdx.x24{128} ] ca_pos( 2 )
   = T0_g[ iS34{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iS35{1}, iS33{1}, iS31{128} ];
T3_l[ iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, iS19{1}, ithreadIdx.x17{128} ] ca_pos( 4 ) produce_pos( 2 )
   = relu(T2_l[ iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, iS26{1}, ithreadIdx.x24{128} ] ca_pos( 2 ));
T1_g[ iblockIdx.x13{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS14{1}, iS12{1}, ithreadIdx.x10{128} ] ca_pos( 2 ) produce_pos( 4 )
   = T3_l[ iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, iS19{1}, ithreadIdx.x17{128} ] ca_pos( 4 ) produce_pos( 2 );

TransformPrinter : 
T0_g[ iS34{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iS35{1}, iS33{1}, iS31{128} ]
 root domain : (iS36{T0.size[0]},iS37{T0.size[1]})
 contiguity: t t
  Merge: iS36{T0.size[0]} and iS37{T0.size[1]} -> iS29{( T0.size[0] * T0.size[1] )}
  Split: iS29{( T0.size[0] * T0.size[1] )} by factor 128 -> iS30{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, iS31{128}, start offset: 0, stop offset: 0
  Split: iS30{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS32{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS33{1}, start offset: 0, stop offset: 0
  Split: iS32{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iS34{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iS35{1}, start offset: 0, stop offset: 0
T2_l[ iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, iS26{1}, ithreadIdx.x24{128} ] ca_pos( 2 )
 root domain : (iS38{T0.size[0]},iS39{T0.size[1]})
 contiguity: t t
  Merge: iS38{T0.size[0]} and iS39{T0.size[1]} -> iS22{( T0.size[0] * T0.size[1] )}
  Split: iS22{( T0.size[0] * T0.size[1] )} by factor 128 -> iS23{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, ithreadIdx.x24{128}, start offset: 0, stop offset: 0
  Split: iS23{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS25{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS26{1}, start offset: 0, stop offset: 0
  Split: iS25{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iblockIdx.x27{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS28{1}, start offset: 0, stop offset: 0
T3_l[ iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, iS19{1}, ithreadIdx.x17{128} ] ca_pos( 4 ) produce_pos( 2 )
 root domain : (iS40{T0.size[0]},iS41{T0.size[1]})
 contiguity: t t
  Merge: iS40{T0.size[0]} and iS41{T0.size[1]} -> iS15{( T0.size[0] * T0.size[1] )}
  Split: iS15{( T0.size[0] * T0.size[1] )} by factor 128 -> iS16{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, ithreadIdx.x17{128}, start offset: 0, stop offset: 0
  Split: iS16{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS18{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS19{1}, start offset: 0, stop offset: 0
  Split: iS18{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iblockIdx.x20{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS21{1}, start offset: 0, stop offset: 0
T1_g[ iblockIdx.x13{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS14{1}, iS12{1}, ithreadIdx.x10{128} ] ca_pos( 2 ) produce_pos( 4 )
 root domain : (iS42{T0.size[0]},iS43{T0.size[1]})
 contiguity: t t
  Merge: iS42{T0.size[0]} and iS43{T0.size[1]} -> iS8{( T0.size[0] * T0.size[1] )}
  Split: iS8{( T0.size[0] * T0.size[1] )} by factor 128 -> iS9{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )}, ithreadIdx.x10{128}, start offset: 0, stop offset: 0
  Split: iS9{( ceilDiv(( T0.size[0] * T0.size[1] ), 128) )} by factor 1 -> iS11{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )}, iS12{1}, start offset: 0, stop offset: 0
  Split: iS11{( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) )} by factor 1 -> iblockIdx.x13{( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * T0.size[1] ), 128) ), 1) ), 1) )}, iUS14{1}, start offset: 0, stop offset: 0
}
```

**Cuda Code**
Gives a string for the Cuda Code for a fusion with or without the full
intrinsic code required to execute the kernel for either the last
scheduled fusion or for a specific set of inputs.

Example Usages:
```python
print(fd.last_cuda_code(intrinsic_code=False))
print(fd.cuda_code_for(inputs, intrinsic_code=False))
```

Output:
```c++
__global__ void kernel1(Tensor<float, 2> T0, Tensor<float, 2> T1) {
  int i50;
  i50 = ((nvfuser_index_t)threadIdx.x) + (128 * ((nvfuser_index_t)blockIdx.x));
  if ((i50 < (T0.size[0] * T0.size[1]))) {
    float T2[1];
    T2[0] = 0;
    T2[0]
       = T0[i50];
    float T3[1];
    T3[0]
       = relu(T2[0]);
    T1[i50]
       = T3[0];
  }
}
```

Apr 6, 2023
602718d
zip
tar.gz

v0.0.7

Add nextafter op (NVIDIA#44)

The `nextafter(x, y)` operation provides the nearest distinct
representable floating point value to `x` between `x` and `y`. In CUDA
these are obtained with the builtins `nextafter` and `nextafterf`. Other
types such as bfloat16 are not directly supported, though PyTorch has
implemented that case based on some code in musl:
https://github.com/pytorch/pytorch/pull/61829/files#diff-ece04c31934b3504382e10ed3e9a69f03ffabd81ad1a2a890aab19b1642f53c0R120.

In PyTorch, the torch.nextafter function is defined for any pair of
arguments which would normally be promoted to either float32 or float64.
So arguments which are both ints, bools, complex, or half-precision
floats are not supported. This PR implements a binary op macro with a
TypePromotionConfig that enforces that rule.

This is a translation/update of
csarofeen/pytorch#2510.

---------

Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

Mar 30, 2023
0583d57
zip
tar.gz

v0.0.6

Add pad to Python frontend (NVIDIA#10)

Fixes NVIDIA#8 

Based on NVIDIA#3

---------

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

Mar 24, 2023
4c18efc
zip
tar.gz

v0.0.5

Add python `slice` API. (NVIDIA#27)

This PR adds the `slice` operation as is documented by the [XLA Slice
Op](https://www.tensorflow.org/xla/operation_semantics#slice).

---------

Co-authored-by: Kevin Stephano <me@nvdl-a112-d004.nvidia.com>
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

Mar 22, 2023
8e10207
zip
tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.0.13

v0.0.12

v0.0.11

v0.0.10

v0.0.9

v0.0.8

v0.0.7

v0.0.6

v0.0.5

Tags: Steboss/Fuser