8000 Device mismatch during evaluation when training on mps · Issue #2385 · mosaicml/composer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Device mismatch during evaluation when training on mps #2385
Open
@erosenthal-square

Description

@erosenthal-square

I ran into an issue trying to train flan-t5 on an M1 using torchmetrics. Training metrics worked fine, but I got the following stacktrace when calculating evaluation metrics:

...
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1804, in fit
    self._train_loop()
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2032, in _train_loop
    self._run_evaluators(Event.BATCH_END)
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2117, in _run_evaluators
    self._eval_loop(
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2833, in _eval_loop
    self._original_model.update_metric(
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/models/huggingface.py", line 438, in update_metric
    metric.update(outputs, self.labels)  # pyright: ignore [reportGeneralTypeIssues]
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torchmetrics/metric.py", line 400, in wrapped_func
    raise err
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torchmetrics/metric.py", line 390, in wrapped_func
    update(*args, **kwargs)
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/metrics/nlp.py", line 111, in update
    losses = self.loss_fn(logits, target)
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Placeholder storage has not been allocated on MPS device!

I believe the issue is related to the following snippet of code:

if isinstance(self.state.device, DeviceMPS):
# torchmetrics math has numerical errors on M1 devices
# running the compute on CPU instead
outputs = self.state.outputs.cpu()
else:
outputs = self.state.outputs
for _, metric in metrics.items():
self._original_model.update_metric(
self.state.batch,
outputs,
metric,
)

The outputs tensor is explicitly moved to cpu if it's on mps, but the batch tensor is not. Hence, you inevitably have a device mismatch when updating metrics. AFAICT, outputs are not explicitly moved to cpu when they're on mps when updating training metrics which is why I only saw this bug during evaluation.

If there really are numerical errors with torchmetrics on mps, then training metrics probably ought to be calculated on cpu in order to bring parity to the training and eval calculations. Additionally, the batch tensor will need to be moved to cpu.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0