Description
I ran into an issue trying to train flan-t5 on an M1 using torchmetrics. Training metrics worked fine, but I got the following stacktrace when calculating evaluation metrics:
...
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1804, in fit
self._train_loop()
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2032, in _train_loop
self._run_evaluators(Event.BATCH_END)
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2117, in _run_evaluators
self._eval_loop(
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2833, in _eval_loop
self._original_model.update_metric(
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/models/huggingface.py", line 438, in update_metric
metric.update(outputs, self.labels) # pyright: ignore [reportGeneralTypeIssues]
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torchmetrics/metric.py", line 400, in wrapped_func
raise err
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torchmetrics/metric.py", line 390, in wrapped_func
update(*args, **kwargs)
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/composer/metrics/nlp.py", line 111, in update
losses = self.loss_fn(logits, target)
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/Users/erosenthal/.pyenv/versions/pynlp/lib/python3.9/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Placeholder storage has not been allocated on MPS device!
I believe the issue is related to the following snippet of code:
composer/composer/trainer/trainer.py
Lines 2846 to 2858 in ff59e86
The outputs
tensor is explicitly moved to cpu
if it's on mps
, but the batch
tensor is not. Hence, you inevitably have a device mismatch when updating metrics. AFAICT, outputs
are not explicitly moved to cpu
when they're on mps
when updating training metrics which is why I only saw this bug during evaluation.
If there really are numerical errors with torchmetrics
on mps
, then training metrics probably ought to be calculated on cpu
in order to bring parity to the training and eval calculations. Additionally, the batch
tensor will need to be moved to cpu
.