Closed
Description
Currently if there is a problem handling the Add event for a TfJob, the job just gets stuck as opposed to being reported as failed with an appropriate error message.
Here's an exmaple
apiVersion: mlkube.io/v1beta1
kind: TfJob
metadata:
clusterName: ""
creationTimestamp: 2017-08-21T13:41:49Z
deletionGracePeriodSeconds: null
deletionTimestamp: null
name: dv2-eval-0641
namespace: default
resourceVersion: "877473"
selfLink: /apis/mlkube.io/v1beta1/namespaces/default/tfjobs/eval-0641
uid: 7b3eb3cd-8676-11e7-a025-42010a8e0097
spec:
replicaSpecs:
- replicas: 1
template:
spec:
containers:
- command:
- python
- -m
- my_code.train
- --master=
- --checkpoint_dir=gs:/some/path
- --eval_dir=gs:/some/path
- --alsologtostderr
image: gcr.io/cloud-ml-dev/image:latest
name: tensorflow
restartPolicy: OnFailure
tfPort: 2222
tfReplicaType: MASTER
tensorBoard:
logDir: null
NewTBReplicaSet in this case is failing because logDir isn't specified.
The TfJob just remains in the state as indicated by the YAML above. That's not very helpful. The TfJob should be failed with a helpful error message.
Metadata
Metadata
Assignees
Labels
No labels