If handling Add event fails, TfJob should be marked as failed with appropriate error

Currently if there is a problem handling the Add event for a TfJob, the job just gets stuck as opposed to being reported as failed with an appropriate error message.

Here's an exmaple

apiVersion: mlkube.io/v1beta1
kind: TfJob
metadata:
  clusterName: ""
  creationTimestamp: 2017-08-21T13:41:49Z
  deletionGracePeriodSeconds: null
  deletionTimestamp: null
  name: dv2-eval-0641
  namespace: default
  resourceVersion: "877473"
  selfLink: /apis/mlkube.io/v1beta1/namespaces/default/tfjobs/eval-0641
  uid: 7b3eb3cd-8676-11e7-a025-42010a8e0097
spec:
  replicaSpecs:
  - replicas: 1
    template:
      spec:
        containers:
        - command:
          - python
          - -m
          - my_code.train
          - --master=
          - --checkpoint_dir=gs:/some/path
          - --eval_dir=gs:/some/path
          - --alsologtostderr
          image: gcr.io/cloud-ml-dev/image:latest
          name: tensorflow
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: MASTER
  tensorBoard:
    logDir: null

NewTBReplicaSet in this case is failing because logDir isn't specified.

The TfJob just remains in the state as indicated by the YAML above. That's not very helpful. The TfJob should be failed with a helpful error message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions