10000 If handling Add event fails, TfJob should be marked as failed with appropriate error · Issue #26 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
If handling Add event fails, TfJob should be marked as failed with appropriate error #26
Closed
@jlewi

Description

@jlewi

Currently if there is a problem handling the Add event for a TfJob, the job just gets stuck as opposed to being reported as failed with an appropriate error message.

Here's an exmaple

apiVersion: mlkube.io/v1beta1
kind: TfJob
metadata:
  clusterName: ""
  creationTimestamp: 2017-08-21T13:41:49Z
  deletionGracePeriodSeconds: null
  deletionTimestamp: null
  name: dv2-eval-0641
  namespace: default
  resourceVersion: "877473"
  selfLink: /apis/mlkube.io/v1beta1/namespaces/default/tfjobs/eval-0641
  uid: 7b3eb3cd-8676-11e7-a025-42010a8e0097
spec:
  replicaSpecs:
  - replicas: 1
    template:
      spec:
        containers:
        - command:
          - python
          - -m
          - my_code.train
          - --master=
          - --checkpoint_dir=gs:/some/path
          - --eval_dir=gs:/some/path
          - --alsologtostderr
          image: gcr.io/cloud-ml-dev/image:latest
          name: tensorflow
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: MASTER
  tensorBoard:
    logDir: null

NewTBReplicaSet in this case is failing because logDir isn't specified.

The TfJob just remains in the state as indicated by the YAML above. That's not very helpful. The TfJob should be failed with a helpful error message.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0