8000 failed to launch command with executor: rpc error: · Issue #5576 · hashicorp/nomad · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
failed to launch command with executor: rpc error:  #5576
Closed
@sirkjohannsen

Description

@sirkjohannsen

We are testing nomad 0.9 at the moment and discovered that many of the services that ran fine in 0.8 now do not start properly.

Nomad version

Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)

Operating system and Environment details

  • Ubuntu 16.04
  • VM instance on GCP
  • CPU: 2x Intel(R) Xeon(R) CPU @ 2.60GHz
  • RAM: 8GB

linux version 4.15.0-1029-gcp (buildd@lgw01-amd64-006) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #31~16.04.1-Ubuntu SMP Fri Mar 22 13:06:42 UTC 2019

We are running nomad as client and server on single instance here (sandbox)

Issue

with the exec driver we experience these errors at random.
Restarts sometimes help to recover the issue but obviously breaks our deployment flow.

failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead

or

failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container

Reproduction steps

Its happening randomly but its happening all the time.

Job file (if appropriate)

one of many jobs where this randomly happens:

{
  "Stop": false,
  "Region": "eu",
  "Namespace": "default",
  "ID": "test-service",
  "ParentID": "",
  "Name": "test-service",
  "Type": "service",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "sandbox"
  ],
  "Constraints": [
    {
      "LTarget": "${meta.xyz}",
      "RTarget": "True",
      "Operand": "="
    },
    {
      "LTarget": "",
      "RTarget": "true",
      "Operand": "distinct_hosts"
    }
  ],
  "TaskGroups": [
    {
      "Name": "service",
      "Count": 1,
      "Constraints": null,
      "Tasks": [
        {
          "Name": "test",
          "Driver": "exec",
          "User": "www-data",
          "Config": {
            "args": [
              "-debug",
              "prometheus",
              "-listen_addr",
              ":${NOMAD_PORT_test_metrics}",
              "trace",
              "-collector_address",
              "localhost:9411",
              "log",
              "-environment",
              "sandbox",
              "-hostname",
              "${attr.unique.hostname}",
              "consul",
              "-uri",
              "http://localhost:8500",
              "rpc",
              "-address",
              ":${NOMAD_PORT_test_rpc}",
              "database",
              "-host",
              "localhost",
              "-user",
              "postgres",
              "-password",
              "",
              "-database",
              "sandbox"
            ],
            "command": "/usr/local/bin/test-service"
          },
          "Constraints": null,
          "Env": null,
          "Services": [
            {
              "Id": "",
              "Name": "test-service-metrics",
              "Tags": [
                "metrics",
                "test",
                "internal"
              ],
              "CanaryTags": null,
              "PortLabel": "test_metrics",
              "AddressMode": "auto",
              "Checks": [
                {
                  "Id": "",
                  "Name": "alive",
                  "Type": "tcp",
                  "Command": "",
                  "Args": null,
                  "Path": "",
                  "Protocol": "",
                  "PortLabel": "",
                  "AddressMode": "",
                  "Interval": 25000000000,
                  "Timeout": 2000000000,
                  "InitialStatus": "",
                  "TLSSkipVerify": false,
                  "Header": null,
                  "Method": "",
                  "CheckRestart": null,
                  "GRPCService": "",
                  "GRPCUseTLS": false
                }
              ],
              "CheckRestart": null
            },
            {
              "Id": "",
              "Name": "test-service-rpc",
              "Tags": [
                "test",
                "service",
                "internal"
              ],
              "CanaryTags": null,
              "PortLabel": "test_rpc",
              "AddressMode": "auto",
              "Checks": [
                {
                  "Id": "",
                  "Name": "service-health",
                  "Type": "grpc",
                  "Command": "",
                  "Args": null,
                  "Path": "",
                  "Protocol": "",
                  "PortLabel": "test_rpc",
                  "AddressMode": "",
                  "Interval": 25000000000,
                  "Timeout": 15000000000,
                  "InitialStatus": "",
                  "TLSSkipVerify": false,
                  "Header": null,
                  "Method": "",
                  "CheckRestart": null,
                  "GRPCService": "",
                  "GRPCUseTLS": false
                }
              ],
              "CheckRestart": null
            }
          ],
          "Resources": {
            "CPU": 20,
            "MemoryMB": 40,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": [
              {
                "Device": "",
                "CIDR": "",
                "IP": "",
                "MBits": 5,
                "ReservedPorts": null,
                "DynamicPorts": [
                  {
                    "Label": "test_rpc",
                    "Value": 0
                  },
                  {
                    "Label": "test_metrics",
                    "Value": 0
                  }
                ]
              }
            ]
          },
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 20,
            "MaxFileSizeMB": 10
          },
          "Artifacts": [
            {
              "GetterSource": "s3::https://s3-eu-west-1.amazonaws.com/xxxxxxxx/test-service-01ba3eacaaeb9337da1a2e5cc18eff5c8f56c3e5.tar.xz",
              "GetterOptions": {
                "aws_access_key_id": "XXXXXXXXXX",
                "aws_access_key_secret": "XXXXXXXXXXXXXXXXXXXXXXXXX"
              },
              "GetterMode": "any",
              "RelativeDest": "local/"
            }
          ],
          "Vault": null,
          "Templates": null,
          "DispatchPayload": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "KillSignal": "",
          "ID": "test-service.service.test"
        }
      ],
      "RestartPolicy": {
        "Interval": 300000000000,
        "Attempts": 5,
        "Delay": 5000000000,
        "Mode": "delay"
      },
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },
      "EphemeralDisk": {
        "Sticky": false,
        "Migrate": false,
        "SizeMB": 250
      },
      "Update": {
        "Stagger": 10000000000,
        "MaxParallel": 1,
        "HealthCheck": "task_states",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 1500000000000,
        "ProgressDeadline": 1920000000000,
        "AutoRevert": false,
        "Canary": 0
      },
      "Migrate": {
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000
      },
      "Meta": null,
      "ID": "test-service.service"
    }
  ],
  "Update": {
    "Stagger": 10000000000,
    "MaxParallel": 1,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "Canary": 0
  },
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "Payload": null,
  "Reschedule": null,
  "Migrate": null,
  "Meta": null,
  "VaultToken": "",
  "Status": "running",
  "StatusDescription": "",
  "Stable": true,
  "Version": 2,
  "SubmitTime": 1555581591398130200,
  "CreateIndex": 53,
  "ModifyIndex": 147,
  "JobModifyIndex": 135
}

Nomad logs (if appropriate)

alloc-status:

ID                   = 42a141b6
Eval ID              = 21107ced
Name                 = test-service.service[0]
Node ID              = 1fb13a5d
Job ID               = test-service
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = run
Desired Description  = <none>
Created              = 29m7s ago
Modified             = 27m55s ago
Deployment ID        = 6e64f7f8
Deployment Health    = unhealthy
Replacement Alloc ID = 6e172afa

Task "test" is "dead"
Task Resources
CPU     Memory  Disk     Addresses
20 MHz  40 MiB  250 MiB  test_rpc: 10.132.0.14:21227
                                                             test_metrics: 10.132.0.14:23471

Task Events:
Started At     = N/A
Finished At    = 2019-04-18T09:57:39Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                   Description
2019-04-18T09:57:39Z  Alloc Unhealthy        Unhealthy because of failed task
2019-04-18T09:57:39Z  Not Restarting         Error was unrecoverable
2019-04-18T09:57:39Z  Driver Failure         failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container
2019-04-18T09:56:29Z  Downloading Artifacts  Client is downloading artifacts
2019-04-18T09:56:29Z  Task Setup             Building Task Directory
2019-04-18T09:56:29Z  Received               Task received by client

alloc status after retry:

ID                   = 6e172afa
Eval ID              = 296d51f0
Name                 = test-service.service[0]
Node ID              = 1fb13a5d
Job ID               = test-service
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = run
Desired Description  = <none>
Created              = 28m16s ago
Modified             = 27m42s ago
Deployment ID        = 6e64f7f8
Deployment Health    = unhealthy
Replacement Alloc ID = 72725256

Task "test" is "dead"
Task Resources
CPU       Memory         Disk     Addresses
0/20 MHz  48 KiB/40 MiB  250 MiB  test_rpc: 10.132.0.14:31086
                                                                      test_metrics: 10.132.0.14:27131

Task Events:
Started At     = 2019-04-18T09:58:37Z
Finished At    = 2019-04-18T09:58:42Z
Total Restarts = 1
Last Restart   = 2019-04-18T09:58:37Z

Recent Events:
Time                  Type                   Description
2019-04-18T09:58:42Z  Alloc Unhealthy        Unhealthy because of failed task
2019-04-18T09:58:42Z  Not Restarting         Error was unrecoverable
2019-04-18T09:58:42Z  Driver Failure         failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead
2019-04-18T09:58:37Z  Restarting             Task restarting in 5.057962516s
2019-04-18T09:58:37Z  Terminated             Exit Code: 137, Signal: 9
2019-04-18T09:58:37Z  Started                Task started by client
2019-04-18T09:58:10Z  Downloading Artifacts  Client is downloading artifacts
2019-04-18T09:58:09Z  Task Setup             Building Task Directory
2019-04-18T09:58:09Z  Received               Task received by client

logs:

Apr 18 09:56:29 sb-nomad09 nomad[23445]:     2019-04-18T09:56:29.977Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:56:29 sb-nomad09 nomad[23445]:     2019-04-18T09:56:29.978Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:56:29 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:56:29 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:57:39 sb-nomad09 nomad[23445]:     2019-04-18T09:57:39.814Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container"
Apr 18 09:57:39 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: running driver failed: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container"
Apr 18 09:57:39 sb-nomad09 nomad[23445]:     2019-04-18T09:57:39.816Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test reason="Error was unrecoverable"
Apr 18 09:57:39 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: not restarting task: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test reason="Error was unrecoverable"
Apr 18 09:57:39 sb-nomad09 nomad[23445]:     2019-04-18T09:57:39.820Z [INFO ] client.gc: marking allocation for GC: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12
Apr 18 09:57:39 sb-nomad09 nomad[23445]: client.gc: marking allocation for GC: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12
Apr 18 09:57:40 sb-nomad09 nomad[23445]:     2019-04-18T09:57:40.059Z [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="read |0: file already closed"
Apr 18 09:57:40 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="read |0: file already closed"
Apr 18 10:13:07 sb-nomad09 nomad[23445]:     2019-04-18T10:13:07.697Z [ERROR] http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stdout&plain=false error="task "test" not started yet. No logs available" code=404
Apr 18 10:13:07 sb-nomad09 nomad[23445]: http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stdout&plain=false error="task "test" not started yet. No logs available" code=404
Apr 18 10:13:07 sb-nomad09 nomad[23445]:     2019-04-18T10:13:07.699Z [ERROR] http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stderr&plain=false error="task "test" not started yet. No logs available" code=404
Apr 18 10:13:07 sb-nomad09 nomad[23445]: http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stderr&plain=false error="task "test" not started yet. No logs available" code=404

2nd allocation (after retry):

Apr 18 09:58:09 sb-nomad09 nomad[23445]:     2019-04-18T09:58:09.995Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:09 sb-nomad09 nomad[23445]:     2019-04-18T09:58:09.995Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:09 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:09 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:37 sb-nomad09 nomad[23445]:     2019-04-18T09:58:37.040Z [ERROR] client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec driver=exec task_name=test error="rpc error: code = Unavailable desc = transport is closing"
Apr 18 09:58:37 sb-nomad09 nomad[23445]: client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec driver=exec task_name=test error="rpc error: code = Unavailable desc = transport is closing"
Apr 18 09:58:37 sb-nomad09 nomad[23445]:     2019-04-18T09:58:37.043Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Restart within policy" delay=5.057962516s
Apr 18 09:58:37 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Restart within policy" delay=5.057962516s
Apr 18 09:58:42 sb-nomad09 nomad[23445]:     2019-04-18T09:58:42.134Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]:     2019-04-18T09:58:42.134Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]:     2019-04-18T09:58:42.750Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: running driver failed: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead"
Apr 18 09:58:42 sb-nomad09 nomad[23445]:     2019-04-18T09:58:42.753Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Error was unrecoverable"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: not restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Error was unrecoverable"
Apr 18 09:58:42 sb-nomad09 nomad[23445]:     2019-04-18T09:58:42.757Z [INFO ] client.gc: marking allocation for GC: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.gc: marking allocation for GC: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec
Apr 18 09:58:42 sb-nomad09 nomad[23445]:     2019-04-18T09:58:42.765Z [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="read |0: file already closed"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="read |0: file already closed"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0