Description
We are testing nomad 0.9 at the moment and discovered that many of the services that ran fine in 0.8 now do not start properly.
Nomad version
Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)
Operating system and Environment details
- Ubuntu 16.04
- VM instance on GCP
- CPU: 2x
Intel(R) Xeon(R) CPU @ 2.60GHz
- RAM: 8GB
linux version 4.15.0-1029-gcp (buildd@lgw01-amd64-006) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #31~16.04.1-Ubuntu SMP Fri Mar 22 13:06:42 UTC 2019
We are running nomad as client and server on single instance here (sandbox)
Issue
with the exec driver we experience these errors at random.
Restarts sometimes help to recover the issue but obviously breaks our deployment flow.
failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead
or
failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container
Reproduction steps
Its happening randomly but its happening all the time.
Job file (if appropriate)
one of many jobs where this randomly happens:
{
"Stop": false,
"Region": "eu",
"Namespace": "default",
"ID": "test-service",
"ParentID": "",
"Name": "test-service",
"Type": "service",
"Priority": 50,
"AllAtOnce": false,
"Datacenters": [
"sandbox"
],
"Constraints": [
{
"LTarget": "${meta.xyz}",
"RTarget": "True",
"Operand": "="
},
{
"LTarget": "",
"RTarget": "true",
"Operand": "distinct_hosts"
}
],
"TaskGroups": [
{
"Name": "service",
"Count": 1,
"Constraints": null,
"Tasks": [
{
"Name": "test",
"Driver": "exec",
"User": "www-data",
"Config": {
"args": [
"-debug",
"prometheus",
"-listen_addr",
":${NOMAD_PORT_test_metrics}",
"trace",
"-collector_address",
"localhost:9411",
"log",
"-environment",
"sandbox",
"-hostname",
"${attr.unique.hostname}",
"consul",
"-uri",
"http://localhost:8500",
"rpc",
"-address",
":${NOMAD_PORT_test_rpc}",
"database",
"-host",
"localhost",
"-user",
"postgres",
"-password",
"",
"-database",
"sandbox"
],
"command": "/usr/local/bin/test-service"
},
"Constraints": null,
"Env": null,
"Services": [
{
"Id": "",
"Name": "test-service-metrics",
"Tags": [
"metrics",
"test",
"internal"
],
"CanaryTags": null,
"PortLabel": "test_metrics",
"AddressMode": "auto",
"Checks": [
{
"Id": "",
"Name": "alive",
"Type": "tcp",
"Command": "",
"Args": null,
"Path": "",
"Protocol": "",
"PortLabel": "",
"AddressMode": "",
"Interval": 25000000000,
"Timeout": 2000000000,
"InitialStatus": "",
"TLSSkipVerify": false,
"Header": null,
"Method": "",
"CheckRestart": null,
"GRPCService": "",
"GRPCUseTLS": false
}
],
"CheckRestart": null
},
{
"Id": "",
"Name": "test-service-rpc",
"Tags": [
"test",
"service",
"internal"
],
"CanaryTags": null,
"PortLabel": "test_rpc",
"AddressMode": "auto",
"Checks": [
{
"Id": "",
"Name": "service-health",
"Type": "grpc",
"Command": "",
"Args": null,
"Path": "",
"Protocol": "",
"PortLabel": "test_rpc",
"AddressMode": "",
"Interval": 25000000000,
"Timeout": 15000000000,
"InitialStatus": "",
"TLSSkipVerify": false,
"Header": null,
"Method": "",
"CheckRestart": null,
"GRPCService": "",
"GRPCUseTLS": false
}
],
"CheckRestart": null
}
],
"Resources": {
"CPU": 20,
"MemoryMB": 40,
"DiskMB": 0,
"IOPS": 0,
"Networks": [
{
"Device": "",
"CIDR": "",
"IP": "",
"MBits": 5,
"ReservedPorts": null,
"DynamicPorts": [
{
"Label": "test_rpc",
"Value": 0
},
{
"Label": "test_metrics",
"Value": 0
}
]
}
]
},
"Meta": null,
"KillTimeout": 5000000000,
"LogConfig": {
"MaxFiles": 20,
"MaxFileSizeMB": 10
},
"Artifacts": [
{
"GetterSource": "s3::https://s3-eu-west-1.amazonaws.com/xxxxxxxx/test-service-01ba3eacaaeb9337da1a2e5cc18eff5c8f56c3e5.tar.xz",
"GetterOptions": {
"aws_access_key_id": "XXXXXXXXXX",
"aws_access_key_secret": "XXXXXXXXXXXXXXXXXXXXXXXXX"
},
"GetterMode": "any",
"RelativeDest": "local/"
}
],
"Vault": null,
"Templates": null,
"DispatchPayload": null,
"Leader": false,
"ShutdownDelay": 0,
"KillSignal": "",
"ID": "test-service.service.test"
}
],
"RestartPolicy": {
"Interval": 300000000000,
"Attempts": 5,
"Delay": 5000000000,
"Mode": "delay"
},
"ReschedulePolicy": {
"Attempts": 0,
"Interval": 0,
"Delay": 30000000000,
"DelayFunction": "exponential",
"MaxDelay": 3600000000000,
"Unlimited": true
},
"EphemeralDisk": {
"Sticky": false,
"Migrate": false,
"SizeMB": 250
},
"Update": {
"Stagger": 10000000000,
"MaxParallel": 1,
"HealthCheck": "task_states",
"MinHealthyTime": 10000000000,
"HealthyDeadline": 1500000000000,
"ProgressDeadline": 1920000000000,
"AutoRevert": false,
"Canary": 0
},
"Migrate": {
"MaxParallel": 1,
"HealthCheck": "checks",
"MinHealthyTime": 10000000000,
"HealthyDeadline": 300000000000
},
"Meta": null,
"ID": "test-service.service"
}
],
"Update": {
"Stagger": 10000000000,
"MaxParallel": 1,
"HealthCheck": "",
"MinHealthyTime": 0,
"HealthyDeadline": 0,
"ProgressDeadline": 0,
"AutoRevert": false,
"Canary": 0
},
"Periodic": null,
"ParameterizedJob": null,
"Dispatched": false,
"Payload": null,
"Reschedule": null,
"Migrate": null,
"Meta": null,
"VaultToken": "",
"Status": "running",
"StatusDescription": "",
"Stable": true,
"Version": 2,
"SubmitTime": 1555581591398130200,
"CreateIndex": 53,
"ModifyIndex": 147,
"JobModifyIndex": 135
}
Nomad logs (if appropriate)
alloc-status:
ID = 42a141b6
Eval ID = 21107ced
Name = test-service.service[0]
Node ID = 1fb13a5d
Job ID = test-service
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 29m7s ago
Modified = 27m55s ago
Deployment ID = 6e64f7f8
Deployment Health = unhealthy
Replacement Alloc ID = 6e172afa
Task "test" is "dead"
Task Resources
CPU Memory Disk Addresses
20 MHz 40 MiB 250 MiB test_rpc: 10.132.0.14:21227
test_metrics: 10.132.0.14:23471
Task Events:
Started At = N/A
Finished At = 2019-04-18T09:57:39Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2019-04-18T09:57:39Z Alloc Unhealthy Unhealthy because of failed task
2019-04-18T09:57:39Z Not Restarting Error was unrecoverable
2019-04-18T09:57:39Z Driver Failure failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container
2019-04-18T09:56:29Z Downloading Artifacts Client is downloading artifacts
2019-04-18T09:56:29Z Task Setup Building Task Directory
2019-04-18T09:56:29Z Received Task received by client
alloc status after retry:
ID = 6e172afa
Eval ID = 296d51f0
Name = test-service.service[0]
Node ID = 1fb13a5d
Job ID = test-service
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 28m16s ago
Modified = 27m42s ago
Deployment ID = 6e64f7f8
Deployment Health = unhealthy
Replacement Alloc ID = 72725256
Task "test" is "dead"
Task Resources
CPU Memory Disk Addresses
0/20 MHz 48 KiB/40 MiB 250 MiB test_rpc: 10.132.0.14:31086
test_metrics: 10.132.0.14:27131
Task Events:
Started At = 2019-04-18T09:58:37Z
Finished At = 2019-04-18T09:58:42Z
Total Restarts = 1
Last Restart = 2019-04-18T09:58:37Z
Recent Events:
Time Type Description
2019-04-18T09:58:42Z Alloc Unhealthy Unhealthy because of failed task
2019-04-18T09:58:42Z Not Restarting Error was unrecoverable
2019-04-18T09:58:42Z Driver Failure failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead
2019-04-18T09:58:37Z Restarting Task restarting in 5.057962516s
2019-04-18T09:58:37Z Terminated Exit Code: 137, Signal: 9
2019-04-18T09:58:37Z Started Task started by client
2019-04-18T09:58:10Z Downloading Artifacts Client is downloading artifacts
2019-04-18T09:58:09Z Task Setup Building Task Directory
2019-04-18T09:58:09Z Received Task received by client
logs:
Apr 18 09:56:29 sb-nomad09 nomad[23445]: 2019-04-18T09:56:29.977Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:56:29 sb-nomad09 nomad[23445]: 2019-04-18T09:56:29.978Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:56:29 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:56:29 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test @module=logmon path=/var/nomad/alloc/42a141b6-7704-f0f1-139b-d4332c4c4c12/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:56:29.977Z
Apr 18 09:57:39 sb-nomad09 nomad[23445]: 2019-04-18T09:57:39.814Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container"
Apr 18 09:57:39 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: running driver failed: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="failed to launch command with executor: rpc error: code = Unknown desc = cannot start an already running container"
Apr 18 09:57:39 sb-nomad09 nomad[23445]: 2019-04-18T09:57:39.816Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test reason="Error was unrecoverable"
Apr 18 09:57:39 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: not restarting task: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test reason="Error was unrecoverable"
Apr 18 09:57:39 sb-nomad09 nomad[23445]: 2019-04-18T09:57:39.820Z [INFO ] client.gc: marking allocation for GC: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12
Apr 18 09:57:39 sb-nomad09 nomad[23445]: client.gc: marking allocation for GC: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12
Apr 18 09:57:40 sb-nomad09 nomad[23445]: 2019-04-18T09:57:40.059Z [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="read |0: file already closed"
Apr 18 09:57:40 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=42a141b6-7704-f0f1-139b-d4332c4c4c12 task=test error="read |0: file already closed"
Apr 18 10:13:07 sb-nomad09 nomad[23445]: 2019-04-18T10:13:07.697Z [ERROR] http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stdout&plain=false error="task "test" not started yet. No logs available" code=404
Apr 18 10:13:07 sb-nomad09 nomad[23445]: http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stdout&plain=false error="task "test" not started yet. No logs available" code=404
Apr 18 10:13:07 sb-nomad09 nomad[23445]: 2019-04-18T10:13:07.699Z [ERROR] http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stderr&plain=false error="task "test" not started yet. No logs available" code=404
Apr 18 10:13:07 sb-nomad09 nomad[23445]: http: request failed: method=GET path=/v1/client/fs/logs/42a141b6-7704-f0f1-139b-d4332c4c4c12?task=test&type=stderr&plain=false error="task "test" not started yet. No logs available" code=404
2nd allocation (after retry):
Apr 18 09:58:09 sb-nomad09 nomad[23445]: 2019-04-18T09:58:09.995Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:09 sb-nomad09 nomad[23445]: 2019-04-18T09:58:09.995Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:09 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:09 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:09.995Z
Apr 18 09:58:37 sb-nomad09 nomad[23445]: 2019-04-18T09:58:37.040Z [ERROR] client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec driver=exec task_name=test error="rpc error: code = Unavailable desc = transport is closing"
Apr 18 09:58:37 sb-nomad09 nomad[23445]: client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec driver=exec task_name=test error="rpc error: code = Unavailable desc = transport is closing"
Apr 18 09:58:37 sb-nomad09 nomad[23445]: 2019-04-18T09:58:37.043Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Restart within policy" delay=5.057962516s
Apr 18 09:58:37 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Restart within policy" delay=5.057962516s
Apr 18 09:58:42 sb-nomad09 nomad[23445]: 2019-04-18T09:58:42.134Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]: 2019-04-18T09:58:42.134Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stdout.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test @module=logmon path=/var/nomad/alloc/6e172afa-8cdd-97c1-052d-9e918ff736ec/alloc/logs/.test.stderr.fifo timestamp=2019-04-18T09:58:42.134Z
Apr 18 09:58:42 sb-nomad09 nomad[23445]: 2019-04-18T09:58:42.750Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: running driver failed: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="failed to launch command with executor: rpc error: code = Unknown desc = container process is already dead"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: 2019-04-18T09:58:42.753Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Error was unrecoverable"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner: not restarting task: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test reason="Error was unrecoverable"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: 2019-04-18T09:58:42.757Z [INFO ] client.gc: marking allocation for GC: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.gc: marking allocation for GC: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec
Apr 18 09:58:42 sb-nomad09 nomad[23445]: 2019-04-18T09:58:42.765Z [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="read |0: file already closed"
Apr 18 09:58:42 sb-nomad09 nomad[23445]: client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=6e172afa-8cdd-97c1-052d-9e918ff736ec task=test error="read |0: file already closed"