8000 Jobs executed twice after 24 hours · Issue #192 · dshearer/jobber · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Jobs executed twice after 24 hours #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ckotte opened this issue Apr 26, 2018 · 32 comments
Closed

Jobs executed twice after 24 hours #192

ckotte opened this issue Apr 26, 2018 · 32 comments
Assignees
Labels
Milestone

Comments

@ckotte
Copy link
ckotte commented Apr 26, 2018

The first backup is executed at 2AM as scheduled, but all subsequent backups also start 2 seconds before 2AM!?

bash-4.4# jobber list
NAME                STATUS  SEC/MIN/HR/MDAY/MTH/WDAY  NEXT RUN TIME               NOTIFY ON SUCCESS  NOTIFY ON ERR  NOTIFY ON FAIL  ERR HANDLER
backup_confluence   Good    0 0 2 * * *               Apr 27 02:00:00 +0200 CEST  false              false          false           Continue
backup_jira         Good    0 0 2 * * *               Apr 27 02:00:00 +0200 CEST  false              false          false           Continue
backup_bitbucket    Good    0 0 2 * * *               Apr 27 02:00:00 +0200 CEST  false              false          false           Continue
backup_letsencrypt  Good    0 0 2 * * *               Apr 27 02:00:00 +0200 CEST  false              false          false           Continue
backup_syncthing    Good    0 0 2 * * *               Apr 27 02:00:00 +0200 CEST  false              false          false           Continue
bash-4.4# jobber log
TIME                  JOB                 SUCCEEDED  RESULT
Apr 26 02:00:00 2018  backup_bitbucket    true       Good
Apr 26 02:00:00 2018  backup_letsencrypt  true       Good
Apr 26 02:00:00 2018  backup_jira         true       Good
Apr 26 02:00:00 2018  backup_syncthing    true       Good
Apr 26 02:00:00 2018  backup_confluence   true       Good
Apr 26 01:59:58 2018  backup_bitbucket    false      Good
Apr 26 01:59:58 2018  backup_letsencrypt  true       Good
Apr 26 01:59:58 2018  backup_jira         true       Good
Apr 26 01:59:58 2018  backup_syncthing    true       Good
Apr 26 01:59:58 2018  backup_confluence   true       Good
Apr 25 02:00:00 2018  backup_bitbucket    true       Good
Apr 25 02:00:00 2018  backup_letsencrypt  true       Good
Apr 25 02:00:00 2018  backup_jira         true       Good
Apr 25 02:00:00 2018  backup_syncthing    true       Good
Apr 25 02:00:00 2018  backup_confluence   true       Good

This will create two backup files for each job and sometimes the backup fails because it can't access files while the same backup job runs more or less in parallel.

This issue happens with Jobber 1.2 and also with Jobber 1.3.2. I use a custom Docker container based on Alpine 3.7 with Jobber built from sources.

@dshearer dshearer self-assigned this Apr 27, 2018
@dshearer dshearer added the bug label Apr 27, 2018
@dshearer
Copy link
Owner

Um... That is quite strange. I think I need to ask you for steps to reproduce this. If you can publish a Docker image that reproduces this, that would be ideal.

@ckotte
Copy link
Author
ckotte commented May 3, 2018

Sorry, I didn't get a notification from Github.

I use this image: blacklabelops/jobber-cron. I also created a fork and modified it to support Jobber 1.3.2 with the same result.

I will try your docker image and give you feedback.

@ckotte
Copy link
Author
ckotte commented May 4, 2018

I have a similar behavior with your image. The jobs are executed twice right away:

/ $ jobber log
TIME                  JOB                 SUCCEEDED  RESULT
May  4 02:00:00 2018  backup_bitbucket    false      Good
May  4 02:00:00 2018  backup_letsencrypt  false      Good
May  4 02:00:00 2018  backup_jira         false      Good
May  4 02:00:00 2018  backup_syncthing    false      Good
May  4 02:00:00 2018  backup_confluence   false      Good
May  4 01:59:58 2018  backup_bitbucket    false      Good
May  4 01:59:58 2018  backup_syncthing    false      Good
May  4 01:59:58 2018  backup_letsencrypt  false      Good
May  4 01:59:58 2018  backup_confluence   false      Good
May  4 01:59:58 2018  backup_jira         false      Good

@dshearer
Copy link
Owner

Here's the scheduling code:

		// sleep till it's time to run it
		if now.Before(*job.NextRunTime) {
			afterChan := time.After(job.NextRunTime.Sub(now))
			select {
			case now = <-afterChan:
			case <-ctx.Done():
				// abort!
				heap.Push(&jq.q, job)
				return nil
			}
		}

		// schedule this job's next run
		job.NextRunTime = nextRunTime(job, now.Add(time.Second))
		if job.NextRunTime != nil {
			heap.Push(&jq.q, job)

I suspect that, for some reason, the time.After timer is firing two seconds early.

@ckotte
Copy link
Author
ckotte commented May 15, 2018

It's executed 2-3 seconds before and again at the scheduled time. I will try your container on another Docker host. Maybe it's related to the kernel and/or Docker version used on my primary Docker host...

Any hints on how can I debug this? Do I have to add logging of the variables to the code and compile it again or can I debug the compiled code?

@dshearer
Copy link
Owner
dshearer commented May 15, 2018 via email

@MohnishSingla
Copy link

I am facing the same issue with 1.2 version. In my case, it is happening alternate day. For example, refer below log:

2018-05-18 10:00:00.859403633 -0400 EDT DailyEmails root true Good
2018-05-18 09:59:58.856692681 -0400 EDT DailyEmails root true Good
2018-05-17 10:00:00.079149698 -0400 EDT DailyEmails root true Good
2018-05-16 10:00:00.079163695 -0400 EDT DailyEmails root true Good
2018-05-16 09:59:59.076448174 -0400 EDT DailyEmails root true Good

If we look at the logs, job executed twice on 16th and 18th whereas once on 17th.

@dshearer dshearer added this to the 1.3.3 milestone May 28, 2018
dshearer added a commit that referenced this issue May 28, 2018
- Queue was causing jobs to run twice within a few seconds of the
time they were supposed to run once.
@dshearer
Copy link
Owner

I've made a Docker image containing what I hope is a fix. Could one of you please try it out? It's dshearer/jobber:1.3.3-pr.1-alpine3.7

@ckotte
Copy link
Author
ckotte commented May 30, 2018

I tried it. Unfortunately, I have the same behavior:

/ $ jobber log
TIME                  JOB                 SUCCEEDED  RESULT
May 30 02:00:00 2018  backup_bitbucket    false      Good
May 30 02:00:00 2018  backup_letsencrypt  false      Good
May 30 02:00:00 2018  backup_jira         false      Good
May 30 02:00:00 2018  backup_syncthing    false      Good
May 30 02:00:00 2018  backup_confluence   false      Good
May 30 01:59:58 2018  backup_bitbucket    false      Good
May 30 01:59:58 2018  backup_letsencrypt  false      Good
May 30 01:59:58 2018  backup_jira         false      Good
May 30 01:59:58 2018  backup_confluence   false      Good
May 30 01:59:58 2018  backup_syncthing    false      Good
May 29 02:00:00 2018  backup_letsencrypt  false      Good
May 29 02:00:00 2018  backup_jira         false      Good
May 29 02:00:00 2018  backup_syncthing    false      Good
May 29 02:00:00 2018  backup_confluence   false      Good
May 29 02:00:00 2018  backup_bitbucket    false      Good

@dshearer
Copy link
Owner
dshearer commented Jun 2, 2018

I really can't imagine what is causing this. Unfortunately, I cannot reproduce it. If you can provide instructions for reproducing it, that would help. Otherwise, I need to think about this.

@ckotte
Copy link
Author
ckotte commented Jun 6, 2018

Sorry, I'm a bit busy at the moment.

I don't know how to reproduce. Maybe it's related to the kernel or Docker version I use..!? I need to do more tests to narrow down the issue..

Could you create a special version that logs the values of the variables etc. to a logfile during execution or something similar?

# docker --version
Docker version 17.07.0-ce, build fd7dc00
# uname -r
4.2.8

This is a part of the Dockerfile I used with your Docker image:

version: '2'

services:
  jobber-docker:
  [...]
  jobber-tools:
  [...]
  # TEST
  jobber-test:
    restart: always
    image: dshearer/jobber:1.3.3-pr.1-alpine3.7
    container_name: jobber-test
    hostname: jobber-test
    networks:
      - atlassian_backend
    volumes:
      - /xxx/Container/jobber_test:/home/jobberuser
      - /xxx/Backups/Test:/backup
      - /xxx/Container/postgres:/mnt/postgresql:ro
      - /xxx/Container/bitbucket:/mnt/bitbucket:ro
      - /xxx/Container/jira:/mnt/jira:ro
      - /xxx/Container/confluence:/mnt/confluence:ro
      - /xxx/Container/letsencrypt:/mnt/letsencrypt:ro
      - /xxx/Container/syncthing:/mnt/syncthing:ro
    environment:
      TZ: "Europe/Berlin"

networks:
  atlassian_backend:
    external: true

This is the test jobber file:

[/share/Container/jobber_test] # cat .jobber
[jobs]
- name: backup_confluence
  cmd: /backup/container-backup/container-backup.sh -p confluence 2>> /var/log/backup-confluence-err.log >> /var/log/backup-confluence.log
  time: '0 0 2 * * *'
  onError: Continue
  notifyOnError: false
  notifyOnFailure: false
- name: backup_jira
  cmd: /backup/container-backup/container-backup.sh -p jira 2>> /var/log/backup-jira-err.log >> /var/log/backup-jira.log
  time: '0 0 2 * * *'
  onError: Continue
  notifyOnError: false
  notifyOnFailure: false
- name: backup_bitbucket
  cmd: /backup/container-backup/container-backup.sh -p bitbucket 2>> /var/log/backup-bitbucket-err.log >> /var/log/backup-bitbucket.log
  time: '0 0 2 * * *'
  onError: Continue
  notifyOnError: false
  notifyOnFailure: false
- name: backup_letsencrypt
  cmd: /backup/container-backup/container-backup.sh -p letsencrypt 2>> /var/log/backup-letsencrypt-err.log >> /var/log/backup-letsencrypt.log
  time: '0 0 2 * * *'
  onError: Continue
  notifyOnError: false
  notifyOnFailure: false
- name: backup_syncthing
  cmd: /backup/container-backup/container-backup.sh -p syncthing 2>> /var/log/backup-syncthing-err.log >> /var/log/backup-syncthing.log
  time: '0 0 2 * * *'
  onError: Continue
  notifyOnError: false
  notifyOnFailure: false

Current status on jobber test container:

/ $ jobber log
TIME                  JOB                 SUCCEEDED  RESULT
Jun  6 02:00:00 2018  backup_bitbucket    false      Good
Jun  6 02:00:00 2018  backup_letsencrypt  false      Good
8000

Jun  6 02:00:00 2018  backup_jira         false      Good
Jun  6 02:00:00 2018  backup_syncthing    false      Good
Jun  6 02:00:00 2018  backup_confluence   false      Good
Jun  6 01:59:57 2018  backup_bitbucket    false      Good
Jun  6 01:59:57 2018  backup_letsencrypt  false      Good
Jun  6 01:59:57 2018  backup_jira         false      Good
Jun  6 01:59:57 2018  backup_syncthing    false      Good
Jun  6 01:59:57 2018  backup_confluence   false      Good
Jun  5 02:00:00 2018  backup_bitbucket    false      Good
Jun  5 02:00:00 2018  backup_letsencrypt  false      Good
Jun  5 02:00:00 2018  backup_jira         false      Good
Jun  5 02:00:00 2018  backup_syncthing    false      Good
Jun  5 02:00:00 2018  backup_confluence   false      Good
Jun  5 01:59:58 2018  backup_bitbucket    false      Good
Jun  5 01:59:58 2018  backup_letsencrypt  false      Good
Jun  5 01:59:58 2018  backup_jira         false      Good
Jun  5 01:59:58 2018  backup_syncthing    false      Good
Jun  5 01:59:58 2018  backup_confluence   false      Good
[...]

@MohnishSingla
Copy link

Hello,
For me, unfortunately this is happening only on one or two servers and m not able to repro it in my boxes.

@dshearer
Copy link
Owner
dshearer commented Jun 9, 2018

Thanks for the info. I'll keep working on this.

@dshearer
Copy link
Owner

Ok. @ckotte , I've done as you asked: I've made a build that logs some useful info when scheduling jobs. dshearer/jobber:1.3.3-pr.2-alpine-3.7

Here's an example of the output:

Next job to run is ExampleJob, at Jun 10 18:32:49.
It is now Jun 10 18:32:49, which is NOT before Jun 10 18:32:49
Running ExampleJob
jobberuser: echo "Jobber is running!"
Calling pop
Next job to run is ExampleJob, at Jun 10 18:32:50.
It is now Jun 10 18:32:49.
Sleeping for 999.251675ms.
It is now Jun 10 18:32:50, which is NOT before Jun 10 18:32:50
Running ExampleJob
jobberuser: echo "Jobber is running!"

It goes to stdout. You can also write this to disk with the logPath setting in your jobfile:

[prefs]
logPath: /path/to/logfile

[jobs]
...

@ckotte
Copy link
Author
ckotte commented Jun 11, 2018

thx. I updated my jobber test container with your new version. I will let you know.

@ckotte
Copy link
Author
ckotte commented Jun 12, 2018

Ok. Here is the output:

/ $ jobber log
TIME                  JOB                 SUCCEEDED  RESULT
Jun 12 02:00:00 2018  backup_bitbucket    false      Good
Jun 12 02:00:00 2018  backup_letsencrypt  false      Good
Jun 12 02:00:00 2018  backup_jira         false      Good
Jun 12 02:00:00 2018  backup_syncthing    false      Good
Jun 12 02:00:00 2018  backup_confluence   false      Good
Jun 12 01:59:59 2018  backup_bitbucket    false      Good
Jun 12 01:59:59 2018  backup_letsencrypt  false      Good
Jun 12 01:59:59 2018  backup_jira         false      Good
Jun 12 01:59:59 2018  backup_syncthing    false      Good
Jun 12 01:59:59 2018  backup_confluence   false      Good
Jun 12 01:59:58 2018  backup_bitbucket    false      Good
Jun 12 01:59:58 2018  backup_letsencrypt  false      Good
Jun 12 01:59:58 2018  backup_jira         false      Good
Jun 12 01:59:58 2018  backup_syncthing    false      Good
Jun 12 01:59:58 2018  backup_confluence   false      Good
Jun 11 02:00:00 2018  backup_bitbucket    false      Good
Jun 11 02:00:00 2018  backup_letsencrypt  false      Good
Jun 11 02:00:00 2018  backup_jira         false      Good
Jun 11 02:00:00 2018  backup_syncthing    false      Good
Jun 11 02:00:00 2018  backup_confluence   false      Good
Starting job-runner thread...
Listening for commands on /var/jobber/100/cmd.sock
JobManager Waiting...
Launching job runner thread
done
Calling pop
Next job to run is backup_confluence, at Jun 11 02:00:00.
It is now Jun 10 22:47:18.
Sleeping for 3h12m41.974725264s.
Jobberrunner Quitting
JobManager cancelling...
done
JobManager Waiting...
Main thread cancelled
JobRunner: cancelling
Consuming remaining run recs...
Run thread got 'stop'
JobRunner: cleaning up...
JobRunner done
Done onsuming remaining run recs
done
Jobberrunner Quitting
JobManager cancelling...
done
JobManager Waiting...
done
Starting job-runner thread...
Launching job runner thread
done
Calling pop
Next job to run is backup_confluence, at Jun 11 02:00:00.
It is now Jun 10 22:53:51.
Sleeping for 3h6m8.980066293s.
It is now Jun 11 02:00:00, which is NOT before Jun 11 02:00:00
Running backup_confluence
jobberuser: /backup/container-backup/container-backup.sh -p confluence 2>> /var/log/backup-confluence-err.log >> /var/log/backup-confluence.log
Calling pop
Next job to run is backup_syncthing, at Jun 11 02:00:00.
It is now Jun 11 02:00:00, which is NOT before Jun 11 02:00:00
Running backup_syncthing
jobberuser: /backup/container-backup/container-backup.sh -p syncthing 2>> /var/log/backup-syncthing-err.log >> /var/log/backup-syncthing.log
Calling pop
Next job to run is backup_jira, at Jun 11 02:00:00.
It is now Jun 11 02:00:00, which is NOT before Jun 11 02:00:00
Running backup_jira
jobberuser: /backup/container-backup/container-backup.sh -p jira 2>> /var/log/backup-jira-err.log >> /var/log/backup-jira.log
Calling pop
Next job to run is backup_letsencrypt, at Jun 11 02:00:00.
It is now Jun 11 02:00:00, which is NOT before Jun 11 02:00:00
ExecAndWait: /bin/sh: exit status 1
ExecAndWait: /bin/sh: exit status 1
ExecAndWait: /bin/sh: exit status 1
Running backup_letsencrypt
jobberuser: /backup/container-backup/container-backup.sh -p letsencrypt 2>> /var/log/backup-letsencrypt-err.log >> /var/log/backup-letsencrypt.log
Calling pop
Next job to run is backup_bitbucket, at Jun 11 02:00:00.
It is now Jun 11 02:00:00, which is NOT before Jun 11 02:00:00
ExecAndWait: /bin/sh: exit status 1
Running backup_bitbucket
jobberuser: /backup/container-backup/container-backup.sh -p bitbucket 2>> /var/log/backup-bitbucket-err.log >> /var/log/backup-bitbucket.log
Calling pop
Next job to run is backup_confluence, at Jun 12 02:00:00.
It is now Jun 11 02:00:00.
Sleeping for 23h59m59.754671253s.
ExecAndWait: /bin/sh: exit status 1
Got command
It is now Jun 12 01:59:58, which is NOT before Jun 12 02:00:00
Running backup_confluence
jobberuser: /backup/container-backup/container-backup.sh -p confluence 2>> /var/log/backup-confluence-err.log >> /var/log/backup-confluence.log
Calling pop
Next job to run is backup_syncthing, at Jun 12 02:00:00.
It is now Jun 12 01:59:58.
Sleeping for 122.217005ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:58, which is NOT before Jun 12 02:00:00
Running backup_syncthing
jobberuser: /backup/container-backup/container-backup.sh -p syncthing 2>> /var/log/backup-syncthing-err.log >> /var/log/backup-syncthing.log
Calling pop
Next job to run is backup_jira, at Jun 12 02:00:00.
It is now Jun 12 01:59:58.
Sleeping for 30.217956ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:58, which is NOT before Jun 12 02:00:00
Running backup_jira
jobberuser: /backup/container-backup/container-backup.sh -p jira 2>> /var/log/backup-jira-err.log >> /var/log/backup-jira.log
Calling pop
Next job to run is backup_letsencrypt, at Jun 12 02:00:00.
It is now Jun 12 01:59:58.
Sleeping for 29.751361ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:58, which is NOT before Jun 12 02:00:00
Running backup_letsencrypt
jobberuser: /backup/container-backup/container-backup.sh -p letsencrypt 2>> /var/log/backup-letsencrypt-err.log >> /var/log/backup-letsencrypt.log
Calling pop
Next job to run is backup_bitbucket, at Jun 12 02:00:00.
It is now Jun 12 01:59:58.
Sleeping for 31.366824ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:58, which is NOT before Jun 12 02:00:00
Running backup_bitbucket
jobberuser: /backup/container-backup/container-backup.sh -p bitbucket 2>> /var/log/backup-bitbucket-err.log >> /var/log/backup-bitbucket.log
Calling pop
Next job to run is backup_confluence, at Jun 12 02:00:00.
It is now Jun 12 01:59:58.
Sleeping for 1.784548946s.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:59, which is NOT before Jun 12 02:00:00
Running backup_confluence
jobberuser: /backup/container-backup/container-backup.sh -p confluence 2>> /var/log/backup-confluence-err.log >> /var/log/backup-confluence.log
Calling pop
Next job to run is backup_syncthing, at Jun 12 02:00:00.
It is now Jun 12 01:59:59.
Sleeping for 122.076316ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:59, which is NOT before Jun 12 02:00:00
Running backup_syncthing
jobberuser: /backup/container-backup/container-backup.sh -p syncthing 2>> /var/log/backup-syncthing-err.log >> /var/log/backup-syncthing.log
Calling pop
Next job to run is backup_jira, at Jun 12 02:00:00.
It is now Jun 12 01:59:59.
Sleeping for 30.214068ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:59, which is NOT before Jun 12 02:00:00
Running backup_jira
jobberuser: /backup/container-backup/container-backup.sh -p jira 2>> /var/log/backup-jira-err.log >> /var/log/backup-jira.log
Calling pop
Next job to run is backup_letsencrypt, at Jun 12 02:00:00.
It is now Jun 12 01:59:59.
Sleeping for 29.762412ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:59, which is NOT before Jun 12 02:00:00
Running backup_letsencrypt
jobberuser: /backup/container-backup/container-backup.sh -p letsencrypt 2>> /var/log/backup-letsencrypt-err.log >> /var/log/backup-letsencrypt.log
Calling pop
Next job to run is backup_bitbucket, at Jun 12 02:00:00.
It is now Jun 12 01:59:59.
Sleeping for 31.5993ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 01:59:59, which is NOT before Jun 12 02:00:00
Running backup_bitbucket
jobberuser: /backup/container-backup/container-backup.sh -p bitbucket 2>> /var/log/backup-bitbucket-err.log >> /var/log/backup-bitbucket.log
Calling pop
Next job to run is backup_confluence, at Jun 12 02:00:00.
It is now Jun 12 01:59:59.
Sleeping for 784.902407ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 02:00:00, which is NOT before Jun 12 02:00:00
Running backup_confluence
jobberuser: /backup/container-backup/container-backup.sh -p confluence 2>> /var/log/backup-confluence
8000
-err.log >> /var/log/backup-confluence.log
Calling pop
Next job to run is backup_syncthing, at Jun 12 02:00:00.
It is now Jun 12 02:00:00.
Sleeping for 91.668149ms.
ExecAndWait: /bin/sh: exit status 1
It is now Jun 12 02:00:00, which is NOT before Jun 12 02:00:00
Running backup_syncthing
jobberuser: /backup/container-backup/container-backup.sh -p syncthing 2>> /var/log/backup-syncthing-err.log >> /var/log/backup-syncthing.log
Calling pop
Next job to run is backup_jira, at Jun 12 02:00:00.
It is now Jun 12 02:00:00, which is NOT before Jun 12 02:00:00
ExecAndWait: /bin/sh: exit status 1
Running backup_jira
jobberuser: /backup/container-backup/container-backup.sh -p jira 2>> /var/log/backup-jira-err.log >> /var/log/backup-jira.log
Calling pop
Next job to run is backup_letsencrypt, at Jun 12 02:00:00.
It is now Jun 12 02:00:00, which is NOT before Jun 12 02:00:00
ExecAndWait: /bin/sh: exit status 1
Running backup_letsencrypt
jobberuser: /backup/container-backup/container-backup.sh -p letsencrypt 2>> /var/log/backup-letsencrypt-err.log >> /var/log/backup-letsencrypt.log
Calling pop
Next job to run is backup_bitbucket, at Jun 12 02:00:00.
It is now Jun 12 02:00:00, which is NOT before Jun 12 02:00:00
ExecAndWait: /bin/sh: exit status 1
Running backup_bitbucket
jobberuser: /backup/container-backup/container-backup.sh -p bitbucket 2>> /var/log/backup-bitbucket-err.log >> /var/log/backup-bitbucket.log
Calling pop
Next job to run is backup_confluence, at Jun 13 02:00:00.
It is now Jun 12 02:00:00.
Sleeping for 23h59m59.745693673s.
ExecAndWait: /bin/sh: exit status 1
Got command
Got list jobs cmd
Got command

@dshearer
Copy link
Owner

It is now Jun 12 01:59:59, which is NOT before Jun 12 02:00:00

I am flabbergasted.

@dshearer
Copy link
Owner

Sorry, I've been quite busy with some personal stuff. Would one of you be able to do some experiments?

  1. Please change the time of day these jobs run.
  2. Please change the frequency at which they run -- say, have them run every 2 hours.

Does the problem remain after either of these changes?

@ckotte
Copy link
Author
ckotte commented Jul 3, 2018

I will try this after I'm back from vacation in 2 weeks.

@ckotte
Copy link
Author
ckotte commented Jul 12, 2018

Same issue with different frequency. Looks like the issue always starts the next day...

/ $ jobber list
NAME                STATUS  SEC/MIN/HR/MDAY/MTH/WDAY  NEXT RUN TIME               NOTIFY ON SUCCESS  NOTIFY ON ERR  NOTIFY ON FAIL  ERR HANDLER
backup_confluence   Good    0 0 */1 * * *             Jul 12 10:00:00 +0200 CEST  false              false          false           Continue
backup_jira         Good    0 0 */2 * * *             Jul 12 10:00:00 +0200 CEST  false              false          false           Continue
backup_bitbucket    Good    0 0 8 * * *               Jul 13 08:00:00 +0200 CEST  false              false          false           Continue
backup_letsencrypt  Good    0 0 2 * * *               Jul 13 02:00:00 +0200 CEST  false              false          false           Continue
backup_syncthing    Good    0 0 10 * * *              Jul 12 10:00:00 +0200 CEST  false              false          false           Continue
/ $ jobber log
TIME                  JOB                 SUCCEEDED  RESULT
Jul 12 09:00:00 2018  backup_confluence   false      Good
Jul 12 08:59:58 2018  backup_confluence   false      Good
Jul 12 08:00:00 2018  backup_jira         false      Good
Jul 12 08:00:00 2018  backup_confluence   false      Good
Jul 12 08:00:00 2018  backup_bitbucket    false      Good
Jul 12 07:59:59 2018  backup_bitbucket    false      Good
Jul 12 07:00:00 2018  backup_confluence   false      Good
Jul 12 06:00:00 2018  backup_jira         false      Good
Jul 12 06:00:00 2018  backup_confluence   false      Good
Jul 12 05:59:59 2018  backup_jira         false      Good
Jul 12 05:00:00 2018  backup_confluence   false      Good
Jul 12 04:59:59 2018  backup_confluence   false      Good
Jul 12 04:00:00 2018  backup_jira         false      Good
Jul 12 04:00:00 2018  backup_confluence   false      Good
Jul 12 03:00:00 2018  backup_confluence   false      Good
Jul 12 02:00:00 2018  backup_jira         false      Good
Jul 12 02:00:00 2018  backup_confluence   false      Good
Jul 12 02:00:00 2018  backup_letsencrypt  false      Good
Jul 12 01:00:00 2018  backup_confluence   false      Good
Jul 12 00:00:00 2018  backup_jira         false      Good
Jul 12 00:00:00 2018  backup_confluence   false      Good
Jul 11 23:00:00 2018  backup_confluence   false      Good
Jul 11 22:00:00 2018  backup_jira         false      Good
Jul 11 22:00:00 2018  backup_confluence   false      Good
Jul 11 21:00:00 2018  backup_confluence   false      Good
Jul 11 20:00:00 2018  backup_jira         false      Good
Jul 11 20:00:00 2018  backup_confluence   false      Good
Jul 11 19:00:00 2018  backup_confluence   false      Good
Jul 11 18:00:00 2018  backup_jira         false      Good
Jul 11 18:00:00 2018  backup_confluence   false      Good
Jul 11 17:00:00 2018  backup_confluence   false      Good
Jul 11 16:00:00 2018  backup_jira         false      Good
Jul 11 16:00:00 2018  backup_confluence   false      Good
Jul 11 15:00:00 2018  backup_confluence   false      Good
Jul 11 14:00:00 2018  backup_jira         false      Good
Jul 11 14:00:00 2018  backup_confluence   false      Good
Jul 11 13:00:00 2018  backup_confluence   false      Good
Jul 11 12:00:00 2018  backup_jira         false      Good
Jul 11 12:00:00 2018  backup_confluence   false      Good
Jul 11 11:00:00 2018  backup_confluence   false      Good
Jul 11 10:00:00 2018  backup_jira         false      Good
Jul 11 10:00:00 2018  backup_syncthing    false      Good
Jul 11 10:00:00 2018  backup_confluence   false      Good

@dshearer
Copy link
Owner
dshearer commented Aug 2, 2018

I have finally reproduced this! I'll have a fix this weekend.

@dshearer
Copy link
Owner
dshearer commented Aug 6, 2018

Actually, this is taking longer than expected to make the fix. I'll keep this issue updated.

@ckotte
Copy link
Author
ckotte commented Aug 6, 2018

ok. thx. How could you reproduce this? Is it a general issue in the code or related to a specific setting?

@dshearer
Copy link
Owner
dshearer commented Aug 8, 2018 via email

@dshearer
Copy link
Owner

Ugh. I haven't been able to reproduce it again.

Sorry this is taking so long to fix. I can only work on this on the weekends.

@dshearer
Copy link
Owner

I don't know guys. Jobber's code is correct. This has something to do with Docker. I really don't know what to do at this point.

@ckotte
Copy link
Author
ckotte commented Aug 18, 2018

Thank you very much for your time and effort. Could you re-produce it in Docker again or only the first time?

@dshearer
Copy link
Owner

Only the first time. It is very frustrating.

@dshearer
Copy link
Owner

Okay, I reproduced it again, with a job called "Test" that runs every minute:

Next job to run is Test, at Aug 19 19:28:00.
It is now Aug 19 19:27:00.
Sleeping for 59.9998044s.
It is now Aug 19 19:27:59, which is NOT before Aug 19 19:28:00
Running Test
jobberuser: echo Hi
Calling pop
Next job to run is Test, at Aug 19 19:28:00.
It is now Aug 19 19:27:59.
Sleeping for 999.8574ms.
It is now Aug 19 19:28:00, which is NOT before Aug 19 19:28:00
Running Test

Here's the implementation: At startup, Jobber computes the next run time for each job, and then puts each job in a priority queue that is sorted by runtime. When Pop is called on this queue, it finds the next job that should run, sleeps until it's time to run it, and then returns it. Here's the part of the code corresponding to the output above (variable now is of type time.Time, and initially contains the current time):

// get next-scheduled job
job := heap.Pop(&jq.q).(*jobfile.Job)

var timeFmt = "Jan _2 15:04:05"

common.Logger.Printf("Next job to run is %v, at %v.", job.Name,
	job.NextRunTime.Format(timeFmt))

// sleep till it's time to run it
for now.Before(*job.NextRunTime) {
	sleepDur := job.NextRunTime.Sub(now)

	common.Logger.Printf("It is now %v.", now.Format(timeFmt))
	common.Logger.Printf("Sleeping for %v.", sleepDur)

	afterChan := time.After(sleepDur)
	select {
	case now = <-afterChan:
	case <-ctx.Done():
		// abort!
		heap.Push(&jq.q, job)
		return nil
	}
}

common.Logger.Printf("It is now %v, which is NOT before %v",
	now.Format(timeFmt), job.NextRunTime.Format(timeFmt))

The log indicates that, at some point, now == Aug 19 19:27:59, *job.NextRunTime == Aug 19 19:28:00, and yet now.Before(*job.NextRunTime) returned false.

@dshearer
Copy link
Owner

I've managed to reproduce this in a small program. I'll probably open a bug with the Go team.

@dshearer dshearer modified the milestones: 1.3.3, 1.3.4 Aug 25, 2018
dshearer added a commit that referenced this issue Sep 10, 2018
- A bug in Go's time package (golang/go#27090) was causing jobs
to be executed multiple times.  In this commit, we provide a workaround.
dshearer added a commit that referenced this issue Sep 10, 2018
(#192) Bugfix in queue.go: Job execed multiple times
@dshearer
Copy link
Owner

Implemented workaround.

@dshearer
Copy link
Owner

Fix is in v1.3.4. Please let me know if it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
0