I have a large job that takes about 45 minutes to run, and it fails from time to time for various reasons. When this happens, I simply restart it and 99 times out of 100 it will finish just fine.
What I'm looking to do is either:
a) upon a failure that causes the job to terminate, wait 5 minutes then start the job again. No need to repeat if it fails subsequently, since there's probably something else going on that requires human intervention.
OR
b) 60 minutes after the normal start of the job, run a second job that checks to see if the first job failed. If so, run the first job again.
Has anyone utilized either of these methods, or have any general suggestions to share?