Job in queue even though logs show it finished

Log says that job finished but it is still in the queue running. Since we have our jobs running in snych mode they next time the job was scheduled to run it didnt.

I scanned the forums and see where other people have had similar issues but cannot find a root cause or fix for the problem. For whatever reason the job actually finishes but never gets purged from the queue so it cannot start again.

2008-07-03 07:10:57,979 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2008-07-03 07:10:57,991 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: ftp_connect: 127.0.0.1
2008-07-03 07:10:58,047 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: earchive login ok
2008-07-03 07:10:58,048 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): command line [/bin/sh -c ./ea_bat_filepreprocess.php] in work directory [/home/earchive/home/cgi-php/]
2008-07-03 07:10:58,048 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() username=earchive command=/bin/sh,-c,./ea_bat_filepreprocess.php workDir=/home/earchive/home/cgi-php
2008-07-03 07:10:58,048 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./runas.pl,earchive,/bin/sh -c ./ea_bat_filepreprocess.php,/home/earchive/home/cgi-php
2008-07-03 07:10:58,055 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2008-07-03 07:10:58,056 [Thread-1210] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2008-07-03 07:10:58,056 [Thread-1210] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): timeout check not required
2008-07-03 07:11:27,822 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2008-07-03 07:11:27,822 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 0
2008-07-03 07:11:27,822 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-03 07:11:27,822 [Job #122 - Tol_1 File Preprocess] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): end

Above is the log showing it ran just like it does every other day no different. However it didnt get purged from the queue and the job that runs after this one finishes started and ran as well.

- Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2008-07-03 07:12:04,855 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: ftp_connect: 127.0.0.1
2008-07-03 07:12:05,066 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: earchive login ok
2008-07-03 07:12:05,066 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): command line [/bin/sh -c ./ea_bat_getimages.php] in work directory [/home/earchive/home/cgi-php/]
2008-07-03 07:12:05,067 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() username=earchive command=/bin/sh,-c,./ea_bat_getimages.php workDir=/home/earchive/home/cgi-php
2008-07-03 07:12:05,067 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./runas.pl,earchive,/bin/sh -c ./ea_bat_getimages.php,/home/earchive/home/cgi-php
2008-07-03 07:12:05,074 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2008-07-03 07:12:05,075 [Thread-1213] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2008-07-03 07:12:05,075 [Thread-1213] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): timeout check not required
2008-07-03 07:14:33,205 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2008-07-03 07:14:33,206 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 0
2008-07-03 07:14:33,206 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-03 07:14:33,206 [Job #123 - Tol_1 Get Images] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): end

This job above requires the job prior to finish and thats how it starts. Any help is greatly appreciated even a workaround if we cannot get to the root of the problem.

This seems to be similar to http://www.softtreetech.com/support/phpBB2/viewtopic.php?t=22320 Please take a look if the suggestion solution works in your case

Wow thats a long post but, i tried to follow it. It does seem that exactly the same thing is happening to that user that is happening to us. I guess what im trying to follow is what the resolution is. Let me explain what we are doing a bit better first. We have 6 jobs that run 1 to 6 in order. We have the run another job on finish set up. So the queue only ever shows job 1. What happen in out case is that job1 was not removed from the queue BUT, jobs 2-6 ran anyways. The logs show that job one finished but yet it was never deleted from the queue so when job's 1-6 went to run again the couldnt because the first instance of job 1 was still in the queue.

So that being said would i need to add a "custom kill command" to all 6 jobs or just job 1? If i just add a timeout to to job 1 that would remove it from the queue? Im just not sure what happens if 2-6 is actually running when it times out, any ideas?

Sorry for not clarifying exactly why that other issue sounded similar. You are correct, the point was that this also could be an issue with the "kill" command. That would make sense when jobs 1 to 6 are scheduled to run in order but they are not linked directly. However if they are linked and you have job 1 trigger job 2, job 2 -trigger job 3 and so on... then "kill" command is not a cause.

The kill command is not job specific. The suggested is like a global change affecting all jobs. By placing script named "kill" into the scheduler home directory, you can make the scheduler to use your custom kill instead of the system kill. This way you can have full control over kill execution and if necessary over process logging. In case if jobs are run remotely on an agent(s) you should deploy the same custom "kill" script to all agents.

" However if they are linked and you have job 1 trigger job 2, job 2 -trigger job 3 and so on... then "kill" command is not a cause."

Any idea what is? The Job just never deleted from the queue but all the linked jobs ran and reported to the logs that they finished. I understand a root cause cannot always be determined and I have had this running for months without any issues. Even a work around at this point would be fine. The operator checked the HTML logs and it said the jobs ran fine yet it was stuck in the queue which put us in a really big bind.

How did you link these jobs? Using job dependencies based on semaphores or using "on finish" event action of "run job" type?

"on finish" event action of "run job" is how we have them set up.

So when the job #1 is done, but still in the queue, the next job #2 is pushed to the queue, but sits there because the queue is blocked by #1. and then you see #3, #$, and #6 sitting in the same queue behind? How could they get into the queue if they are linked to the previous job? Are they having own time based schedules or having [no schedule] for the schedule type.

Well the way it actually works is Job #1 is the only one that we ever see in the queue even when job #2 or later is running. All but job #1 have the [no schedule] type selected. The run another job function act kind of buggy but we are aware of how it works and havent had a problem using it. It always reports the last job in the chain finished first which we know cant be true since its the LAST job in the chain. Also like i said the only job that ever shows in the queue is the first one that is scheduled not any of the jobs that are triggered on another jobs finish. Here is a litte bit of the logs which show that the the jobs show finished in completely backwards order. It shows this way in the monitor too as well. I can get you a screen shot of the queue next time it runs and show you that it only shows the first SCHEDULED job sitting in the queue, the job that triggerers off that jobs finish never show in the queue.

The Jobs actually run from 122 - 126 in order but show they finished in reverse. So using our actual job names here 122 TOL_1 File Preprocess is the scheduled job and the only one that ever goes in the queue. In the explination i explained earlier 122-126 ran and finished and sent finish messages to the log but job 122 never left the queue so next time it was scheduled the job 122 couldnt start because another instance of it was in the queue even though the logs showed it finished.

EXAMPLE:
10-Jul-2008 07:35:17 DataBlasterI 122 Tol_1 File Preprocess Finish message sent.
10-Jul-2008 07:35:16 DataBlasterI 123 Tol_1 Get Images Finish message sent.
10-Jul-2008 07:35:14 DataBlasterI 124 Tol_1 File Reformat Finish message sent.
10-Jul-2008 07:35:13 DataBlasterI 125 Tol_1 File Load Finish message sent.
10-Jul-2008 07:35:12 DataBlasterI 126 Tol_1Web View Load Finish message sent.

That's a totally different game.

What you are observing in regard to job chaining is by design. "Run job" action wasn't originally designed as a method for job chaining. This action is a sort of notification processing for the job it is attached to. Anything triggered from that job is considered to be a continuation of that job processing, kind of a nested processing. That's why triggered jobs run in the same process/thread and don't appear as separate entities in the job queue. The same thing would happen if you were to run a job from a job script using JobRun command. This is also the reason why you see "finish" messages in reversed order. Each of the triggered jobs waits for the next one to complete before reporting it is complete.

If I get it correctly, in your case it is not really clear yet which of these jobs get stuck. It is only known that the entire chain doesn't clear the queue. While we can spend some time checking and troubleshooting each job, it is just might be easier to change the job chaining method and use dependencies, effectively separating each job running and making standalone, yet still having jobs in the process chain linked logically using semaphore files. This will also help us to find out if any of these jobs individually is causing the trouble.

"If I get it correctly, in your case it is not really clear yet which of these jobs get stuck. It is only known that the entire chain doesn't clear the queue. While we can spend some time checking and troubleshooting each job"
As you stated here this is really my biggest problem. I did troubleshoot every job in the chain and they all finished yet the first job, the one that shows in the queue never cleared out of the queue even though i got a finished messages for every job in the chain.

While i have set up dependencies to use semaphore files in the past we found it a bit taxing on the system. It has to scan multiple queues every couple minutes and we thought we might get in trouble with jobs overlapping. Since say the 1 pm and 3 pm have job1 that creates a semaphore file that job2 looks for. what if the 1PM job 1 isnt finished until 4 PM. Im not sure waht happens here Job2 is just scheduled to look for the semaphore file job 1 creates and both the 1 pm and 3pm should be waiting for the same semaphore file. I guess i can try it in test but im not sure scanning for semaphore files every 5 minutes for 50 + jobs is going to be very effective.

Here is my take on that. If the scheduler is setup on a local device, not a network share, Checking for existence of tens, hundreds, or even thousands of specific files once every few minutes should not be a "taxing" operation. Indeed, there should be near zero performance impact. On the other hand, no matter what happens to the system, restarts, shutdowns, crashes, etc..., job chains can continue running after restarts as long as the device/disk is in a good shape; this method also supports system failovers, while notification action based method doesn't. In addition, you can assign dependent jobs to different job queues, while notification action based method always runs them in the same queue as a single process. The only side effect I see - there could be relatively brief gaps between runs of chained jobs caused by time taken by the scheduler to check and find semaphores, which in turn effectively increases the total processing time for the chain by several additional minutes. If that is acceptable to your job processing requirements, go with semaphores.

I will try semaphore files again. I think I understand the software a lot more than i did when i first tried semaphore files and this is going to run the jobs seperate but still dependent of each other.