Job Queues still lock up

I sent them to the ftp site.

Thanks!

Hi,

I know you guys are probably busy looking over the logs I sent you.

Just want to make sure:

1. You have what you need and it makes sense now.
2. Have you determined any cause yet?

Thanks for the help!

Hi, we are still working on this; we have forwarded your log files to the development team and currently waiting for their analysis results

Hi,

Can you tell us how jobs 1234 and 1154 have been setup?

Are they independent and running on schedule or kicked by other jobs using job chaining (run another job option)?
Detached mode jobs?
Timeout parameter set? If yes, what's the parameter value? In case of using job chaining, are these other jobs having timeout and detached mode parameters set?

Thanks

All jobs should be running detached.
Job chaining is used. The job runs every 5 minutes and is looking for files, so lots of failures are expected.
If a file is present the next job in the chain is called which is an archive process.

No timeout value is set for this job.

If I can provide more information please let me know.

Thanks.

Hi,

Unfortunately the debug logs you copied to the FTP site store trace records for different time windows. One log ends before 2011-01-12 21:20 (timing of the second referenced event), the other starts after 2011-01-06 15:40:03 (timing of the first referenced event). Because of the time differences, we were unable to reconcile logs and match records from the agent system and from the scheduler system for the events referenced as leading to queue lockups. We need your help with finding such matching records. Basically, we need the complete set of debug.log and schedule.log log records for a job run leading to queue stagnation and we need records for this event from both ends, the agent and the scheduler. There is also a chance this issue is somewhat related to job chaining, so we may also need matching records for the entire job chain run. Once we have a matching set, it should be relatively easy figuring out the cause of a job getting stuck in a queue and blocking it forever.

Thank you.

I have sent a new set of debug files.
For the same job 1154, same time etc.

Hopefully I get it right this time.

Let me know.

Thank you. We are looking into this. It may take a little while, the provided files are very large.

Is job #1154 setup to run asynchronously? How many other jobs are assigned to the same queue? Are they asynchronous too? Why does this job run in a loop without pauses, basically one instance ends another instance starts right away. That wouldn't be possible if queues were used correctly.

How many queues do you use? What is an average queue load in terms of assigned jobs?
The logs indicate many jobs starting virtually simultaneously and falling in a matter of milliseconds near the specified time event, then restarting right away.

PS. There seems to be a serious problem with the current job setup. After we figure out the cause of the queue issue, I suggest we spend some time on reviewing the current design and checking how it can be tuned or reconfigured.

None jobs are setup to run asynchronous.
The jobs in the case of 1154 are looking for files on a remote system where the agent is installed.

The job in this case for 1154 is setup to run every 5 minutes. This is the case for all the jobs which I will term "file watcher" jobs, hence why there are so many messages in the log.

For queue file_ops_2, There are approx 55 of these "file watcher" jobs running.

If there would be a better way to handle this I am for suggestions. Let me know what other information I can provide.

Thanks.

For the file checker jobs, I would like to set them to asynchronous and see if the issue sub sides for this queue. Since these are quick jobs, I see no harm in making these file checker jobs run asynchrounously.

Let you know how that goes.

Please don't change jobs to asynchronous, that would only make the situation worse.

For the moment, let's try a different approach.

1. Rename debug.log files on the scheduler and agent systems. They will start fresh, it will be easier to troubleshoot any issues.
2. Setup new queue named "job_1154_queue"
3. Set job #1154 to run in that queue
4. Monitor files in [scheduler home]/Queue/job_1154_queue for a couple of hours. Please check that files appear and disappear in that queue approximately every 5 minutes. If you can confirm that, we are good. If you see something else, jobs don't run as you expect. Your logs indicate job 1154 is running much more often than every 5 minutes

In the long run, I suggest, switching agent from "agent" to "scheduler" mode with distributed server option enabled. Changing remote file-watch jobs to local regular file watch jobs, basically making the scheduler (former agent) to watch for them locally and trigger jobs on the current scheduler remotely. Basically switching from "jobs checking files remotely, if found, running processes locally", to " scheduler checking files locally, if found, running processes remotely" The main advantage of that method, there is no need for frequent job runs, file checking is done locally by the scheduler without running any jobs, no need for extensive network traffic and artificial job exceptions, no need for extensive job queuing, etc… And the number of required jobs is sliced in half. Processes are run only when the scheduler finds the required files.

Just in case, when "server" mode is enabled, the scheduler can run both local jobs and accept and run remote jobs submitted from other systems. That mode kind of supersedes the agent mode. But there is one disadvantage, in the above situation, you would need to manage jobs in 2 places, if you were to schedule jobs for running locally on both systems.

Last edited by SysOp on Fri Feb 11, 2011 10:30 am; edited 1 time in total

Here's something interesting.
I have made the file checker jobs asynchcronus and the queue has not frozen up yet.

I had to make a few changes to the archiver script to ensure unique temp files get created, but so far so good.

So, from the perspective of the scheduler, it appears that what ever needs to notify the master that the job has completed is not working when the jobs are not being run asyncronously.

Maybe that can help with tracking the issue.

Since your changes are quite different from what had been suggested, let's wait for a few days and check if your changes resolve the issue. Please keep us posted on the status.

Due to some issues with the way the scripts were setup, I had to revert it back.
I am going to work on setting up some scripts so that they can kick off in the background.

While the jobs were running in asynchcronous mode however I saw no issues with the queues locking up which would make sense because they are released immediately as they are run.