SoftTree Technologies SoftTree Technologies
Technical Support Forums
RegisterSearchFAQMemberlistUsergroupsLog in
Job Queues still lock up
Goto page Previous  1, 2, 3, 4, 5  Next
 
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite View previous topic
View next topic
Job Queues still lock up
Author Message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
I sent them to the ftp site.

Thanks!
Thu Jan 13, 2011 11:30 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
Hi,

I know you guys are probably busy looking over the logs I sent you.

Just want to make sure:

1. You have what you need and it makes sense now.
2. Have you determined any cause yet?

Thanks for the help!
Fri Jan 14, 2011 4:53 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7838

Post Reply with quote
Hi, we are still working on this; we have forwarded your log files to the development team and currently waiting for their analysis results
Tue Jan 18, 2011 8:53 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7838

Post Reply with quote
Hi,

Can you tell us how jobs 1234 and 1154 have been setup?

Are they independent and running on schedule or kicked by other jobs using job chaining (run another job option)?
Detached mode jobs?
Timeout parameter set? If yes, what's the parameter value? In case of using job chaining, are these other jobs having timeout and detached mode parameters set?


Thanks
Wed Jan 19, 2011 9:55 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
All jobs should be running detached.
Job chaining is used. The job runs every 5 minutes and is looking for files, so lots of failures are expected.
If a file is present the next job in the chain is called which is an archive process.

No timeout value is set for this job.

If I can provide more information please let me know.

Thanks.
Sun Jan 23, 2011 10:09 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7838

Post Reply with quote
Hi,

Unfortunately the debug logs you copied to the FTP site store trace records for different time windows. One log ends before 2011-01-12 21:20 (timing of the second referenced event), the other starts after 2011-01-06 15:40:03 (timing of the first referenced event). Because of the time differences, we were unable to reconcile logs and match records from the agent system and from the scheduler system for the events referenced as leading to queue lockups. We need your help with finding such matching records. Basically, we need the complete set of debug.log and schedule.log log records for a job run leading to queue stagnation and we need records for this event from both ends, the agent and the scheduler. There is also a chance this issue is somewhat related to job chaining, so we may also need matching records for the entire job chain run. Once we have a matching set, it should be relatively easy figuring out the cause of a job getting stuck in a queue and blocking it forever.

Thank you.
Mon Jan 24, 2011 9:34 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
I have sent a new set of debug files.
For the same job 1154, same time etc.

Hopefully I get it right this time.

Let me know.
Mon Jan 24, 2011 1:21 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7838

Post Reply with quote
Thank you. We are looking into this. It may take a little while, the provided files are very large.
Tue Jan 25, 2011 9:47 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7838

Post Reply with quote
Is job #1154 setup to run asynchronously? How many other jobs are assigned to the same queue? Are they asynchronous too? Why does this job run in a loop without pauses, basically one instance ends another instance starts right away. That wouldn't be possible if queues were used correctly.

How many queues do you use? What is an average queue load in terms of assigned jobs?
The logs indicate many jobs starting virtually simultaneously and falling in a matter of milliseconds near the specified time event, then restarting right away.


PS. There seems to be a serious problem with the current job setup. After we figure out the cause of the queue issue, I suggest we spend some time on reviewing the current design and checking how it can be tuned or reconfigured.
Tue Jan 25, 2011 11:19 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
None jobs are setup to run asynchronous.
The jobs in the case of 1154 are looking for files on a remote system where the agent is installed.

The job in this case for 1154 is setup to run every 5 minutes. This is the case for all the jobs which I will term "file watcher" jobs, hence why there are so many messages in the log.

For queue file_ops_2, There are approx 55 of these "file watcher" jobs running.

If there would be a better way to handle this I am for suggestions. Let me know what other information I can provide.

Thanks.
Tue Jan 25, 2011 11:41 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
For the file checker jobs, I would like to set them to asynchronous and see if the issue sub sides for this queue. Since these are quick jobs, I see no harm in making these file checker jobs run asynchrounously.

Let you know how that goes.
Tue Jan 25, 2011 3:42 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7838

Post Reply with quote
Please don't change jobs to asynchronous, that would only make the situation worse.

For the moment, let's try a different approach.

1. Rename debug.log files on the scheduler and agent systems. They will start fresh, it will be easier to troubleshoot any issues.
2. Setup new queue named "job_1154_queue"
3. Set job #1154 to run in that queue
4. Monitor files in [scheduler home]/Queue/job_1154_queue for a couple of hours. Please check that files appear and disappear in that queue approximately every 5 minutes. If you can confirm that, we are good. If you see something else, jobs don't run as you expect. Your logs indicate job 1154 is running much more often than every 5 minutes


In the long run, I suggest, switching agent from "agent" to "scheduler" mode with distributed server option enabled. Changing remote file-watch jobs to local regular file watch jobs, basically making the scheduler (former agent) to watch for them locally and trigger jobs on the current scheduler remotely. Basically switching from "jobs checking files remotely, if found, running processes locally", to " scheduler checking files locally, if found, running processes remotely" The main advantage of that method, there is no need for frequent job runs, file checking is done locally by the scheduler without running any jobs, no need for extensive network traffic and artificial job exceptions, no need for extensive job queuing, etc… And the number of required jobs is sliced in half. Processes are run only when the scheduler finds the required files.

Just in case, when "server" mode is enabled, the scheduler can run both local jobs and accept and run remote jobs submitted from other systems. That mode kind of supersedes the agent mode. But there is one disadvantage, in the above situation, you would need to manage jobs in 2 places, if you were to schedule jobs for running locally on both systems.


Last edited by SysOp on Fri Feb 11, 2011 10:30 am; edited 1 time in total
Wed Jan 26, 2011 10:31 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
Here's something interesting.
I have made the file checker jobs asynchcronus and the queue has not frozen up yet.

I had to make a few changes to the archiver script to ensure unique temp files get created, but so far so good.

So, from the perspective of the scheduler, it appears that what ever needs to notify the master that the job has completed is not working when the jobs are not being run asyncronously.

Maybe that can help with tracking the issue.
Wed Jan 26, 2011 4:01 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7838

Post Reply with quote
Since your changes are quite different from what had been suggested, let's wait for a few days and check if your changes resolve the issue. Please keep us posted on the status.
Thu Jan 27, 2011 11:00 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
Due to some issues with the way the scripts were setup, I had to revert it back.
I am going to work on setting up some scripts so that they can kick off in the background.

While the jobs were running in asynchcronous mode however I saw no issues with the queues locking up which would make sense because they are released immediately as they are run.
Thu Jan 27, 2011 4:58 pm View user's profile Send private message
Display posts from previous:    
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite All times are GMT - 4 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 4 of 5

 
Jump to: 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


 

 

Powered by phpBB © 2001, 2005 phpBB Group
Design by Freestyle XL / Flowers Online.