SoftTree Technologies SoftTree Technologies
Technical Support Forums
RegisterSearchFAQMemberlistUsergroupsLog in
Job Queues still lock up
Goto page 1, 2, 3, 4, 5  Next
 
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite View previous topic
View next topic
Job Queues still lock up
Author Message
seanc217



Joined: 23 May 2007
Posts: 272

Post Job Queues still lock up Reply with quote
I am on the latest version of the multi-platform edition, and I am still having issues with queues locking up from time to time. I want to write a shell script to check for this condition and e-mail a support address when this happens. This script would be run via cron outside of the scheduler. What are some of the things I could check for when queues get backed up? I understand if things are running that the queues will get backed up, but I would rather get false positives then no notification at all and miss our SLAs. Somethings I am going to do is go into the Queues folder and look for files older than an hour or 2 hours and fire off notifications if this happens. Of course if there are jobs that run longer than this, I will get the false positives. If you have any ideas on this I would appreciate any input.

Thanks!
Mon May 24, 2010 2:12 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 6714

Post Reply with quote
Hi,

You can watch for .Q files in queue folders and for their file times. Technically you can simply watch for queue folder times and if they are recent, don't bother with individual files.

As you said, if you find files 2+ hours old, you still don't know if this is a problem or not. Yet, you may make the monitoring script alert somebody and ask to investigate. I'm not aware of sure way to for implementing a generic automated solution for avoiding false positives. In many cases, only the owner of the process knows for sure how to check for the process status, timing may be data driven, for example, more data can cause the same job to run longer on

Theoretically you can create a file listing all jobs, their ids and expected run time durations and compare against that file. There is a way to figure out job id and name from a .Q file name. That name is a just a run-time job instance id. If you know that number you can call 247 API to get list of queued jobs, compare their run-time ids and from there get other properties.
Mon May 24, 2010 3:57 pm View user's profile Send private message
barefootguru



Joined: 10 Aug 2007
Posts: 195

Post Reply with quote
Are the queues ever empty when the system's running smoothly? e.g. if they're regularly empty at least once an hour the monitor job could wake up every hour and monitor the queue directories every 5 minutes for up to an hour. Empty queues within that hour = success, go back to sleep for an hour. No empty queues=failure, send e-mail.

Other tests could be the number of jobs in the queue directories, or as above, oldest timestamp.

Or you could attack from the other end: Have high priority jobs in 24x7 which run frequently and update a heartbeat timestamp somewhere. Monitor job checks the heartbeats have been updated recently.

Cheers
Tue Jun 01, 2010 5:15 pm View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
OK I got a queue that got locked up again last night.
I had enabled the debug.log to keep entries longer.

Here's what I got, The job that locks up the queue is: Job #797 - 010_check_visa_dm_cdh_run_trigger

2010-06-21 21:00:00,492 [Job #411 - 01_check_carms_trigger] DEBUG com.softtreetech.jscheduler.business.runner.RemoteJobRunner - runJob
com.softtreetech.jscheduler.common.SchedException: Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl.executeJob(Unknown Source)
at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:592)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:294)
at sun.rmi.transport.Transport$1.run(Transport.java:153)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:149)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:466)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:707)
at java.lang.Thread.run(Thread.java:595)
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:247)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:223)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:126)
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl_Stub.executeJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.RemoteJobRunner.runJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.oO0000(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.Object(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl$1.run(Unknown Source)
at java.lang.Thread.run(Thread.java:595)
2010-06-21 21:00:00,495 [Job #797 - 010_check_visa_dm_cdh_run_trigger] DEBUG com.softtreetech.jscheduler.business.runner.RemoteJobRunner - runJob
com.softtreetech.jscheduler.common.SchedException: Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl.executeJob(Unknown Source)
at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:592)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:294)
at sun.rmi.transport.Transport$1.run(Transport.java:153)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:149)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:466)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:707)
at java.lang.Thread.run(Thread.java:595)
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:247)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:223)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:126)
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl_Stub.executeJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.RemoteJobRunner.runJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.oO0000(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.Object(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl$1.run(Unknown Source)
at java.lang.Thread.run(Thread.java:595)
2010-06-21 21:00:00,495 [Job #411 - 01_check_carms_trigger] DEBUG com.softtreetech.jscheduler.business.queue.JobQueue - QUEUE_UNLOCKED
2010-06-21 21:00:00,572 [Job #411 - 01_check_carms_trigger] ERROR com.softtreetech.jscheduler.business.runner.JobExecutorImpl - Job errors: Remote job failed. Exit code: -1



One thing I notice is that there is never a message for QUEUE_UNLOCKED for the the job mentioned above. I think this is the issue. Can you take a look at this and let me know what you find out?

Thanks!
Tue Jun 22, 2010 11:54 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 6714

Post Reply with quote
This looks like the job is getting stuck on the remote site after the failure and the queue is waiting forever for that job to complete.

I believe I've seen something like that reported before and there should be some fix for that. I'm searching for that fix.
Tue Jun 22, 2010 1:11 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 6714

Post Reply with quote
It seems that the fix I tried to locate is not a separate fix. It is part of all 4.3.2xx builds. Which version/build are you running now on the remote agent system?
Wed Jun 23, 2010 9:20 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
Version 4.3.293
Wed Jun 23, 2010 10:08 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 6714

Post Reply with quote
We are still diagnosing this issue. This seems to be an issue with remote agent connection not being closed after a job exception is raised, and that is making the job to hang in the queue.

Do you have debug.log file from the agent system for the same job failure? If you do, please post fragment of the log with the exception for that job failure and the next few lines. We'd like to see what has happened after the exception was raised.
Thu Jun 24, 2010 9:32 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
I don't have the debug.log enabled and extended for when this kind of error happens.
I can enable this, so the next time it happens, I can send it along.

I will keep you posted when I get the same issue. Somtimes it take awhile for this to happen.
Thu Jun 24, 2010 10:43 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
Hi,

The queue finally locked up.
Here's the debug information from the agent this time...

2010-07-19 07:47:00,126 [Thread-433523] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2010-07-19 07:47:00,128 [Thread-433523] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): timeout check not required
2010-07-19 07:47:00,156 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2010-07-19 07:47:00,157 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: /opt/24x7_Scheduler/auth.pl
2010-07-19 07:47:00,196 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - runAs() username=srv_etl command=/home/srv_etl/file_scripts/file_checker.ksh,/loads/work/etl/inbound/visa_extras/ve_cmr_trigger.txt,dsadm,N workDir=/home/srv_etl/file_scripts
2010-07-19 07:47:00,196 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - exec : ./runas.pl,srv_etl,/home/srv_etl/file_scripts/file_checker.ksh /loads/work/etl/inbound/visa_extras/ve_cmr_trigger.txt dsadm N,/home/srv_etl/file_scripts
2010-07-19 07:47:00,213 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2010-07-19 07:47:00,384 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2010-07-19 07:47:00,384 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 1
2010-07-19 07:47:00,531 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : Enumeration found [0]
2010-07-19 07:47:00,531 [Job #498 - 010_check_ve_cmr_trigger] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2010-07-19 07:47:00,643 [Job #498 - 010_check_ve_cmr_trigger] ERROR com.softtreetech.jscheduler.business.runner.JobExecutorImpl - Job errors: Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.


I really do not see anything here however the queue got stuck.
Here's the listing from the queue folder where it is stuck...

drwxr-x--- 15 srv_etl users 408 2010-07-08 12:29 ..
-rw-r----- 1 srv_etl users 1419 2010-07-19 07:47 3600868.q
-rw-r----- 1 srv_etl users 1427 2010-07-19 07:47 3600867.q
-rw-r----- 1 srv_etl users 1418 2010-07-19 07:48 3600875.q
-rw-r----- 1 srv_etl users 1444 2010-07-19 07:48 3600874.q
-rw-r----- 1 srv_etl users 1431 2010-07-19 07:52 3600896.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 07:52 3600895.q
-rw-r----- 1 srv_etl users 1432 2010-07-19 07:52 3600894.q
-rw-r----- 1 srv_etl users 1431 2010-07-19 07:52 3600893.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 07:52 3600892.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 07:52 3600891.q
-rw-r----- 1 srv_etl users 1418 2010-07-19 07:53 3600905.q
-rw-r----- 1 srv_etl users 1444 2010-07-19 07:53 3600904.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 07:57 3600935.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 07:57 3600934.q
-rw-r----- 1 srv_etl users 1418 2010-07-19 07:58 3600942.q
-rw-r----- 1 srv_etl users 1444 2010-07-19 07:58 3600941.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 08:02 3600960.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 08:02 3600959.q
-rw-r----- 1 srv_etl users 1418 2010-07-19 08:03 3600967.q
-rw-r----- 1 srv_etl users 1444 2010-07-19 08:03 3600966.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 08:07 3600984.q
-rw-r----- 1 srv_etl users 1419 2010-07-19 08:07 3600983.q
Mon Jul 19, 2010 11:10 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 6714

Post Reply with quote
Thank you. We will compare this log to other logs and check for any differences. Do you happen to have matching fragment of debug.log from the scheduler system? Also, can you post matching records from the scheduler.log file from the agent system? That will give us all the pieces of the puzzle.
Mon Jul 19, 2010 11:31 am View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
This is from the master the job that got locked up was number 498:

13-Jul-2010 07:47:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 498 010_check_ve_cmr_trigger Remote job started.
13-Jul-2010 07:47:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 605 010_check_retail_lend_trigger Agent "proddsn1" contacted.
13-Jul-2010 07:47:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 498 010_check_ve_cmr_trigger Agent "proddsn1" contacted.
13-Jul-2010 07:47:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 814 010_check_dw_s2_visa_dm_cdh_trigger Remote job started.
13-Jul-2010 07:47:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 597 010_check_clrr_kalido_trigger Remote job started.
13-Jul-2010 07:47:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 814 010_check_dw_s2_visa_dm_cdh_trigger Agent "proddsn1" contacted.
13-Jul-2010 07:47:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 597 010_check_clrr_kalido_trigger Agent "proddsn1" contacted.
13-Jul-2010 07:47:00 AM 3 7xScmRht5hhVbDPyGb3xspoWOaY= 498 010_check_ve_cmr_trigger Remote job failed. Exit code: -1

This is the log from the agent system:

19-Jul-2010 07:47:00 AM 3 null 498 010_check_ve_cmr_trigger Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
19-Jul-2010 07:47:00 AM 3 null 814 010_check_dw_s2_visa_dm_cdh_trigger Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
19-Jul-2010 07:47:00 AM 3 null 498 010_check_ve_cmr_trigger Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.

Let me know if I can provide anything else.

Thanks.
Mon Jul 19, 2010 12:33 pm View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
Also one other thing.
I modified auth.pl so authentication is not required.

Maybe something with this is causing the lock up?

Here's the contents of auth.pl:

#!/usr/bin/perl
print "OK";
Mon Jul 19, 2010 12:45 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 6714

Post Reply with quote
Quote:
I modified auth.pl so authentication is not required.

Maybe something with this is causing the lock up?


Not likely.

Something with the event sequencing between the master and the agent. We are still looking into that, may take a little while to analyze the code.
Mon Jul 19, 2010 4:07 pm View user's profile Send private message
seanc217



Joined: 23 May 2007
Posts: 272

Post Reply with quote
OK thanks.
I will check in periodically.

If you want just e-mail me.

sean.conway@<nospam>yesbank.com

remove the <nospam> please.

Thanks.
Mon Jul 19, 2010 4:51 pm View user's profile Send private message
Display posts from previous:    
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite All times are GMT - 4 Hours
Goto page 1, 2, 3, 4, 5  Next
Page 1 of 5

 
Jump to: 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


 

 

Powered by phpBB © 2001, 2005 phpBB Group
Design by Freestyle XL / Flowers Online.