SoftTree Technologies SoftTree Technologies
Technical Support Forums
RegisterSearchFAQMemberlistUsergroupsLog in
Left files in Queue directory
Goto page 1, 2, 3, 4  Next
 
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite View previous topic
View next topic
Left files in Queue directory
Author Message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Left files in Queue directory Reply with quote
Hello,

since a few days (after having no problems for months) we encounter some strange errors on our Linux master server.
What can be the reason for file that are "forgotten" in the Queue directory? For instance I just had a case that a job that is running every 5 minutes left it's queue-file and the next jobs did not run. I found the job "stuck" in the job monitor with status "runnning" (I'm using the web console) and was not able to delete it via web console. After that I deleted the queue-file with "rm" on the shell. After that I found the following in the debug.log:

Caused by: java.io.FileNotFoundException: /opt/scheduler/24x7_Scheduler_246/Queue/often_runners/118803.q (No such file or directory)

OK - this is a follow-up of the manual deletion.

In the /var/log/messages I found the following:

Exception in thread "Job #67 - bamacc:check_schedule" java.util.ConcurrentModificationException
at java.util.AbstractList$Itr.checkForComodification(AbstractList.java)
at java.util.AbstractList$Itr.remove(AbstractList.java)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.notifyJobFinished(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.ÔO0000(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.Ó00000(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.execute(Unknown Source)
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl.executeJob(Unknown Source)
at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java)
at java.lang.reflect.Method.invoke(Method.java)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java)
at sun.rmi.transport.Transport$1.run(Transport.java)
at java.security.AccessController.doPrivileged1(Native Method)
at java.security.AccessController.doPrivileged(AccessController.java)
at sun.rmi.transport.Transport.serviceCall(Transport.java)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.handleRequest(TCPTransport.java)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java)
at java.lang.Thread.run(Thread.java)
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl_Stub.executeJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.RemoteJobRunner.runJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.oO0000(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.Ó00000(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl$1.run(Unknown Source)
at java.lang.Thread.run(Thread.java:619)

This fits to the stuck queue-file because this is the only job using the "often_runners" queue.

What could be the reason for this behavior?
If you need more information please let me know.

Thanks in advance for your help.
Tue May 06, 2008 8:34 am View user's profile Send private message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Reply with quote
What I just found out in addition - please take at look at scheduler.log:

6-May-2008 07:02:56 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:02:57 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:02:58 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:02:59 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:02 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:04 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:05 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:07 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:09 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:10 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:12 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:14 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:15 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:16 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:17 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:18 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:20 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:21 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:22 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:23 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:24 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:25 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:27 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:28 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:29 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:31 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:32 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:34 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:36 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:39 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:40 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:42 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:43 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:44 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:46 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:47 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:48 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:49 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:50 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:51 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:53 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:54 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:55 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:03:57 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:03:58 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:03:59 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:04:00 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.
6-May-2008 07:04:02 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job started.
6-May-2008 07:04:03 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Agent "bamacc" contacted.
6-May-2008 07:04:04 AM 2 7xScmRht5hhVbDPyGb3xspoWOaY= 85 bamacc:pruef_bam Remote job finished.

??? The job "pruef_bam" is configured to run every 10 minutes (Starttime 3/23/2004 at 10:29) - with worked since we use 24x7!
What can cause the scheduler to run it over and over again?

This behavior stopped after the next "normal" start time was reached 6-May 7:09. The the next start was 7:19 (ok), 7:29 (ok) and so on...
Tue May 06, 2008 10:06 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7949

Post Reply with quote
Such things could happen if you have a stagnant queue with some job(s) stuck and holding other jobs (or other instances of itself in the same queue). Over a period of time the queue accumulates more and more jobs which do not run and just sitt there waiting for the queue to become available to run them. Ass son as the offending job finishes or gets killed, all instances already sitting in the queue start running immediately one after another.
Tue May 06, 2008 11:07 am View user's profile Send private message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Reply with quote
OK then... but what can cause a job running 100 times a day suddendly gets stuck in the queue and what about the "ConcurrentModificationException"-JAVA error.

This job is "written" to check if scheduler itself is up and running functional. This is as simple as possible and I cannot imagine that the job itself is able to cause any problem. Look at this:

Job Definition:
=========
Type Program
Command date > /tmp/check_schedule.txt
Start In /
Host bamacc
Asynchronous No
Detached Yes
Queue often_runners
Priority High
Timeout (minutes) 0
Logging Enabled No
Run As root

This reappeared last night (ConcurrentModificationException) with the same job. Here's the debug log:

2008-05-07 01:15:01,184 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.queue.JobQueue - QUEUE_UNLOCKED
2008-05-07 01:20:00,877 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.queue.JobQueue - QUEUE_UNLOCKED
2008-05-07 01:25:00,614 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.queue.JobQueue - QUEUE_UNLOCKED
2008-05-07 01:30:00,586 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.queue.JobQueue - QUEUE_UNLOCKED
2008-05-07 07:05:01,243 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.queue.JobQueue - QUEUE_UNLOCKED
2008-05-07 07:10:00,597 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.queue.JobQueue - QUEUE_UNLOCKED

You can see that the job last ran at 1:30 and then at 7:05 when operations team restarted the scheduler service.
Wed May 07, 2008 4:38 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7949

Post Reply with quote
This QUEUE_UNLOCKED message indicates adding new job to the queue. And since there is nothing between these events, we can assume that a job is not running/ending while new instances are added. Can you replace > with >> ? I mean make it append to check_schedule.txt file rather than overwrite it on each run. This will provide you with a time log of job runs

Since it is a remote job run, the issue could be caused by the connection rather than the job itself.. The job might be getting stuck while connecting or disconnecting from the agent. Can you verify this theory? If tracing is enabled on the agent side, you can compare times in the agent and scheduler debug.log files.

By the way, are there any unusual messages in the debug.log file of the agent? Any exceptions?
Wed May 07, 2008 9:24 am View user's profile Send private message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Reply with quote
Alright then...

I followed you suggestions and changed the job to append the output of the 'date'-command to keep track of the job on the remote agent.
I took a look at all so-called "relevant" messages files on both (master and agent) debug.log, syslog messages, whatever and cannot find any unusal except the "java.util.ConcurrentModificationException" in syslog messages (I'm starting the master.sh with >> /varlog/messages 2>&1).

I also just enabled tracing on the remote agent.

BTW : We're now running on version 2.55 and after 2 hours the problem appeared again.

What really makes me wonder is the fact that I'm not aware of any changed in our environment that may point to the problem. The job database and config was untouched for weeks and we had not had any such problem in the past. The scheduler ran stable for several months without a problem and suddenly (starting on Apr. 30) we have these problems.

Anyway.. I'll let you know what I can find out.

If you have further ideas what to check - any idea is appreciated.

Thanks
Wed May 07, 2008 9:55 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7949

Post Reply with quote
ConcurrentModificationException means that something is being removed from some collection or queue by one internal process while another internal processes is iterating through collection elements and validating their statuses. Something must be going on very fast or very slow or the collection is very very large and takes long time to check statuses of collection items. My guess, the job was set to run asynchronous and was deleted manually while still running. When the connection ended, some internal processes had a reference to the background job queue still open, didn't reopen that reference and tried to delete that job again. I don't think this is a critical issue, or at least I don't think it has any direct side effects.

As a side note I think this exception should not happen. Some internal method is not synchronized allowing other processes to modify/remove elements from the collection while it is still working on it – some mutex is missing somewhere. I am going to submit this issue to the support tracking system for further review.
Wed May 07, 2008 10:31 am View user's profile Send private message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Reply with quote
OK - here are some news...

It now happens really often, that jobs are not executed on an remote agent without any reason I can see. This morning there are again left files in the Queue directory which are shown as 'running' in the job monitor. But they are not running. These are extremly short and fast jobs running since years without changing any logic in the job. Some of the jobs could be deleted using the web-console; others cannot be deleted and always re-appeared in the job monitor.
This stuck job seems to prevent the job from running the next scheduled time until a complete restart of the scheduler proccess.

So I started to experiment with the "timeout" parameter hoping that if a job gets stuck the scheduler will "teminate" the job from it's own point of view (although the job already finished).
This seems to lead to the following messages in debug.log (on the remote agent):

2008-05-14 12:10:02,870 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): st
art
2008-05-14 12:10:02,871 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - auth
NativeUser: ftp_connect: 127.0.0.1
2008-05-14 12:10:02,894 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - auth
NativeUser: root login ok
2008-05-14 12:10:02,894 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(
): command line [date > /tmp/check_schedule.txt] in work directory [/]
2008-05-14 12:10:02,895 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() user
name=root command=date,>,/tmp/check_schedule.txt workDir=/
2008-05-14 12:10:02,895 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./run
as.pl,root,date > /tmp/check_schedule.txt,/
2008-05-14 12:10:02,937 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProce
ss(): start
2008-05-14 12:10:02,937 [Thread-45] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2008-05-14 12:10:02,937 [Thread-45] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): sleeping.
Max wait time is 60000 milliseconds
2008-05-14 12:10:03,349 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProce
ss(): end
2008-05-14 12:10:03,349 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(..
.) : exit code 0
2008-05-14 12:10:03,349 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess
start
2008-05-14 12:10:03,349 [Thread-45] INFO com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run():
java.lang.InterruptedException: operation interrupted
at java.lang.Thread.sleep(Native Method)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier.run(Unknown Source)
2008-05-14 12:10:03,349 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): en
d
2008-05-14 12:10:03,350 [Thread-45] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): ended

I configured a timeout of 2 minutes for Job #67.

At 2:50 pm this job (#67) was again NOT started. Is it possible that this may collide with another job running at 2:49 (with a timeout of 5 min). Here's the debug log (taken from the remote agent with AIX 5.3/JAVA 14.2) for this job (#85).

2008-05-14 14:49:02,883 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2008-05-14 14:49:02,883 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativ
eUser: ftp_connect: 127.0.0.1
2008-05-14 14:49:02,902 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativ
eUser: root login ok
2008-05-14 14:49:02,902 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): co
mmand line [/bam/bin/pruef_bam >>/bam/spool/pruef_bam.err] in work directory [/]
2008-05-14 14:49:02,902 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() username=
root command=/bam/bin/pruef_bam,>>/bam/spool/pruef_bam.err workDir=/
2008-05-14 14:49:02,902 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./runas.pl
,root,/bam/bin/pruef_bam >>/bam/spool/pruef_bam.err,/
2008-05-14 14:49:02,956 [Thread-189] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2008-05-14 14:49:02,956 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess():
start
2008-05-14 14:49:02,956 [Thread-189] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): sleeping.
Max wait time is 300000 milliseconds
2008-05-14 14:50:03,513 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess():
end
2008-05-14 14:50:03,513 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) :
exit code 0
2008-05-14 14:50:03,513 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-05-14 14:50:03,513 [Thread-189] INFO com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run():
java.lang.InterruptedException: operation interrupted
at java.lang.Thread.sleep(Native Method)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier.run(Unknown Source)
2008-05-14 14:50:03,513 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): end
2008-05-14 14:50:03,513 [Thread-189] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): ended


Both jobs are running in the same queue - but I think concurrent runtimes sould not affect another job to be started or not. Of course there are absolutely no dependencies between them.


Again : these jobs and their timing are existing for 2 years or more. I really cannot think of anything that changed that may affect the functionality/stabilty of the scheduler. In fact in changed back to JAVA 1.4.2 when upgraded to 255. I let the 246 run with JAVA 1.6.

Other JAVA exceptions (beside the still upcoming "ConcurrentModificationException"-error) cannot be found. BTW : the "Concurrent"-error is the output from starting the master.sh with ">> LOGFILE 2>&1" - these errors are not in the debug.log's

What's going on here?

Thanks in advance.

Joerg
Wed May 14, 2008 10:09 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7949

Post Reply with quote
Can you verify for me that you are running the same version of 24x7 in both paces? Maybe double check it again... just in case.

Second, please ensure that you are using the same JDK/JRE on both systems. Different versions can potentially cause undesired anomalies.


Concurrent run-time may indeed affect jobs in the same queue, especially if jobs are set to run synchronously as they should. If a job is set synchronously, the queue is starting it and waiting for the job to complete before moving on to the next job. If there are 2 jobs with exactly same run-time, one of the starts first and the other waits for the first to complete. To avoid the wait, assign these jobs to different queues. If the job is set to run asynchronously, the queue makes a connection to the agent first and then pushes the job to background queue and immediately starts working on the next jobs.

(running everything in background typically means chaos)

---
If JDK/JRE/24x7 versions are the same, please assign these jobs to a different queues and check if still experience this issue.
Wed May 14, 2008 11:11 am View user's profile Send private message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Reply with quote
Alright - I verified that master and agent are running the same version (255) and the same JAVA version 1.4.2. Triple check... :-)

I just played around with some things for example a new queue where I added 2 jobs that are exactly the same except the name and got the following:
"Queue customers maximumum allowed size exceeded". With almost 2 jobs I asked myself??

When I tried to change the size with the web-console (size in MB!) I found out that this cannot be changed! Any queue size is 1 MB!
So I decided to stop the master running in backgroung and started over in GUI mode. To my surprise the old queues showed up with 1000 (or 50) entries possible. The new queue I created with the web console only has ONE. So 1 in master-GUI mean 1 entry, web-console interprets this to 1 MB which cannot be changed.

After that I saw that only the master GUI has an overview of "last run" and "next run" I saw that several jobs are marked as "not scheduled"! Getting deeper I found out that jobs with "run daily"-attribute had an end date shown up with "1/1/00". This may be a relict of converting the job database 1 year ago from Windows 3.24 platform to the JAVA edition. (But until I touched this with the web-console this has never been an issue - the jobs ran in the past as expected).

Unfourtunatley you can't just enter a NULL values to tell scheduler that the job should have NO end-date.
What has happend seems to be (not quite sure) that when you edit such a job with end-date "1/1/00" (or any enddate in the past?) with the web-console and then save it, the job is no longer scheduled (which can only be seen in the master-GUI). I edited every job from the master GUI setting the end date to 1/1/20 and the jobs are no longer marked as "not scheduled". Why do the user has to enter an end-date? This is randomly for I would guess 95% of jobs.

So from my point of view a "big warning" if using the web console to administer the job database. Some results are really strange and buggy.
You should do some bug-fixing this really nice feature. (I know, I know - easy to say. But if you provide functionality it should work).

Let's see the next days if there will be more strange issues after my complete "refresh" of the job datebase using the master GUI...

Thanks.
Thu May 15, 2008 8:43 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7949

Post Reply with quote
I'm not sure where to start. You described a number of different issues.

For beginning, it looks like you are using an old version of the web console creating bad jobs or saving invalid properties. That version might be incompatible with 4.1.255. Which version are you running? Is that a JSP based or ASP based variant, which built (you can see than on the login page)?

You might be dealing here with several different issues Bad job definitions saved from the web console could be the root cause of all that. Let's try to isolate that problem. Can you update old or create new jobs using GUI interface or using command line console and check if that corrects the problem?

The queue size in the MP and Windows versions have different meaning. In MP version it is measured in number of jobs that can fit the queue. In Windows version it is measured in MB as the size of the max size of corresponding queue folder.
Thu May 15, 2008 11:06 am View user's profile Send private message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Reply with quote
The Web Console I'm using is version 4.1.0 build 263 and I have it running in tomcat 5.0 on a Linux RedHat ES 4. JAVA version is 1.6

Yesterday I touched every job with the GUI interface (changed the end-date to year 2020 and so on..) and saved the job database. This morning everything is fine again. All jobs ran as expected!

If I have time I will do some other tests regarding the master-GUI vs. web console and let you know about the results.
Fact is that for instance the queue size (no matter what it means at least) is not changeble in the web-console. Default value is "1". You can enter a different number via web-console but this is ignored. It remains on "1".

I'll post again if I have results.

Beside the web-console issues : Wouldn't it be a good idea to completly separate the scheduling-"core"-engine from the admin-GUI. In most cases (IMHO) the scheduling-engine ran in the background on a central (or more) servers in the data center. But if you want to modify jobs or dependencies I have to stop the no-gui-master and then call it up with the GUI again which of course interrupts the scheduling-service (and if configured cleaning up the logs). After that I have to shutdown the GUI and again start it up in the background without GUI. Or do you have an idea how to avoid this procedure?

Thank you.
Fri May 16, 2008 4:42 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7949

Post Reply with quote
Web-manag. Console build #263 is fine, that is actually the most recent. However using Java 1.6 is not ok. Java versions are not backward compatible. 1.6 fails to run properly libraries compiled in 1.4 and having dependencies on other 1.4 and earlier libraries. Typically you will not see any hard errors on the screen but internally the code does NOT run correctly causing various anomalies in the processing.

By the way, are you aware of the command console based interface that can be used to modify and monitor jobs without stopping the scheduler and bringing it up in GUI mode?
Fri May 16, 2008 12:02 pm View user's profile Send private message
Redemann



Joined: 11 Jul 2007
Posts: 90
Country: Germany

Post Reply with quote
>> By the way, are you aware of the command console based interface that can be used to modify and monitor jobs without stopping the scheduler and bringing it up in GUI mode?

Yes. Good tool if you need to change more than one thing in several jobs. But that's what I mean : a GUI version of cons.sh. :-) Would greatly improve the management comfort...

But...

Sorry that I have to ask again about stuck jobs in a queue. This weekend we ran again into this. I have a queue which is only used by a single job - the "check_schedule". It's the job I described some posts above. I picked up your suggestion and now the job creates a timestamp history.
After May 17th 5:45 am the job gets stuck.

ls -l /opt/scheduler/24x7_Scheduler/Queue/check_queue
-rw-r--r-- 1 root root 1471 17. Mai 05:45 123743.q
-rw-r--r-- 1 root root 1463 17. Mai 05:50 123745.q
-rw-r--r-- 1 root root 1463 17. Mai 05:55 123746.q
-rw-r--r-- 1 root root 1463 17. Mai 06:00 123750.q
-rw-r--r-- 1 root root 1463 17. Mai 06:05 123752.q
-rw-r--r-- 1 root root 1463 17. Mai 06:10 123754.q
-rw-r--r-- 1 root root 1463 17. Mai 06:15 123755.q
-rw-r--r-- 1 root root 1463 17. Mai 06:20 123757.q
-rw-r--r-- 1 root root 1463 17. Mai 06:25 123758.q
-rw-r--r-- 1 root root 1463 17. Mai 06:30 123760.q

Here is the history-timestamp file:

Sat May 17 05:30:02 DFT 2008
Sat May 17 05:35:03 DFT 2008
Sat May 17 05:40:03 DFT 2008
Sat May 17 05:45:03 DFT 2008
Sat May 17 08:55:02 DFT 2008
Sat May 17 09:00:03 DFT 2008
Sat May 17 09:05:02 DFT 2008

You can see that the first job stuck at 5:45 still created the timestamp but for whatever reason the job was not deleted from the queue. After that the queue ran full (max size 10) and the job did not run until the operations team restarted the master around 8:50 am. So the next sucessful run was at 8:55.

Because the "check_schedule" job runs every five minutes I did not enable the logging option. The only log can be seen in the debug.log on the remote agent. And here's another problem - because the debug.log is switched very often by the scheduler it is difficult/impossible to take a look into older debug.log files. Is it possible to take influence to keep debug.log's longer?

But the core question is still the same : Why does the job gets stuck? Why is the queue-file not being deleted?

Again : the "check_schedule" job is the *only* job using this queue so I cannot imagine any strange side effects. I cannot find anything in the logs - neither on the master nor on the remote agent.
The log file on the remote agent looks normal. The are no anomalies around this this time and other job using another queue ran fine. Here's a cut of the remote log - noting unusual.

17-May-2008 05:39:03 AM 2 null 85 bamacc:pruef_bam Remote job started.
17-May-2008 05:39:03 AM 2 null 85 bamacc:pruef_bam Job started.
17-May-2008 05:39:03 AM 2 null 85 bamacc:pruef_bam Job finished.
17-May-2008 05:39:03 AM 2 null 85 bamacc:pruef_bam Remote job finished.
17-May-2008 05:45:03 AM 2 null 69 bamacc:check_views Remote job started.
17-May-2008 05:45:03 AM 2 null 69 bamacc:check_views Job started.
17-May-2008 05:45:03 AM 2 null 69 bamacc:check_views Job finished.
17-May-2008 05:45:03 AM 2 null 69 bamacc:check_views Remote job finished.
17-May-2008 05:49:03 AM 2 null 85 bamacc:pruef_bam Remote job started.
17-May-2008 05:49:03 AM 2 null 85 bamacc:pruef_bam Job started.
17-May-2008 05:49:03 AM 2 null 85 bamacc:pruef_bam Job finished.
17-May-2008 05:49:03 AM 2 null 85 bamacc:pruef_bam Remote job finished.
17-May-2008 05:59:03 AM 2 null 85 bamacc:pruef_bam Remote job started.
17-May-2008 05:59:03 AM 2 null 85 bamacc:pruef_bam Job started.
17-May-2008 05:59:03 AM 2 null 85 bamacc:pruef_bam Job finished.
17-May-2008 05:59:03 AM 2 null 85 bamacc:pruef_bam Remote job finished.
17-May-2008 06:00:03 AM 2 null 79 bamacc:get_slr_data Remote job started.
17-May-2008 06:00:03 AM 2 null 79 bamacc:get_slr_data Job started.
17-May-2008 06:02:52 AM 2 null 79 bamacc:get_slr_data Job finished.
17-May-2008 06:02:52 AM 2 null 79 bamacc:get_slr_data Remote job finished.
17-May-2008 06:09:03 AM 2 null 85 bamacc:pruef_bam Remote job started.

Uhh - I forgot. I found one error. The output of the "master.sh > 2>&1". Again the "java.util.ConcurrentModificationException".
But unfortunately there is no timestamp for this exception. Here's the complete output bewteen the last 2 scheduler starts:

Fr Mai 16 09:57:28 CEST 2008 : Starting Scheduler service
Copyright (c) 2006 SoftTree Technologies, Inc.
24x7 Scheduler started in service mode.

Press Ctrl+C to abort
java.util.ConcurrentModificationException
at java.util.AbstractList$Itr.checkForComodification(Unknown Source)
at java.util.AbstractList$Itr.remove(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.notifyJobFinished(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.ÔO0000(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.Object(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl$1.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Sa Mai 17 08:49:41 CEST 2008 : Stopping Scheduler service
Sa Mai 17 08:49:58 CEST 2008 : Starting Scheduler service
Copyright (c) 2006 SoftTree Technologies, Inc.
24x7 Scheduler started in service mode.

Press Ctrl+C to abort

These are all informations I can provide.

Thanks for more input...
Mon May 19, 2008 5:27 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7949

Post Reply with quote
Quote:
But the core question is still the same : Why does the job gets stuck? Why is the queue-file not being deleted?


The answer might be available in the debug.log of the agent where the job is run. My guess, the remote connection gets terminated or ends abnormally and the job hangs on the scheduler side waiting for the connection to be released which doesn't happen, the connection is dead


Quote:
Is it possible to take influence to keep debug.log's longer?

Yes, here is what you can do.
1. Please make a backup of 24x7 files first
2. Shutdown the scheduler
3. Rename jscheduler.jar file to jscheduler.zip
4. Unzip log4j.properties file and edit this file in Notepad
5. In the file, find "log4j.appender.A2.MaxFileSize=100KB"
6. Change 100KB to 1MB or some other value and save the file
7. Zip the updated file back to jscheduler.zip
8. Rename .ZIP back to .JAR
Mon May 19, 2008 4:50 pm View user's profile Send private message
Display posts from previous:    
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite All times are GMT - 4 Hours
Goto page 1, 2, 3, 4  Next
Page 1 of 4

 
Jump to: 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


 

 

Powered by phpBB © 2001, 2005 phpBB Group
Design by Freestyle XL / Flowers Online.