|
SoftTree Technologies
Technical Support Forums
|
|
Author |
Message |
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
Hi There, I was wondering if you have any update on this issue yet?
This continues to happen from time to time, but I have cron jobs in place that alert me.
Let me know.
Thanks
|
|
Mon Aug 30, 2010 3:46 pm |
|
|
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7857
|
|
|
|
Hi,
We are expcting new maintenance version this week. I hope to see this issue resolved in the new version.
|
|
Mon Aug 30, 2010 4:08 pm |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
Excellent, Can I get a copy of the new version?
|
|
Mon Aug 30, 2010 4:18 pm |
|
|
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7857
|
|
|
|
Please wait for a couple of days. It should be available soon.
|
|
Mon Aug 30, 2010 5:54 pm |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
Hi there,
I am not familiar with the issue number for this.
Please let me know if the issue in this thread was resolved in the lastest release.
Thanks!
|
|
Mon Oct 11, 2010 4:06 pm |
|
|
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7857
|
|
|
|
Hi,
I believe the issue number is 24x7-10781. This is the same issue that covers thread synchronization problems when code is run on multi-processor systems.
|
|
Mon Oct 11, 2010 11:44 pm |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
Excellent.
Thanks!
|
|
Tue Oct 12, 2010 11:26 am |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
Hi,
We upgraded to the latest version of the scheduler multi-platform edition, however I am still having issues with the queues locking up. I have enabled tracing to get you log files once one of them locks up again. A question, once the queue locks up the only way I have found to clear the queues is to shut down the master scheduler delete the .q files and then restart. Is there any other way I could clear the queue without re-starting? Sometimes this happens at very inconvient times when critical job processes are running and I would prefer to keep it up and running.
Also I need this to be a priority fix because it's just annoying and causes alot of problems on my production instance.
Thanks.
|
|
Mon Dec 20, 2010 5:22 pm |
|
|
SysOpJ
Joined: 20 Aug 2010 Posts: 95
|
|
|
|
Could you send us a recent debug.log file after a queue lockup?
Also, can you tell me the exact type and version of the OS, and number of processors on the system?
The queue is locked because the queue manager is waiting for the current job to exit and that job or resource it is using is stuck. Cleaning that
resource or job would release the queue.
PS. The resource can be local or remote, depending on the job setup
|
|
Tue Dec 21, 2010 11:16 am |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
All of our jobs are remote jobs.
Right now things are not locking up, so I will get you logs when it does.
Thanks.
|
|
Wed Dec 22, 2010 2:17 pm |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
One of my queues finally locked up... Job 1234 got locked up on a queue
Here's pieces of the log from the agent:
2011-01-06 15:40:03,146 [Job #1548 - 010_check_ach_report_02_trigger] DEBUG com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl - Starting job=010_check_fidfm2201, runtime id=6546870
2011-01-06 15:40:03,154 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2011-01-06 15:40:03,155 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: /opt/24x7_Scheduler/auth.pl
2011-01-06 15:40:03,185 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): command line [/home/srv_etl/file_scripts/file_checker.ksh /loads/dropoff/fidelity/FIDFM2201.TXT fidelity Y] in work directory [/home/srv_etl/file_scripts]
2011-01-06 15:40:03,185 [Thread-1538847] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2011-01-06 15:40:03,186 [Thread-1538847] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): timeout check not required
2011-01-06 15:40:03,186 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - runAs() username=srv_etl command=/home/srv_etl/file_scripts/file_checker.ksh,/loads/dropoff/fidelity/FIDFM2201.TXT,fidelity,Y workDir=/home/srv_etl/file_scripts
2011-01-06 15:40:03,186 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - exec : ./runas.pl,srv_etl,/home/srv_etl/file_scripts/file_checker.ksh /loads/dropoff/fidelity/FIDFM2201.TXT fidelity Y,/home/srv_etl/file_scripts
2011-01-06 15:40:03,204 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2011-01-06 15:40:03,204 [Thread-1538848] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2011-01-06 15:40:03,216 [Thread-1538851] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner$TimeoutVerifier - run(): start
2011-01-06 15:40:03,367 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2011-01-06 15:40:03,368 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 1
2011-01-06 15:40:03,368 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : Enumeration found [0]
2011-01-06 15:40:03,368 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2011-01-06 15:40:03,495 [Job #1234 - 010_check_fidfm2201] ERROR com.softtreetech.jscheduler.business.runner.JobExecutorImpl - Job errors: Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
2011-01-06 15:40:03,496 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl - Error occurred while running job=010_check_fidfm2201, runtime id=6546870
2011-01-06 15:40:03,600 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl - Starting job=010_check_ach_report_03c_02c_trigger, runtime id=6546872
Here's the log from the master
2011-01-06 15:40:03,485 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.runner.RemoteJobRunner - runJob
com.softtreetech.jscheduler.common.SchedException: Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl.executeJob(Unknown Source)
at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:592)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:294)
at sun.rmi.transport.Transport$1.run(Transport.java:153)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:149)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:466)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:707)
at java.lang.Thread.run(Thread.java:595)
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:247)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:223)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:126)
at com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl_Stub.executeJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.RemoteJobRunner.runJob(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.runJobIgnoringErrorsIfNeeded(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.startExecution(Unknown Source)
at com.softtreetech.jscheduler.business.runner.AbstractJobRunner.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl.execute(Unknown Source)
at com.softtreetech.jscheduler.business.runner.JobExecutorImpl$1.run(Unknown Source)
at java.lang.Thread.run(Thread.java:595)
From what I can see it appears that maybe the job might have kicked off twice, from the logs above can you tell that?
Other information that might be useful:
If I go into the queue monitor for the queue that locked up here's the info I see:
Queue: file_ops_2
Job#: 8207997
Time queued: 6-Jan-2011 15:40:00
Priority: Normal
Status: RUNNING
Job ID: 1234
Job Name: 010_check_fidfm2201
Time Job started: 6-Jan-2011 15:40:02
Size: 1
Size in Queue: 1
System Process ID: 8207997 file_ops_2
|
|
Thu Jan 06, 2011 5:17 pm |
|
|
SysOpJ
Joined: 20 Aug 2010 Posts: 95
|
|
|
|
Thanks for the update. We're looking through the trace now to see if anything stands out.
|
|
Fri Jan 07, 2011 9:57 am |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
Thanks, If this can be interim fix that would be great, because it's getting annoying when they lock up.
Let me know.
|
|
Fri Jan 07, 2011 3:37 pm |
|
|
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7857
|
|
|
|
Hi,
There are couple of suspicious lines in the posted trace from the agent. For example,
2011-01-06 15:40:03,146 [Job #1548 - 010_check_ach_report_02_trigger] DEBUG com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl - Starting job=010_check_fidfm2201, runtime id=6546870
Job 010_check_fidfm2201 has Job #1234
Similarly, 2011-01-06 15:40:03,600 [Job #1234 - 010_check_fidfm2201] DEBUG com.softtreetech.jscheduler.business.agent.remote.RemoteAgentImpl - Starting job=010_check_ach_report_03c_02c_trigger, runtime id=6546872
The above would make sense if it were recorded in the master scheduler trace, but not in the agent trace. All job control and chaining is supposed be handled by the scheduler. Can you describe how these jobs relate to each other? How do you start them?
Are there any records in the agent trace indicating other activities of jobs 010_check_ach_report_02_trigger and 010_check_ach_report_03c_02c_trigger?
The current theory is the scheduler queue is stuck because the queue is waiting for the remote job to terminate. Your agent trace indicates that instead of terminating, the job is triggering some chain reaction. The fragment of the log too short to see what is happens after that.
|
|
Sat Jan 08, 2011 1:02 pm |
|
|
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
Hi there,
Basically, I have trigger files that get created based on some event, like say when I receive a file.
These jobs check for the existence of the trigger file every 5 minutes.
When one is found some jobs are kicked off that's basically it.
I have the full debug file, but it's too big to post here. If you like I can e-mail both the master and agent logs to you.
Just let me know where to send them.
Thanks!
|
|
Sat Jan 08, 2011 9:50 pm |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|