 |
SoftTree Technologies
Technical Support Forums
|
|
Author |
Message |
Redemann
Joined: 11 Jul 2007 Posts: 90 Country: Germany |
|
|
|
I have some news regarding this issue:
Please have a look at these debug-logs taken from the remote agent:
008-05-22 14:05:02,910 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2008-05-22 14:05:02,911 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: ftp_connect: 127.0.0.1
2008-05-22 14:05:02,933 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: root login ok
2008-05-22 14:05:02,933 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): command line [date > /tmp/check_schedule.txt] in work directory [/]
2008-05-22 14:05:02,933 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() username=root command=date,>,/tmp/check_schedule.txt workDir=/
2008-05-22 14:05:02,933 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./runas.pl,root,date > /tmp/check_schedule.txt,/
2008-05-22 14:05:02,987 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2008-05-22 14:05:03,079 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2008-05-22 14:05:03,080 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 0
2008-05-22 14:05:03,080 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-05-22 14:05:03,080 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): end
2008-05-22 14:10:04,961 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2008-05-22 14:10:04,961 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: ftp_connect: 127.0.0.1
2008-05-22 14:10:04,985 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: root login ok
2008-05-22 14:10:04,985 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): command line [date > /tmp/check_schedule.txt] in work directory [/]
2008-05-22 14:10:04,985 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() username=root command=date,>,/tmp/check_schedule.txt workDir=/
2008-05-22 14:10:04,985 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./runas.pl,root,date > /tmp/check_schedule.txt,/
2008-05-22 14:10:05,065 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2008-05-22 14:10:05,159 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2008-05-22 14:10:05,160 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 0
2008-05-22 14:10:05,160 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-05-22 13:03:05,384 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2008-05-22 13:03:05,385 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: ftp_connect: 127.0.0.1
2008-05-22 13:03:05,407 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: informix login ok
2008-05-22 13:03:05,408 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): command line [/home/dcs/bin/check_views 2>/dev/null] in work directory [/home/informix]
2008-05-22 13:03:05,408 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() username=informix command=/home/dcs/bin/check_views,2>/dev/null workDir=/home/informix
2008-05-22 13:03:05,408 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./runas.pl,informix,/home/dcs/bin/check_views 2>/dev/null,/home/informix
2008-05-22 13:03:05,461 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2008-05-22 13:03:05,559 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2008-05-22 13:03:05,559 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 0
2008-05-22 13:03:05,559 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-05-22 13:03:05,559 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): end
2008-05-22 14:10:04,962 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runJob(): start
2008-05-22 14:10:04,962 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: ftp_connect: 127.0.0.1
2008-05-22 14:10:04,999 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.security.SecurityService - authNativeUser: informix login ok
2008-05-22 14:10:05,002 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - execProcess(): command line [/home/dcs/bin/check_views 2>/dev/null] in work directory [/home/informix]
2008-05-22 14:10:05,002 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - runAs() username=informix command=/home/dcs/bin/check_views,2>/dev/null workDir=/home/informix
2008-05-22 14:10:05,002 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - exec : ./runas.pl,informix,/home/dcs/bin/check_views 2>/dev/null,/home/informix
2008-05-22 14:10:05,065 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): start
2008-05-22 14:10:05,160 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - waitForProcess(): end
2008-05-22 14:10:05,160 [Job #69 - bamacc:check_views] DEBUG com.softtreetech.jscheduler.business.runner.AbstractJobRunner - isFailed(...) : exit code 0
First look at the debug for job #67. The job at 14:05 finishes with "killProcess start" and then "runJob(): end". The last log entry for job #67 is the "killProcess start" and "unJob(): end" is missing! For exactly this job the queue file isn't deleted the 24x7 assumes that the job is still running!
At nearly the exact time another job (#69) was started. This one ended normally.
#67 and #69 do *NOT use the same Queue.
So it seems that the "killProcess start" is somehow unfinished. Why? It seems that this only happens if there is another job running at the same time (IMHO).
In addition you can see that job #69 at 13:03 finished with "killProcess start" and then "runJob(): end" and the one at 14:10 ended with "isFailed(...) : exit code 0".
I checked the log over and over and I'm sure I did not miss any lines. (I saved the log and if it would be useful for you I could send it via mail if you want).
Seems where are getting closer...
Thanks in advance for your help.
|
|
Thu May 22, 2008 9:07 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7966
|
|
|
|
Sorry for delay with the response. The forum was closed for weekend because of the backend database server upgrades.
Yes, your observation brings us closer to the finding of the root cause. We knew the job didn't complete and the agent never reported job completion as a reason for the job getting stuck in the queue. Now we know this occurs when the job clean up process hangs. 24x7 always executes some kind of a "kill process" command to terminate external job process in case it is still running or not completely shutdown. The implementation of the "kill" is different on different systems. From the trace I can see the job is running on some Unix system, therefore the "kill" command is just a regular Unix "kill [pid]" command.
Why could system kill command hang on your system? Could it display an interactive prompt in case it doesn't like the operation and wait for interactive user input?
Can you put a script file with "kill" name and no extension into 24x7 directory and use it as a wrapper for the system kill command? If you can, please make the script to invoke the system command from /usr/bin directory (or wherever it is on your system) passing the command line parameter and also logging every call from 24x7 for this script into some log file, including the command line parameter value, and also logging the return code and response from the system kill command. This should help us to get to the bottom of the issue and find out what happens when the 'kill" operation hangs.
|
|
Mon May 26, 2008 12:56 pm |
|
 |
Redemann
Joined: 11 Jul 2007 Posts: 90 Country: Germany |
|
|
|
>> Sorry for delay with the response. The forum was closed for weekend because of the backend database server upgrades.
No problem.
>> Can you put a script file with "kill" name and no extension into 24x7 directory and use it as a wrapper...
Sorry. I'm not sure if I understand that. How should this skript look like?
For instance (?):
/usr/local/24x7_Scheduler/kill (kill is an executable shell script)
kill-Script:
#!/bin/bash
/usr/bin/kill -TERM <what>
echo "log something into logfile" >> logfile
Can you provide me an example please?
PS : The remote agent is running AIX 5.3 (latest service level and latest JAVA 1.4.2 package from IBM)
|
|
Tue May 27, 2008 5:26 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7966
|
|
|
|
I think your example script is just fine. The script is going to receive 1 command line parameter which is the system process id of the process it needs to terminate in case that process is still running. I would add one more echo line in the beginning to have the script print some diagnostic message before calling /usr/bin/kill, for example,
 |
 |
#!/bin/bash
echo "-------------"
echo "running kill for process $1" >> kill.log
/usr/bin/kill -TERM $1 >> kill.log
echo "kill completed with exit code $?" >> kill.log |
Maybe also add current date-time priting to this script, to easy the troubleshooting
|
|
Tue May 27, 2008 9:10 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7966
|
|
|
|
Hi. Could you please update us on the status of this issue? Did the implementation of a "custom" kill command resolve the issue?
|
|
Wed Jun 11, 2008 12:31 am |
|
 |
Redemann
Joined: 11 Jul 2007 Posts: 90 Country: Germany |
|
|
|
Sorry, I was out of office for 2 weeks and had no time to get into this.
Last 2 weeks no more problems regarding this issue occured.
I'll keep you up to date if I find time to check this...
Thank you.
|
|
Wed Jun 11, 2008 3:29 am |
|
 |
Redemann
Joined: 11 Jul 2007 Posts: 90 Country: Germany |
|
|
|
Took a long time but I just was able to take care about this case and continue...
I just added a kill script into the 24x7 directory on the remote agent to check out your suggestion but the Scheduler seems to ignore it's existence. I even restarted the process.
root@bam00(bam_tcp):"/acc/24x7_Scheduler"$ cat kill
#!/usr/bin/bash
echo "-------------"
echo "`date` : running kill for process $1" >> kill.log
/usr/bin/kill -TERM $1 >> kill.log
echo "`date` : kill completed for $1 with exit code $?" >> kill.log
No kill.log is created.
Any idea?
|
|
Thu Jul 10, 2008 5:43 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7966
|
|
|
|
Please check debug.log in /acc/24x7_Scheduler for text fragments like "/bin/sh kill" and what happens after that. This should provide a clue for why your kill script is not being picked.
|
|
Thu Jul 10, 2008 6:40 pm |
|
 |
Redemann
Joined: 11 Jul 2007 Posts: 90 Country: Germany |
|
|
|
I cannot find any hints:
root@bam00(bam_tcp):"/acc/24x7_Scheduler"$ grep kill debug.log
2008-07-18 15:30:00,298 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:30:19,920 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:35:00,419 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:39:03,829 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:40:00,385 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:45:00,279 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:49:04,841 [Job #85 - bamacc:pruef_bam] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:50:00,340 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
2008-07-18 15:55:00,340 [Job #67 - bamacc:check_schedule] DEBUG com.softtreetech.jscheduler.business.runner.ProgramJobRunner - killProcess start
Only the "normal" kill-messages. The kill-script seems to be ignored.
The scripts exists (of course):
root@bam00(bam_tcp):"/acc/24x7_Scheduler"$ ls -l kill
-rwxr-xr-- 1 root system 196 Jul 10 11:08 kill
|
|
Fri Jul 18, 2008 10:01 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7966
|
|
|
|
Sorry, I must be looking at some other version of the scheduler. I will try to find out why the output is different.
|
|
Fri Jul 18, 2008 11:36 am |
|
 |
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
I am having similar issues where my queue get locked up and then nothing gets executed. Is there any solution to this issue?
Thanks.
|
|
Mon Jul 06, 2009 12:11 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7966
|
|
|
|
Technically this is a side effect of some other issue. The root cause - the job is not clearing some resources and the system is unable to terminate the job thread or some external processes associated with that thread. The thread gets stuck in the queue.
Does this occur randomly or you see some patterns? Morning jobs?, jobs A, B, and C running concurrently, large files and longer job runs? Something else?
|
|
Mon Jul 06, 2009 5:35 pm |
|
 |
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
I see it in the morning most of the time when I am running jobs to check for trigger files. The job checks for a trigger file and if it's there will kick off a job via notification event. If the trigger file is not there, then the job exits with a status of 1 however I do not disable the job. It's very possible that some concurrency is going on here, where multiple jobs are getting kicked off at the same time because I check for files every 5 mintues for my high priority jobs.
Let me know if I can provide more information.
Thanks.
|
|
Mon Jul 06, 2009 7:02 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7966
|
|
|
|
Can you find out whether the jobs getting stuck in queue are doing some regular stuff or doing something different when they get stuck? Is there a detailed step-by-step log of their activities available?
BTW, the next maintenance release should support email alerts for queues loaded with lots of jobs. The alert is not going to clear such jobs, but at least it is going to simplify the management, as it would automatically notify you when an attention is required, for example when a queue is filled with x-number of jobs.
|
|
Tue Jul 07, 2009 12:42 pm |
|
 |
seanc217
Joined: 23 May 2007 Posts: 272
|
|
|
|
The scripts that sometime get stuck are simple file checker scripts that are just looking for files on the remote agent. Nothing real tricky going on here. I was hoping that someone may have figured out what's going on with this, but it appears there is no definitive answer. One thing I do notice is that I have been running in GUI mode on a windows server since Saturday and I have not seen this issue happen. It seems to be when the scheduler is in service mode the issue happens. Thanks for any input you can provide.
|
|
Tue Jul 07, 2009 2:05 pm |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|