 |
SoftTree Technologies
Technical Support Forums
|
|
Author |
Message |
David Ciechanowicz
Joined: 05 Nov 2003 Posts: 6
|
|
job flow handling |
|
Hi, I have the same problem as the one described in threads: Queues Hang -- Oracle/24x7 on Same Server, Job Hung in Queue as Running after Completion, Job Hung in Queue Revisited and others. Up to know I haven't found the reason for that behavior. Also I haven't saw a patch or any real solution for that behavior. Should we start to port our jobs to enterprise ready scheduler? System details: Uptime: 10 days 5 hours 35 minutes 36 seconds Kernel version: Microsoft Windows 2000, Multiprocessor Free Product type: Server Product version: 5.0 Service pack: 4 Kernel build number: 2195 Registered organization: GE Capital Bank Poland Registered owner: Saturnstrator (Dawid) Install date: 2003-03-01, 09:45:47 Activation status: Not applicable IE version: 5.0100 System root: C:\WINNT Processors: 4 Processor speed: 2.8 GHz Processor type: Intel(R) Xeon(TM) CPU Physical memory: 2560 MB Video driver: ATI Technologies Inc. RAGE XL PCI Volume Type Format Label Size Free Free A: Removable 0% C: Fixed NTFS 9.2 GB 5.8 GB 63% D: CD-ROM CDFS W2SSEL_EN 404.1 MB 0% E: Fixed NTFS Dane 58.6 GB 58.5 GB 100% H: Remote 0% K: Remote 0% N: Remote 0% OS Hot Fix Installed KB823182 2003-10-30 KB823559 2003-10-22 KB823980 2003-10-22 KB824105 2003-10-30 KB824141 2003-10-30 KB824146 2003-10-22 KB825119 2003-10-30 KB826232 2003-10-30 KB828035 2003-10-30 Q147222 2003-03-01 ServicePackUninstall 2003-10-22 Applications: 24x7 Automation Suite 3.3 Compaq Management Agents Hewlett-Packard Survey Utility Internet Explorer Q828750 LiveAdvisor (Symantec Corporation) 1.0.0.777 LiveUpdate 1.80 (Symantec Corporation) 1.80.19.0 Symantec AntiVirus Client 8.1.0.825 Symantec pcAnywhere 9.0 Version Control Agent 1.0 WebFldrs 9.00.3501 Windows 2000 Hotfix - KB823182 20030618.121409 Windows 2000 Hotfix - KB823559 20030627.135515 Windows 2000 Hotfix - KB823980 20030705.101654 Windows 2000 Hotfix - KB824105 20030716.151320 Windows 2000 Hotfix - KB824141 20030805.151423 Windows 2000 Hotfix - KB824146 20030823.144456 Windows 2000 Hotfix - KB825119 20030827.151123 Windows 2000 Hotfix - KB826232 20031007.160553 Windows 2000 Hotfix - KB828035 20031002.141358 Windows 2000 Service Pack 4 24x7 configuration: 103 script jobs (JAL) using semaphores, notifiaction actions, etc. I've tried using multiple queues/one queue, synchonous/asynchronous and detached jobs. I've removed some semaphores and instead I've created jobs that use JobRun and JobGetStatus statements. I've put Exit statement at the end of every job and... nothing has changed. They still randomly get stuck. Problem is exactly as it was described by Bill. By all visible result job finishes, but in the queue monitor it's shown as running. Also JobGetStatus returns -3 for that job - it's sick :-) I've also found out other 'job flow' bug: Job doesn't start after it have found it's semaphore if the previous job got stuck (all jobs are asynchronous). Log entry shows that the job has found it trigger but it's awaiting in the queue. I've also noticed that 24x7 especially some of it's threads has quite resonable page faults number per second (10 when running jobs, 6 idle) - I don't know if that's connected. Regards, David ps.: turning trace on is not an option for me. I've got some clipper programs to run simultanouesly and turning trace on causes the 24x7 to open only one NTVDM.
|
|
Mon Nov 10, 2003 6:20 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7969
|
|
Re: job flow handling |
|
General cause of this problem is that the job leaves some resources open which at some point causes the job to hang after completion. Unfortunately there is no common solution to this problem. As a proven workaround try to no run many asynchronous jobs. Run jobs synchronous and detached. To allow more concurrent jobs to run at the same time create several job queues and spread jobs across multiple queues. : Hi, : I have the same problem as the one described in threads: Queues Hang -- : Oracle/24x7 on Same Server, : Job Hung in Queue as Running after Completion, : Job Hung in Queue Revisited : and others. : Up to know I haven't found the reason for that behavior. : Also I haven't saw a patch or any real solution for that behavior. : Should we start to port our jobs to enterprise ready scheduler? : System details: Uptime: 10 days 5 hours 35 minutes 36 seconds : Kernel version: Microsoft Windows 2000, Multiprocessor Free : Product type: Server : Product version: 5.0 : Service pack: 4 : Kernel build number: 2195 : Registered organization: GE Capital Bank Poland : Registered owner: Saturnstrator (Dawid) : Install date: 2003-03-01, 09:45:47 : Activation status: Not applicable : IE version: 5.0100 : System root: C:\WINNT : Processors: 4 : Processor speed: 2.8 GHz : Processor type: Intel(R) Xeon(TM) CPU : Physical memory: 2560 MB : Video driver: ATI Technologies Inc. RAGE XL PCI : Volume Type Format Label Size Free Free : A: Removable 0% : C: Fixed NTFS 9.2 GB 5.8 GB 63% : D: CD-ROM CDFS W2SSEL_EN 404.1 MB 0% : E: Fixed NTFS Dane 58.6 GB 58.5 GB 100% : H: Remote 0% : K: Remote 0% : N: Remote 0% : OS Hot Fix Installed : KB823182 2003-10-30 : KB823559 2003-10-22 : KB823980 2003-10-22 : KB824105 2003-10-30 : KB824141 2003-10-30 : KB824146 2003-10-22 : KB825119 2003-10-30 : KB826232 2003-10-30 : KB828035 2003-10-30 : Q147222 2003-03-01 : ServicePackUninstall 2003-10-22 : Applications: 24x7 Automation Suite 3.3 : Compaq Management Agents : Hewlett-Packard Survey Utility : Internet Explorer Q828750 : LiveAdvisor (Symantec Corporation) 1.0.0.777 : LiveUpdate 1.80 (Symantec Corporation) 1.80.19.0 : Symantec AntiVirus Client 8.1.0.825 : Symantec pcAnywhere 9.0 : Version Control Agent 1.0 : WebFldrs 9.00.3501 : Windows 2000 Hotfix - KB823182 20030618.121409 : Windows 2000 Hotfix - KB823559 20030627.135515 : Windows 2000 Hotfix - KB823980 20030705.101654 : Windows 2000 Hotfix - KB824105 20030716.151320 : Windows 2000 Hotfix - KB824141 20030805.151423 : Windows 2000 Hotfix - KB824146 20030823.144456 : Windows 2000 Hotfix - KB825119 20030827.151123 : Windows 2000 Hotfix - KB826232 20031007.160553 : Windows 2000 Hotfix - KB828035 20031002.141358 : Windows 2000 Service Pack 4 : 24x7 configuration: 103 script jobs (JAL) using semaphores, notifiaction : actions, etc. : I've tried using multiple queues/one queue, synchonous/asynchronous and : detached jobs. : I've removed some semaphores and instead I've created jobs that use JobRun : and JobGetStatus : statements. I've put Exit statement at the end of every job and... nothing : has changed. : They still randomly get stuck. Problem is exactly as it was described by : Bill. : By all visible result job finishes, but in the queue monitor it's shown as : running. : Also JobGetStatus returns -3 for that job - it's sick :-) : I've also found out other 'job flow' bug: Job doesn't start after it have : found it's semaphore if the previous job got stuck : (all jobs are asynchronous). Log entry shows that the job has found it : trigger but it's : awaiting in the queue. : I've also noticed that 24x7 especially some of it's threads has quite : resonable page faults : number per second (10 when running jobs, 6 idle) - I don't know if that's : connected. : Regards, : David : ps.: turning trace on is not an option for me. I've got some clipper programs : to run simultanouesly : and turning trace on causes the 24x7 to open only one NTVDM.
|
|
Mon Nov 10, 2003 10:25 am |
|
 |
David Ciechanowicz
Joined: 05 Nov 2003 Posts: 6
|
|
Re: job flow handling |
|
I don't know what means general for you, but I did a two weeks extended performance testing and found no resource leaks on the jobs. Findings: 1. Except 2 of 104 jobs all others execute PL/SQL scripts through Oracle's SqlPlus. Those scripts are spooled or interactive - they executes stored procedures on the database. So there is minimal chance they will leave any resources open. 2. If the problem where caused by leaky sqlplus then the 'hang' condition will get more probably with every job run and that's not true. 3. The same 'hangups' were reported by other people and as I remember they don't use sqlplus. All the jobs are as simple as they can be. Build comandline (sting concatenation), run job, get pid, check if job with given pid exist, if no consider job done and log the message. So there is not a lot of room for resource leak on the 24x7 side either. Although I've found one bug that happens from time to time - sometimes scheduler doesn't detect that a job with given PID does not exist. Here is code sample for shared routine detecting if job with given PID exist: Routine definition: _ChkByPID (PID_No number) return Boolean Dim (lista_proc,String) Dim (poszuk, String) Dim (wynik,Boolean) ProcessList (lista_proc) ConCat ("\n",PID_NO, poszuk) Match (lista_proc,poszuk,wynik) Return wynik If the same approach is taken by 24x7 (or Powerbuilder) to check if the process/thread is running (PID/TID existence) then the bug would be explained. Also standard program jobs get hang sometimes and that might be explained by the same issue. Question is can you do anything about that? Or is it fault of Powerbuilder internal features and process/thread checking can not be fixed? Regards, David Ps.: I'm using 15 queues and 50% of the jobs are synchronous and haven't observed any diffrence. : General cause of this problem is that the job leaves some resources open : which at some point causes the job to hang after completion. : Unfortunately there is no common solution to this problem. : As a proven workaround try to no run many asynchronous jobs. Run jobs : synchronous and detached. To allow more concurrent jobs to run at the same : time create several job queues and spread jobs across multiple queues.
|
|
Wed Nov 19, 2003 11:34 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7969
|
|
Re: job flow handling |
|
Let me concentrate on the following paragraph: "All the jobs are as simple as they can be. Build comandline (sting concatenation), run job, get pid, check if job with given pid exist, if no consider job done and log the message. " Do they run as asynchronous jobs? (I mean 24x7 job not SQL*Plus processes) Why do you check for pid? What do you do if you cannot find it in the process list? How extensively you use the Script Library? Can you post some simple script? Please check the following: 1. Start NT Performance Monitor 2. Add "Handle" and "Thread" counters for "24x7.exe" to monitor. 3. Set the Performance Monitor to refresh automatically every 10 or 15 minutes and let it run for a day or two. 4. Please check out if the system instantly looses handles or threads or there is any pattern. If there is any pattern try to identify which jobs were run at the time of leaking. Please let us know who these jobs are setup. I will be able to answer your other questions once I understand better what you do. PS. I am sure this hanging issue can be solved. : I don't know what means general for you, but I did a two weeks extended : performance testing and found no resource leaks on the jobs. : Findings: 1. Except 2 of 104 jobs all others execute PL/SQL scripts through : Oracle's SqlPlus. : Those scripts are spooled or interactive - they executes stored procedures on : the database. So there is minimal chance they will leave any resources open. : 2. If the problem where caused by leaky sqlplus then the 'hang' condition : will : get more probably with every job run and that's not true. : 3. The same 'hangups' were reported by other people and as I remember they : don't : use sqlplus. : All the jobs are as simple as they can be. Build comandline (sting : concatenation), : run job, get pid, check if job with given pid exist, if no consider job done : and log the message. : So there is not a lot of room for resource leak on the 24x7 side either. : Although I've found one : bug that happens from time to time - sometimes scheduler doesn't detect that : a job with given PID : does not exist. Here is code sample for shared routine detecting if job with : given PID exist: Routine definition: _ChkByPID (PID_No number) return : Boolean : Dim (lista_proc,String) : Dim (poszuk, String) : Dim (wynik,Boolean) : ProcessList (lista_proc) : ConCat ("\n",PID_NO, poszuk) : Match (lista_proc,poszuk,wynik) : Return wynik : If the same approach is taken by 24x7 (or Powerbuilder) to check if the : process/thread is running (PID/TID existence) then the bug would be : explained. : Also standard program jobs get hang sometimes and that might be explained by : the : same issue. : Question is can you do anything about that? Or is it fault of Powerbuilder : internal : features and process/thread checking can not be fixed? : Regards, : David : Ps.: I'm using 15 queues and 50% of the jobs are synchronous and haven't : observed : any diffrence.
|
|
Wed Nov 19, 2003 8:53 pm |
|
 |
David Ciechanowicz
Joined: 05 Nov 2003 Posts: 6
|
|
Re: job flow handling |
|
: Let me concentrate on the following paragraph: "All the jobs are as : simple as they can be. Build comandline (sting concatenation), : run job, get pid, check if job with given pid exist, if no consider job done : and log the message. " : Do they run as asynchronous jobs? (I mean 24x7 job not SQL*Plus processes) Some of them. I would say it's about 50% : Why do you check for pid? What do you do if you cannot find it in the process : list? Beacause checking for name is pointless, when there are more than one sqlplus.exe running. RunAndWait is not an option either - some of the jobs starts ten simultaneous sqlplus processes. BTW. i've noticed that putting few seconds delay between starting these processes lowers the hangup rate. : How extensively you use the Script Library? There are three shared scripts. Two of them are used by every job. First one I've included in previous post. Second is here: _InformujAdministratorow (_jobid string, _jobname string, _message string) return [None] Dim (wiadomosc,String) ConCat ("GECB-DES-BATCH\nZadanie Nr ",_jobid,wiadomosc) ConCat (wiadomosc," ",wiadomosc) ConCat (wiadomosc,_jobname,wiadomosc) ConCat (wiadomosc,"\n",wiadomosc) ConCat (wiadomosc,"Info: "wiadomosc) ConCat (wiadomosc,_message,wiadomosc) // Dawid MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc) // Marcin MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc) // Dominik MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc) // Pawel MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc) // Jarek MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc) This script is used as a notification action in case of job error to send SMS to all systems administrators. Up to know it was executed twice. : Can you post some simple script? Here is one of the simplest: // PID biegnacego procesu Dim (PID_Fill,Number) Dim (ScriptName,String,"smok_fill.SQL") //temporary Dim (Poszuk,String) Dim (Wynik,Boolean) // Konfugracja uruchomien RunConfig ("WINDOW","MINIMIZE") RunConfig ("TITLE","Smok wypelnianie przedplat") // Uruchomienie update'u kolejek Concat ("sqlplus ",global.BatchConnectString,poszuk) ConCat (poszuk," @",poszuk) ConCat (poszuk,global.ScriptPath,poszuk) ConCat (poszuk,ScriptName,poszuk) RUN (poszuk,"",PID_Fill) Concat ("Uruchomiono wypelnianie kolejki przedplat - PID: ",PID_Fill,Poszuk) LogAddMessageEx ("INFO","@V"job_id"","@V"job_name"",Poszuk) //Oczekiwanie na zakonczenie update'u Petla_Oczekujaca: Wait(5) _ChkByPID(PID_Fill,wynik) If (wynik,Petla_Oczekujaca,Brak_Sesji) // Wszystkie sesje rownolegle zakonczone Brak_Sesji: Concat ("Wypelnianie kolejki przedplat koniec PID: ",PID_Fill,Poszuk) LogAddMessageEx ("INFO","@V"job_id"","@V"job_name"",Poszuk) : Please check the following: 1. Start NT Performance Monitor I've already checked that and over 30 others counters both for 24x7 process and its threads. I've used 15 seconds interval to catch the moment, but had no luck with it. System appears to work fine. : PS. I am sure this hanging issue can be solved. I think so too.
|
|
Thu Nov 20, 2003 7:23 am |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7969
|
|
Re: job flow handling |
|
I still don't understand why you check for process id, I mean what is the purpose of doing that? Suggestion: In all asynchronous scripts that call the Script Library try adding Wait with at least 3 seconds at the end of the job. I am just guessing that multiple and fast concurrent calls to the Script Library may in theory cause loosing of some references to the internal Script Library program object. If this is the case, then it possibly creates some internal resource leaking and thus in the end causes the job hanging effect. In theory, references to the internal Script Library object in the code of the scheduler are not released immediately. Each call to the Script Library in the beginning adds a new references and after the call is complete this reference is marked for deletion and then the internal "garbage collection" process physically removes the references from memory. If that "garbage collection" process doesn't keep up with the new references being add and deleted it could cause the described problem. By the way, if you wish you can use single ConCatEx comma-separated list of merged strings instead of multiple ConCat and also single MailSend with comma-separated list of recipients instead of multiple MailSend. This way you can make scripts a little bit shorter and more efficient. : Some of them. I would say it's about 50% : Beacause checking for name is pointless, when there are more than one : sqlplus.exe running. : RunAndWait is not an option either - some of the jobs starts ten simultaneous : sqlplus processes. BTW. i've noticed that putting few seconds delay between : starting : these processes lowers the hangup rate. : There are three shared scripts. Two of them are used by every job. First one : I've included : in previous post. Second is here: _InformujAdministratorow (_jobid string, : _jobname string, _message string) return [None] : Dim (wiadomosc,String) : ConCat ("GECB-DES-BATCH\nZadanie Nr ",_jobid,wiadomosc) : ConCat (wiadomosc," ",wiadomosc) : ConCat (wiadomosc,_jobname,wiadomosc) : ConCat (wiadomosc,"\n",wiadomosc) : ConCat (wiadomosc,"Info: "wiadomosc) : ConCat (wiadomosc,_message,wiadomosc) : // Dawid : MailSend : ("BackupScheduler","","SMS","XXXX",wiadomosc) : // Marcin : MailSend : ("BackupScheduler","","SMS","XXXX",wiadomosc) : // Dominik : MailSend : ("BackupScheduler","","SMS","XXXX",wiadomosc) : // Pawel : MailSend : ("BackupScheduler","","SMS","XXXX",wiadomosc) : // Jarek : MailSend : ("BackupScheduler","","SMS","XXXX",wiadomosc) : This script is used as a notification action in case of job error to send : SMS to all systems administrators. Up to know it was executed twice. : Here is one of the simplest: // PID biegnacego procesu : Dim (PID_Fill,Number) : Dim (ScriptName,String,"smok_fill.SQL") : //temporary : Dim (Poszuk,String) : Dim (Wynik,Boolean) : // Konfugracja uruchomien : RunConfig ("WINDOW","MINIMIZE") : RunConfig ("TITLE","Smok wypelnianie przedplat") : // Uruchomienie update'u kolejek : Concat ("sqlplus ",global.BatchConnectString,poszuk) : ConCat (poszuk," @",poszuk) : ConCat (poszuk,global.ScriptPath,poszuk) : ConCat (poszuk,ScriptName,poszuk) : RUN (poszuk,"",PID_Fill) : Concat ("Uruchomiono wypelnianie kolejki przedplat - PID: : ",PID_Fill,Poszuk) : LogAddMessageEx : ("INFO","@V"job_id"","@V"job_name"",Poszuk) : //Oczekiwanie na zakonczenie update'u : Petla_Oczekujaca: Wait(5) : _ChkByPID(PID_Fill,wynik) : If (wynik,Petla_Oczekujaca,Brak_Sesji) : // Wszystkie sesje rownolegle zakonczone : Brak_Sesji: Concat ("Wypelnianie kolejki przedplat koniec PID: : ",PID_Fill,Poszuk) : LogAddMessageEx : ("INFO","@V"job_id"","@V"job_name"",Poszuk) : I've already checked that and over 30 others counters both for 24x7 process : and : its threads. I've used 15 seconds interval to catch the moment, but had no : luck with it. : System appears to work fine. : I think so too.
|
|
Thu Nov 20, 2003 10:02 am |
|
 |
David Ciechanowicz
Joined: 05 Nov 2003 Posts: 6
|
|
Re: job flow handling |
|
: I still don't understand why you check for process id, I mean what is the : purpose of doing that? I check to be sure that job has ended and I can start another process. I use this method to do dynaminc load balancing in some jobs. For example: Job consist of 30 execution of SQLPlus with various parameters. Our performance tests on database shows that 4 of these processes might run simultaneusly. The script starts the jobs and control how many of them run at the current moment. If their are less than four it start another one in queue, in other case it waits till one of previously started processes has finished. To control the number of concurent jobs you just change one parameter in the script (max_session_number). I also use PIDs and PID logging to allow administrator to stop (kill) the job he want to. : Suggestion: In all asynchronous scripts that call the Script Library try : adding Wait with at least 3 seconds at the end of the job. Ok. I'll try this one. : I am just guessing that multiple and fast concurrent calls to the Script : Library may in theory cause loosing of some references to the internal : Script Library program object. If this is the case, then it possibly : creates some internal resource leaking and thus in the end causes the job : hanging effect. In theory, references to the internal Script Library : object in the code of the scheduler are not released immediately. Each : call to the Script Library in the beginning adds a new references and : after the call is complete this reference is marked for deletion and then : the internal "garbage collection" process physically removes the : references from memory. If that "garbage collection" process : doesn't keep up with the new references being add and deleted it could : cause the described problem. That sound probable. I'll for sure try this one and let you know about the results. : By the way, if you wish you can use single ConCatEx comma-separated list of : merged strings instead of multiple ConCat and also single MailSend with : comma-separated list of recipients instead of multiple MailSend. This way : you can make scripts a little bit shorter and more efficient. Thans for the advice - I didn't knew about these extensions to JAL.
|
|
Thu Nov 20, 2003 12:59 pm |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|