SoftTree Technologies SoftTree Technologies
Technical Support Forums
RegisterSearchFAQMemberlistUsergroupsLog in
job flow handling

 
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite View previous topic
View next topic
job flow handling
Author Message
David Ciechanowicz



Joined: 05 Nov 2003
Posts: 6

Post job flow handling Reply with quote

Hi,

I have the same problem as the one described in threads:
Queues Hang -- Oracle/24x7 on Same Server,
Job Hung in Queue as Running after Completion,
Job Hung in Queue Revisited
and others.

Up to know I haven't found the reason for that behavior.
Also I haven't saw a patch or any real solution for that behavior.
Should we start to port our jobs to enterprise ready scheduler?

System details:
Uptime: 10 days 5 hours 35 minutes 36 seconds
Kernel version: Microsoft Windows 2000, Multiprocessor Free
Product type: Server
Product version: 5.0
Service pack: 4
Kernel build number: 2195
Registered organization: GE Capital Bank Poland
Registered owner: Saturnstrator (Dawid)
Install date: 2003-03-01, 09:45:47
Activation status: Not applicable
IE version: 5.0100
System root: C:\WINNT
Processors: 4
Processor speed: 2.8 GHz
Processor type: Intel(R) Xeon(TM) CPU
Physical memory: 2560 MB
Video driver: ATI Technologies Inc. RAGE XL PCI
Volume Type Format Label Size Free Free

A: Removable 0%

C: Fixed NTFS 9.2 GB 5.8 GB 63%

D: CD-ROM CDFS W2SSEL_EN 404.1 MB 0%

E: Fixed NTFS Dane 58.6 GB 58.5 GB 100%

H: Remote 0%

K: Remote 0%

N: Remote 0%
OS Hot Fix Installed
KB823182 2003-10-30
KB823559 2003-10-22
KB823980 2003-10-22
KB824105 2003-10-30
KB824141 2003-10-30
KB824146 2003-10-22
KB825119 2003-10-30
KB826232 2003-10-30
KB828035 2003-10-30
Q147222 2003-03-01
ServicePackUninstall 2003-10-22
Applications:
24x7 Automation Suite 3.3
Compaq Management Agents
Hewlett-Packard Survey Utility
Internet Explorer Q828750
LiveAdvisor (Symantec Corporation) 1.0.0.777
LiveUpdate 1.80 (Symantec Corporation) 1.80.19.0
Symantec AntiVirus Client 8.1.0.825
Symantec pcAnywhere 9.0
Version Control Agent 1.0
WebFldrs 9.00.3501
Windows 2000 Hotfix - KB823182 20030618.121409
Windows 2000 Hotfix - KB823559 20030627.135515
Windows 2000 Hotfix - KB823980 20030705.101654
Windows 2000 Hotfix - KB824105 20030716.151320
Windows 2000 Hotfix - KB824141 20030805.151423
Windows 2000 Hotfix - KB824146 20030823.144456
Windows 2000 Hotfix - KB825119 20030827.151123
Windows 2000 Hotfix - KB826232 20031007.160553
Windows 2000 Hotfix - KB828035 20031002.141358
Windows 2000 Service Pack 4

24x7 configuration:
103 script jobs (JAL) using semaphores, notifiaction actions, etc.

I've tried using multiple queues/one queue, synchonous/asynchronous and detached jobs.
I've removed some semaphores and instead I've created jobs that use JobRun and JobGetStatus
statements. I've put Exit statement at the end of every job and... nothing has changed.
They still randomly get stuck. Problem is exactly as it was described by Bill.
By all visible result job finishes, but in the queue monitor it's shown as running.
Also JobGetStatus returns -3 for that job - it's sick :-)
I've also found out other 'job flow' bug:
Job doesn't start after it have found it's semaphore if the previous job got stuck
(all jobs are asynchronous). Log entry shows that the job has found it trigger but it's
awaiting in the queue.
I've also noticed that 24x7 especially some of it's threads has quite resonable page faults
number per second (10 when running jobs, 6 idle) - I don't know if that's connected.

Regards,
David

ps.: turning trace on is not an option for me. I've got some clipper programs to run simultanouesly
and turning trace on causes the 24x7 to open only one NTVDM.

Mon Nov 10, 2003 6:20 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7969

Post Re: job flow handling Reply with quote

General cause of this problem is that the job leaves some resources open which at some point causes the job to hang after completion.
Unfortunately there is no common solution to this problem.
As a proven workaround try to no run many asynchronous jobs. Run jobs synchronous and detached. To allow more concurrent jobs to run at the same time create several job queues and spread jobs across multiple queues.

: Hi,

: I have the same problem as the one described in threads: Queues Hang --
: Oracle/24x7 on Same Server,
: Job Hung in Queue as Running after Completion,
: Job Hung in Queue Revisited
: and others.

: Up to know I haven't found the reason for that behavior.
: Also I haven't saw a patch or any real solution for that behavior.
: Should we start to port our jobs to enterprise ready scheduler?

: System details: Uptime: 10 days 5 hours 35 minutes 36 seconds
: Kernel version: Microsoft Windows 2000, Multiprocessor Free
: Product type: Server
: Product version: 5.0
: Service pack: 4
: Kernel build number: 2195
: Registered organization: GE Capital Bank Poland
: Registered owner: Saturnstrator (Dawid)
: Install date: 2003-03-01, 09:45:47
: Activation status: Not applicable
: IE version: 5.0100
: System root: C:\WINNT
: Processors: 4
: Processor speed: 2.8 GHz
: Processor type: Intel(R) Xeon(TM) CPU
: Physical memory: 2560 MB
: Video driver: ATI Technologies Inc. RAGE XL PCI
: Volume Type Format Label Size Free Free

: A: Removable 0%

: C: Fixed NTFS 9.2 GB 5.8 GB 63%

: D: CD-ROM CDFS W2SSEL_EN 404.1 MB 0%

: E: Fixed NTFS Dane 58.6 GB 58.5 GB 100%

: H: Remote 0%

: K: Remote 0%

: N: Remote 0%
: OS Hot Fix Installed
: KB823182 2003-10-30
: KB823559 2003-10-22
: KB823980 2003-10-22
: KB824105 2003-10-30
: KB824141 2003-10-30
: KB824146 2003-10-22
: KB825119 2003-10-30
: KB826232 2003-10-30
: KB828035 2003-10-30
: Q147222 2003-03-01
: ServicePackUninstall 2003-10-22
: Applications: 24x7 Automation Suite 3.3
: Compaq Management Agents
: Hewlett-Packard Survey Utility
: Internet Explorer Q828750
: LiveAdvisor (Symantec Corporation) 1.0.0.777
: LiveUpdate 1.80 (Symantec Corporation) 1.80.19.0
: Symantec AntiVirus Client 8.1.0.825
: Symantec pcAnywhere 9.0
: Version Control Agent 1.0
: WebFldrs 9.00.3501
: Windows 2000 Hotfix - KB823182 20030618.121409
: Windows 2000 Hotfix - KB823559 20030627.135515
: Windows 2000 Hotfix - KB823980 20030705.101654
: Windows 2000 Hotfix - KB824105 20030716.151320
: Windows 2000 Hotfix - KB824141 20030805.151423
: Windows 2000 Hotfix - KB824146 20030823.144456
: Windows 2000 Hotfix - KB825119 20030827.151123
: Windows 2000 Hotfix - KB826232 20031007.160553
: Windows 2000 Hotfix - KB828035 20031002.141358
: Windows 2000 Service Pack 4

: 24x7 configuration: 103 script jobs (JAL) using semaphores, notifiaction
: actions, etc.

: I've tried using multiple queues/one queue, synchonous/asynchronous and
: detached jobs.
: I've removed some semaphores and instead I've created jobs that use JobRun
: and JobGetStatus
: statements. I've put Exit statement at the end of every job and... nothing
: has changed.
: They still randomly get stuck. Problem is exactly as it was described by
: Bill.
: By all visible result job finishes, but in the queue monitor it's shown as
: running.
: Also JobGetStatus returns -3 for that job - it's sick :-)
: I've also found out other 'job flow' bug: Job doesn't start after it have
: found it's semaphore if the previous job got stuck
: (all jobs are asynchronous). Log entry shows that the job has found it
: trigger but it's
: awaiting in the queue.
: I've also noticed that 24x7 especially some of it's threads has quite
: resonable page faults
: number per second (10 when running jobs, 6 idle) - I don't know if that's
: connected.

: Regards,
: David

: ps.: turning trace on is not an option for me. I've got some clipper programs
: to run simultanouesly
: and turning trace on causes the 24x7 to open only one NTVDM.

Mon Nov 10, 2003 10:25 am View user's profile Send private message
David Ciechanowicz



Joined: 05 Nov 2003
Posts: 6

Post Re: job flow handling Reply with quote

I don't know what means general for you, but I did a two weeks extended
performance testing and found no resource leaks on the jobs.
Findings:

1. Except 2 of 104 jobs all others execute PL/SQL scripts through Oracle's SqlPlus.
Those scripts are spooled or interactive - they executes stored procedures on
the database. So there is minimal chance they will leave any resources open.

2. If the problem where caused by leaky sqlplus then the 'hang' condition will
get more probably with every job run and that's not true.

3. The same 'hangups' were reported by other people and as I remember they don't
use sqlplus.

All the jobs are as simple as they can be. Build comandline (sting concatenation),
run job, get pid, check if job with given pid exist, if no consider job done and log the message.
So there is not a lot of room for resource leak on the 24x7 side either. Although I've found one
bug that happens from time to time - sometimes scheduler doesn't detect that a job with given PID
does not exist. Here is code sample for shared routine detecting if job with given PID exist:

Routine definition:
_ChkByPID (PID_No number) return Boolean

Dim (lista_proc,String)
Dim (poszuk, String)
Dim (wynik,Boolean)

ProcessList (lista_proc)
ConCat ("\n",PID_NO, poszuk)
Match (lista_proc,poszuk,wynik)
Return wynik

If the same approach is taken by 24x7 (or Powerbuilder) to check if the
process/thread is running (PID/TID existence) then the bug would be explained.
Also standard program jobs get hang sometimes and that might be explained by the
same issue.

Question is can you do anything about that? Or is it fault of Powerbuilder internal
features and process/thread checking can not be fixed?

Regards,
David

Ps.: I'm using 15 queues and 50% of the jobs are synchronous and haven't observed
any diffrence.

: General cause of this problem is that the job leaves some resources open
: which at some point causes the job to hang after completion.
: Unfortunately there is no common solution to this problem.
: As a proven workaround try to no run many asynchronous jobs. Run jobs
: synchronous and detached. To allow more concurrent jobs to run at the same
: time create several job queues and spread jobs across multiple queues.

Wed Nov 19, 2003 11:34 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7969

Post Re: job flow handling Reply with quote

Let me concentrate on the following paragraph:

"All the jobs are as simple as they can be. Build comandline (sting concatenation),
run job, get pid, check if job with given pid exist, if no consider job done and log the message. "

Do they run as asynchronous jobs? (I mean 24x7 job not SQL*Plus processes)
Why do you check for pid? What do you do if you cannot find it in the process list?
How extensively you use the Script Library?

Can you post some simple script?

Please check the following:
1. Start NT Performance Monitor
2. Add "Handle" and "Thread" counters for "24x7.exe" to monitor.
3. Set the Performance Monitor to refresh automatically every 10 or 15 minutes and let it run for a day or two.
4. Please check out if the system instantly looses handles or threads or there is any pattern.
If there is any pattern try to identify which jobs were run at the time of leaking. Please let us know who these jobs are setup.

I will be able to answer your other questions once I understand better what you do.

PS. I am sure this hanging issue can be solved.

: I don't know what means general for you, but I did a two weeks extended
: performance testing and found no resource leaks on the jobs.
: Findings: 1. Except 2 of 104 jobs all others execute PL/SQL scripts through
: Oracle's SqlPlus.
: Those scripts are spooled or interactive - they executes stored procedures on
: the database. So there is minimal chance they will leave any resources open.

: 2. If the problem where caused by leaky sqlplus then the 'hang' condition
: will
: get more probably with every job run and that's not true.

: 3. The same 'hangups' were reported by other people and as I remember they
: don't
: use sqlplus.

: All the jobs are as simple as they can be. Build comandline (sting
: concatenation),
: run job, get pid, check if job with given pid exist, if no consider job done
: and log the message.
: So there is not a lot of room for resource leak on the 24x7 side either.
: Although I've found one
: bug that happens from time to time - sometimes scheduler doesn't detect that
: a job with given PID
: does not exist. Here is code sample for shared routine detecting if job with
: given PID exist: Routine definition: _ChkByPID (PID_No number) return
: Boolean

: Dim (lista_proc,String)
: Dim (poszuk, String)
: Dim (wynik,Boolean)

: ProcessList (lista_proc)
: ConCat ("\n",PID_NO, poszuk)
: Match (lista_proc,poszuk,wynik)
: Return wynik

: If the same approach is taken by 24x7 (or Powerbuilder) to check if the
: process/thread is running (PID/TID existence) then the bug would be
: explained.
: Also standard program jobs get hang sometimes and that might be explained by
: the
: same issue.

: Question is can you do anything about that? Or is it fault of Powerbuilder
: internal
: features and process/thread checking can not be fixed?

: Regards,
: David

: Ps.: I'm using 15 queues and 50% of the jobs are synchronous and haven't
: observed
: any diffrence.

Wed Nov 19, 2003 8:53 pm View user's profile Send private message
David Ciechanowicz



Joined: 05 Nov 2003
Posts: 6

Post Re: job flow handling Reply with quote

: Let me concentrate on the following paragraph: "All the jobs are as
: simple as they can be. Build comandline (sting concatenation),
: run job, get pid, check if job with given pid exist, if no consider job done
: and log the message. "
: Do they run as asynchronous jobs? (I mean 24x7 job not SQL*Plus processes)
Some of them. I would say it's about 50%

: Why do you check for pid? What do you do if you cannot find it in the process
: list?
Beacause checking for name is pointless, when there are more than one sqlplus.exe running.
RunAndWait is not an option either - some of the jobs starts ten simultaneous
sqlplus processes. BTW. i've noticed that putting few seconds delay between starting
these processes lowers the hangup rate.

: How extensively you use the Script Library?
There are three shared scripts. Two of them are used by every job. First one I've included
in previous post. Second is here:

_InformujAdministratorow (_jobid string, _jobname string, _message string) return [None]

Dim (wiadomosc,String)
ConCat ("GECB-DES-BATCH\nZadanie Nr ",_jobid,wiadomosc)
ConCat (wiadomosc," ",wiadomosc)
ConCat (wiadomosc,_jobname,wiadomosc)
ConCat (wiadomosc,"\n",wiadomosc)
ConCat (wiadomosc,"Info: "wiadomosc)
ConCat (wiadomosc,_message,wiadomosc)

// Dawid
MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc)
// Marcin
MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc)
// Dominik
MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc)
// Pawel
MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc)
// Jarek
MailSend ("BackupScheduler","","SMS","XXXX",wiadomosc)

This script is used as a notification action in case of job error to send
SMS to all systems administrators. Up to know it was executed twice.

: Can you post some simple script?
Here is one of the simplest:

// PID biegnacego procesu
Dim (PID_Fill,Number)
Dim (ScriptName,String,"smok_fill.SQL")

//temporary
Dim (Poszuk,String)
Dim (Wynik,Boolean)

// Konfugracja uruchomien
RunConfig ("WINDOW","MINIMIZE")
RunConfig ("TITLE","Smok wypelnianie przedplat")

// Uruchomienie update'u kolejek
Concat ("sqlplus ",global.BatchConnectString,poszuk)
ConCat (poszuk," @",poszuk)
ConCat (poszuk,global.ScriptPath,poszuk)
ConCat (poszuk,ScriptName,poszuk)
RUN (poszuk,"",PID_Fill)
Concat ("Uruchomiono wypelnianie kolejki przedplat - PID: ",PID_Fill,Poszuk)
LogAddMessageEx ("INFO","@V"job_id"","@V"job_name"",Poszuk)

//Oczekiwanie na zakonczenie update'u

Petla_Oczekujaca:
Wait(5)

_ChkByPID(PID_Fill,wynik)
If (wynik,Petla_Oczekujaca,Brak_Sesji)

// Wszystkie sesje rownolegle zakonczone
Brak_Sesji:

Concat ("Wypelnianie kolejki przedplat koniec PID: ",PID_Fill,Poszuk)
LogAddMessageEx ("INFO","@V"job_id"","@V"job_name"",Poszuk)

: Please check the following: 1. Start NT Performance Monitor
I've already checked that and over 30 others counters both for 24x7 process and
its threads. I've used 15 seconds interval to catch the moment, but had no luck with it.
System appears to work fine.

: PS. I am sure this hanging issue can be solved.
I think so too.

Thu Nov 20, 2003 7:23 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7969

Post Re: job flow handling Reply with quote

I still don't understand why you check for process id, I mean what is the purpose of doing that?

Suggestion: In all asynchronous scripts that call the Script Library try adding Wait with at least 3 seconds at the end of the job.

I am just guessing that multiple and fast concurrent calls to the Script Library may in theory cause loosing of some references to the internal Script Library program object. If this is the case, then it possibly creates some internal resource leaking and thus in the end causes the job hanging effect. In theory, references to the internal Script Library object in the code of the scheduler are not released immediately. Each call to the Script Library in the beginning adds a new references and after the call is complete this reference is marked for deletion and then the internal "garbage collection" process physically removes the references from memory. If that "garbage collection" process doesn't keep up with the new references being add and deleted it could cause the described problem.

By the way, if you wish you can use single ConCatEx comma-separated list of merged strings instead of multiple ConCat and also single MailSend with comma-separated list of recipients instead of multiple MailSend. This way you can make scripts a little bit shorter and more efficient.

: Some of them. I would say it's about 50%
: Beacause checking for name is pointless, when there are more than one
: sqlplus.exe running.
: RunAndWait is not an option either - some of the jobs starts ten simultaneous
: sqlplus processes. BTW. i've noticed that putting few seconds delay between
: starting
: these processes lowers the hangup rate.
: There are three shared scripts. Two of them are used by every job. First one
: I've included
: in previous post. Second is here: _InformujAdministratorow (_jobid string,
: _jobname string, _message string) return [None]

: Dim (wiadomosc,String)
: ConCat ("GECB-DES-BATCH\nZadanie Nr ",_jobid,wiadomosc)
: ConCat (wiadomosc," ",wiadomosc)
: ConCat (wiadomosc,_jobname,wiadomosc)
: ConCat (wiadomosc,"\n",wiadomosc)
: ConCat (wiadomosc,"Info: "wiadomosc)
: ConCat (wiadomosc,_message,wiadomosc)

: // Dawid
: MailSend
: ("BackupScheduler","","SMS","XXXX",wiadomosc)
: // Marcin
: MailSend
: ("BackupScheduler","","SMS","XXXX",wiadomosc)
: // Dominik
: MailSend
: ("BackupScheduler","","SMS","XXXX",wiadomosc)
: // Pawel
: MailSend
: ("BackupScheduler","","SMS","XXXX",wiadomosc)
: // Jarek
: MailSend
: ("BackupScheduler","","SMS","XXXX",wiadomosc)

: This script is used as a notification action in case of job error to send
: SMS to all systems administrators. Up to know it was executed twice.
: Here is one of the simplest: // PID biegnacego procesu
: Dim (PID_Fill,Number)
: Dim (ScriptName,String,"smok_fill.SQL")

: //temporary
: Dim (Poszuk,String)
: Dim (Wynik,Boolean)

: // Konfugracja uruchomien
: RunConfig ("WINDOW","MINIMIZE")
: RunConfig ("TITLE","Smok wypelnianie przedplat")

: // Uruchomienie update'u kolejek
: Concat ("sqlplus ",global.BatchConnectString,poszuk)
: ConCat (poszuk," @",poszuk)
: ConCat (poszuk,global.ScriptPath,poszuk)
: ConCat (poszuk,ScriptName,poszuk)
: RUN (poszuk,"",PID_Fill)
: Concat ("Uruchomiono wypelnianie kolejki przedplat - PID:
: ",PID_Fill,Poszuk)
: LogAddMessageEx
: ("INFO","@V"job_id"","@V"job_name"",Poszuk)

: //Oczekiwanie na zakonczenie update'u

: Petla_Oczekujaca: Wait(5)

: _ChkByPID(PID_Fill,wynik)
: If (wynik,Petla_Oczekujaca,Brak_Sesji)

: // Wszystkie sesje rownolegle zakonczone
: Brak_Sesji: Concat ("Wypelnianie kolejki przedplat koniec PID:
: ",PID_Fill,Poszuk)
: LogAddMessageEx
: ("INFO","@V"job_id"","@V"job_name"",Poszuk)
: I've already checked that and over 30 others counters both for 24x7 process
: and
: its threads. I've used 15 seconds interval to catch the moment, but had no
: luck with it.
: System appears to work fine.
: I think so too.

Thu Nov 20, 2003 10:02 am View user's profile Send private message
David Ciechanowicz



Joined: 05 Nov 2003
Posts: 6

Post Re: job flow handling Reply with quote

: I still don't understand why you check for process id, I mean what is the
: purpose of doing that?

I check to be sure that job has ended and I can start another process. I use this
method to do dynaminc load balancing in some jobs. For example: Job consist of 30
execution of SQLPlus with various parameters. Our performance tests on database shows
that 4 of these processes might run simultaneusly. The script starts the jobs and control
how many of them run at the current moment. If their are less than four it start another
one in queue, in other case it waits till one of previously started processes has finished.
To control the number of concurent jobs you just change one parameter in the script
(max_session_number).
I also use PIDs and PID logging to allow administrator to stop (kill) the job he want to.

: Suggestion: In all asynchronous scripts that call the Script Library try
: adding Wait with at least 3 seconds at the end of the job.
Ok. I'll try this one.

: I am just guessing that multiple and fast concurrent calls to the Script
: Library may in theory cause loosing of some references to the internal
: Script Library program object. If this is the case, then it possibly
: creates some internal resource leaking and thus in the end causes the job
: hanging effect. In theory, references to the internal Script Library
: object in the code of the scheduler are not released immediately. Each
: call to the Script Library in the beginning adds a new references and
: after the call is complete this reference is marked for deletion and then
: the internal "garbage collection" process physically removes the
: references from memory. If that "garbage collection" process
: doesn't keep up with the new references being add and deleted it could
: cause the described problem.
That sound probable. I'll for sure try this one and let you know about the results.

: By the way, if you wish you can use single ConCatEx comma-separated list of
: merged strings instead of multiple ConCat and also single MailSend with
: comma-separated list of recipients instead of multiple MailSend. This way
: you can make scripts a little bit shorter and more efficient.
Thans for the advice - I didn't knew about these extensions to JAL.


Thu Nov 20, 2003 12:59 pm View user's profile Send private message
Display posts from previous:    
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite All times are GMT - 4 Hours
Page 1 of 1

 
Jump to: 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


 

 

Powered by phpBB © 2001, 2005 phpBB Group
Design by Freestyle XL / Flowers Online.