Author |
Message |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
Scheduler hangs |
|
Hi guys
I'm having some problems with non-production installs of 24x7 running as a process.
In my dev environment, which I access via RDP over a VPN, I get temporary hangs every 1-2 minutes. The window says not responding, the menu bar goes white (?). When I click on the window, it comes back after 3 or 4 further seconds.
The timing is not predictable exactly.
On the copy I run on my workstation it's worse again; using the same job db it hangs permanently not temporarily and the process needs killing before it will come back.
There are plenty of custom jobs in the db obviously but nothing that screams out as broken.
Any ideas? Might a file-watch that ♦can't see it's target cause this?
|
|
Wed Oct 03, 2007 5:43 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7951
|
|
|
|
If the menu bar goes white, it means the scheduler main process is busy doing something or waiting for something. This could happen if you have lots of event based jobs such as jobs watching for files or emails, etc... or if these events are resource intensive, a typical example is a file watch for network based files. This type of event monitoring is a common killer.
This issue could be also a result of job notification actions, but that is less likely, for example, if a job logging is forward to a database table or lots of emails are sent after many jobs they can also make the main process busy.
I suggest to start looking first at jobs that check for network files or other network based resources. If there are many of them or they are affected by slow networks, such file monitoring should be redesigned and events should be moved to the 24x7 Event Server. The jobs itself, I mean the main business logic, can be kept in the scheduler, just the monitoring part moved to the Event Server.
|
|
Wed Oct 03, 2007 10:38 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
How many file watch jobs is too many? On a dedicated physical server.
|
|
Wed Oct 03, 2007 10:46 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7951
|
|
|
|
When they make the scheduler too busy checking files instead of running jobs, then you got "too may" of them.
Seriously, there is no good number. A single job looking for networking files easily can hang the entire server graphical interface, not just the scheduler. Explorer is involved in many file operations and if something goes wrong, it appears hung or very slow and screens don't redraw well.
Also, if you have file mask based watches and check in directories containing lots of files, or files constantly change, you can see a performance hit affecting related file operations.
How may such jobs did you get?
|
|
Wed Oct 03, 2007 11:01 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
There's like 6. in prod, 4 work fine.
|
|
Wed Oct 03, 2007 11:02 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
What makes the event server a better option?
|
|
Wed Oct 03, 2007 11:09 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7951
|
|
|
|
This is really nothing, of course if network factor is not involved and you are checking local files only.. If network its involved, wait when it starts happening again and then try opening Windows Explorer and typing the same path in the explorers address bar. Check how long it takes explorer to display the folder contents and if it hangs too.
If explorer opens that folder fast, look for other event based jobs or notification actions such as moving multi-megabyte files before/after job start, sending large emails, updating databases, etc…
If you have HTML Reports option enabled, make sure the files are not generated/updated on a network drive. They should be generated locally and then copied periodically using some background job.
Please let us know what you find.
|
|
Wed Oct 03, 2007 11:12 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
All of the file watches are network based.
So on my local workstation I have disabled the job that references a server that I know accesses slowly, and the account I am running under doesn't have permission to access the other network semaphores. I only have it on there to manipulate the job database (transfer jobs from dev to pre-prod file) but it's not usable in this manner.
In the dev environment all file watches are still network but are on virtual servers in the same network zone. Nothing dramatic.
HTML reports go to the local drive. There is no other major network activity on the dev environment.
|
|
Thu Oct 04, 2007 4:28 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7951
|
|
|
|
To avoid network performance dependency you may need to switch file watches to the Event Server or convert jobs to regular time based jobs and set their job script to check for files first, and if not found, exit without doing anything else. This way all file checking will go to the background and will not affect anything.
My personal preference is using the Event Server because it is much more efficient and specifically designed to handle such things. On the other hand, it requires installing an additional service. In the event definitions, you can set them to trigger the existing 24x7 jobs, at the same time set schedule of these jobs to [No Schedule]
|
|
Thu Oct 04, 2007 4:44 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
Is there any disaster recovery or failover code built into event server?
Also the network these are on is an internal company LAN with minimum bandwidth of 100 mb and insignificant latency. Performance as such should not be an issue right?
|
|
Thu Oct 04, 2007 6:12 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
Also what makes the event server less vulnerable to these issues? Surely if the network is a bottleneck then event server will suffer the same issues.♦
|
|
Thu Oct 04, 2007 6:13 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7951
|
|
|
|
The Event Server has a different architecture and no GUI. It runs as a service using multiple threads to monitor different events. Anyway, before you look into this further, please verify that the issue is caused by the network latency. You can, for example setup a test script job to check for files using FileExists or Dir statement with the same file mask and recording into a log file how long it took to check for files. Review the log after some time and check the times. If you see spikes in times, then you got some latency. This test job can be run in the background periodically or set to start one and loop forever until killed.
If the issue is not latency, you need to find what else causes these hang-ups. I suggested your network only because network is the most common cause.
|
|
Thu Oct 04, 2007 6:43 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
Ok so the file watch jobs are indeed the culprit as when I disable them all no more hangs.
I will try to figure out which one it is.
|
|
Thu Oct 04, 2007 8:21 pm |
|
 |
LeeD
Joined: 17 May 2007 Posts: 311 Country: New Zealand |
|
|
|
Seems it's none in particular. The network performance doesn't seem to be the problem more the fact that the target folders aren't available.
Still I would have thought the scheduler would deal with it more elegantly than that.
|
|
Thu Oct 04, 2007 9:13 pm |
|
 |
SysOp
Site Admin
Joined: 26 Nov 2006 Posts: 7951
|
|
|
|
Well... it is not very good at dealing with such issues as latency. The file-checking process is part of the time/event loop and to be efficient it should be very fast. When network is involved and things get slow, it starts affecting everything. There are of course alternative ways to deal with such issues. Just to list a few of them (don't pay attention to the order in which they are listed)
1. Don't check files on network shares, use local schedulers to check local files and trigger jobs on the master scheduler when they find the required files. This method will release the network, but requires maintaining several scheduler instances and job in different places.
2. Use 24x7 Event Server - this is the thing that has been built from ground up to do heavy lifting and check for events efficiently. A scheduler's job can be attached to any particular event with a simple mouse click and typing of the job number in the event notification properties.
3. Don't schedule file-watch jobs using "file exist" schedule type. On contrary, schedule periodic time-based jobs that run in the background, check for required files and do their job or trigger other jobs when the required files become available. This method makes sense when frequent file checks are not needed; perhaps if such checks are done every 10 -15 minutes they will not create much stress on the system and network.
4. Use background jobs sleeping and looping forever and waiting for file changes using DirWaitForUpdate command (not recommended for network files because even brief network outages can trigger false runs and make programming such jobs more difficult)
5. Don't watch for files, make the processes that create these files trigger jobs in 24x7 after file creating (this by far the most efficient method)
|
|
Thu Oct 04, 2007 10:22 pm |
|
 |
|