SoftTree Technologies SoftTree Technologies
Technical Support Forums
RegisterSearchFAQMemberlistUsergroupsLog in
Stuck Queues, Jobs not executing.

 
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite View previous topic
View next topic
Stuck Queues, Jobs not executing.
Author Message
robertk



Joined: 14 Jul 2008
Posts: 10
Country: United States

Post Stuck Queues, Jobs not executing. Reply with quote
I have several new jobs which run frequently which have been acting up. I'll use one as an example. We have a job titled "ScanMachineReader", a console app, which runs once every five minutes. It runs well for a while (e.g. for 6 hours, sometimes an entire day), and then I'll check the Job Monitor and notice that there's one instance that says "Running" ... even though the log file from the actual application says it has completed OK. Then, other instances of this job in the queue just seem to pile up with the message "Queued, Awaiting Start" (using the Web Console).

Here are some key properties/requirements of the job:

Requirements:
--Run once every 5 minutes
--Only one instance can run at a time

Current Job properties:
--Runs once every 5 minutes
--Runs a JAL script which calls RunAndWait("...ScanMachineReader.exe", 1800, id) to launch the program (gives generous 30 minutes to finish if needed.)
--Executes in its own dedicated Queue (only this job uses this queue)
--Runs Detached
--Asynchronous process property is NOT CHECKED (e.g. runs syncronously)
--Skip this job if delay is over 2 minutes


I figured using RunAndWait(), along with Asynchronous=False, and Skipping job instances if delay is more than 2 minutes would assure that only one instance would run at a time, and certainly not back up the queue. On the surface, it doesn't seem that RunAndWait() forcefully quits the job if it takes longer than 1800 seconds (30 mintues). It doesn't seem to even know the job has completed.

I have also noticed this behavior in the other frequently running jobs as well. When the scheduler is in this stuck/backed up situation, it generally becomes unstable and doesn't schedule other pre-existing, long-term, trouble free jobs. To clear out of this situation, I have to log into the server and restart the 24x7 Service periodically. Right now, this hardly reliable.

Here is my 24x7 setup:

Windows 2008 Server, 64 bit
24x7 Scheduler, Version 3.5.2, running as a Service

Ideas?
Mon Dec 21, 2009 3:38 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7850

Post Reply with quote
Please set 5 minutes timeout for the job, so that if for whatever reason it gets stuck, the scheduler can terminate job process and don't wait indefinitely for the job to clear the queue.

Hope this helps
Mon Dec 21, 2009 4:56 pm View user's profile Send private message
robertk



Joined: 14 Jul 2008
Posts: 10
Country: United States

Post Stuck Queues, Jobs not executing Reply with quote
OK, I tried what you said...putting a value of 30 in the Timeout field and I'm still having troubles with my queues.

Here's what I did...
-- Ditched the script with RunAndWait() and just put the program in "Command Line" field on Step 2 of 13 of the Wizard.
-- Put 30 in the Timout field

All ran well for about 12 hours. When I checked the scheduler this morning, two of my queues were piling up again...after I specified the timeout.

Digging deeper:

In the schedule.log file, if found these entries from last night at 11:30 PM:

Code:

12/22/2009 23:30:00.823000   0   0   0   Integration.BMSCLetters Queue Manager   Job #60 - waiting for the process to complete
12/22/2009 23:30:00.823000   0   0   0   Integration.ScanMachineReader Queue Manager   Job #58 - waiting for the process to complete
12/22/2009 23:30:01.073000   0   0   0   Integration.NSSubscriptionsToWACS Queue Manager   Job #53 - waiting for the process to complete
12/22/2009 23:30:01.120000   0   0   0   Integration.WISEstimatesToSales Queue Manager   Job #59 - waiting for the process to complete
12/22/2009 23:30:01.260000   0   0   0   Integration.CentralBillingScanReader Queue Manager   Job #61 - waiting for the process to complete


Zeroing in on Job #59 and #61, two jobs that are in their own single queues, I found that Job #59 had no more entries in the schedule.log file until after I restarted the scheduler service this morning. Job #61 continued until 1:30 AM this morning and then just stopped logging entries in the schedule.log file. Now, when I looked at the Job Monitor this morning, both Job #59 and #61 had one instance with a green flag (indicating it was running), and then both jobs had numerous coffee cup icons after the green flag, I guess to indicate instances in the queues that were waiting to be executed.

Now, after I restarted the service this morning, from the log file, here are some entries:

Code:

12/23/2009 6:59:08.669000   0   0   0   24x7 Scheduler   [24x7 service] 24x7 Scheduler starting...
12/23/2009 7:00:00.607000   0   0   0   OPIS.OpisUsageFileLoader Queue Manager   Job #49 - waiting for the process to complete
12/23/2009 7:00:00.607000   0   0   0   Integration.WISEstimatesToSales Queue Manager   Job #59 - waiting for the process to complete
12/23/2009 7:00:00.732000   0   0   0   Integration.NSSubscriptionsToWACS Queue Manager   Job #53 - waiting for the process to complete
12/23/2009 7:00:00.732000   0   0   0   Integration.ScanMachineReader Queue Manager   Job #58 - waiting for the process to complete
12/23/2009 7:00:00.967000   0   0   0   Integration.CentralBillingScanReader Queue Manager   Job #61 - waiting for the process to complete
12/23/2009 7:00:00.982000   0   0   0   Integration.BMSCLetters Queue Manager   Job #60 - waiting for the process to complete
12/23/2009 7:00:01.592000   0   59   0   WISEstimatesToSales   Job started.
12/23/2009 7:00:02.154000   0   49   0   OPIS Usage File Loader   Job started.
12/23/2009 7:00:02.342000   0   59   0   WISEstimatesToSales   Start message sent.
12/23/2009 7:00:02.888000   0   49   0   OPIS Usage File Loader   Start message sent.
12/23/2009 7:00:03.295000   0   61   0   CentralBillingReader   Job started.


After the scheduler restart, things seemed to work as expected. But, unfortunately, I am just waiting for the queues to get stuck again. Both example jobs are in their own dedicated queues, run synchronously, and have a timeout of 30 minutes.

Why don't I see any log entries for Job #59 after 11:30 PM on 12/22/2009? I would at least expect to see something like "Job #59 - waiting for the process to complete" for the next 30 minutes (the timeout period), and then a successful attempt again after the scheduler forcefully stopped the job after the time out period.

Please shed some light on this...this is VERY frustrating.

Thanks
Robert
Wed Dec 23, 2009 11:17 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7850

Post Reply with quote
Please verify that both jobs are set to run detached. If not, please set them to run detached. This typically cures most of the problems of this type.

Also please let us know whether these jobs execute any notification actions on job completion or job finish, such as sending emails, sending SNMP traps, updating some database, etc.. any other sort of post job processing activity that could play a role…
Wed Dec 23, 2009 7:47 pm View user's profile Send private message
robertk



Joined: 14 Jul 2008
Posts: 10
Country: United States

Post Reply with quote
OK, both these jobs run detached. Both jobs write to a HealthMonitoring database, usually on OnStart and OnFinish just to keep a log of when the job started, finished, or encountered an error.

I don't think the database processing is the cause of the errors however, because usually one job's queue will get stuck, but the other one will not and it will continue to write to the database. Even so, I have the timeout set on the jobs, but after the timeout period elapses there are no more log entries in the schedule.log file until I restart the scheduler.

In addition to these 2 jobs, I have 4 more, with similar properties, and they all sooner or later get stuck. To summarize, all these jobs have these similarities:
-- Run detached
-- Have timeouts set
-- Runs SYNCRONOUSLY
-- Have short time intervals of 10 minutes or less
-- All run in a dedicated queue

All of these jobs were recent addtions to the scheduler, within the last 6 weeks. Up until then, the sheduler has behaved pretty well with only a minor hiccup now and then.

Over the long holiday weekend, I had to periodically log in from home to restart the scheduler service, which was a real drag. And unfortunately, we have several more jobs in the pipeline to add to the scheduler soon, but now I very reluctant to do so because of these problems.

Any other ideas?
Mon Dec 28, 2009 10:06 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7850

Post Reply with quote
Which version of the scheduler is that?


By the way I suspect that notification actions are the root cause, you may need to tune them a bit. They are executed by the scheduling engine before/after job completion, that's why jobs get stuck after they seem to complete.

Please describe how these notification actions are currently implemented. Are they using JAL, VBScript, external command, something else? If JAL or VBScript, which database driver is used for the connection?
Mon Dec 28, 2009 10:34 am View user's profile Send private message
robertk



Joined: 14 Jul 2008
Posts: 10
Country: United States

Post Reply with quote
Scheduler version:

24x7 Scheduler, Version 3.5.2, running as a Service
Windows 2008 Server SP1, 64 bit, 1 GB memory

Since your last post this morning, I commented out any calls to any database logging in the script events since they are not essential. Now, I only have "Job Error" checked on step 8 of 12 of the wizard and I send an email to myself if there is an error. Previously, I was using JAL to write info to the database.

However...since I've made this change, one of the job's queue has backed up again (runs once a minute) and it has past the timeout period of 10 minutes. I'm seeing that more instances are queing up. Also, shouldn't have I gotten an error email for the elapsed timeout? And lastly...I have "Skip this job if delay is more that 5 minutes" set. Why are more instances being added to the queue if the delay is more than 5 minutes?
Mon Dec 28, 2009 12:25 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7850

Post Reply with quote
Yes, you should have gotten error emails, if (1) email notifications actions are enabled for this job, (2) global email settings are setup correctly, must be set set to use SMTP, and (3) emails are not blocked by a firewall or something else.

Please check schedule.log for the time this issue occurred last and check if there are any messages like "Sending alert..." "alert sent" and similar.

I wish you had a 3.6.x version that generates debug trace for all the processing so we could quickly find what goes wrong there. Well, let's try to figure out that without the trace

Please post the complete script of the notification action for the database? Is this the only notification action for the job? If there are other actions, please let us know what they are. Emails can be also causing problems if they get stuck when blocked.
Mon Dec 28, 2009 2:17 pm View user's profile Send private message
robertk



Joined: 14 Jul 2008
Posts: 10
Country: United States

Post Reply with quote
Hello again. :-)

In my previous post I mentioned that I took out any notification actions that write to the database in my problem jobs. My jobs are no longer executing any notification "scripts" that write to a database. They just use the built in notification by having the "Job Error" option checked on the line that reads "Send email Message" on step 8 of 12 in the wizard. I did not see anything in schedule.log that indicated something was sent however. Finding this interesting, I setup a very simple job that executed a batch file that didn't exist. Funny thing, I got the "Start" message in my inbox, but not the "error" message, nor the "finish" message. Here are all the log entries just for my simple job titled "Test Error Email"

Code:

12/28/2009 14:16:55.221000   0   66   0   Test Error Email   Job started.
12/28/2009 14:16:55.268000   0   66   0   Test Error Email   Start message sent.
12/28/2009 14:16:55.283000   2   66   0   Test Error Email   Create process failed. Return code: 2 - The system cannot find the file specified. 
12/28/2009 14:16:55.299000   2   66   0   Test Error Email   Job failed. Executable file not found.


Anyway, this is a bit beside my original problem of the jobs piling up in job queues. Are my settings appropriate for my frequently running jobs (see previous post for settings)? What about my operating system and memory? I looked at the system requirements and my system is above your minimum standards. Also, in your release notes for versoins higher than mine, I see an ambiguous "Fixes for known bugs". Is there a list for these, or can you tell me if my type of problem has been adressed in a later release? What else can I do at this point?
Mon Dec 28, 2009 4:18 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7850

Post Reply with quote
I'm really concerned about on finish/on error notification actions. I think that is where the issue is, not in the jobs. I understand that you commented out part of the code writing to the database, but you still have the notification action in place.

Please do me a favor, restart the scheduler in GUI mode, not as a service, and try your test email job. Let's see if that works. If it doesn't, we will need to find out the reason, maybe the first email breaks everything. If it works, let's see what happens to regular jobs, whether they can send error emails and whether they get stuck again.
Mon Dec 28, 2009 5:45 pm View user's profile Send private message
robertk



Joined: 14 Jul 2008
Posts: 10
Country: United States

Post Reply with quote
Unfortunately, I cannot currently run 24x7 in GUI mode since I don't have the admin credentials that the service uses to log into the server (security is over the top here.).

Anyway, for all of my frequently running jobs that run syncronously, I took out ALL notifications. Things were going well for most of yesterday until later in the afternoon when one of the job queues got stuck again.

This morning I dug into the Windows logs (Control Panel > System and Maintenance > View Problem History) and found a boat load of errors, two of them since midnight. Here is the text of one of the logs:
Code:

Product
24x7.exe

Problem
Stopped working

Date
12/30/2009 12:30 AM

Status
Report Sent

Problem signature
Problem Event Name:   APPCRASH
Application Name:   24x7.exe
Application Version:   3.5.1.0
Application Timestamp:   3b453b55
Fault Module Name:   PBVM70.dll
Fault Module Version:   7.0.3.10095
Fault Module Timestamp:   3b45e726
Exception Code:   c0000005
Exception Offset:   002554dc
OS Version:   6.0.6001.2.1.0.272.7
Locale ID:   1033
Additional Information 1:   893c
Additional Information 2:   4b02343b23b7fd292c04c9bff2483f68
Additional Information 3:   f6f8
Additional Information 4:   8af36f96b541fe0e9b7a13e3857a5531

Extra information about the problem
Bucket ID:   1551555318


In the log above, I think the 24x7.exe is an instance of the scheduler since I run things "detached". I also see similar entries for some of our applications that 24x7 runs in the queues. So, if one of my apps stops responding, shouldn't the timeout kill the process? Why then do I see numrous 24x7 crashes in the logs as well? BTW, I also see numerous similar logs for HTMLGEN.exe, which I think is your logging feature.

And finally, a separate question, I would like to know if I am setting my jobs up correctly for synchronous runs with only one instance running at at time? (Key Settings: Detached, timeout set, run SYNC, run single job in dedicated queue, Skip job if delay is over X minutes). I want to know if my approach would be considered a "best practice".

Thanks!
Wed Dec 30, 2009 11:13 am View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7850

Post Reply with quote
I think the job process crashes and Dr.Watson or other debugger kicks in asking to debug the process, that's why it cannot be terminated on timeout. Since they are all running on the invisible services desktop, you cannot see these debugger prompts or process crash messages and you cannot interact with them.

I also see some "Report sent" message. Is that generated by Windows Error Reporting Service configured to automatically send all errors to Microsoft? If that's the case, the Error Reporting Service should display interactive "Message Sent" dialog, which you cannot see. By the way, 24x7 version 3.6 features its own build-in job crash handler which generates diagnostic information and saves it in separate XML and log files making much easier to troubleshoot such issues.


Well, now we know that notification actions are likely not an issue and the initial theory was wrong. Please check your Windows event logs and try to match timing of crash reports and timing of jobs getting stuck in the queue. If you find a pattern there, that would mean that job process crashing is causing the issue. In that case, you would first need to adjust Windows settings and set it to let crashed processed go without manual intervention and error report sending. This will give you time to search for the root cause of what's causing the processes to crash. If you don't find a pattern, let's discuss how to reconfigure the jobs to make them run out queue and avoid queue stagnation issues.

Quote:

And finally, a separate question, I would like to know if I am setting my jobs up correctly for synchronous runs with only one instance running at at time? (Key Settings: Detached, timeout set, run SYNC, run single job in dedicated queue, Skip job if delay is over X minutes). I want to know if my approach would be considered a "best practice".


Yes, that a good approach if you have relatively small number of jobs. Please keep in mind that typically the number of queues in the Windows edition of 24x7 shouldn't exceed 10-12 queues because of the resource constraints. In the Multi-platform version any number of queues is ok because the resource management is much more efficient.
Wed Dec 30, 2009 11:50 am View user's profile Send private message
robertk



Joined: 14 Jul 2008
Posts: 10
Country: United States

Post Reply with quote
Sorry I haven't replied in a while, but we are still having problems. I think at this point, we are going to upgrade to the latest version and see how that goes (with improved diagnostics, etc.). If we need to move to the multi-platform edition with the better queue management as you mention, is the cost of moving to that version FREE if we already have a site license for the 24x7 Automation Suite? On your pricing page, it reads:

24x7 Scheduler Multi-platform Edition, v. 4.3, discount for 24x7 Automation Suite users (site license) FREE

Also, do you know if the Multi-Platform edtion can run simultaneously on the same box as the 24x7 Scheduler, Windows edition? That way we can port jobs from one version to the other on the same box without having to setup a new machine.


Thanks,
Robert
Wed Jan 06, 2010 12:49 pm View user's profile Send private message
SysOp
Site Admin


Joined: 26 Nov 2006
Posts: 7850

Post Reply with quote
The MP version is actually included with the site license. If your license includes maintenance option then upgrading to the latest versions is also free.

MP Edition can be run concurrently on the same box; unless you want to run it in Windows service mode. In the service mode, only one service can be installed by default – the name of the service is the same for Win and MP Editions. However, it possible to setup multiple services by manually by editing the registry (install/rename/restart).
Wed Jan 06, 2010 8:36 pm View user's profile Send private message
Display posts from previous:    
Reply to topic    SoftTree Technologies Forum Index » 24x7 Scheduler, Event Server, Automation Suite All times are GMT - 4 Hours
Page 1 of 1

 
Jump to: 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


 

 

Powered by phpBB © 2001, 2005 phpBB Group
Design by Freestyle XL / Flowers Online.