Left files in Queue directory

The first thing that comes to mind when I hear word "service" is that typically in service mode the environment settings and user account privileges differ from the interactive application mode. Sometimes this leads to nasty issues with jobs trying to access some networked resources and getting hung when such resources aren't accessible to them, waiting forever for some action. A simple boolean logic "IF files exist DO step 1 ELSE DO step 2" can lead to a wrong step in case service account simply cannot access the place when it needs to check files. So, it does the wrong step and hangs there. Of course, this is just a theory. Your situation might be different, but if you can confirm that never happens in the interactive mode, you would know where to look for the root cause of the problem.

I totally understand what you are saying.
However, I am running all my scripts under the same account and the jobs run fine. I think this has to do something with scripts getting called concurrently and something getting mixed up.

What can I do to debug something like this?

Thanks.

Can you run it as a non-service for a little while, just to establish the fact whether the service mode is having a hand in it?

OK it's definitely not the gui vs service mode issue. I just had a bunch of jobs back up in the queue.
I took a look out on the box where the script is getting run, but I do not see anything there as far as a process hung.

How can I troubleshoot further?

Thanks

I took a look at the agent log. The job that is stuck is called: 01_check_st_ctrl_trigger

8-Jul-2009 09:17:00 AM 2 null 470 01_check_st_ctrl_trigger Remote job started.
8-Jul-2009 09:17:00 AM 2 null 470 01_check_st_ctrl_trigger Job started.
8-Jul-2009 09:17:00 AM 3 null 470 01_check_st_ctrl_trigger Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
8-Jul-2009 09:17:00 AM 3 null 470 01_check_st_ctrl_trigger Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.
8-Jul-2009 09:22:00 AM 2 null 470 01_check_st_ctrl_trigger Remote job started.
8-Jul-2009 09:22:00 AM 2 null 470 01_check_st_ctrl_trigger Job started.
8-Jul-2009 09:22:00 AM 3 null 470 01_check_st_ctrl_trigger Job completed with exit code 1. This exit code does not satisfy job exit code condition. Job failed.

Notice how when the first occurence happened there were to messages with "Job completed with exit code 1."

However when the job got stuck, there is only one occurence.

Thanks.

Some further information, I tried to kill the job through the queue manager.
When I clicked on delete and say yes, it does not kill the process.

I know I am asking a lot of questions, but I am trying to figure this out.
How do I go about clearing the queue once it's stuck. I deleted all the .q files out of the directory, but it is still not cleared.

Do I have to restart the sheduler every time this happens.

The other thing I am going to try is to have time-outs on my file_checker jobs. I'm thinking one minute is more than enough for the check script to run.

Thanks for any input.

Do all these jobs run remotely? When they get stuck, how many jobs are running/being sent to the same agent (at that point in time)?

Please enable tracing on the agent. We should compare timing of events recorded in the debug.log file on the agent with log records on the scheduler side.

Also, in case they are all remote, please provide a bit of description of what these jobs do, how they are currently setup, and how they authenticate to the agent.

Yes all the jobs run remotely.
I have enabled tracing on the agent.

It will be some time before this happens again so I will keep you posted when the issue happens again.

Thanks. I sense that this issue might be related to job concurrency and agent authentication.

I recall another case in which an agent was too busy to respond to all authentication request in time, causing some requests to time out. For timeout our authentications the agent never sent any responses back to the scheduler. As a result, some jobs were getting stuck neither rejected nor started, piling up in the queue. The solution chosen in that case was to setup 2 concurrent job queues for each agent (there were multiple agents in the configuration), spread jobs evenly against queues and set them to run synchronous. Basically that solution limited number of concurrent remote sessions between each pair of agent and scheduler to 2 at a time.

So setup 2 agents that point to the same server in the master scheduler.
Also setup 2 queues.

Then split some of the jobs between them?

Is there any maximum number that is a threshhold for when this starts happening?

Thanks again for the help.

A couple more questions.

If you look at my log that I posted it does not appear to be an agent timeout error because the job was started. I think this has to do with the agent trying to kill the process like in this original post. What's interesting is the one entry is missing Here's what I posted before:

RUN BEFORE the hang:

8-Jul-2009 09:17:00 AM 2 null 470 01_check_st_ctrl_trigger Remote job started.
8-Jul-2009 09:17:00 AM 2 null 470 01_check_st_ctrl_trigger Job started.
8-Jul-2009 09:17:00 AM 3 null 470 01_check_st_ctrl_trigger Job completed with exit code 1. This exit code
does not satisfy job exit code condition. Job failed.
8-Jul-2009 09:17:00 AM 3 null 470 01_check_st_ctrl_trigger Job completed with exit code 1. This exit code
does not satisfy job exit code condition. Job failed.

RUN WHEN IT HANGS:

8-Jul-2009 09:22:00 AM 2 null 470 01_check_st_ctrl_trigger Remote job started.
8-Jul-2009 09:22:00 AM 2 null 470 01_check_st_ctrl_trigger Job started.
8-Jul-2009 09:22:00 AM 3 null 470 01_check_st_ctrl_trigger Job completed with exit code 1. This exit code
does not satisfy job exit code condition. Job failed.

Notice how the extra entry for the "Job completed with exit code 1 is missing from the second run?
I think like with the poster that started this thread it has something to do with the killing of the process.

Also, looking over the documention, it's recommended to run shell scripts like such:

bin/sh -c "/home/srv_etl/file_scripts/file_checker.ksh /loads/dropoff/momentum/execuwriters/1010.fil momentum N"

Usually I just put the full path to the script like such:
/home/srv_etl/file_scripts/file_checker.ksh /loads/dropoff/momentum/execuwriters/1010.fil momentum N

Could this be causing issues?

In regard to number of agents and queues, that's not exactly what I meant. The case I referred to, had 10 agents or so running on different servers and working with the same scheduler.

I don't think the command really matters. I noticed the missing line in the output. Still I'd like to check the debug.log file from the agent for differences in the job behavior.

How do you know that the agent is trying to kill the process? Does it say "terminating process" or "process killed" or something like that the log on the agent system?

No I was going off the original poster's comments, it appears very similar to what I am dealing with.

Thanks.

I created a check queue job to go into all of my queues and start checking for files older than hour so I can be notified via e-mail if the queues get stuck.

An hour for most queues is good, but there may be some jobs that can run for hours, but at least I will be alerted to check.

I will let you know when the queue gets stuck again and post any debug logs I get.

Thanks.