Timeouts until error notification

Hi Team,

When I want to test a remote agent connection (4.x under linux master/linux agent) it takes ages until "connection failed" messages appears. Until that you cannot do anything inside the blocked GUI neither cancel the action. Is there a way to shorten the timeout when a network connection or authentication doesn't work?

Unfortunately it doesn't currently support custom timeout settings. The timeout is controlled by your system settings, which if I'm not mistaken, on most Linux systems is set to 75 seconds. I'm not sure if you can change the default. Take a look at what you system manual "man 7 tcp" says about the default timeout and related keepalive settings.

OK - I will investigate this. But right now I'm not able to get a connection from the master to the remote agent.
The agent *is* behind a firewall and because I'm just testing this new version (on the agent host there is still running the old ra_server instance for 3.x version with port 1096) I configured the agent to run with port 10000, because this port is allowed in the firewall.

When I do a simple telnet 10000 to the host running the agent I get a connection. I also get a connection when I test ftp (for authentication) and can login from the host running the master to the one with the agent as user 'root'. So far, so good.

But when I do a "test connect" from the master in the "remote agent" dialog box the connection seems to timeout (takes minutes). When I start "netstat -an| grep 10000" on the host with the agent I can see an established connection for about 10-20 seconds during the connection test (the source ip matches to the master). Then the connection closes after approx. 10-20 seconds while the master GUI still doesn't respond. After 1-2 Minutes the "connection failed" messages appears.

I looked in every messages/logs files I can imagine (linux /var/log/... and in the scheduler logs on master and agent) - I cannot find any hint.

Is the agent doing any kind of "connect back" *from* the agent *to* the master? (This specific firewall rule allows only one-way-connection from master to agent)
I also tried PAM authentication - same behavior.
Oh - before I forgot : simply running a testjob using credentials doesn't work either. Same behavoir. Long time to time-out when when running the job in test mode.

Another agent in the same network without a firewall between is working.

Any hints for me?

Please try creating a simple remote job and assigning it to the agent profile which you cannot test-connect to the agent. Try running this job in Normal mode and see what happens.

Running a simple testjob does not work.

I've set up the 4.x master on the original server runing 3.x version. I disabled 3.x to free up default port 1096 to be sure to have identical ip environment.
telnet to port 1096 from master to agent and vice versa gives a connection. But running the job fails with error code -1. It is aboslutely sure that this is no exit code of the job. (The job is a simple shell script with a single "echo Hello world" line).

As I said : the 3.x master/agent (ra_server) combination bewteen these two hosts works without problems. With 4.x is does not.

May be this *is* are firewall problem - but I don't know how to find out which traffic may be blocked. Is version 4.x using additional ports wich may cause the problem. ping? udp packets? Is it really using only 1096 tcp?

I've done some other test and setup a master/agent combination running through our managed firewall. This connection works!

What I see in the firewall-log is the following:
There is an undocumentated port which is used every time a do a "test connect" or run a job in normal or test mode : 59338

I'm confused! Why is this port used??

Besides I wonder why port 21 (ftp) is not never during these tests - although I activated ftp-auth.
Another point : When I run the job in nomal mode then port 10000 (the agent listener) is *not* used either!??

Please enlighten me.

Thanks in advance.

I found out that the mentioned port 59338 is a random port listener created by the java process when the agent is startet.
When I stop and restart the agent another high port is opened randomly.

This is of course extremly unfriendly for a firewall. Is there a way to configure this port?

I am not sure why you are getting random port numbers. This is not supposed to happen. It should be using the number entered in settings (Tools/Options/Network). I can only guess, that when the specified number is blocked for whatever reason, something makes it slip to the random number which is free but not really usable.

As for the ftp port 21, it is not used for connections. Authentication method is used by 24x7 on the target computer not on the computer from where you are connecting. For example, if you have 24x7 scheduler connecting to 24x7 agent and FTP authentication is chosen on the agent, during connect phase the agent talks to locally running ftp server and asks the server to validate user and password. If PAM authentication is chosen, the agent uses configured PAM module to talk to the operation system and asks the OS to authenticate the user. Please note that PAM is not available on all systems, that is why we support 2 alternative methods.

Now to verify the listener is listening on the specified port you can execute netstat –a command from shell and check the output. If you've got many programs running you can use grep with "java" filter to narrow the output.

Regarding the used ports with java: This is what I don't understand:

When the agent is started (no matter with gui or nogui) the configured port (in this case 10000) is opened as listener + an additional random port.

Example:

ps -ef| grep jscheduler.jar
root 29840 29839 0 17:23 pts/5 00:00:00 /usr/java/jdk1.5.0_12/bin/java -Xms64m -Xmx96m -jar jscheduler.jar agent nogui

lsof -i | grep 29840
java 29840 root 7u IPv4 570141120 TCP *:10000 (LISTEN)
java 29840 root 9u IPv4 570141122 TCP *:36168 (LISTEN)

After shutting down the agent both listeners are vanished.

Next startup:

[root@kasten 24x7_Scheduler]# ./agent.sh nogui &
This is a 30-day trial product. It may not be used for production purposes. (29 days left)
Copyright (c) 2006 SoftTree Technologies, Inc.
24x7 Scheduler started in agent mode.

[root@kasten root]# ps -ef| grep jscheduler.jar
root 30525 30524 24 17:32 pts/5 00:00:00 /usr/java/jdk1.5.0_12/bin/java -Xms64m -Xmx96m -jar jscheduler.jar agent nogui

[root@kasten root]# lsof -i | grep 30525
java 30525 root 7u IPv4 570158802 TCP *:10000 (LISTEN)
java 30525 root 9u IPv4 570158804 TCP *:36203 (LISTEN)

When I track the traffic via firewall I can see that BOTH ports (10000 AND the random port) are used for communication.

Any hints?

I don't understand this either. I am not aware of any random ports.

But wait, maybe you have remote control option enabled and something weird is entered for the port or it is simply not specified. Can you check the settings and check if the remote control option is enabled.?

Last edited by SysOp on Mon Jul 16, 2007 7:46 am; edited 1 time in total

I'm sorry, but...

"remote control" was enabled in the master config. But whether I enable or disable this doesn't change anything. From my point of view the problem belongs to the agent. In the agent-GUI the "remote control" option is greyed out and can neither be activated nor deactivated.

Meanwhile I downgraded the used JAVA version from 1.5 to 1.4.2 but the behavior didn't change. With every restart of the agent the java process opens up an random port to the configured one. When I try to do any remote operation from the master (agent check, run a test job in normal or test mode) the master uses this open random port in addition the the configured port. There are always 2 entries in the firewall-log matching to the open ports which determine with "ps" and "lsof" commands. I can test this over and over. The real-time firewall log shows me that and it's the only explanation why the connection to the "problematic" host behind the second firewall times out because these random ports are not allowed. I don't know yet if any senseful communication is done over this random port. I can check this with a monitoring tool like ethereal if it makes sense...

You are not able to verify/confirm this mystic random port with "ps" and "lsof" commands in your own test environment?

We will try to reproduce this in the lab and check what could be creating this second random port.

Please let us know which exact version / build are you running?

The current suspect is that some component or components are packaged with the debugging info and java activates some port for the debugger hook. That port is random and normally should not cause any problems. You can control which port is used if you add something like the following to the command line

-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=[desired port number]

This can be added to the command line within master.sh or master.bat depending on the platform.

I apologize for this thread growing and growing...

Unfortunately there is not tcp/11000 port opening (choosen as an example) - still the random port.
I have modified agent.sh the following way:

[root@kasten 24x7_Scheduler]# cat agent.sh
#!/bin/sh

export JAVA_HOME=/usr/java/jdk1.5.0_12/

if [ -f $JAVA_HOME/bin/java ] ; then
if [ "$1" = "nogui" ] || [ "$2" = "nogui" ] ; then
$JAVA_HOME/bin/java -Xms64m -Xmx96m -jar jscheduler.jar agent $1 $2 -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=11000
else
$JAVA_HOME/bin/java -Xms64m -Xmx96m -jar jscheduler.jar agent $1 $2 -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=11000 &
fi
else
echo JAVA_HOME must be set to JDK or JRE 1.4.x or later distribution directory
exit 0
fi

Agent started with the script:
[root@kasten 24x7_Scheduler]# ./agent.sh

Right after the start:
[root@kasten 24x7_Scheduler]# ps -ef| grep jscheduler.jar | grep -v grep
root 8800 1 1 17:46 pts/0 00:00:02 /usr/java/jdk1.5.0_12//bin/java -Xms64m -Xmx96m -jar jscheduler.jar agent -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=11000

Checking the sockets with "lsof -i":
[root@kasten 24x7_Scheduler]# lsof -i| grep 8800
java 8800 root 6u IPv4 572343994 TCP localhost:57778->localhost:x11-ssh-offset (ESTABLISHED)
java 8800 root 8u IPv4 572344008 TCP *:10000 (LISTEN) <-- as configured, absolut correct!
java 8800 root 10u IPv4 572344010 TCP *:57781 (LISTEN) <-- the mystic random port, changes whenever agent is started!

Version is the downloaded trial : Version 4.1 build 242
Java version can be seen in the path : JDK 1.5.0-12
Server : Standard Intel running Red Hat Enterprise Linux ES release 3 (Taroon)

BTW : Many thanks for your support in trying to solve this issue!

Thank you very much for the detailed info. We will continue looking into this and I will be posting status updates as new information comes to light.