fd_conn_rcv: Connection timed out

Discussion of the Co:Z Co-Processing Toolkit for z/OS
Post Reply
stvn
Posts: 6
Joined: Thu May 26, 2016 2:11 pm

fd_conn_rcv: Connection timed out

Post by stvn » Thu May 26, 2016 2:20 pm

Hello Everyone,
I am newbie to this forum and just learning how about Co:Z. When transferring files I am seeing the following error after the job has been running longer than two hours:
todsn-client.E.: handleCmdIO: read error on fd_conn_rcv: Connection timed out
If the job is less than two hours it runs fine. Any help would be greatly appreciated.

dovetail
Site Admin
Posts: 1910
Joined: Thu Jul 29, 2004 12:12 pm

Re: fd_conn_rcv: Connection timed out

Post by dovetail » Tue May 31, 2016 8:33 am

What versions of Co:Z Toolkit for z/OS and the Co:Z Target system tookit are you using?

stvn
Posts: 6
Joined: Thu May 26, 2016 2:11 pm

Re: fd_conn_rcv: Connection timed out

Post by stvn » Tue May 31, 2016 10:40 am

We are using version 1.2.0.

dovetail
Site Admin
Posts: 1910
Joined: Thu Jul 29, 2004 12:12 pm

Re: fd_conn_rcv: Connection timed out

Post by dovetail » Tue May 31, 2016 11:30 am

and which version of Co:Z on z/OS ?

stvn
Posts: 6
Joined: Thu May 26, 2016 2:11 pm

Re: fd_conn_rcv: Connection timed out

Post by stvn » Thu Jun 30, 2016 10:21 am

CoZLauncher.N.: version: 1.7.8 2011-01-17
cozagent.N.: version: 1.2.0 2015-05-01

dovetail
Site Admin
Posts: 1910
Joined: Thu Jul 29, 2004 12:12 pm

Re: fd_conn_rcv: Connection timed out

Post by dovetail » Thu Jun 30, 2016 11:09 am

There have been several fixes for the launcher since this (old) version of of Co:Z for z/OS:

http://dovetail.com/docs/cozinstall/changes.html

Please retry with the current version (currently 3.6.3).
Note: you can install an alternate version in a different directory and datasets for testing.

stvn
Posts: 6
Joined: Thu May 26, 2016 2:11 pm

Re: fd_conn_rcv: Connection timed out

Post by stvn » Tue Jul 05, 2016 8:55 am

We upgraded to 3.6.3 and we still had the same issue when the job took longer than two hours.

dovetail
Site Admin
Posts: 1910
Joined: Thu Jul 29, 2004 12:12 pm

Re: fd_conn_rcv: Connection timed out

Post by dovetail » Tue Jul 05, 2016 9:56 am

- What settings are you using in DD:COZCONF (COZCFGD and COZCFG) ?

- What is the target operating system?

stvn
Posts: 6
Joined: Thu May 26, 2016 2:11 pm

Re: fd_conn_rcv: Connection timed out

Post by stvn » Tue Jul 05, 2016 10:49 am

COZCONF settngs:
server-path=/app/pp/cozr363/bin/cozserver    
server-ports=8040-8059   
ssh-tunnel=false                       
saf-cert=MY-RING:MY-CERT                        
agent-path=/opt/dovetail/coz/bin/cozagent       
server-env-COZ_TRSUB_US-ASCII=ISO8859-1         
target-env-COZ_CLIENT_CODEPAGE=ISO8859-1 

Target Operating System:
 RHEL5

dovetail
Site Admin
Posts: 1910
Joined: Thu Jul 29, 2004 12:12 pm

Re: fd_conn_rcv: Connection timed out

Post by dovetail » Tue Jul 05, 2016 11:06 am

We think that the problem is that something in your network path is timing out this particular socket after two hours.

I can't tell from the information that you have provided if the "todsn" that times out is for your file transfer.
More likely it is the internal todsn used for either DD:STDOUT or DD:STDERR redirection. It is common for these to not send any data until a message was issued on the target side to one of those standard handles.

If you are willing to test a beta release, we think that it makes sense to change the socket options on these in such a way to hopefully prevent them from timing out (using TCP_KEEPALIVE).

stvn
Posts: 6
Joined: Thu May 26, 2016 2:11 pm

Re: fd_conn_rcv: Connection timed out

Post by stvn » Thu Jul 07, 2016 7:32 am

We tried making the following changes:
zOS changes applied:
TCP_KEEPALIVE increase to 240 minutes

Linux changes applied:
tcp_keepalive_time 14400
tcp_keepalive_intvl 75
tcp_keepalive_probes 90

and we got the following error:
Error message:
todsn(DD:STDOUT)ÝN¨: 69454 bytes read; 587 records/68868 bytes written in 9746.222 seconds (7.126 Bytes/sec).
todsn-client(27282)ÝE¨: handleCmdIO: read error on fd_conn_rcv: Connection timed out
todsn-client(27282)ÝE¨: Error: no exit code received from CoZServer
Ý22:42:49.270522¨ CoZLauncherÝD¨: CoZAgent: completed with RC=103
Ý22:42:49.270590¨ CoZLauncherÝT¨: -> handleAgentCompletion(103)
cozagentÝE¨: STDERR DD Writer(27282) ended with RC=102

Any ideas from the error?

dovetail
Site Admin
Posts: 1910
Joined: Thu Jul 29, 2004 12:12 pm

Re: fd_conn_rcv: Connection timed out

Post by dovetail » Thu Jul 07, 2016 4:21 pm

The tcp_keepalive_time needs to be lower than the time that your firewall(s) are timing out the connection. 14400 is probably not low enough. In fact, the default is usually 2 hours so you moved it in the wrong direction. Try something like 600 (10 minutes). Also, I don't believe that keep alives are enabled by default for sockets, so it may not matter since the Co:Z is not explicitly turning it on.

We have tested an enhancement to Co:Z z/OS that will allow you to turn on TCP_KEEPALIVE for the sockets and set the initial interval (tcp_keepliave_time) to whatever you want. This will be much better than changing the kernel.

Note 1:
Until the release is available that includes these enhancements, the only way that I know that I know for sure how to turn on TCP_KEEPALIVE is to use libkeepalive on linux.

See: http://www.tldp.org/HOWTO/html_single/T ... ive-HOWTO/

To use this you would need an executable shell script to run instead of cozagent:

#! /bin/sh
# Use libkeepalive to cause CO:Z sockets to enable TCP_KEEPALIVE with specified intervales
# This script should be chmod 755
#
export LD_PRELOAD=libkeepalive.so
export KEEPCNT=20
export KEEPIDLE=180 # this must be lower than firewall expiration
export KEEPINTVL=60
exec /opt/dovetail/coz/bin/cozagent "$@"

point the "agent-path" property to the full path name of this shell script.

Note 2:
it is also possible that your firewalls are configured so that they ignore TCP keep_alive packets when determining if a connection is expired. If this is the case, then TCP_KEEPALIVE will not help no matter what you set it to. This is probably not the case.

We will also add a feature that sends out actual data packets at the application level, for situations where TCP_KEEPALIVE is not enough.

Post Reply