Multiple io-net crashes

bridged with qdn.public.qnxrtp.os
John Nagle

Multiple io-net crashes

Post by John Nagle » Tue Nov 18, 2003 5:56 am

We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

Xiaodan Tang

Re: Multiple io-net crashes

Post by Xiaodan Tang » Tue Nov 18, 2003 5:21 pm

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <nagle@overbot.com> wrote in message
news:3FB9B47F.5020308@overbot.com...
We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

John Nagle

Re: Multiple io-net crashes

Post by John Nagle » Tue Nov 18, 2003 7:17 pm

We will try that as a debugging effort. But fixing
a fundamental reliability problem by adjusting time
delays is not a solution.

We will also send in some crash dumps of io-net.

Neither of those options is documented in the Helpviewer
database, incidentally.

John Nagle
Team Overbot

Xiaodan Tang wrote:
Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <nagle@overbot.com> wrote in message
news:3FB9B47F.5020308@overbot.com...

We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot



Bill Caroselli

Re: Multiple io-net crashes

Post by Bill Caroselli » Tue Nov 18, 2003 8:04 pm

I believe that your adjusting a time-out, not a delay.

The driver is assuming that something is wrong even though everything
was still working fine, just slow.


John Nagle <nagle@downside.com> wrote:
JN > We will try that as a debugging effort. But fixing
JN > a fundamental reliability problem by adjusting time
JN > delays is not a solution.

JN > We will also send in some crash dumps of io-net.

JN > Neither of those options is documented in the Helpviewer
JN > database, incidentally.

JN > John Nagle
JN > Team Overbot

JN > Xiaodan Tang wrote:
Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

Robert Rutherford

Re: Multiple io-net crashes

Post by Robert Rutherford » Tue Nov 18, 2003 11:52 pm

We have also recently seen a couple of io-net crashes.

This is on standard machines runing 6.2.1A with dual Intel 82557 NICs. The
network is hardwired - there is no wireless LAN anywhere.

Coincidentally (or not?) the crashes have only occured after we started
implementing inter-node native IPC over Ethernet.

We haven't spent any effort to get to the bottom of this yet (as it is only
very intermittent and we have more pressing bugs to fix) but I thought I
would mention it as possibly relevant to this thread.

Rob Rutherford

On Mon, 17 Nov 2003 21:56:15 -0800, John Nagle wrote:
We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

Xiaodan Tang

Re: Multiple io-net crashes

Post by Xiaodan Tang » Wed Nov 19, 2003 1:35 am

To get better response for "QNET over Ethernet (LAN)", QNET is tuned
for use on ethernet by default. The aggressive timeout then effect to links
that have higher packet lost rate. (QNET does recognize if the interface
under
it is a PPP, and adjust the timeout automaticly, but unfortunatly, the
wireless
thing claim they are "ethernet")

But you are right that this should never core.

-xtang

John Nagle <nagle@downside.com> wrote in message
news:3FBA7035.5000008@downside.com...
We will try that as a debugging effort. But fixing
a fundamental reliability problem by adjusting time
delays is not a solution.

We will also send in some crash dumps of io-net.

Neither of those options is documented in the Helpviewer
database, incidentally.

John Nagle
Team Overbot

Xiaodan Tang wrote:
Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <nagle@overbot.com> wrote in message
news:3FB9B47F.5020308@overbot.com...

We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot




John Nagle

Re: Multiple io-net crashes

Post by John Nagle » Wed Nov 19, 2003 5:51 am

Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

1. If you spawn a process on another node, it's a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented "no
zombies" flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn't work across node boundaries.

2. The "maproot" command to QNET seems to affect all UIDs, not just
root. If we set "maproot=99", but don't specify "mapany",
and user 99 is "nobody", we can only use the "on" command across nodes if
running as user "nobody". If we don't specify
"maproot=99", inter-node "on" works.

Our sysadmin should be trying the suggested timing tweaks. Where
do we send the io-net dumps?

John Nagle
Team Overbot


Robert Rutherford wrote:
We have also recently seen a couple of io-net crashes.

This is on standard machines runing 6.2.1A with dual Intel 82557 NICs. The
network is hardwired - there is no wireless LAN anywhere.

Coincidentally (or not?) the crashes have only occured after we started
implementing inter-node native IPC over Ethernet.

We haven't spent any effort to get to the bottom of this yet (as it is only
very intermittent and we have more pressing bugs to fix) but I thought I
would mention it as possibly relevant to this thread.

Rob Rutherford

Rennie Allen

Re: Multiple io-net crashes

Post by Rennie Allen » Wed Nov 19, 2003 3:46 pm

John Nagle wrote:
Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

1. If you spawn a process on another node, it's a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented "no
zombies" flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn't work across node boundaries.
My $0.02:

I happen to think that the parent/child relationship should extend
across node boundries. I guess the problem comes when the network
is severed and the child later terminates, who would do the waitpid ?

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.

David Gibbs

Re: Multiple io-net crashes

Post by David Gibbs » Wed Nov 19, 2003 3:55 pm

Rennie Allen <rallen@csical.com> wrote:
John Nagle wrote:
Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

1. If you spawn a process on another node, it's a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented "no
zombies" flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn't work across node boundaries.

My $0.02:

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.
In this case, I think the child should be re-parented to Proc. This
is consistent with the local case, where if the parent of a child
exits/terminates, the child gets reparented to Proc.

-David
--
QNX Training Services
http://www.qnx.com/support/training/
Please followup in this newsgroup if you have further questions.

John Nagle

Re: Multiple io-net crashes

Post by John Nagle » Wed Nov 19, 2003 6:39 pm

That would be nice, but it would require an API change.
"getppid()", etc. would have to return a node ID as well
as a process ID. That's not unreasonable, considering that
QNX provides a form of "kill" that accepts a node ID.
QNX already supports things like creating a pipe and
passing one end to a spawned process, so you can create
parent/child pipe connections. We've found that a useful
means of monitoring child death. When the child dies, the
pipe breaks.

I'm more concerned about the unkillable zombies piling up
under io-net, which is clearly a defect. But even for that
we have a workaround.

The io-net crashes are the serious problem.

Remember, we're putting all this on a robot vehicle.
If io-net crashes, the hardware watchdog timer slams on
the brakes and kills the engine in about 200ms. The words
"QNX NET FAILED" appear in a big LED sign. Then other
watchdog timers reboot all the computers, and the
vehicle starts up again after a minute or so.

John Nagle
Team Overbot

Rennie Allen wrote:
John Nagle wrote:

Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

1. If you spawn a process on another node, it's a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented "no
zombies" flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn't work across node boundaries.


My $0.02:

I happen to think that the parent/child relationship should extend
across node boundries. I guess the problem comes when the network
is severed and the child later terminates, who would do the waitpid ?

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.

Bill Caroselli

Re: Multiple io-net crashes

Post by Bill Caroselli » Wed Nov 19, 2003 6:51 pm

Hi John

Just curious, is your robotic vehicle just for research or does it
have a practicle reason for being?


John Nagle <nagle@downside.com> wrote:
JN > That would be nice, but it would require an API change.
JN > "getppid()", etc. would have to return a node ID as well
JN > as a process ID. That's not unreasonable, considering that
JN > QNX provides a form of "kill" that accepts a node ID.
JN > QNX already supports things like creating a pipe and
JN > passing one end to a spawned process, so you can create
JN > parent/child pipe connections. We've found that a useful
JN > means of monitoring child death. When the child dies, the
JN > pipe breaks.

JN > I'm more concerned about the unkillable zombies piling up
JN > under io-net, which is clearly a defect. But even for that
JN > we have a workaround.

JN > The io-net crashes are the serious problem.

JN > Remember, we're putting all this on a robot vehicle.
JN > If io-net crashes, the hardware watchdog timer slams on
JN > the brakes and kills the engine in about 200ms. The words
JN > "QNX NET FAILED" appear in a big LED sign. Then other
JN > watchdog timers reboot all the computers, and the
JN > vehicle starts up again after a minute or so.

JN > John Nagle
JN > Team Overbot

Khian Hao Lim

Re: Multiple io-net crashes

Post by Khian Hao Lim » Wed Nov 19, 2003 8:10 pm

Hi Xiandan,

I am with John Nagle. I tried your changes on our setup. It still causes the
same sloginfo errors. Can you please have a look at the rc.local and see if
I did anything wrong implementing your changes.
The issue really seems to be latency dependent like what you said. Taking
out wep encryption, the hubs in between the links occasionally helped remove
the sloginfo errors (latency?) Any other fixes you recommend?

My setup:
---------------------
/etc/rc.d/rc.local

mount -T io-net -o busvendor=0x8086,busdevice=0x103a /lib/dll/devn-speedo.so
# Restart TCP/IP networking so that new Ethernet driver is attached to
it.
netmanager -r all

# Start QNX native networking.
# map user to vehicle
mount -T io-net -o "ticksize=200,sstimer=0x00140014" /lib/dll/npm-qnet.so

-----------------------
Between node0 and node1 (node0 calls spawn with node1's nd):

linksys wet11 bridge and linksys wap11 access point with 128 bit encryption
between
2 hops of hubs

-----------------------
qnet and io-net are both Jan 18 2003 versions
-----------------------
sloginfo output after internode spawning:

Nov 19 11:34:58 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

Nov 19 11:34:58 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad file
descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

--------------------------------------
After internode spawning, I got the following:

node1$ ls /net/node0
ls: readdir of '/net/node0' failed (Bad file descriptor)
-------------------------------

Khian Hao Lim



"Xiaodan Tang" <xtang@qnx.com> wrote in message
news:bpdk8q$lkj$1@nntp.qnx.com...
Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <nagle@overbot.com> wrote in message
news:3FB9B47F.5020308@overbot.com...
We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot


Khian Hao Lim

Re: Multiple io-net crashes

Post by Khian Hao Lim » Wed Nov 19, 2003 8:10 pm

Hi Xiandan,

I am with John Nagle. I tried your changes on our setup. It still causes the
same sloginfo errors. Can you please have a look at the rc.local and see if
I did anything wrong implementing your changes.
The issue really seems to be latency dependent like what you said. Taking
out wep encryption, the hubs in between the links occasionally helped remove
the sloginfo errors (latency?) Any other fixes you recommend?

My setup:
---------------------
/etc/rc.d/rc.local

mount -T io-net -o busvendor=0x8086,busdevice=0x103a /lib/dll/devn-speedo.so
# Restart TCP/IP networking so that new Ethernet driver is attached to
it.
netmanager -r all

# Start QNX native networking.
# map user to vehicle
mount -T io-net -o "ticksize=200,sstimer=0x00140014" /lib/dll/npm-qnet.so

-----------------------
Between node0 and node1 (node0 calls spawn with node1's nd):

linksys wet11 bridge and linksys wap11 access point with 128 bit encryption
between
2 hops of hubs

-----------------------
qnet and io-net are both Jan 18 2003 versions
-----------------------
sloginfo output after internode spawning:

Nov 19 11:34:58 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

Nov 19 11:34:58 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad file
descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

--------------------------------------
After internode spawning, I got the following:

node1$ ls /net/node0
ls: readdir of '/net/node0' failed (Bad file descriptor)
-------------------------------

Khian Hao Lim

"Xiaodan Tang" <xtang@qnx.com> wrote in message
news:bpdk8q$lkj$1@nntp.qnx.com...
Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <nagle@overbot.com> wrote in message
news:3FB9B47F.5020308@overbot.com...
We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot


John Nagle

Re: Multiple io-net crashes

Post by John Nagle » Sun Nov 30, 2003 5:34 am

Xiaodan Tang has identified a buffer overflow in io-net which is
causing some of our problems.

Currently, inter-node spawn is unreliable. Sometimes it
works, sometimes it fails, and on rare occasions, io-net
on the destination machine gets a segmentation fault.
Some sequences of chroot/spawn/exec seem to bring out the defect.
Details have been provided to Xiaodan by our Khian Hao.

Because we designed our system assuming this QNX feature
works, we have a serious problem. Who do we need to talk to
to get this fixed quickly? We've tried various workarounds,
but nothing really satisfactory or that we can trust has
emerged.

John Nagle
Team Overbot
650-326-9109

Xiaodan Tang wrote:
To get better response for "QNET over Ethernet (LAN)", QNET is tuned
for use on ethernet by default. The aggressive timeout then effect to links
that have higher packet lost rate. (QNET does recognize if the interface
under
it is a PPP, and adjust the timeout automaticly, but unfortunatly, the
wireless
thing claim they are "ethernet")

But you are right that this should never core.

-xtang

John Nagle <nagle@downside.com> wrote in message
news:3FBA7035.5000008@downside.com...

We will try that as a debugging effort. But fixing
a fundamental reliability problem by adjusting time
delays is not a solution.

We will also send in some crash dumps of io-net.

Neither of those options is documented in the Helpviewer
database, incidentally.

John Nagle
Team Overbot

Xiaodan Tang wrote:

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <nagle@overbot.com> wrote in message
news:3FB9B47F.5020308@overbot.com...


We've been experiencing multiple io-net crashes on
QNX 6.2.1PE. We've now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We're running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
"ls: readdir of '/net/gcrear0' failed (Bad file descriptor)"
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot





Alain Bonnefoy

Re: Multiple io-net crashes

Post by Alain Bonnefoy » Mon Dec 01, 2003 7:43 am

Hum,
I've experienced some zombies and more rarely io-net crashs not farther
than the last week caused by smbd.
Could be the same problem.

Alain

Rennie Allen a écrit:
John Nagle wrote:

Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

1. If you spawn a process on another node, it's a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented "no
zombies" flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn't work across node boundaries.


My $0.02:

I happen to think that the parent/child relationship should extend
across node boundries. I guess the problem comes when the network
is severed and the child later terminates, who would do the waitpid ?

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.

Post Reply

Return to “qdn.public.qnxrtp.os”