[Barrelfish-users] Threads using sockets may block

Thu Feb 14 09:23:48 CET 2013

Hello.

Thanks for all your comments.  I investigated further; here are my findings:

* Original assessment was incorrect.  Snippets from the POSIX glue are unrelated
  and refer to Unix sockets.  lwip also uses the default wait set if not told
  otherwise (see 'lwip_init_ex').  However, this may not be an issue as lwip
  uses a single-threaded core in conjunction with message passing [1]. 

* Network stack initialization is a little suspect.  This isn't documented so I
  copied from usr/net-test, but that snippet has FIXMEs and an unused wait set.

* Hint by Kornilios was very promising.  Tried a few scenarios and the problem
  indeed seems to involve the loopback interface:

  a) same machine, same domain: both sides block indefinitely (server in 'accept',
     client in 'read') [2].
  b) same machine, separate domains: server waiting in 'accept', client fails to
     connect (ERR_RST: Connection reset).
  c) client/server on separate machines: works.
  d) single domain, multi-threaded server: works.

  Something worth investigating: trace logs for (a) show both sides waiting on
  their mailbox, yet an arriving packet cannot be delivered because (another)
  mailbox is full.  The current mailbox implementation only has space for a
  single message and there is an open TODO to support the general case.

* I have resolved the original problem with SharedDB.  This started while trying
  to diagnose hanging client requests.  That was caused by something entirely
  unrelated (faulty atomic integer -> server never even entered its select loop)!

Best,

--Zaheer

[1]: http://git.savannah.gnu.org/cgit/lwip.git/tree/doc/rawapi.txt#n25

[2]: Client and server in same domain (a).
[server] listening on port 5000.
lwip_accept(0)...
../lib/lwip/src/api/sockets.c:280 in lwip_accept(): before netconn_accept ...
../lib/lwip/src/api/api_lib.c:320 in netconn_accept(): before sys_arch_mbox_fetch ...
../lib/lwip/src/sys_arch.c:258 in sys_arch_mbox_fetch(): mailbox is empty, wait until something arrives (mbox = 0x953710)
<snipped>
[client] calling read on socket
lwip_recvfrom(1, 0x953260, 1024, 0x0, ..)
lwip_recvfrom: top while sock->lastdata=0x0
netconn_recv called on [0x805db3d8]
../lib/lwip/src/api/api_lib.c:395 in netconn_recv(): before sys_arch_mbox_fetch.
../lib/lwip/src/sys_arch.c:258 in sys_arch_mbox_fetch(): mailbox is empty, wait until something arrives (mbox = 0x953798)
../lib/lwip/src/core/ipv4/ip.c:664 in ip_output_if(): loopback #ifdef enabled, packet for myself? 1.
../lib/lwip/src/sys_arch.c:201 in sys_mbox_post(): mailbox full; waiting for some space (mbox = 0x9135c8)

> -----Original Message-----
> From: Roscoe Timothy
> Sent: 13 February 2013 21:03
> To: Kornilios Kourtis
> Cc: barrelfish-users at lists.inf.ethz.ch; Chothia Zaheer; Roscoe Timothy
> Subject: Re: [Barrelfish-users] Threads using sockets may block
> 
> 
> Hi there,
> 
> Just a quick naive question (not for Zaheer): why is the socket
> implementation using the default wait set?  Surely for current blocking
> semantics each (blocking) socket should have its own waitset?
> 
>  -- Mothy
> 
> At Wed, 13 Feb 2013 11:05:27 +0100, Kornilios Kourtis
> <kornilios.kourtis at inf.ethz.ch> wrote:
> > Hi Zaheer,
> >
> > On Mon, Feb 11, 2013 at 03:44:59PM +0000, Chothia  Zaheer wrote:
> > > Hello,
> > >
> > > When multiple threads use the sockets API some calls may block
> indefinitely.
> > > It seems this is because they use the default waitset ->
> lib/posixcompat/sockets.c:
> > >
> > >   ssize_t recv(int sockfd, void *buf, size_t len, int flags)
> > >                     // XXX: Assume it was on the default waitset
> > >                     err = us->u.active.binding->change_waitset
> > >                         (us->u.active.binding,
> > > get_default_waitset());
> > >
> > > A simple server-client example is attached.  Output looks like this:
> > >
> > [snip]
> > >   client: owner has the IP address 10.110.4.21
> > >   [server] listening on port 5000.
> > >   [client] created socket: fd = 4
> > >   [client] connecting to server at 10.110.4.21:5000 ...
> > >   [client] connected to server at 10.110.4.21:5000
> > >   [client] calling read on socket
> > >   netconn_recv called on [0x805d93d8]
> > [snip]
> >
> > Just adding to the comments: AFAICT you are running the client and the
> > server on the same machine, which typically requires some kind of
> > loopback mechanism on the network stack for the routing. Doing some
> > naive grepping in lwip, I got the following:
> >
> > lib/lwip/src/core/ipv4/ip.c:
> > 655-#if (LWIP_NETIF_LOOPBACK || LWIP_HAVE_LOOPIF)
> > 656-    if (ip_addr_cmp(dest, &netif->ip_addr)) {
> > 657:        /* Packet to self, enqueue it for loopback */
> > 658-        LWIP_DEBUGF(IP_DEBUG, ("netif_loop_output()"));
> > 659-        return netif_loop_output(netif, p, dest);
> > 660-    } else
> > 661-#endif                          /* (LWIP_NETIF_LOOPBACK ||
> LWIP_HAVE_LOOPIF) */
> >
> > I'm not sure what our configuration options are, but it might worth
> > making sure that the problem is not in loopback routing (e.g., by
> > using two different machines, or two different domains).
> >
> > cheers,
> > Kornilios.
> >
> > --
> > Kornilios Kourtis
> >
> > _______________________________________________
> > Barrelfish-users mailing list
> > Barrelfish-users at lists.inf.ethz.ch
> > https://lists.inf.ethz.ch/mailman/listinfo/barrelfish-users
> >
> >
> >