[Barrelfish-users] Possible bug?

Thu Sep 15 01:40:22 CEST 2011

I had a quick look at the code (was just looking at the snippets in your email before), but I'm afraid I can't reason through it in my head. Can you simplify it at all and still trigger the bug? I suspect there is a race condition: you are using two threads on each dispatcher, one of which is servicing the waitset and one of which is invoking methods on the binding using state produced by the other, and you appear to be synchronising those via the spin-yield loops. For one thing, I'd suggest seeing whether you can use the builtin mutexes (thread_mutex_lock/unlock) or semaphores, condition variables, etc. to express this logic.

Even then, you should beware that binding objects are not thread-safe (this was motivated by avoiding overhead for single-threaded fully event-driven programs, but in hindsight is not ideal from a principle of least surprise). If there is a chance that two threads can call into the binding at the same time (e.g. event_dispatch() racing with a call on a send function), you need to protect that from the outside. I wasn't able to convince myself that this was the case in your program.

Hope that helps somehow,
Andrew

From: zeus at aluzina.org [mailto:zeus at aluzina.org] On Behalf Of Zeus Gómez Marmolejo
Sent: Wednesday, 14 September, 2011 16:18
To: Baumann Andrew
Cc: barrelfish-users at lists.inf.ethz.ch
Subject: Re: [Barrelfish-users] Possible bug?

Ok, yes. You are right. There is an extra malloc() being done in the sender when receiving back the buffer, done by the flounder stub that is not free'd. I send you now the corrected version. But in any case, the user page fault doesn't go away... :(

Thank you for your help

zeus.
El 15 de septiembre de 2011 00:55, Zeus Gómez Marmolejo <zeus.gomez at bsc.es<mailto:zeus.gomez at bsc.es>> escribió:
Ok, I'm checking this now, but this is not the problem. In fact, it shouldn't fail anyway, as if malloc() fails, then the sender will fail when accessing the buffer, as the length is set to 65536 anyway.

But the sender is never failing, is always the receiver who fails. (The buffer in the receiver side is being allocated in lib/barrelfish/flounder_support.c:424) It's the user's job to free the pointer allocated from the flounder stub. And this is always done in the continuation closure (function endr() ) in the receiver side.

zeus.

El 14 de septiembre de 2011 19:59, Baumann Andrew <andrewb at inf.ethz.ch<mailto:andrewb at inf.ethz.ch>> escribió:

Hi,

A quick first thought: have you checked whether malloc() is failing and returning NULL on the sender side, which is then being delivered as a zero-length NULL buffer on the receiver? You do appear to have a memory leak on the receiver (someone needs to free the buffer p after it has been sent), so this might be the cause.

Andrew

From: Zeus Gómez Marmolejo [mailto:zeus.gomez at bsc.es<mailto:zeus.gomez at bsc.es>]
Sent: Wednesday, 14 September, 2011 10:29
To: barrelfish-users at lists.inf.ethz.ch<mailto:barrelfish-users at lists.inf.ethz.ch>
Subject: [Barrelfish-users] Possible bug?

Hi all,

I'm trying the Barrelfish' ability of sending large buffers (now 64kb) between cores using flounder. And I'm experiencing a strange behaviour. After sending some messages, and running it for about 1 min in QEMU, the program segfaults or simply hangs. I'm not sure if I am doing something wrong... I'm trying it with the latest version of Barrelfish.

The program is inspired in "usr/tests/idctest/idctest.c". It's spawning another instance on core 1 and setting the binding. After that it creates another thread to "dispatch events". The main loop in core 0 is sending messages, while the main loop in core 1 is just waiting for a reply queue to have some messages and send them back. It's using the same "test.if" interface that idctest.c is using. I send it as an attachment. You can simply try it by copying it over the existing idctest.c.

So, core 0 is sending a message to core 1:

a = malloc(65536);
test_buf__tx(b, MKCONT(end, 0), a, 65536);

before sending it, it obtains a lock to ensure that next message is not sent before the previous one has been  already sent:

  while (__sync_lock_test_and_set(&busy, true))
    thread_yield();

The continuation closure is releasing the lock:

static void end(void *r)
{
  free (a);
  busy = false;
}

This function is being called always by the second thread, which is always dispatching messages. I know this is not very efficient to use this spinlock... Is there any other way to block the thread and wake it up by the other thread?

Core 1 when receives the message:

static void buf1(struct test_binding *b, uint8_t *p, size_t buflen)
{
  debug_printf("buf1 %u\n", p[65535]);
  ambf_buff_send(p);
}

Simply prints the last position and sends it back to core 0: queuing it to the other thread in order not to block the event dispatcher.

The program segfaults when accessing to buffer p, in the previous function, as sometimes p is null. In any case, there is no error reported...

I would like to know if you see something wrong here ...

Many thanks!!

--
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
PhD student
http://www.bsc.es

--
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
PhD student
http://www.bsc.es

--
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
PhD student
http://www.bsc.es

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.inf.ethz.ch/pipermail/barrelfish-users/attachments/20110914/fbfd0742/attachment.html