[Barrelfish-users] Corruption sending buffer
Baumann Andrew
andrewb at inf.ethz.ch
Wed Jan 18 19:01:57 CET 2012
Hi Zeus,
I was dubious, but you're right, it really is broken. I'm amazed the rest of the system works as well as it does! :)
The good news, is that it's not just random memory corruption. If I print out the expected and received value, it is always off by 32:
expected 200711824 but got 200711856
expected 218117022 but got 218117054
expected 292546689 but got 292546721
expected 360468278 but got 360468310
expected 370495166 but got 370495198
expected 377519952 but got 377519984
expected 581714255 but got 581714287
expected 619853741 but got 619853773
...
The default UMP channel length (which is chosen so as to fit two unidirectional channels in a 4k page) is 32 messages. I can't see any pattern to when the incorrect values arise (in the overall stream), so this suggests pretty strongly that there is a race condition occurring on wrap-around in the underlying channel. I'll try to get to the bottom of it...
Andrew
From: zeus at aluzina.org [mailto:zeus at aluzina.org] On Behalf Of Zeus Gómez Marmolejo
Sent: Tuesday, 17 January 2012 15:26
To: Baumann Andrew
Cc: barrelfish-users at lists.inf.ethz.ch
Subject: Re: [Barrelfish-users] Corruption sending buffer
Ok,
I've simplified the example even more. Now, there is no buffer sent. Only a uint32_t as a parameter of the message handler. Each time this integer is being incremented by 1. This keeps going till it aborts when the integer received by the second core is not what it should be.
As before, there are 2 cores, one sending and the other receiving with 1 thread per core. x86_64 is the architecture.
This error prevents my benchmarks from running as I'm sending a lot of messages from one node to another and at some point it aborts because the gasnet handler index is not correct.
Thanks for your help!
zeus.
El 17 de enero de 2012 15:06, Zeus Gómez Marmolejo <zeus.gomez at bsc.es<mailto:zeus.gomez at bsc.es>> escribió:
Hi Andrew,
El 16 de enero de 2012 20:58, Baumann Andrew <andrewb at inf.ethz.ch<mailto:andrewb at inf.ethz.ch>> escribió:
Hi Zeus,
At first look your program appears to be correct if sizeof(unsigned) == 4 on both source and destination. Although sending 4kB buffers through the IDC system in this way is madness, it should work.
Yes, I forgot to say that the architecture I was running is x86_64.
Unfortunately I'm not in a good position to run this right now (... it's a long story involving snow and an out-of-date QEMU). When it fails, is the incorrect value one from the previous iteration, or is it garbage?
The incorrect value is 0 almost all the time is failing.
BTW, there is a minor use-after-free bug in your debug_printf() in the receive handler.
You are right, but placing the debug_printf() before the free() doesn't make the error go away.
Andrew
From: Zeus Gómez Marmolejo [mailto:zeus.gomez at bsc.es<mailto:zeus.gomez at bsc.es>]
Sent: Monday, 16 January 2012 11:39
To: barrelfish-users at lists.inf.ethz.ch<mailto:barrelfish-users at lists.inf.ethz.ch>
Subject: [Barrelfish-users] Corruption sending buffer
Dear Barrelfish developers,
I believe that I've found a bug in the message passing interface when a buffer is sent between two endpoints. I would like you to have a look at this example I'm sending to you. You can copy it to the folder "usr/tests/idctest" of the latest version of the public repository. It should build correctly.
This is a very simple example: it has 2 cores, with one thread per core. Only one core is sending a message to the other core. The message is a simple buffer of 1024 unsigned integers where the first integer and the last one is the same, it's incremented in each message sent. The message handler on the receiver is just checking that the first and the last integer of the buffer are the same.
The application keeps running until it finds that the two integers differ. This means that the buffer has been sent incorrectly. I've tested this in qemu and in a real machine and after a while the program aborts because of a corrupt buffer. In both systems, the error happens before 2 minutes approx.
Can you have a look at this example and check whether I'm doing something wrong?
Many thanks!!
--
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
PhD student
http://www.bsc.es
--
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
PhD student
http://www.bsc.es
--
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
PhD student
http://www.bsc.es
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.inf.ethz.ch/pipermail/barrelfish-users/attachments/20120118/90fa2302/attachment.html
More information about the Barrelfish-users
mailing list