[Barrelfish-users] Corruption sending buffer

Zeus Gómez Marmolejo zeus.gomez at bsc.es
Thu Jan 19 13:18:02 CET 2012


wow. Amazing!!! It really fixes the bug. This was a time record fix!! Thank
you very much Andrew :)

El 18 de enero de 2012 20:15, Baumann Andrew <andrewb at inf.ethz.ch> escribió:

>  I believe I’ve fixed the bug; it was actually more mundane than I
> guessed: the UMP stubs would send an ack for a received message before they
> had consumed the message payload. ****
>
> ** **
>
> There are two patches attached:****
>
> ump_flounder.patch fixes the broken logic in the stub compiler. It should
> be the only one you need.****
>
> ump_support.patch fixes a minor unrelated problem I noticed while looking
> for the bug: the “next sequence ID” field was mislabelled and started from
> 0 rather than 1. I believe this is independent of the previous patch, but
> since I’m testing with this code in my tree, I’m including it for
> completeness.****
>
> ** **
>
> Please let me know how you go…****
>
> Andrew****
>
> ** **
>
> *From:* Baumann Andrew
> *Sent:* Wednesday, 18 January 2012 10:02
> *To:* 'Zeus Gómez Marmolejo'
> *Cc:* barrelfish-users at lists.inf.ethz.ch
> *Subject:* RE: [Barrelfish-users] Corruption sending buffer****
>
> ** **
>
> Hi Zeus,****
>
> ** **
>
> I was dubious, but you’re right, it really is broken. I’m amazed the rest
> of the system works as well as it does! :)****
>
> ** **
>
> The good news, is that it’s not just random memory corruption. If I print
> out the expected and received value, it is always off by 32:****
>
> expected 200711824 but got 200711856****
>
> expected 218117022 but got 218117054****
>
> expected 292546689 but got 292546721****
>
> expected 360468278 but got 360468310****
>
> expected 370495166 but got 370495198****
>
> expected 377519952 but got 377519984****
>
> expected 581714255 but got 581714287****
>
> expected 619853741 but got 619853773****
>
> …****
>
> ** **
>
> The default UMP channel length (which is chosen so as to fit two
> unidirectional channels in a 4k page) is 32 messages. I can’t see any
> pattern to when the incorrect values arise (in the overall stream), so this
> suggests pretty strongly that there is a race condition occurring on
> wrap-around in the underlying channel. I’ll try to get to the bottom of it…
> ****
>
> ** **
>
> Andrew****
>
> ** **
>
> *From:* zeus at aluzina.org [mailto:zeus at aluzina.org] *On Behalf Of *Zeus
> Gómez Marmolejo
> *Sent:* Tuesday, 17 January 2012 15:26
> *To:* Baumann Andrew
> *Cc:* barrelfish-users at lists.inf.ethz.ch
> *Subject:* Re: [Barrelfish-users] Corruption sending buffer****
>
> ** **
>
> Ok,****
>
> ** **
>
> I've simplified the example even more. Now, there is no buffer sent. Only
> a uint32_t as a parameter of the message handler. Each time this integer is
> being incremented by 1. This keeps going till it aborts when the integer
> received by the second core is not what it should be.****
>
> ** **
>
> As before, there are 2 cores, one sending and the other receiving with 1
> thread per core. x86_64 is the architecture.****
>
> ** **
>
> This error prevents my benchmarks from running as I'm sending a lot of
> messages from one node to another and at some point it aborts because the
> gasnet handler index is not correct.****
>
> ** **
>
> Thanks for your help!****
>
> ** **
>
> zeus.****
>
> ** **
>
> El 17 de enero de 2012 15:06, Zeus Gómez Marmolejo <zeus.gomez at bsc.es>
> escribió:****
>
> Hi Andrew,****
>
> El 16 de enero de 2012 20:58, Baumann Andrew <andrewb at inf.ethz.ch>
> escribió:****
>
> ** **
>
> Hi Zeus,****
>
>  ****
>
> At first look your program appears to be correct if sizeof(unsigned) == 4
> on both source and destination. Although sending 4kB buffers through the
> IDC system in this way is madness, it should work.****
>
>  ** **
>
> Yes, I forgot to say that the architecture I was running is x86_64. ****
>
> ** **
>
>  ****
>
>   ****
>
> Unfortunately I’m not in a good position to run this right now (… it’s a
> long story involving snow and an out-of-date QEMU). When it fails, is the
> incorrect value one from the previous iteration, or is it garbage?****
>
>  ****
>
>  ** **
>
> The incorrect value is 0 almost all the time is failing.****
>
>  ****
>
>  BTW, there is a minor use-after-free bug in your debug_printf() in the
> receive handler.****
>
>  ****
>
>  ** **
>
> You are right, but placing the debug_printf() before the free() doesn't
> make the error go away.****
>
>  ****
>
>  Andrew****
>
>  ****
>
> *From:* Zeus Gómez Marmolejo [mailto:zeus.gomez at bsc.es]
> *Sent:* Monday, 16 January 2012 11:39
> *To:* barrelfish-users at lists.inf.ethz.ch
> *Subject:* [Barrelfish-users] Corruption sending buffer****
>
>  ****
>
> Dear Barrelfish developers,****
>
>  ****
>
> I believe that I've found a bug in the message passing interface when a
> buffer is sent between two endpoints. I would like you to have a look at
> this example I'm sending to you. You can copy it to the folder
> "usr/tests/idctest" of the latest version of the public repository. It
> should build correctly.****
>
>  ****
>
> This is a very simple example: it has 2 cores, with one thread per core.
> Only one core is sending a message to the other core. The message is a
> simple buffer of 1024 unsigned integers where the first integer and the
> last one is the same, it's incremented in each message sent. The message
> handler on the receiver is just checking that the first and the last
> integer of the buffer are the same.****
>
>  ****
>
> The application keeps running until it finds that the two integers differ.
> This means that the buffer has been sent incorrectly. I've tested this in
> qemu and in a real machine and after a while the program aborts because of
> a corrupt buffer. In both systems, the error happens before 2 minutes
> approx.****
>
>  ****
>
> Can you have a look at this example and check whether I'm doing something
> wrong?
> ****
>
>  ****
>
> Many thanks!!****
>
>  ****
>
>  ****
>
> --
> Zeus Gómez Marmolejo
> Barcelona Supercomputing Center
> PhD student
> http://www.bsc.es****
>
>
>
> ****
>
> ** **
>
> --
> Zeus Gómez Marmolejo
> Barcelona Supercomputing Center
> PhD student
> http://www.bsc.es****
>
>
>
> ****
>
> ** **
>
> --
> Zeus Gómez Marmolejo
> Barcelona Supercomputing Center
> PhD student
> http://www.bsc.es****
>



-- 
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
PhD student
http://www.bsc.es
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.inf.ethz.ch/pipermail/barrelfish-users/attachments/20120119/15ec4ca8/attachment.html 


More information about the Barrelfish-users mailing list