John Floren

Home | Blog | Uses | Links
Back to blog archive

Posted 2013/10/8

The Mystery of the Disappearing Memory

The other day, I was testing some Go programs I had written. One was a distributed key-value store, and the other was a parallel implementation of Conway’s Game of Life that uses the k-v store to distribute data and coordinate iterations. The Conway’s program has node 0 print out the “game board” after each iteration, which means it pulls down the results from every other node, stitches them together, and iterates through to print them out. I was running my tests on 4 virtual machines, each with about half a gig of memory.

Early runs with a 10x10 game board went fine. I started scaling up my tests, redirecting the output to a file when the game board printouts became too big to display on stdout. Then I tried running with a 2000x2000 gameboard. The “conway” process on node 0 was soon shut down by Linux’s OOM killer–I had managed to use up all the memory on the VM.

I restarted the test, and repeatedly ran “free”, watching as the available memory steadily declined on node 0, but not on any other node. What’s the difference between the two nodes? Node 0 runs two additional functions: printfield, and cleaniteration. Printfield grabs every other node’s output and prints it out. Cleaniteration goes through and deletes the data from earlier, no longer needed iterations.

Go is garbage collected, so I didn’t have to go through and count my mallocs and frees. I figured there must be a problem with how I delete elements in the key-value store; with a 2000x2000 game board, each node’s data takes up a full megabyte of memory, so if things weren’t being deleted, that would cause an OOM kill pretty quickly. After writing a test to spam the key-value store with values, then delete them, I determined that deletion seemed to be OK.

I tried commenting out the call to “cleaniteration”. Now every node’s free memory decreased rapidly as it ran, but node 0’s still decreased faster. The problem had to be in printfield.

Printfield creates a new array the size of the game board. It then goes into a doubly-nested for loop to grab each node’s output for that iteration and stick the output into the appropriate place in the board. Once the board is populated, it iterates through and prints. This is pretty much CS 101. I figure maybe there’s a bug with how we’re fetching the node data from the key-value store, so I instead have it use an empty array rather than call out to the key value store. Same problem.

Ok, let’s skip that whole part where we populate the game board. Now the function just allocates the game board and prints it out (very boring, since the array is initialized to zeroes). We’re still using up all the memory on the VM.

Just for kicks, let’s comment out the printing part. The function just allocates the empty array and returns. Upon return, Go should garbage collect the array.

Now we don’t see any memory problems.

What.

Oh. DAMN IT! It’s easy to forget, because I usually work on a regular desktop, but these VMs were booted with ramdisk root filesystems. That means everything is in RAM. Including /tmp. Where I had been sending the output of the program. Meaning on each iteration, I was writing out 4 megabytes as I printed the game board. This fills up available memory pretty quickly!

There was nothing wrong with my programs. There was nothing wrong with Go, although I suspected for a little while that it wasn’t properly deleting elements from maps–luckily, I didn’t make an ass of myself by emailing Russ, I just made an ass of myself by writing up the whole fiasco and posting it online. The only thing wrong was my invocation of the program, when I said “conway -s 2000 > /tmp/output”.

Is there a moral to this story? I guess it’s just that you should be mindful of your environment–if I had remembered that I was using a ramfs, and that the program would generate a ton of output, I’d have never wasted an afternoon on this nonsense. Still, it was a kind of fun detective exercise and it definitely made me go back and re-examine a lot of my code.