Debugging - into rabbit hole with docker containers

This post will tell a story about successful debugging session. As an author of a portion of code, we are naturally biased to not to think about worst ways of exploiting it. Luckily I have creative colleagues. Recently one of my colleagues did some additional testing and left me with a destructive, yet interesting bug related to Docker, sandboxing and abuse. In this post I will describe how I approached and fixed it.

In syncano - company I work at - we allow for custom code execution, it’s a part of our Backend as a Service solution. We run our code in Docker containers which should be more less safe if enough security is in place. We do such things like executing this code on separate machines that rest of the tasks. Those machines have less privileges and less knowledge about the system and containers itself are constrained on things like memory usage and time of execution.

So I was really surprised recently that snippet such as:

while True:
    print(“woopdido!”)

could cause us serious trouble. The symptoms were that codebox (abstraction for wrapper around container and code) wouldn’t time out, but will stay in the pending state. We would expect it to be terminated after 30 seconds. Moreover codeboxes scheduled next, after the one that was stuck, would also stay in pending state.

In our monitoring platform we could see that load on this kind of workers was surprisingly high and that codeboxes that were meant to time out after 15 or 30 seconds could run for hours! Clearly it was a bug in our implementation of timeout subroutine.

It was a bit surprising, because I remembered implementing this part of code and writing tests for it. If there was a regression in this part of code, we should catch it on our Continuous Integration server during build. Tests were green.

I’ve remembered that in our integration test cases I used time.sleep and quickly tested a codebox with sleep on our staging:

import time

while True:
    time.sleep(0.1)
    print(“woopdido!”)

And this time, the behavior was correct. Maybe it was connected to some kind of waiting? We used to send a signal to container to shut it down if hasn’t finished on time.

I quickly verified this hypothesis by trying out another example:

x = 0
while True:
    x += 1
return x

This time behavior was also correct. So it wasn’t specific for using time! It was pretty late when I found it out, so I just wrote a small note in the bug ticket. Next step was to recreate the same problem on my local machine next day.

Recreating the bug was quite messy. One side effect was making my machine swap. One of the first things I noticed was that we were using the pretty old version of the docker library for python.

I freezed version 0.6 and there was version 1.2.2 out there when I started debugging it. Time passes really fast. From one side of view, it’s bad that we haven’t updated this dependency for half a year, from the other side - it’s nice to have a piece of code that doesn’t have to be modified often.

Anyway, it turned out that new version has timeout functionality built in. Great, I thought, probably my custom implementation of timeout was wrong in some way and the official one would handle my case better.

Actually - it didn’t change anything. I also noticed during testing with few flavours of problematic scripts, that the whole problem it’s not really about timeout - but about resource usage. Even if container was killed on time, memory wasn’t returned to the OS. So problem lied somewhere else. My next suspect was extremely verbose printing.

I measured how much logs can be written to a file during 15 seconds assuming simple printing. Not too much - it would take long to display with tool such as less, less than few hundreds Mbs. It definitely didn’t match the behavior of eating all my RAM (16 Gb on my private laptop and 8Gb on my work machine). However if I did more ‘printing’ problem only had gotten worse:

for _ in range(1000000):
    print(“hello” * 1000000)

Would it my RAM with speed of half Gb per second - and before actual time out machine would start swapping like crazy. It shouldn’t happen, because newly spawned containers were memory constrained and not even one of my containers seemed to be busy - by observing their processes I would rather say that they were bored. Problem was probably on host.

I could easily observe the RAM / cpu disaster in a window with htop opened. There were two ways to get out from this situation - restart my computer or restart docker. Both very tedious, because with whole environment using docker it meant restarting the whole stack every time.

When I tried to access logs from the container which was running codebox containers, for few times I got an interesting exception:

# docker run -ti ubuntu:trusty bash
FATA[0028] Error response from daemon: Cannot start container 02533d9e9afcc81818d3d71aa7498e0f41e7904483fd041fa45a0638423a5d3a: [8] System error: fork/exec /usr/bin/docker: cannot allocate memory

Initially I thought that it was because docker had eaten all my RAM, but it turned out that I could do other things that required memory allocations. Puzzled, I googled this exact exception message.

It led me to the issue on docker’s github page. The main asker describe the similar problem, verbose logging resulted in big memory leaks. Bingo! That looked roughly like my problem.

github issue

memory leak in buffer grow #9139 Closed e-max opened this issue on Nov 13, 2014 · 29 comments

Memory leak issue in buffer grow. It sounded nasty, but I decided to read all the comments. Some devs suggested using different logging handlers available in new versions of docker, but it didn’t work for me - I had to restart my environment few times more. Moreover on the end of the issue page Docker devs say that those problems won’t be fixed until 1.7 (probably).

Sometimes you’re left to your own devices. I thought for a while about creating a patch for Docker, but it wouldn’t be there until the next release, I also wouldn’t be able to use my own patched version of docker, because we have some other environment constraints. Well, I’m also a very novice golang programmer and I had no experience with fixing memory allocation problems. So I had to think about workaround on our side.

If you’ve spend some time in terminal window on Linux or Mac, you’re probably familiar with stream redirection. If you can’t print to foreground, why not redirect it to the files? It’s a good solution because it’s language independent, so I wouldn’t have to worry about capturing streams in every language that we support for codebox. > sign in bash is much more elegant.

It sounded like the exact thing that I needed. I quickly changed the command string that was passed to newly created containers:

'python my_script.py' -> 'python my_script.py > logs'

I started the adversary container and… nothing changed. And my computer started swapping again. Duh, what could go wrong?

I remembered that sometimes writes in Copy On Write filesystems can give worse performance, than those on more typical ones, such as EXT4, however this behavior was clearly strange. I tried to mount logs file as volume, so that I would omit COW filesystem.

It didn’t work, not only my computer also ended up without free RAM, I also couldn’t see any logs in the mounted files. It was strange and on this point I was also pretty tired. I got some coffee, juggled for five minutes and helped fellow devops in his problem.

I clearly wasn’t redirecting correctly to the files. Luckily, the offensive containers were running for 15 to 30 seconds depending on my timeout configuration and I could make sure that stdout was redirected by looking up container’s logs.

$ docker log -f nifty_ritchie
logging
logging
logging
….

They weren’t redirected correctly. By sheer accident I remembered that I’ve read somewhere on some blog or even in docs, that wrapping calls of docker commands in bash -c ‘original bash stuff’ could solve some unexpected behavior.

That was it.

 'bash -c ‘python my_script.py > logs 2> errors’'

was a winning combination.

Summary

I was lucky that it took me only one day to debug this problem. Bugs always seem obvious in retrospect, but can be a real rabbit hole and consume lots of your precious time that could be spent on developing features.

Here are my rules of thumb for debugging:

recreate your bug in local environment that you control (if that’s possible)
write your hypothesis down
write down your assumptions and try to test them as fast as possible, preferably in isolation
always try to learn more about the problem
take breaks - thinking in front of computer can be really hard, especially if you’re deeply involved into problem, it's also very important to avoid getting unproductive and wasting your time, I've recently seen a very good talk 'Into the Rabbit Hole' by @asendecka that talks about exactly that.
talk to people :) - if they’re technical, they will help you spot flaws in your thinking, if they aren’t technical, at least you’ll get some perspective during the break

Debugging is more like art than a science. I would like to hear about your opinions and your best techniques. Just comment below, talk to me on twitter on @attilczuk or write me an email to justyna.ilczuk@gmail.com.