Developer experiences from the trenches

How To Crash With Kubernetes and Go

Sat 23 February 2019 by Michael Labbe
tags code

Kubernetes is so good at maintaining a user-facing veneer of a stable service that you might not even know that you are periodically crashing until you set up log aggregation and do a keyword search for panic. You can miss crash cues because pods spin up so transparently.

Okay, so your application can crash. You are using Go. What can you do about it? In practice, here are the steps we have found useful:

Log your panic record into a single log line so it can be tracked.
If a panic occurred while serving a RESTful request, return 500 to prevent client timeout while continuing to serve others.
Handle panic-inducing signals such as SIGSEGV gracefully.
Handle Kubernetes pod pre-shutdown SIGTERM messages.

Maui

Panic-inducing Signals

If you write a C program and do not explicitly handle SIGSEGV with signal(2), the receipt of SIGSEGV terminates the offending thread.

Go is different from C. Go’s runtime has a default panic handler that catches these signals and turns them into a panic. Defer, Panic and Recover on the official blog covers the basic mechanism.

SIGSEGV (“segmentation violation”) is the most common one. Go will happily compile this SIGSEGV-generating code:

var diebad *int
*diebad++       // oh, no

The full list of panic reasons is described in the official panic.go source.

Non-Panic Inducing Signals

Not every signal produces a Go panic — not by a long shot. Linux has over 50 signals. Version 7 had 15 different signals; SVR4 and 4.4BSD both have 31 different signals. Signals are a kernel interface exposed in userspace, and a primary means for processes to contend with their role in the larger operating system.

Let’s go over the non-panic inducing signals and discuss what they mean to our Kubernetes-driven Go program:

Unignorable signals: SIGKILL and SIGSTOP can’t be ignored. They are provided by the kernel as a surefire way of killing a process. If received, the process terminates without warning and we have to rely on logging coming from external sources. It is not recommended to use unignorable signals in automating your process restarts.
Flow-related signals: Many signals can be classified as supporting thread execution. These include SIGCONT and SIGPIPE. They do not interact with Kubernetes and we can safely ignore them or reserve them for any process-specific needs that come up.
Kubernetes-Generated Signals. Kubernetes sends SIGTERM to PID 1 in your container thirty seconds before shutting down a pod. If you weren’t trapping this previously (and also not using a preStop hook), you are missing an opportunity to gracefully shut down your pod. By default, SIGTERM terminates the process in a Go program. The more aggressive SIGKILL is sent to your pod if it is still running after the grace period.

Handling Panics in Go

We’ve established that crashing signals in Go are received by its runtime panic handler, and that we want to override this behaviour to provide our own logging, stack tracing, and http response to a calling client.

In some environments you can globally trap exceptions. For instance, on Windows in a c++ environment you can use Structured Exception Handling to unwind the stack and perform diagnostics.

Not so in Go. We have one technique: defer. We can set up a defer function near the top of our goroutine stack that is executed if a panic occurs. When there, we can detect if a panic is currently in progress. There are a number of gotchas with this technique:

defer does not run if os.Exit() is called. Make sure all error paths out of your process call panic or use runtime.Goexit().
defer (and recover) operate on goroutines, not processes. If you set a defer to run in main and then spawn a goroutine which panics, the defer will not be called.

We can use the latter trait to our advantage in our web service, providing a generic panic handler that logs, and a second panic handler inside the goroutine that responds to a web request that returns 500 error to the user.

Global Panic Handler

The global panic handler is your opportunity to employ your logger to use your logger to provide all relevant crash diagnostics that occur outside of responding to an HTTP request:

//
// Sample code to catch panics in the main goroutine
//
func main() {
    defer func() {
        r := recover()
        if r == nil {
            return // no panic underway
        }

        fmt.Printf("PanicHandler invoked because %v\n", r)

        // print debug stack
        debug.PrintStack()

        os.Exit(1)
    }()
}

In-Request Panic Handler

Most (if not all) Go RESTful packages use a per-request Goroutine to respond to incoming requests so they can perform in parallel. The top of this stack is under package control, and so it is up to the RESTful package maintainer to provide a panic handler.

go-restful defaults to doing nothing but offers an API to trap a panic, calling your designated callback. From there, it is up to you to log diagnostics and respond to the user. Check with your RESTful package for similar handlers.

go-restful’s default panic handler (implemented in logStackOnRecover) logs the stack trace back to the caller. Don’t use it. Write your own panic handler that leverages your logging solution and does not expose internals at a crash site to a client.

Terminating Gracefully on Request

Okay, at this point we are logging crash diagnostics, but what about amicable pod termination? Kubernetes is sending SIGTERM and because we are not yet trapping it, it is causing our process to silently exit.

Consider the case of a DB connection over TCP. If our process has open TCP connections, a TCP connection sits idle until one side sends a packet. Killing the process without closing a TCP socket results in a half-open connection. Half-open connections are handled deep in your database driver and explicit disconnection is not necessary, but it is nice.

It avoids the need for application-level keepalive round trips to discover a half-open connection. Correctly closing all TCP connections ensures your database-side connection count telemetry is accurate. Further, if a starting pod initializes a large enough database connection pool in the timeout window, it may temporarily exceed your max db connections because the half-closed ones have not timed out yet!

//
// Sample code to trap SIGTERM
//
func main() 
    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGTERM)

    go func() {
        // before you trapped SIGTERM your process would
        // have exited, so we are now on borrowed time.
        //
        // Kubernetes sends SIGTERM 30 seconds before 
        // shutting down the pod.

        sig := <-sigs

        // Log the received signal
        fmt.Printf("LOG: Caught sig ")
        fmt.Println(sig)

        // ... close TCP connections here.

        // Gracefully exit.
        // (Use runtime.GoExit() if you need to call defers)
        os.Exit(0)
    }()
}

You may also want to trap SIGINT which usually occurs when the user types Control-C. These don’t happen in production, but if you see one in a log, you can quickly recognize you aren’t looking at production logs!

No Exit Left Behind

At this point we have deeply limited the number of ways your application can silently fail in production. The resiliency of Kubernetes and the default behaviours of the Go runtime can sweep issues under the rug.

With just a few small code snippets, we are back in control of our exit conditions.

Crashing gracefully is about leaving a meaningful corpse for others to find.