Developer experiences from the trenches

Developer experiences from the trenches

Start of a new post

How To Crash With Kubernetes and Go

Sat 23 February 2019 by Michael Labbe
tags code 

Kubernetes is so good at maintaining a user-facing veneer of a stable service that you might not even know that you are periodically crashing until you set up log aggregation and do a keyword search for panic. You can miss crash cues because pods spin up so transparently.

Okay, so your application can crash. You are using Go. What can you do about it? In practice, here are the steps we have found useful:

  1. Log your panic record into a single log line so it can be tracked.
  2. If a panic occurred while serving a RESTful request, return 500 to prevent client timeout while continuing to serve others.
  3. Handle panic-inducing signals such as SIGSEGV gracefully.
  4. Handle Kubernetes pod pre-shutdown SIGTERM messages.

Maui

Panic-inducing Signals

If you write a C program and do not explicitly handle SIGSEGV with signal(2), the receipt of SIGSEGV terminates the offending thread.

Go is different from C. Go’s runtime has a default panic handler that catches these signals and turns them into a panic. Defer, Panic and Recover on the official blog covers the basic mechanism.

SIGSEGV (“segmentation violation”) is the most common one. Go will happily compile this SIGSEGV-generating code:

var diebad *int
*diebad++       // oh, no

The full list of panic reasons is described in the official panic.go source.

Non-Panic Inducing Signals

Not every signal produces a Go panic — not by a long shot. Linux has over 50 signals. Version 7 had 15 different signals; SVR4 and 4.4BSD both have 31 different signals. Signals are a kernel interface exposed in userspace, and a primary means for processes to contend with their role in the larger operating system.

Let’s go over the non-panic inducing signals and discuss what they mean to our Kubernetes-driven Go program:

Handling Panics in Go

We’ve established that crashing signals in Go are received by its runtime panic handler, and that we want to override this behaviour to provide our own logging, stack tracing, and http response to a calling client.

In some environments you can globally trap exceptions. For instance, on Windows in a c++ environment you can use Structured Exception Handling to unwind the stack and perform diagnostics.

Not so in Go. We have one technique: defer. We can set up a defer function near the top of our goroutine stack that is executed if a panic occurs. When there, we can detect if a panic is currently in progress. There are a number of gotchas with this technique:

We can use the latter trait to our advantage in our web service, providing a generic panic handler that logs, and a second panic handler inside the goroutine that responds to a web request that returns 500 error to the user.

Global Panic Handler

The global panic handler is your opportunity to employ your logger to use your logger to provide all relevant crash diagnostics that occur outside of responding to an HTTP request:

//
// Sample code to catch panics in the main goroutine
//
func main() {
    defer func() {
        r := recover()
        if r == nil {
            return // no panic underway
        }

        fmt.Printf("PanicHandler invoked because %v\n", r)

        // print debug stack
        debug.PrintStack()

        os.Exit(1)
    }()
}

In-Request Panic Handler

Most (if not all) Go RESTful packages use a per-request Goroutine to respond to incoming requests so they can perform in parallel. The top of this stack is under package control, and so it is up to the RESTful package maintainer to provide a panic handler.

go-restful defaults to doing nothing but offers an API to trap a panic, calling your designated callback. From there, it is up to you to log diagnostics and respond to the user. Check with your RESTful package for similar handlers.

go-restful’s default panic handler (implemented in logStackOnRecover) logs the stack trace back to the caller. Don’t use it. Write your own panic handler that leverages your logging solution and does not expose internals at a crash site to a client.

Terminating Gracefully on Request

Okay, at this point we are logging crash diagnostics, but what about amicable pod termination? Kubernetes is sending SIGTERM and because we are not yet trapping it, it is causing our process to silently exit.

Consider the case of a DB connection over TCP. If our process has open TCP connections, a TCP connection sits idle until one side sends a packet. Killing the process without closing a TCP socket results in a half-open connection. Half-open connections are handled deep in your database driver and explicit disconnection is not necessary, but it is nice.

It avoids the need for application-level keepalive round trips to discover a half-open connection. Correctly closing all TCP connections ensures your database-side connection count telemetry is accurate. Further, if a starting pod initializes a large enough database connection pool in the timeout window, it may temporarily exceed your max db connections because the half-closed ones have not timed out yet!

//
// Sample code to trap SIGTERM
//
func main() 
    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGTERM)

    go func() {
        // before you trapped SIGTERM your process would
        // have exited, so we are now on borrowed time.
        //
        // Kubernetes sends SIGTERM 30 seconds before 
        // shutting down the pod.

        sig := <-sigs

        // Log the received signal
        fmt.Printf("LOG: Caught sig ")
        fmt.Println(sig)

        // ... close TCP connections here.

        // Gracefully exit.
        // (Use runtime.GoExit() if you need to call defers)
        os.Exit(0)
    }()
}

You may also want to trap SIGINT which usually occurs when the user types Control-C. These don’t happen in production, but if you see one in a log, you can quickly recognize you aren’t looking at production logs!

No Exit Left Behind

At this point we have deeply limited the number of ways your application can silently fail in production. The resiliency of Kubernetes and the default behaviours of the Go runtime can sweep issues under the rug.

With just a few small code snippets, we are back in control of our exit conditions.

Crashing is about leaving a meaningful corpse for others to find.

More posts by Michael Labbe

rss
We built Frogtoss Labs for creative developers and gamers. We give back to the community by sharing designs, code and tools, while telling the story about ongoing independent game development at Frogtoss.