Thursday, November 8, 2012

Funny performance characteristics of DBI tools

These days I'm working on dynamic binary instrumentation (DBI) tools built with DynamoRIO, in particular Dr. Memory.  One of the things people always as is, "So what's the slowdown?  Are you faster than Valgrind?"  The answer is incredibly complicated, as performance questions usually are.  The easy answer is, "On SPEC?  10x slowdown, and yes, we are faster than Valgrind."  Go to the drmemory.org front page and you can see our pretty graph of spec slowdowns and how we are twice as fast as Valgrind.

OK, great!  Unfortunately, it turns out that most apps aren't at all like SPEC.

My team's goal is to find bugs in Chrome, so we want to run Chrome and its tests, not SPEC.  So what's different about Chrome?  Many things, but the biggest difference in one word is: V8.  V8 is the JavaScript engine that gives Chrome much of its performance edge, and it loves to optimize, deoptimize, and generally modify its generated code.  This creates a problem for DBI systems like DynamoRIO and Valgrind because they actually execute instrumented code out of a code cache, and not from the original application PC.  DBI systems need to maintain code cache consistency.

Valgrind doesn't actually try to solve this problem.  It requires the application to annotate all of its code modifications before re-executing the modified code.  Search for "VALGRIND_DISCARD_TRANSLATIONS" for more information on how this works.

DynamoRIO was originally a security product designed to run on unmodified Windows applications, so this approach was a non-starter.  Instead, DR uses page protections and code sandboxing to detect modification.  Sandboxing is where we insert extra code to check that the code we're about to execute is unmodified on every instruction.  When we use page protections, we mark all read, write, execute pages as read-only.  When the app writes its code, we catch the page fault, flush the code, and mark the page writable.

In theory, with those two techniques we are able to provide perfect application transparency.  However, it they come at a very high performance cost.  I'm currently running a V8 test case that takes .1 seconds to execute natively.  The version running under DynamoRIO has been running for 50 minutes while I've been writing this blog post, and it's actually making progress based on the output.  That gives us approximately a 32400x slowdown!

Generally speaking, our slowdown isn't this bad.  This particular test case is stress testing the optimizer.  But it demonstrates how hard it is to answer the question of performance.

Still, there's a lot of room for improvement here.  In particular, we are considering integrating parts of the Valgrind approach where we get help from the app to maintain code cache consistency, but I don't want to give up on the dream of a perfectly transparent DBI system yet.  Our rough idea for how to do this is to have the app tell us which areas of virtual memory it will maintain consistency for, and for the rest of the memory, we'll use our normal cache consistency protocol.  This naturally handles two JITs in the same process, one which is cooperating with us, and one which isn't.

Hopefully I'll write another blog post when we get this stuff implemented.

Monday, July 16, 2012

The environment is a command line with a standard format

This isn't particularly deep, but I was playing with _start and totally static ELF exes yesterday, and I had this minor realization about the environment.  It lives completely in userspace, and it's basically just a command line with a standard format.

People have varying opinions about environment variables.  Because of the way your shell handles them, they're usually implicit and easily forgotten.  People rarely specify them manually, and when they have to, they grumble.  But, as far as the kernel is concerned, they're just another null-terminated array of strings to pass to userspace when you exec a new process.  Truly, the execve prototype is:

int execve(const char *fname, char **argv, char **envp);

There's really nothing special about envp.  You can pass argv there if you like, but if you exec a regular exe that uses glibc, it probably won't understand you.

The environment has a standard, agreed-upon format, which is not true for argv.  Imagine if we used environment variables for most arguments.  You'd never have to write another flag parser again.  Different, unrelated components of your program can all read from it without conflicting with each other.

On the other hand, it's a totally global and flat namespace, kind of like DOM ids and CSS class names.  Everyone's nervous that they're going to trip over each other.  So, people tend to shy away from using it, and every exe has a slightly different flags format.

Monday, December 12, 2011

Stack crawling in DrMemory without frame pointers or debug info

I work on DrMemory, a dynamic instrumentation tool for finding memory bugs in native apps similar to memcheck, more widely known as just Valgrind. We have to be able to generate stack traces in a wide variety of interesting situations, and we want to do it quickly and without depending on slow, non-portable, and hard to integrate libraries.

If you remember your early computer systems classes, you'll recall those exam questions about unwinding the stack by chasing the base pointer. The reality is that for years we've been stuck with x86_32, which has a measly 8 general purpose registers, and compilers have been itching to grab that tender little %ebp register. So, they took it, and now the world is much more complicated because walking the stack is a whole lot harder.

To unwind the stack, we need to answer the question: given the stack pointer (SP) and program counter (PC), what is the offset from the SP to the next return address? With that, we can repeat the question until we hit main (or NULL or whatever).

Maintaining a huge table mapping any arbitrary module offset to a stack offset would waste a lot of space, so most systems use some form of state machine encoding that can compute the value on the fly, should you need to walk the stack. (Win64 SEH uses an interesting mechanism involving carefully constructed prologues which I'd like to learn more about.) If stack walking is rare, such as for an "exceptional" exception or when in the debugger, then the state machine (DWARF CFI) works great. But if you need to do it all the time and you need to do it on multiple platforms, it will be slow, you will need platform-specific code, or you will need to depend on some other library. Pulling arbitrary code into a tool that runs in the same as the address space as the application can often break isolation between the tool and the app, so we don't like to do it very often.

However, we're in dynamic binary instrumentation land, so we get to see every instruction as it's executed. The next most obvious solution for unwinding is what we call the "shadow stack" solution. Every time you see a call or ret instruction, you just push or pop the PC on to or off of the current thread's shadow stack. If you need to record lots of stack traces, then this may be your best bet, especially since you can imagine logging these updates and then recording a trace is as easy as logging a timestamp. I believe that this is what ThreadSanitizer does, because they need to produce a stack trace not just for the current memory access, but for the other memory access that races with the current thread.

For a tool like DrMemory, we only need to store stack traces for every heap routine invocation, which is not nearly quite as often. We also don't want to unnecessarily slow down an app that performs many calls in a loop without calling any heap routines or creating any memory bug reports. You also have to worry about a lot of corner cases, such as an app which clobbers the return addr to go somewhere else, or longjmp, or exception handling.

The solution we use is I think quite elegant. First, if there are frame pointers, we use them, because it's fast (like iterating a linked list that's in cache :). While the compiler may get a 10% perf boost from an extra register, we get a bigger boost by not doing a "scan", which is what I am about to describe.

If we don't have frame pointers, we start with the stack pointer, and scan each pointer-aligned cell for a value that looks like a pointer into executable memory. This part is obvious, and you can get Windbg to do the same thing and symbolize it to boot (anyone know how to do this in gdb?).

The problem is that most code rarely initializes its entire stack frame. In any given frame, there is a lot of spill space for temporaries in unexecuted code, or arrays that are not filled. This means that stale return addresses from prior frames are still on the stack. The clever bit is realizing that we're already shadowing every memory write to track definedness information to find uninitialized reads, so we actually *know* which bytes are initialized and which are not. That means we can tell what is a probable return address and what is stale.

Still, what if the app stores a function pointer to the stack? That will look like a code pointer and show up as an extra frame. The next bit of cleverness is where we take the candidate return address and look backwards in the code for likely encodings of a call instruction. This is pretty straightforward, as there are only a handful of encodings for direct and indirect calls, and we can just enumerate the opcodes along with their offsets. If it's a function pointer, this test should fail, unless we're really unlucky.

And with that, we have our (mostly) reliable, (reasonably) performant, and (mostly) portable stack crawler! :) To my knowledge, Memcheck doesn't attempt any of this craziness, and encourages you to recompile with FPs if it gets a weird looking stack.

Edit: Comments on G+ are preferred, since I'll actually respond.

Thursday, August 11, 2011

Early injection of DBI frameworks on Linux

I just had an interesting discussion with Qin Zhao about how the various instrumentation frameworks out there achieve early injection. I haven't actually cross-checked this all at all, so it may not be 100% accurate, but I thought it was interesting enough to write down.

Early injection is the ability of a binary instrumentation framework to interpret an application from the very first instruction. On Linux, applications do not start life in userspace at main. They start life at _start, and that usually hops into the dynamic loader to bring in the shared libraries such as libc. Usually this code is very similar for every application, and you don't need to instrument it to find uses of uninitialized memory in your application. However, its possible you have an app that doesn't use the dynamic loader, or there really are bugs there you want to see. A related problem is how to load your own dynamic libraries without conflicting with the loader being used by the application. Typically the DBI framework will load two copies of libc: one for the application, and one for the framework and its libraries. DynamoRIO, Valgrind, and Pin all have different ways of loading and injecting with different tradeoffs.

DynamoRIO is the easiest, because we don't support early injection on Linux. (Windows is a different story which I can't speak to.) We simply set LD_LIBRARY_PATH and LD_PRELOAD in a shell script that wraps the application invocation. This way, the loader will pull in one of our shared libraries, and call _init, at which point we take control. Instead of returning back to the loader, we start interpreting from where we would have returned. For DynamoRIO's dynamic libraries, the most important of which is the actual tool that you want to run, ie your race detector or memory debugger, we have our own private loader which keeps its own libraries on the side and does all the memory mapping and relocations itself. This is a pain, since it practically reimplements all of the system loader, but we do have the ability to place restrictions on the kinds of libraries we load to make things easier.

Valgrind is the next easiest, since it's basically the inverse of the above strategy. The command "valgrind" is a proper binary, so the system loader kicks in and loads all of Valgrind's libraries for it. It can use dlopen to bring in the tools it wants. To load libraries for the application, Valgrind has a private loader similar to our own, except that it loads the application's libraries instead of Valgrind's.

Finally, Pin has the most interesting strategy. Pin does not have it's own loader. Pin is also a proper binary. They start by loading all the libraries they ever want, including the tool you want to run, using the system loader (dlopen, etc). Then they fork a child process. The *parent*, not the child, sets up a call to exec to match the invocation of the application. Somehow, probably using ptrace or some other API, the child is able to pause the parent application before the first instruction and attach. The child then copies its memory map into the parent. If there are conflicts, it errors out. The child then redirects execution to Pin's entry point, which eventually starts interpreting the application from _start.

Pin's approach allows early injection without the maintenance burden of a private loader. However, it can cause conflicts with the application binary and requires heuristics to set preferred addresses which won't conflict. It also slows down startup, due to the extra forking and memory mapping and copying. Personally, I think this is pretty cool, even if it has some transparency issues.

Thursday, May 5, 2011

FLAGS dependencies on x86

One big annoyance I have with x86 in my work on DynamoRIO is the special FLAGS register. You can't read or write to it like a normal register, and 2/3 of the bits are reserved or always 0 or 1. The only easy way to get flags in or out is to use pushf and popf, but in DynamoRIO we can't use those without switching off the application stack, since it could be doing things like using the x86_64 Linux red zone. So, we try to avoid clobbering FLAGS in inserted code whenever possible. For example:

You wanted to subtract a constant? Instead of "sub $-1, %rax" try "lea -1(%rax), %rax". Works for add, too.

With the more complex base, displacement, index, scale, and offset form of the x86 memory operand, you can get even more fancy, accomplishing small shifts and multiplies and register to register adds using the LEA instruction. You can even get a 3-operand add like so:
lea 0(%rax, %rbx, 1), %rcx

Which adds %rax and %rbx and stores the result in %rcx. Another old trick is multiplication by 5, which works like so:
lea 0(%rax, %rax, 4), %rax

This is powerful enough to perform register to register adds and simple arithmetic with constants, but we can't yet subtract a register from another. Unfortunately, SUB and NEG both touch FLAGS. However, the NOT instruction does not. This means, assuming %rax is dead, we can rewrite "sub %rax, %rbx" like so:
not %rax
lea 1(%rax, %rbx, 1), %rbx

Remembering how two's complement works, negation is equivalent to a complement and add one. Addition is commutative, so we can fold the increment into the displacement field of the LEA.

Finally, you can also use JRCXZ (JECXZ in 32-bit land) to perform a jump if RCX is zero, without looking at flags. Putting it all together, you can implement equality checks without changing flags like so (assuming one of %rdi or %rsi and %rcx are dead or have been saved):

Flags code:
cmp %rdi, %rsi
je .Llabel

No flags code:
not %rsi
lea 1(%rdi, %rsi, 1), %rcx
jrcxz .Llabel

I haven't checked, but I imagine the compare is faster. However, in DynamoRIO, we have to save the flags for the cmp, which requires three instructions to save and to restore. With that factored in, I have a feeling the no flags version may be preferable. I will have to implement the no-flags transformation and see if it performs better.

Unfortunately, the toy client I am developing with has code like this:
lea -1(%rdx), %eax
test %rax, %rdi
jz .Llabel

Which performs an alignment check (assuming rdx is a power of two). I'm not sure how to implement an arithmetic AND (which is what TEST does) without touching flags.

Wednesday, April 6, 2011

HipHop talk

I just went to a talk by Facebook HipHop engineer Haiping Zhao about various systems issues FB has had. There was a discussion of language issues with data movement in distributed systems and how to get that right, but I'm going to focus on HipHop, Facebook's open source PHP to C++ translator, since that's more up my alley.

Thinking carefully about HipHop, it seems to make better tradeoffs for large web deployments than the (at this point) traditional JIT approach. HipHop essentially trades compilation time, code size, and maximum performance for performance predictability and immediate startup performance. Furthermore, static compilation is probably perceived as being less likely to have bugs than a JIT compilation to someone in charge of deploying the system, although I would argue against that perception.

The main argument against static compilation of dynamic languages is that you don't have enough information to generate efficient code, and that you need to be able to run it before compiling it. To address that, HipHop is uses feedback directed (aka profile guided) optimization, even though PGO has never caught on in the C/C++ world. Essentially, they translate all their PHP to instrumented C++ to get types of variables, and then run that on an instance with a model workload. The compile takes about 15 minutes to make a 1 GB binary (the gold linker helps), and the profiling takes about another 15 minutes. Then they re-do the compilation with the profile information. This allows optimizations like speculating the target of a call, and speculating the type of a variable as being int or string, the easy to optimize types.

I would argue that this approach won't have as good peak performance as a JIT, which only has to generate the specialized version of the code, and can do some more clever optimizations. Also, a JIT can always fallback to the interpreter for the general case. If your profiling ends up being wrong (unlikely), a JIT can respond appropriately.

The drawback of the JIT is that you don't know when or if your code is going to get compiled, and you have to wait for that to happen. Debugging performance problems with JITed code is tricky. On the other hand, there are many solid tools for profiling C++, and mapping that back to PHP doesn't sound too hard.

It's probably possible to do a little better in the static model by avoiding generating or loading code for slow paths that you feel will never actually be loaded. I imagine right now the generated code looks like:

if (is_int(v)) { /* large int code */ }
else { /* massive general code */ }

The else branch is probably never taken. It seems like there should be techniques to either move the else branch far away from the hot code so that the cold code is never even faulted in from disk, or you could try to rig up a fallback to the interpreter, which I think would be much harder and not help your code size. On the other hand, branch predictors these days are good and moving code around is really only going to show up as instruction cache effects, which are pretty hard to predict. HipHop probably already uses __builtin_expect to tell g++ which branch it expects to take as well.

One other major thing I remembered during the talk is that optimizing PHP turns out to be much easier than Python. In Python, functions, classes, and globals in a module all share the same namespace. In PHP, variables start with a $ and functions do not. Short of eval, which HipHop forbids, there is no way to shadow a function without the compiler noticing that it might be shadowed. For example, this works:

if ($a) {
function foo() { return true; }
} else {
function foo() { return false; }
}

But this is not possible:

function foo() { return true; }
function bar() { return false; }
foo = bar; // No can do.

You can also invoke functions through variables, like so:

function foo() { return true; }
$func = "foo";
$bar = $func();

That will set $bar to true. This is generally rare and easy to implement by falling back to the standard dynamic lookup.

OTOH, local variable lookup seems like it sucks because PHP supports something like the following:

$which = $a ? "foo" : "bar";
$$which = "string";

Which will assign "string" to the variable $foo or $bar depending on whether $a is true or false. Hopefully the scoping rules restrict this to the scope of the local function, so you only need to worry about this feature in functions that use it.

Another feature of HipHop is that, once you forbid certain features like eval or using include to pull in generated or uploaded source from the filesystem, you get to do whole program compilation. One of the major optimization blockers even for C and C++ is that the compiler only gets to see a translation unit (TU) at a time. HipHop gets to see the whole program, so it makes it possible to do the checks I mentioned earlier, like making sure that there is only a single definition of a function 'foo' at top-level which is always present, so you can statically bind it at the call-site even if it's in a different source file.

Finally, Haiping mentioned that the massive binary produced is deployed to servers using BitTorrent, which surprisingly seems like exactly the right tool for the job. You could try to be more clever if you knew something about the network topology, or you could just use a general, well-tested tool and call it a day. =D

P.S. Hopefully none of this is considered FB confidential and Haiping doesn't get in any trouble. It was a very interesting talk!

Wednesday, March 30, 2011

Building spec benchmarks with newer libstdc++

This post is mostly for the sake of others searching with the same problem.

I was having issues building the 483.xalancbmk benchmark with a not-so-new libstdc++ (4.3.4). There's a particular source file (FormatterToHTML.cpp) that uses memset without including string.h or cstring. This was the error message:

FormatterToHTML.cpp:139: error: 'memset' was not declared in this scope

It depended on an include of 'vector' to pull in 'cstring' as a transitive dep. That seems to have changed. Since the SPEC benchmarks go so far as to checksum the benchmark sources before letting you run them, it was easier to add '#include <cstring>' to /usr/include/c++/4.3.4/vector than fix the application. =(