Thursday, 26 January 2017

Trampolines on MIPS for GCC nested functions under VxWorks

Taking the address of a GCC nested function generates a trampoline  on most platforms so that the stack frame address can be recovered during the callback.

Nested functions are a GCC extension to C (available for Pascal), with an alternative of code blocks under clang.

Trampolines are described in Lexical Closures for C++ (Thomas M. Breuel, USENIX C++ Conference Proceedings, October 17-21, 1988), and this trampoline is a piece of generated code that can recover a stack frame pointer or any other useful value by loading a literal before making a jump to the callback address, so that auto variables of the containing function can be accessed in their stack frame.

It means that your callback can recover any number of auto-variables without the need for a cookie pointer or an associated struct.

The MIPS trampoline structure is explained here by Ian Lance Taylor in 2006.

On Systems with a unified cache, data-writes to generate the code automatically become visible to the instruction cache, but on other systems (MIPS, ARM) the data cache and instruction cache may be independent, so after generating the trampoline, the data cache must be flushed and the instruction cache invalidated for that region, to be sure that the CPU will read the newly generated instructions.

This Arm Community post explains the difficulty quite well, although do not expect the specification for the __clear_cache builtin function to have the same calling signature as the _flush_cache function whose calling is generated by a set of definable macros.

The _flush_cache function is system specific, you need whatever flushes the data cache and invalidates the instruction cache properly across all CPU's for your target system.

If it were easy, the compiler would have done it; but it doesn't know how to clear the caches for whatever system you might be targeting.

The proper way to do this on a MIPS system is not clear. There may be privileged instructions required, maybe user instructions such as SYNCI will work.

Since GCC 3.1 GCC has support for the -msynci option
GCC now supports an -msynci option, which specifies that synci is enough to flush the instruction cache, without help from the operating system. GCC uses this information to optimize automatically-generated cache flush operations, such as those used for nested functions in C. There is also a --with-synci configure-time option, which makes -msynci the default.
Maybe the OS kernel exports a system call or function to deal with this, so unless this is specified, the GCC nested function trampolines on MIPS under VxWorks tend to fail at link time, like this:

(.text+0x146370): undefined reference to `_flush_cache'
vxWorks.bin: In function `zig':
(.text+0x146370): relocation truncated to fit: R_MIPS_26 against `_flush_cache'

because the compiler has no real idea how to flush the caches on your MIPS system, and this function is not provided.

I can redefine the name of the missing function, so it calls the cache flush function on my system with this compiler option:
-mflush-func=really_flush_cache

(.text+0x146370): undefined reference to `really_flush_cache'
vxWorks.bin: In function `zig':
(.text+0x146370): relocation truncated to fit: R_MIPS_26 against `really_flush_cache'

If only I knew what the name of the cache flush function was, and that it had a matching calling signature.

Flush function signature

The documentation is very poor on the arguments to be provided to the flush cache function, and this seems to vary from platform to platform.

In the documentation, of special relevance are CLEAR_INSN_CACHE and TARGET_TRAMPOLINE_INIT which in one case is given this definition:

#define CLEAR_INSN_CACHE(beg, end) mips_sync_icache (beg, end - beg)

By adding debugging to inspect the arguments of a flush-cache function wrapper, I see that this matches the arguments passed when compiling under my VxWorks tool chain; but given that those macros can be re-defined it may not match GCC building for a different system so I must acknowledge that my flush function is not portable.

Possibly a portable implementation is given here, and I note that other JIT systems (e.g. SLJT used by PCRE-SLJT) and code generators (libffcall) attempt to provide their own portable implementations but which may not work on some multi-core systems.

Cache Operations

Linux

#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);

VxWorks

VxWorks provides cacheInvalidate, cacheFlush and cacheClear (invalidate and flush); and particularly cacheTextUpdate which seems just what we want:

If cacheTextUpdate works, we could just have the compiler option:

-mflush-func=cacheTextUpdate

Or we might prefer a wrapper that allows us to vary it, or add debugging statements:

#include <cacheLib.h>

void _flush_cache(char* beg, size_t bytes) {
    cacheFlush(DATA_CACHE, beg, bytes);
    cacheInvalidate(INSTRUCTION_CACHE, beg, bytes);
}

or

#include <cacheLib.h>
void _flush_cache(char* beg, size_t bytes) {
    cacheTextUpdate(beg, bytes);
}

A clue, search for cacheTextUpdate in this initial attempt to add MIPS vxWorks support: 

It Lives!

Does it work? Well, the trampoline works; but is it treating the cache properly or will it randomly fail? I don't know.

#include <cacheLib.h>

void _flush_cache(char* beg, size_t bytes) {
    cacheTextUpdate(beg, bytes);
}

void zag(void(*zog)(void)) {
    printf("ZAG\n");
    zog();
}

void zig() {
  const char* where = NULL;

  void zog() {
    printf("ZOG from %s\n", where);
  }

  where = __FUNCTION__;

  printf("ZIG\n");
  zag(zog);
}

Call function zag( ) to see

ZIG
ZAG
ZOG from zig

If you want to see the generated trampoline code, use GCC flags -save-temps -fverbose-asm and then look at the .s file.

An end note for smug x86 users

From: http://lkml.iu.edu/hypermail/linux/kernel/0806.0/1702.html
If you do the sub-word write using a regular store, you are now invoking the _one_ non-coherent part of the x86 memory pipeline: the store buffer. Normal stores can (and will) be forwarded to subsequent loads from the store buffer, and they are not strongly ordered wrt cache coherency while they are buffered.
...
...if you do a regular store to a partial word, with no serializing instructions between that and a subsequent load of the whole word, the value of the store can be bypassed from the store buffer, and the load from the other part of the word can be carried out _before_ the store has actually gotten that cacheline exclusively!
So when you do
  movb reg,(byteptr)
  movl (byteptr),reg
you may actually get old data in the upper 24 bits, along with new data in the lower 8.
I think.
Anyway, be careful. The cacheline itself will always be coherent, but the store buffer is not going to be part of the coherency rules, and without serialization (or locked ops), you _are_ going to invoke the store buffer!