Tuesday, 25 April 2017

Too many arguments; or: keyword arguments in C

Sometimes you extend a generic helper function and it starts to take too many optional arguments, which can be left as NULL if they don't matter.

Maybe your function formats and signs a message that could have quite a variety of fields.

int sign(struct key *key, const char *this, const char *that)
{
    x_sign(key, this);
    ...
}
...

int result = sign(key, "This", "That");

and then you find yourself wanting to add const char* other.

Yeah... and there will be a lot of caller code to fixup (unless you want to use the other macro trick for default arguments, which I'll write up later). 

Or will there?

struct sign_args {
    const char* this;
    const char* that;
    const char* other;
};

int sign(struct key *key, struct sign_args *args)
#define sign(key, ...) sign(key, &(struct sign_args){ __VA_ARGS__ })
{
    x_sign(key,args->this);
    ...
}
and it is called the same way:

sign(key, "This", "That", "Other");

but you also get keyword arguments for free!

sign(key, .other = "Other"); // leaves this & that NULL

And this expands out to a valid compound literal expression. The temporary struct (an lvalue) is created (probably on the stack) and disposed of automatically at the end of the expression.

sign(key, &(struct sign_args){ .other = "Other" });

Another advantage is that now as your arguments are passed as a pointer, it is very easy to tunnel them all through the usual void* callback systems. In the case below, it briefly obtains an interactively unlocked key to sign with:

int batch_sign(struct key *key, struct sign_args *args);
int sign(struct key *key, struct sign_args *args) 
#define sign(key, ...) sign(key, &(struct sign_args){ __VA_ARGS__})
{
  x_with_unlocked_key(key, batch_sign, args);
  ...
}

Is it efficient? Every time the args are passed, only a pointer is passed, so that can be efficient; but the struct does have to be created once and may be a little more work than passing a struct as an argument or passing the arguments.

The main points to note are that

  • the original arguments should keep their same order at the start of the args struct
  • as the macro has the same name as your function, be sure to define it AFTER your function
  • the arguments are passed by pointer and so shared as they are tunnelled.
    This can make it easy to pass back a temporary value from an auto-variable. 
Of course you don't have to pass a pointer to the struct, you could pass the struct, by removing the & from the macro:

int sign(struct key *key, struct sign_args *args)
#define sign(key, ...) sign(key, (struct sign_args){ __VA_ARGS__ })
{
    x_sign(key,args.this);
    ...
}
it is called the same way:

sign(key, "This", "That", "Other");

but expands out slightly differently, passing the struct instead of a pointer to it, e.g.:

sign(key, .other = "Other"); // leaves this & that NULL

which this expands out to:

sign(key, (struct sign_args){ .other = "Other" });

Of course there is still nothing to prevent you from passing &args via a callback, or making and passing a copy.

Thursday, 26 January 2017

Trampolines on MIPS for GCC nested functions under VxWorks

Taking the address of a GCC nested function generates a trampoline  on most platforms so that the stack frame address can be recovered during the callback.

Nested functions are a GCC extension to C (available for Pascal), with an alternative of code blocks under clang.

Trampolines are described in Lexical Closures for C++ (Thomas M. Breuel, USENIX C++ Conference Proceedings, October 17-21, 1988), and this trampoline is a piece of generated code that can recover a stack frame pointer or any other useful value by loading a literal before making a jump to the callback address, so that auto variables of the containing function can be accessed in their stack frame.

It means that your callback can recover any number of auto-variables without the need for a cookie pointer or an associated struct.

The MIPS trampoline structure is explained here by Ian Lance Taylor in 2006.

On Systems with a unified cache, data-writes to generate the code automatically become visible to the instruction cache, but on other systems (MIPS, ARM) the data cache and instruction cache may be independent, so after generating the trampoline, the data cache must be flushed and the instruction cache invalidated for that region, to be sure that the CPU will read the newly generated instructions.

This Arm Community post explains the difficulty quite well, although do not expect the specification for the __clear_cache builtin function to have the same calling signature as the _flush_cache function whose calling is generated by a set of definable macros.

The _flush_cache function is system specific, you need whatever flushes the data cache and invalidates the instruction cache properly across all CPU's for your target system.

If it were easy, the compiler would have done it; but it doesn't know how to clear the caches for whatever system you might be targeting.

The proper way to do this on a MIPS system is not clear. There may be privileged instructions required, maybe user instructions such as SYNCI will work.

Since GCC 3.1 GCC has support for the -msynci option
GCC now supports an -msynci option, which specifies that synci is enough to flush the instruction cache, without help from the operating system. GCC uses this information to optimize automatically-generated cache flush operations, such as those used for nested functions in C. There is also a --with-synci configure-time option, which makes -msynci the default.
Maybe the OS kernel exports a system call or function to deal with this, so unless this is specified, the GCC nested function trampolines on MIPS under VxWorks tend to fail at link time, like this:

(.text+0x146370): undefined reference to `_flush_cache'
vxWorks.bin: In function `zig':
(.text+0x146370): relocation truncated to fit: R_MIPS_26 against `_flush_cache'

because the compiler has no real idea how to flush the caches on your MIPS system, and this function is not provided.

I can redefine the name of the missing function, so it calls the cache flush function on my system with this compiler option:
-mflush-func=really_flush_cache

(.text+0x146370): undefined reference to `really_flush_cache'
vxWorks.bin: In function `zig':
(.text+0x146370): relocation truncated to fit: R_MIPS_26 against `really_flush_cache'

If only I knew what the name of the cache flush function was, and that it had a matching calling signature.

Flush function signature

The documentation is very poor on the arguments to be provided to the flush cache function, and this seems to vary from platform to platform.

In the documentation, of special relevance are CLEAR_INSN_CACHE and TARGET_TRAMPOLINE_INIT which in one case is given this definition:

#define CLEAR_INSN_CACHE(beg, end) mips_sync_icache (beg, end - beg)

By adding debugging to inspect the arguments of a flush-cache function wrapper, I see that this matches the arguments passed when compiling under my VxWorks tool chain; but given that those macros can be re-defined it may not match GCC building for a different system so I must acknowledge that my flush function is not portable.

Possibly a portable implementation is given here, and I note that other JIT systems (e.g. SLJT used by PCRE-SLJT) and code generators (libffcall) attempt to provide their own portable implementations but which may not work on some multi-core systems.

Cache Operations

Linux

#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);

VxWorks

VxWorks provides cacheInvalidate, cacheFlush and cacheClear (invalidate and flush); and particularly cacheTextUpdate which seems just what we want:

If cacheTextUpdate works, we could just have the compiler option:

-mflush-func=cacheTextUpdate

Or we might prefer a wrapper that allows us to vary it, or add debugging statements:

#include <cacheLib.h>

void _flush_cache(char* beg, size_t bytes) {
    cacheFlush(DATA_CACHE, beg, bytes);
    cacheInvalidate(INSTRUCTION_CACHE, beg, bytes);
}

or

#include <cacheLib.h>
void _flush_cache(char* beg, size_t bytes) {
    cacheTextUpdate(beg, bytes);
}

A clue, search for cacheTextUpdate in this initial attempt to add MIPS vxWorks support: 

It Lives!

Does it work? Well, the trampoline works; but is it treating the cache properly or will it randomly fail? I don't know.

#include <cacheLib.h>

void _flush_cache(char* beg, size_t bytes) {
    cacheTextUpdate(beg, bytes);
}

void zag(void(*zog)(void)) {
    printf("ZAG\n");
    zog();
}

void zig() {
  const char* where = NULL;

  void zog() {
    printf("ZOG from %s\n", where);
  }

  where = __FUNCTION__;

  printf("ZIG\n");
  zag(zog);
}

Call function zag( ) to see

ZIG
ZAG
ZOG from zig

If you want to see the generated trampoline code, use GCC flags -save-temps -fverbose-asm and then look at the .s file.

An end note for smug x86 users

From: http://lkml.iu.edu/hypermail/linux/kernel/0806.0/1702.html
If you do the sub-word write using a regular store, you are now invoking the _one_ non-coherent part of the x86 memory pipeline: the store buffer. Normal stores can (and will) be forwarded to subsequent loads from the store buffer, and they are not strongly ordered wrt cache coherency while they are buffered.
...
...if you do a regular store to a partial word, with no serializing instructions between that and a subsequent load of the whole word, the value of the store can be bypassed from the store buffer, and the load from the other part of the word can be carried out _before_ the store has actually gotten that cacheline exclusively!
So when you do
  movb reg,(byteptr)
  movl (byteptr),reg
you may actually get old data in the upper 24 bits, along with new data in the lower 8.
I think.
Anyway, be careful. The cacheline itself will always be coherent, but the store buffer is not going to be part of the coherency rules, and without serialization (or locked ops), you _are_ going to invoke the store buffer!