futex2

Author

André Almeida <andrealmeid@collabora.com>

futex, or fast user mutex, is a set of syscalls to allow userspace to create performant synchronization mechanisms, such as mutexes, semaphores and conditional variables in userspace. C standard libraries, like glibc, uses it as a means to implement more high level interfaces like pthreads.

The interface

uAPI functions

long sys_futex_wait(void __user *uaddr, u64 val, unsigned int flags, struct __kernel_timespec __user *timo)

Wait on a futex address if (*uaddr) == val

Parameters

void __user * uaddr

User address of futex

u64 val

Expected value of futex

unsigned int flags

Specify the size of futex and the clockid

struct __kernel_timespec __user * timo

Optional absolute timeout.

Description

The user thread is put to sleep, waiting for a futex_wake() at uaddr, if the value at *uaddr is the same as val (otherwise, the syscall returns immediately with -EAGAIN).

Returns 0 on success, error code otherwise.

long sys_futex_waitv(struct futex_waitv __user *waiters, unsigned int nr_futexes, unsigned int flags, struct __kernel_timespec __user *timo)

Wait on a list of futexes

Parameters

struct futex_waitv __user * waiters

List of futexes to wait on

unsigned int nr_futexes

Length of futexv

unsigned int flags

Flag for timeout (monotonic/realtime)

struct __kernel_timespec __user * timo

Optional absolute timeout.

Description

Given an array of struct futex_waitv, wait on each uaddr. The thread wakes if a futex_wake() is performed at any uaddr. The syscall returns immediately if any waiter has *uaddr != val. *timo is an optional timeout value for the operation. Each waiter has individual flags. The flags argument for the syscall should be used solely for specifying the timeout as realtime, if needed. Flags for shared futexes, sizes, etc. should be used on the individual flags of each waiter.

Returns the array index of one of the awaken futexes. There’s no given information of how many were awakened, or any particular attribute of it (if it’s the first awakened, if it is of the smaller index…).

long sys_futex_wake(void __user *uaddr, unsigned int nr_wake, unsigned int flags)

Wake a number of futexes waiting on an address

Parameters

void __user * uaddr

Address of futex to be woken up

unsigned int nr_wake

Number of futexes waiting in uaddr to be woken up

unsigned int flags

Flags for size and shared

Description

Wake nr_wake threads waiting at uaddr.

Returns the number of woken threads on success, error code otherwise.

long sys_futex_requeue(struct futex_requeue __user *uaddr1, struct futex_requeue __user *uaddr2, unsigned int nr_wake, unsigned int nr_requeue, u64 cmpval, unsigned int flags)

Wake futexes at uaddr1 and requeue from uaddr1 to uaddr2

Parameters

struct futex_requeue __user * uaddr1

Address of futexes to be waken/dequeued

struct futex_requeue __user * uaddr2

Address for the futexes to be enqueued

unsigned int nr_wake

Number of futexes waiting in uaddr1 to be woken up

unsigned int nr_requeue

Number of futexes to be requeued from uaddr1 to uaddr2

u64 cmpval

Expected value at uaddr1

unsigned int flags

Reserved flags arg for requeue operation expansion. Must be 0.

Description

If (uaddr1->uaddr == cmpval), wake at uaddr1->uaddr a nr_wake number of waiters and then, remove a number of nr_requeue waiters at uaddr1->uaddr and add then to uaddr2->uaddr list. Each uaddr has its own set of flags, that must be defined at struct futex_requeue (such as size, shared, NUMA).

Return the number of the woken futexes + the number of requeued ones on success, error code otherwise.

uAPI structures

struct futex_waitv

A waiter for vectorized wait

Definition

struct futex_waitv {
  __u64 val;
  void __user *uaddr;
  unsigned int flags;
};

Members

val

Expected value at uaddr

uaddr

User address to wait on

flags

Flags for this waiter

struct futex_requeue

Define an address and its flags for requeue operation

Definition

struct futex_requeue {
  void __user *uaddr;
  unsigned int flags;
};

Members

uaddr

User address of one of the requeue arguments

flags

Flags for this address

The flag argument

The flag is used to specify the size of the futex word (FUTEX_[8, 16, 32, 64]). It’s mandatory to define one, since there’s no default size.

By default, the timeout uses a monotonic clock, but can be used as a realtime one by using the FUTEX_REALTIME_CLOCK flag.

By default, futexes are of the private type, that means that this user address will be accessed by threads that share the same memory region. This allows for some internal optimizations, so they are faster. However, if the address needs to be shared with different processes (like using mmap() or shm()), they need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to set that.

By default, the operation has no NUMA-awareness, meaning that the user can’t choose the memory node where the kernel side futex data will be stored. The user can choose the node where it wants to operate by setting the FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, 32 or 64):

struct futexX_numa {
        __uX value;
        __sX hint;
};

This structure should be passed at the void *uaddr of futex functions. The address of the structure will be used to be waited on/waken on, and the value will be compared to val as usual. The hint member is used to define which node the futex will use. When waiting, the futex will be registered on a kernel-side table stored on that node; when waking, the futex will be searched for on that given table. That means that there’s no redundancy between tables, and the wrong hint value will lead to undesired behavior. Userspace is responsible for dealing with node migrations issues that may occur. hint can range from [0, MAX_NUMA_NODES), for specifying a node, or -1, to use the same node the current process is using.

When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored on a global table on allocated on the first node.

The timo argument

As per the Y2038 work done in the kernel, new interfaces shouldn’t add timeout options known to be buggy. Given that, timo should be a 64-bit timeout at all platforms, using an absolute timeout value.

Implementation

The internal implementation follows a similar design to the original futex. Given that we want to replicate the same external behavior of current futex, this should be somewhat expected.

Waiting

For the wait operations, they are all treated as if you want to wait on N futexes, so the path for futex_wait and futex_waitv is the basically the same. For both syscalls, the first step is to prepare an internal list for the list of futexes to wait for (using struct futexv_head). For futex_wait() calls, this list will have a single object.

We have a hash table, where waiters register themselves before sleeping. Then the wake function checks this table looking for waiters at uaddr. The hash bucket to be used is determined by a struct futex_key, that stores information to uniquely identify an address from a given process. Given the huge address space, there’ll be hash collisions, so we store information to be later used on collision treatment.

First, for every futex we want to wait on, we check if (*uaddr == val). This check is done holding the bucket lock, so we are correctly serialized with any futex_wake() calls. If any waiter fails the check above, we dequeue all futexes. The check (*uaddr == val) can fail for two reasons:

  • The values are different, and we return -EAGAIN. However, if while dequeueing we found that some futexes were awakened, we prioritize this and return success.

  • When trying to access the user address, we do so with page faults disabled because we are holding a bucket’s spin lock (and can’t sleep while holding a spin lock). If there’s an error, it might be a page fault, or an invalid address. We release the lock, dequeue everyone (because it’s illegal to sleep while there are futexes enqueued, we could lose wakeups) and try again with page fault enabled. If we succeed, this means that the address is valid, but we need to do all the work again. For serialization reasons, we need to have the spin lock when getting the user value. Additionally, for shared futexes, we also need to recalculate the hash, since the underlying mapping mechanisms could have changed when dealing with page fault. If, even with page fault enabled, we can’t access the address, it means it’s an invalid user address, and we return -EFAULT. For this case, we prioritize the error, even if some futexes were awaken.

If the check is OK, they are enqueued on a linked list in our bucket, and proceed to the next one. If all waiters succeed, we put the thread to sleep until a futex_wake() call, timeout expires or we get a signal. After waking up, we dequeue everyone, and check if some futex was awakened. This dequeue is done by iteratively walking at each element of struct futex_head list.

All enqueuing/dequeuing operations requires to hold the bucket lock, to avoid racing while modifying the list.

Waking

We get the bucket that’s storing the waiters at uaddr, and wake the required number of waiters, checking for hash collision.

There’s an optimization that makes futex_wake() not take the bucket lock if there’s no one to be woken on that bucket. It checks an atomic counter that each bucket has, if it says 0, then the syscall exits. In order for this to work, the waiter thread increases it before taking the lock, so the wake thread will correctly see that there’s someone waiting and will continue the path to take the bucket lock. To get the correct serialization, the waiter issues a memory barrier after increasing the bucket counter and the waker issues a memory barrier before checking it.

Requeuing

The requeue path first checks for each struct futex_requeue and their flags. Then, it will compare the expected value with the one at uaddr1::uaddr. Following the same serialization explained at Waking, we increase the atomic counter for the bucket of uaddr2 before taking the lock. We need to have both buckets locks at same time so we don’t race with other futex operation. To ensure the locks are taken in the same order for all threads (and thus avoiding deadlocks), every requeue operation takes the “smaller” bucket first, when comparing both addresses.

If the compare with user value succeeds, we proceed by waking nr_wake futexes, and then requeuing nr_requeue from bucket of uaddr1 to the uaddr2. This consists in a simple list deletion/addition and replacing the old futex key with the new one.

Futex keys

There are two types of futexes: private and shared ones. The private are futexes meant to be used by threads that share the same memory space, are easier to be uniquely identified and thus can have some performance optimization. The elements for identifying one are: the start address of the page where the address is, the address offset within the page and the current->mm pointer.

Now, for uniquely identifying a shared futex:

  • If the page containing the user address is an anonymous page, we can just use the same data used for private futexes (the start address of the page, the address offset within the page and the current->mm pointer); that will be enough for uniquely identifying such futex. We also set one bit at the key to differentiate if a private futex is used on the same address (mixing shared and private calls does not work).

  • If the page is file-backed, current->mm maybe isn’t the same one for every user of this futex, so we need to use other data: the page->index, a UUID for the struct inode and the offset within the page.

Note that members of futex_key don’t have any particular meaning after they are part of the struct - they are just bytes to identify a futex. Given that, we don’t need to use a particular name or type that matches the original data, we only need to care about the bitsize of each component and make both private and shared fit in the same memory space.

Source code documentation

struct futex_key

Components to build unique key for a futex

Definition

struct futex_key {
  u64 pointer;
  unsigned long index;
  unsigned long offset;
};

Members

pointer

Pointer to current->mm or inode’s UUID for file backed futexes

index

Start address of the page containing futex or index of the page

offset

Address offset of uaddr in a page

struct futex_waiter

List entry for a waiter

Definition

struct futex_waiter {
  void __user *uaddr;
  struct futex_key key;
  struct list_head list;
  u64 val;
  unsigned int flags;
  struct futex_bucket *bucket;
  unsigned int index;
};

Members

uaddr

Virtual address of userspace futex

key

Information that uniquely identify a futex

list

List node struct

val

Expected value for this waiter

flags

Flags

bucket

Pointer to the bucket for this waiter

index

Index of waiter in futexv list

struct futex_waiter_head

List of futexes to be waited

Definition

struct futex_waiter_head {
  struct task_struct *task;
  bool hint;
  struct futex_waiter objects[0];
};

Members

task

Task to be awaken

hint

Was someone on this list awakened?

objects

List of futexes

struct futex_bucket

A bucket of futex’s hash table

Definition

struct futex_bucket {
  atomic_t waiters;
  spinlock_t lock;
  struct list_head list;
};

Members

waiters

Number of waiters in the bucket

lock

Bucket lock

list

List of waiters on this bucket

u64 futex_get_inode_uuid(struct inode *inode)

Gets an UUID for an inode

Parameters

struct inode *inode

inode to get UUID

Description

Generate a machine wide unique identifier for this inode.

This relies on u64 not wrapping in the life-time of the machine; which with 1ns resolution means almost 585 years.

This further relies on the fact that a well formed program will not unmap the file while it has a (shared) futex waiting on it. This mapping will have a file reference which pins the mount and inode.

If for some reason an inode gets evicted and read back in again, it will get a new sequence number and will _NOT_ match, even though it is the exact same file.

It is important that match_futex() will never have a false-positive, esp. for PI futexes that can mess up the state. The above argues that false-negatives are only possible for malformed programs.

Return

UUID for the given inode

int futex_get_shared_key(uintptr_t address, struct mm_struct *mm, struct futex_key *key)

Get a key for a shared futex

Parameters

uintptr_t address

Futex memory address

struct mm_struct *mm

Current process mm_struct pointer

struct futex_key *key

Key struct to be filled

Return

0 on success, error code otherwise

struct futex_bucket *futex_get_bucket(void __user *uaddr, struct futex_key *key, bool shared)

Check if the user address is valid, prepare internal data and calculate the hash

Parameters

void __user *uaddr

futex user address

struct futex_key *key

data that uniquely identifies a futex

bool shared

is this a shared futex?

Description

For private futexes, each uaddr will be unique for a given mm_struct, and it won’t be freed for the life time of the process. For shared futexes, check futex_get_shared_key().

Return

address of bucket on success, error code otherwise

int futex_get_user(u32 *uval, u32 __user *uaddr)

Get the userspace value on this address

Parameters

u32 *uval

variable to store the value

u32 __user *uaddr

userspace address

Description

Check the comment at futex_enqueue() for more information.

int futex_setup_time(struct __kernel_timespec __user *timo, struct hrtimer_sleeper *timeout, unsigned int flags)

Prepare the timeout mechanism and start it.

Parameters

struct __kernel_timespec __user *timo

Timeout value from userspace

struct hrtimer_sleeper *timeout

Pointer to hrtimer handler

unsigned int flags

Flags from userspace, to decide which clockid to use

Return

0 on success, error code otherwise

int futex_dequeue_multiple(struct futex_waiter_head *futexv, unsigned int nr)

Remove multiple futexes from hash table

Parameters

struct futex_waiter_head *futexv

list of waiters

unsigned int nr

number of futexes to be removed

Description

This function is used if (a) something went wrong while enqueuing, and we need to undo our work (then nr <= nr_futexes) or (b) we woke up, and thus need to remove every waiter, check if some was indeed woken and return. Before removing a waiter, we check if it’s on the list, since we have no clue who have been waken.

Return

  • -1
    • If no futex was woken during the removal

  • 0>= - At least one futex was found woken, index of the last one

int futex_enqueue(struct futex_waiter_head *futexv, unsigned int nr_futexes, int *awakened)

Check the value and enqueue a futex on a wait list

Parameters

struct futex_waiter_head *futexv

List of futexes

unsigned int nr_futexes

Number of futexes in the list

int *awakened

If a futex was awakened during enqueueing, store the index here

Description

Get the value from the userspace address and compares with the expected one.

Getting the value from user futex address:

Since we are in a hurry, we use a spin lock and we can’t sleep. Try to get the value with page fault disabled (when enable, we might sleep).

If we fail, we aren’t sure if the address is invalid or is just a page fault. Then, release the lock (so we can sleep) and try to get the value with page fault enabled. In order to trigger a page fault handling, we just call __get_user() again. If we sleep with enqueued futexes, we might miss a wake, so dequeue everything before sleeping.

If get_user succeeds, this mean that the address is valid and we do the work again. Since we just handled the page fault, the page is likely pinned in memory and we should be luckier this time and be able to get the value. If we fail anyway, we will try again.

If even with page faults enabled we get and error, this means that the address is not valid and we return from the syscall.

If we got an unexpected value or need to treat a page fault and realized that a futex was awakened, we can priority this and return success.

In success, enqueue the futex in the correct bucket

Return

  • 1 - We were awake in the process and nothing is enqueued

  • 0 - Everything is enqueued and we are ready to sleep

  • 0< - Something went wrong, nothing is enqueued, return error code

int __futex_waitv(struct futex_waiter_head *futexv, unsigned int nr_futexes, struct __kernel_timespec __user *timo, unsigned int flags)

Enqueue the list of futexes and wait to be woken

Parameters

struct futex_waiter_head *futexv

List of futexes to wait

unsigned int nr_futexes

Length of futexv

struct __kernel_timespec __user *timo

Timeout

unsigned int flags

Timeout flags

Return

  • 0 >= - Hint of which futex woke us

  • 0 < - Error code

int compat_futex_parse_waitv(struct futex_waiter_head *futexv, struct compat_futex_waitv __user *uwaitv, unsigned int nr_futexes)

Parse a waitv array from userspace

Parameters

struct futex_waiter_head *futexv

Kernel side list of waiters to be filled

struct compat_futex_waitv __user *uwaitv

Userspace list to be parsed

unsigned int nr_futexes

Length of futexv

Return

Error code on failure, pointer to a prepared futexv otherwise

int futex_parse_waitv(struct futex_waiter_head *futexv, struct futex_waitv __user *uwaitv, unsigned int nr_futexes)

Parse a waitv array from userspace

Parameters

struct futex_waiter_head *futexv

Kernel side list of waiters to be filled

struct futex_waitv __user *uwaitv

Userspace list to be parsed

unsigned int nr_futexes

Length of futexv

Return

Error code on failure, pointer to a prepared futexv otherwise

struct futex_waiter_head *futex_get_parent(uintptr_t waiter, unsigned int index)

For a given futex in a futexv list, get a pointer to the futexv

Parameters

uintptr_t waiter

Address of futex in the list

unsigned int index

Index of futex in the list

Return

A pointer to its futexv struct

void futex_mark_wake(struct futex_waiter *waiter, struct futex_bucket *bucket, struct wake_q_head *wake_q)

Find the task to be wake and add it in wake queue

Parameters

struct futex_waiter *waiter

Waiter to be wake

struct futex_bucket *bucket

Bucket to be decremented

struct wake_q_head *wake_q

Wake queue to insert the task

int futex_parse_requeue(struct futex_requeue *rq, struct futex_requeue __user *uaddr, bool *shared)

Copy a user struct futex_requeue and check it’s flags

Parameters

struct futex_requeue *rq

Kernel struct

struct futex_requeue __user *uaddr

Address of user struct

bool *shared

Out parameter, defines if this is a shared futex

Return

0 on success, error code otherwise