N2699 - Sized Memory Deallocation

David Goldblatt, Facebook (davidtgoldblatt@gmail.com)

Chris Kennelly, Google (ckennelly@google.com)

Mark Santaniello, Facebook (marksan@fb.com)

2021-03-24

Summary

Several heap allocators1 (hereafter, "allocators") expose as extensions variants of free that accept, in addition to the address of the allocation, an additional argument "reminding" the allocator of the size of that allocation. These extensions can reduce deallocation cost by 30%, allow extra security-hardening functionality, and currently ship in several implementations. We propose standardizing these functions, so that these improvements are available portably.

void free_sized(void *ptr, size_t size);

Would deallocate the storage pointed to by ptr, informing the allocator that the storage was allocated with a call to one of malloc, realloc, or calloc with the associated size (or, in the case of calloc, with two arguments having size as their product).

void free_aligned_sized(void *ptr, size_t alignment, size_t size);

Would behave similarly, but could be used for storage allocated by aligned_alloc.

Under the API we propose, the following code snippet:

void *buf = malloc(size);
use(buf, size);
free(buf);

void *aligned_buf = aligned_alloc(alignment, size);
use_aligned(buf, size, alignment);
free(buf);

Could be written equivalently (but perhaps more performantly, or with more safety checking) as:

void *buf = malloc(size);
use(buf, size);
free_sized(buf, size);

void *aligned_buf = aligned_alloc(size, alignment);
use_aligned(buf, size, alignment);
free_aligned_sized(buf, alignment, size);

Implementing both proposed functions by falling back to free is valid, so the worst-case implementation overhead is minimal.

Background

Motivation

Many modern allocators divide memory into discrete sizes, keeping their metadata out-of-band. It is also common to cache recently-freed allocations per-thread, or per-core. When deallocating, this type of allocator must determine the true size of the object to put it into the correct per-thread or per-core cache. Looking up the size given just a pointer requires walking a hashtable or radix tree, which requires several memory loads and (often) cache misses. The combination of caching and out-of-band metadata motivate the inclusion of the sized-deallocation extensions.

Many times, when an application is deallocating storage, it already "knows" the size of that storage in some application-specific way This information can be passed to the allocator such that it can modify the appropriate free list without the typical metadata accesses to look up the size of the allocation. This lookup can be the largest single cost of deallocation; in some popular allocators taking up 30% of cycles along the deallocation pathways23 .

Moreover, sized-deallocation can be an aid in testing and security-hardening. If storage is deallocated with an incorrect size, either heap corruption has occurred, or the application has incorrectly tracked the sizes of its allocations, risking a heap overrun vulnerability. In either case, the process can be aborted (or the error otherwise managed). When the equivalent C++ functionality was deployed in both Google and Facebook’s codebases, it exposed a significant number of correctness bugs caused by incorrect type-tracking in applications. With a standardized name, tools like Address Sanitizer can flag these sorts of errors, as they do in C++ code.

Allocators which do not benefit from the additional size information, can easily provide trivial implementations of free_sized and free_aligned_sized which simply ignore it:

void free_sized(void *ptr, size_t /* unused */ size) {
    free(ptr);
}

void free_aligned_sized(void *ptr, size_t /* unused */ alignment, size_t /* unused */ size) {
    free(ptr);
}

Cost of non-portability

To users

Because sized-deallocation is non-standard, portable code has to resort to #ifdefs and linker hacks to use such functionality only conditionally. For example, the deallocation wrappers in BoringSSL requires a declaration for a weak version of one such extension, sdallocx, which is checked against NULL at runtime4:

The true cost is likely also paid in performance; most users simply don't bother using non-portable extensions, and get slower or less secure programs as a result.

To other language implementations

C is often a shared substrate on which other languages build; Rust and most C++ implementations delegate their language-specific allocation interfaces to the C ones (e.g. C++’s operator new / operator delete are typically implemented as calls to malloc and free). Scripting languages often delegate to the underlying heap allocator. Even though these languages support sized-deallocation, they can't inform the underlying allocator of sizes portably. To maximally utilize underlying functionality, each language implementation needs to have custom support for each libc the implementation might be used alongside (one per available extension). Standardizing names for these solves this M:N problem.

Current Sized-Deallocation Support

Two allocators (TCMalloc and jemalloc) expose this functionality under the name sdallocx. The latter is the system allocator for FreeBSD, where sdallocx is exposed as an extension.

GrapheneOS (a security-focused mobile OS compatible with Android apps and devices) uses a hardened allocator that exposes free_sized, for the heap consistency checking purposes indicated above.

Mimalloc (a malloc replacement from Microsoft research, embedded in various language runtimes) exposes these functions as mi_free_size and mi_free_size_aligned.

libumem ships with Solaris and illumos, and exposes sized deallocation via umem_free5.

C++ exposes equivalent functionality via additional overloads of operator delete as of C++146.

The Rust allocator interface is implicitly sized (i.e. all deallocations are sized-deallocations).

Informal implementor polling

We reached out to the allocator authors we were able to; we’ve summarized discussions here.

TCMalloc and jemalloc use this functionality on their fast paths, and regard it as a vital performance improvement (the authors list includes maintainers from both projects). Both do security hardening using size hints, in varying degrees depending on configuration (e.g. "always validate"/"never validate"/"check down paths where we touch metadata anyways, like when flushing thread-local caches").

The GrapheneOS hardened malloc implementer already implements free_sized under the described semantics (using it for the security checking purposes), and supports standardization. He notes the C++ benefits as well.

The OpenBSD implementer sees the ability to specify the size to free as a good thing, though he is interested more in the security benefits than in the performance ones.

The Windows C runtime delegates to the underlying kernel library HeapAlloc/HeapFree APIs, and the maintainers like the idea of being able to turn on size checking as a testing mechanism in debug builds. The current HeapAlloc implementation wouldn’t see a performance benefit from avoiding the metadata lookups, and the maintainers are unsure they would trust the user-provided size without checking. They would rather the free_sized implementation call some new underlying HeapAlloc functionality than just delegate to free. This would let them catch any user bugs now, and avoid uncovering latent bugs if at some point in the future they decide to use the size hints for a performance improvement. They’re willing to add such support functions to the HeapAlloc API.

The current glibc implementation wouldn't see a performance benefit from sized deallocation, but might after some future enhancements the maintainers are considering. They like the security hardening possibilities, but wouldn’t risk trusting the user-specified size without validating it.

The musl maintainer is generally skeptical of API extensions, and leans that way here as well. He notes that the performance and security advantages can be in direct tension -- if an allocator trusts user sizes uncritically, then they also present a whole new attack vector. He sees some value in the increased hardening possibilities.

Suggested Wording

To be added to the memory management functions section (7.22.3, relative to n2596)

7.22.3.X The free_sized function

Synopsis

1.

#include <stdlib.h>
void free_sized(void *ptr, size_t size);

Description

2. If ptr is the result obtained from a call to malloc(size), realloc(old_ptr, size), or calloc(nmemb, memb_size), where nmemb * memb_size is equal to size, this function behaves equivalently to free(ptr). Otherwise, the result is undefined.

3. NOTE: A conforming implementation may simply ignore size and call free.

4. NOTE: The result of an aligned_alloc call may not be passed to free_sized.

5. Implementations may provide extensions to query the usable size of an allocation, or to determine the usable size of the allocation that would result if a request for some other size were to succeed. Such implementations should allow passing the resulting usable size as the size parameter, and provide functionality equivalent to free in such cases.

Returns

6. The free_sized function returns no value.

7.22.3.Y The free_aligned_sized function

Synopsis

1.

#include <stdlib.h>
void free_aligned_sized(void *ptr, size_t alignment, size_t size);

Description

2. If ptr is the result obtained from a call to aligned_alloc(alignment, size), this function behaves equivalently to free(ptr). Otherwise, the result is undefined.

3. NOTE: A conforming implementation may simply ignore size and alignment and call free.

4. NOTE: The result of a malloc, calloc, or realloc call may not be passed to free_aligned_sized.

5. Implementations may provide extensions to query the usable size of an allocation, or to determine the usable size of the allocation that would result if a request for some other size were to succeed. Such implementations should allow passing the resulting usable size as the size parameter, and provide functionality equivalent to free in such cases.

Returns

6. The free_aligned_sized function returns no value.

Design discussion

Naming

It’s hard to prove that any given name added to the standard library won’t collide with existing code. There are two Google hits for free_sized are in allocator implementations, and match the proposed prototype. One is from a seemingly unmaintained user library that declares it as a static function; it would have to change if recompiled against this declaration of free_sized, but there would not be any ABI breakage if the library were simply relinked. free_aligned_sized appears to be a globally unique name, at least as far as Google is concerned.

sized_free and aligned_sized_free are other reasonable choices, with the former having somewhat more Google-able name conflicts.

Restrictions on aligned deallocation

These semantics disallow passing an allocation obtained from alloc_aligned to free_sized (i.e. omitting the alignment parameter during deallocation), or vice versa. So, if a program keeps track of the size of allocation requested, but not whether or not it was allocated with user-specified alignment, it has to use free to deallocate the storage rather than free_sized. This may seem unintuitive; informing the allocator of a true fact is disallowed.

This stems from an implementation constraint (one shared by all existing implementations the authors are aware of); many allocators implement aligned allocation by increasing the effective size of the returned storage. Passing the smaller size back to free_sized would cause an incorrect size class computation, rendering the performance and security benefits unobtainable.

An alternate approach would be to only include free_aligned_sized, and require allocations obtained from malloc, realloc, or calloc to pass 0 for alignment. This shaves down the specification by a function, but at the cost of making the remaining function somewhat more complex, and using a dynamic property (the value of alignment) to pass information available statically (the source of the allocation). It also forces languages built on C to go through a trampoline even for the common case of allocations with default alignment, which imposes modest performance costs on them.

Argument ordering

This wording has free_aligned_sized parallel alloc_aligned, and its parameter order is (ptr, alignment, size). C++ and Rust instead make their argument ordering (ptr, size, alignment). By matching aligned_alloc, we disallow those languages from implementing their deallocation redirects purely in the linker (i.e. they need a host-language trampoline function that swaps their second and third argument before calling free_aligned_sized).

An alternative to consider would be renaming to free_sized_aligned, and swapping the second and third arguments. Practically, a small enough fraction of deallocations will use this function that it’s unclear that this overhead matters.

Application Life-Cycle

Existing systems often deploy custom allocators by using ELF symbol interposition. This means that existing system administrators may be preloading custom allocators to provide developer tooling functionality or improved performance with certain applications given a particular workload (e.g. Using LD_PRELOAD to load jemalloc). Newly deployed applications that use the sized deallocation provided by the system C library will likely fail to operate correctly unless the interposers are also updated to match the system C library, or the C library takes care to handle this case specially. In deploying a new system C library that provides the new sized deallocation API the surrounding ecosystem should be coordinated to ensure the distribution is consistently providing interposers for the new API where required. The C library has avenues to address this issue on its own: at the time of the first call to the libc malloc, the allocator can set a flag indicating that it’s the libc allocator that’s in use. The libc free_sized and free_aligned_sized implementations can check the flag and delegate to free if some other allocator is providing the in-use definition of malloc.

Performance / security tradeoffs

Some reviewers made comments along the lines of the musl maintainer: that while performance-focused allocators might become more performant with sized allocation, it might be at the cost of security, since uncritically trusting user-provided sizes may present a new attack vector. Likewise, security-focused allocators might become more secure with sized deallocation, but without a performance improvement (since they need to validate sizes anyway).

While individual implementations might choose those portions of the design space, it's worth noting that both types of allocators can improve along both dimensions simultaneously; making the size information immediately available can improve instruction-level parallelism for redzone checking in security-focused allocators (by breaking address dependencies between cache misses), and can allow extra security checking in performance-focused ones.

calloc overflow handling

The current wording of calloc does not directly address the possibility that the product of its arguments overflows a size_t, but it nevertheless returns a non-NULL pointer (one whose size we cannot describe with a size_t). In such cases, the size argument passed to free_sized would be smaller than the true size of the allocation. There are a few ways we could handle this.

We recommend the first approach for the purposes of this paper, but would encourage consideration of the third as well.


  1. C standard library implementations include dynamic storage allocation, but often make this functionality replaceable with alternate definitions of malloc, free, etc. supplied by a different vendor. We have tried to be consistent in our terminology, and use "implementation" to refer to a whole C implementation in its entirety, and "allocator" to refer to any particular implementation of aligned_alloc, calloc, free, malloc, realloc specifically.

  2. This is based on the savings in the deployment experience of the jemalloc and TCMalloc teams. This opportunity has been rediscovered many times by many different vendors; see e.g. https://www.samxi.org/papers/kanev_asplos2017.pdf or https://research.fb.com/wp-content/uploads/2020/03/Accelerometer-Understanding-Acceleration-Opportunities-for-Data-Center-Overheads-at-Hyperscale.pdf .

  3. The true savings are probably even higher; avoiding these metadata lookups reduces the allocator’s cache footprint generally, so that other parts of the program benefit as well.

  4. See https://github.com/google/boringssl/blob/94634a72b1cbc327dccfefd577436576992c644e/crypto/mem.c#L104 and the surrounding code.

  5. https://docs.oracle.com/cd/E88353_01/html/E37843/umem-free-3malloc.html

  6. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3536.html