Audience: LEWG, SG14, WG21
Document number: D0447R9
Date: 2019-10-11
Project: Introduction of std::colony to the standard library
Reply-to: Matthew Bentley <mattreecebentley@gmail.com>

Introduction of std::colony to the standard library

Table of Contents

  1. Introduction
  2. Motivation and Scope
  3. Impact On the Standard
  4. Design Decisions
  5. Technical Specifications
  6. Acknowledgements
  7. Appendixes:
    1. Member functions list
    2. Reference implementation benchmarks
    3. Frequently Asked Questions
    4. Specific responses to previous committee feedback
    5. Typical game engine requirements
    6. Questions for reviewers
    7. Paper revision history

I. Introduction

The purpose of a container in the standard library cannot be to provide the most optimal solution for all scenarios. Inevitably in fields such as high-performance trading or gaming, the optimal solution within critical loops will be a custom-made one that fits that scenario perfectly. However, outside of the most critical of hot paths, there is a wide range of application for more generalised solutions.

Colony is a formalisation, extension and optimization of what is typically known as a 'bucket array' container in game programming circles; similar structures exist in various incarnations across the high-performance computing, high performance trading, physics simulation, robotics, server/client application and particle simulation fields (see: https://groups.google.com/a/isocpp.org/forum/#!topic/sg14/1iWHyVnsLBQ).

The concept of a bucket array is: you have multiple memory blocks of elements, and a boolean token for each element which denotes whether or not that element is 'active' or 'erased'. If it is 'erased', it is skipped over during iteration. When all elements in a block are erased, the block is removed, so that iteration does not lose performance by having to skip empty blocks. If an insertion occurs when all the blocks are full, a new memory block is allocated.

The advantages of this structure are as follows: because a skipfield is used, no reallocation of elements is necessary upon erasure. Because the structure uses multiple memory blocks, insertions to a full container also do not trigger reallocations. This means that element memory locations stay stable and pointers/references stay valid regardless of erasure/insertion. This is highly desirable, for example, in game programming because there are usually multiple elements in different containers which need to reference each other during gameplay and elements are being inserted or erased in real time.

Problematic aspects of a typical bucket array are that they tend to have a fixed memory block size, do not re-use memory locations from erased elements, and utilize a boolean skipfield. The fixed block size (as opposed to block sizes with a growth factor) and lack of erased-element re-use leads to far more allocations/deallocations than is necessary. Given that allocation is typically a costly operation in most OSs', this becomes important in performance-critical environments. The boolean skipfield makes iteration time complexity undefined, as there is no way of knowing ahead of time how many erased elements occur between any two erased elements. It also requires branching code, which may cause issues on processors with deep pipelines and poor branch-prediction failure performance.

A colony uses a non-boolean, largely non-branching method for skipping runs of erased elements, which allows for O(1) amortized iteration time complexity and more-predictable iteration performance than a bucket array. It also utilizes a growth factor for memory blocks and reuses erased element locations upon insertion, which leads to fewer allocations/reallocations. Because it reuses erased element memory space, the exact location of insertion is undefined, unless no erasures have occurred or an equal number of erasures and insertions have occurred (in which case the insertion location is the back of the container). The container is therefore considered unordered but sortable. Lastly, because there is no way of predicting in advance where erasures ('skips') may occur during iteration, an O(1) time complexity [] operator is impossible and the container is bidirectional, but not random-access.

Visual demonstration of inserting to a full vector Visual demonstration of inserting to a full colony Visual demonstration of randomly erasing from a vector Visual demonstration of randomly erasing from a colony

There are two patterns for accessing stored elements in a colony: the first is to iterate over the container and process each element (or skip some elements using the advance/prev/next/iterator ++/-- functions). The second is to store the iterator returned by the insert() function (or a pointer derived from the iterator) in some other structure and access the inserted element in that way.

II. Motivation and Scope

Note: Throughout this document I will use the term 'link' to denote any form of referencing between elements whether it be via iterators/pointers/indexes/references/ids/etc.

There are situations where data is heavily interlinked, iterated over frequently, and changing often. An example is the typical video game engine. Most games will have a central generic 'entity' or 'actor' class, regardless of their overall schema (an entity class does not imply an ECS). Entity/actor objects tend to be 'has a'-style objects rather than 'is a'-style objects, which link to, rather than contain, shared resources like sprites, sounds and so on. Those shared resources are usually located in separate containers/arrays so that they can re-used by multiple entities. Entities are in turn referenced by other structures within a game engine, such as quadtrees/octrees, level structures, and so on.

Entities may be erased at any time (for example, a wall gets destroyed and no longer is required to be processed by the game's engine, so is erased) and new entities inserted (for example, a new enemy is spawned). While this is all happening the links between entities, resources and superstructures such as levels and quadtrees, must stay valid in order for the game to run. The order of the entities and resources themselves within the containers is, in the context of a game, typically unimportant, so an unordered container is okay.

Unfortunately the container with the best iteration performance in the standard library, vector[1], loses pointer validity to elements within it upon insertion, and pointer/index validity upon erasure. This tends to lead to sophisticated and often restrictive workarounds when developers attempt to utilize vector or similar containers under the above circumstances.

std::list and the like are not suitable due to their poor locality, which leads to poor cache performance during iteration. This is however an ideal situation for a container such as colony, which has a high degree of locality. Even though that locality can be punctuated by gaps from erased elements, it still works out better in terms of iteration performance[1] than every existing standard library container other than deque/vector, regardless of the ratio of erased to non-erased elements.

Some more specific requirements for containers in the context of game development are listed in the appendix.

As another example, particle simulation (weather, physics etcetera) often involves large clusters of particles which interact with external objects and each other. The particles each have individual properties (spin, momentum, direction etc) and are being created and destroyed continuously. Therefore the order of the particles is unimportant, what is important is the speed of erasure and insertion. No current standard library container has both strong insertion and non-back erasure speed, so again this is a good match for colony.

Reports from other fields suggest that, because most developers aren't aware of containers such as this, they often end up using solutions which are sub-par for iteration such as std::map and std::list in order to preserve pointer validity, when most of their processing work is actually iteration-based. So, introducing this container would both create a convenient solution to these situations, as well as increasing awareness of better-performing approaches in general. It will also ease communication across fields, as opposed to the current scenario where each field uses a similar container but each has a different name for it.

III. Impact On the Standard

This is a pure library addition, no changes necessary to the standard asides from the introduction of the colony container.
A reference implementation of colony is available for download and use here.

IV. Design Decisions

The three core aspects of a colony from an abstract perspective are:

  1. A collection of element memory blocks + metadata, to prevent reallocation during insertion (as opposed to a single memory block)
  2. A non-boolean skipfield, to enable O(1) skipping of erased elements during iteration (as opposed to reallocating subsequent elements during erasure)
  3. An erased-element location recording mechanism, to enable the re-using of memory from erased elements during subsequent insertions

Each memory block houses multiple elements. The metadata about each block may or may not be allocated with the blocks themselves (could be contained in a separate structure). This metadata might include, for example, the number of erased elements within each block and the block's capacity - which would allow the container to know when the block is empty. A non-boolean skipfield is required in order to skip over erased elements during iteration while maintaining O(1) amortized iteration time complexity. Finally, a mechanism for keeping track of elements which have been erased must be present, so that those memory locations can be reused upon subsequent element insertions.

The following aspects of a colony must be implementation-defined in order to allow for variance in implementations:

But their implementation is significantly constrained by the requirements of the container (lack of reallocation and stable pointers to non-erased elements regardless of erasures/insertions, etcetera).

In terms of the reference implementation, the specific structure and mechanisms have changed many times over the course of development, however the interface to the container and its time complexity guarantees have remained largely unchanged (with the exception of the time complexity for updating skipfield nodes). So it is reasonably likely that regardless of specific implementation, it is possible to maintain this general specification without obviating future improvements in implementation, so long time complexity guarantees for updating skipfields are left implementation-defined.

Below I will explain the reference implementation's approach in terms of the three aspects described above, along with some alternatives for implementation.

1. Collection of element memory blocks + metadata

In the reference implementation this is essentially a doubly-linked list of 'group' structs containing (a) memory blocks, (b) memory block metadata and (c) skipfields. The memory blocks and skipfields have a growth factor of 2 from one group to the next. The metadata includes information necessary for an iterator to iterate over colony elements, such as the last insertion point within the memory block, and other information useful to specific functions, such as the total number of non-erased elements in the node. This approach keeps the operation of freeing empty memory blocks from the colony container at O(1) time complexity. Further information is available here.

An alternative implementation could be to use a vector of pointers to dynamically-allocated memory blocks + skipfields in a single struct, with a separate vector of memory block metadata structs. Such an approach would have some advantages in terms of increasing the locality for metadata during iteration, but would create reallocation costs when memory blocks + their skipfields and metadata were removed upon becoming empty.

A vector of memory blocks, as opposed to a vector of pointers to memory blocks, would not work as it would (a) disallow a growth factor in the memory blocks and (b) invalidate pointers to elements in subsequent blocks when a memory block became empty of elements and was therefore removed from the vector. In short it would negate all of a colony's beneficial aspects.

2. Non-boolean skipfield

The reference implementation currently uses a skipfield pattern called the Bentley pattern (current version of paper in-progress). This effectively encodes the run-length of sequences of contiguous erased elements, into a skipfield, which allows for O(1) time complexity during iteration. Since there is no branching involved in iterating over the skipfield aside from end-of-block checks, it is less problematic than a boolean skipfield (which has to branch for every skipfield read) in terms of CPUs which don't handle branching or branch-prediction failure efficiently.

This pattern stores and modifies the run-lengths during insertion and erasure, with O(1) time complexity. It has a lot of similarities to the advanced jump-counting skipfield pattern, which was the pattern previously used by the reference implementation.

Using an advanced jump-counting skipfield is an alternative, though the skipfield update time complexity guarantees for that pattern are effectively undefined, or between O(1) and O(skipfield length) for each insertion/erasure. In practice those updates result in one memcpy operation which resolves to a single block-copy operation, but it is still a little slower than the Bentley skipfield. The skipfield type you use will also typically have an effect on the type of memory-reuse mechanism you can utilize.

A boolean skipfield is not usable because it makes iteration time complexity undefined - it could for example result in thousands of branching statements + skipfield reads for a single ++ operation in the case of many consecutive erased elements. In the high-performance fields for which this container was initially designed, this brings with it unacceptable latency.

3. Erased-element location recording mechanism

The reference implementation currently uses two things to keep track of erased element locations:

  1. Metadata for each memory block includes a 'next block with erasures' pointer. The container itself contains a 'blocks with erasures' intrusive list-head pointer. These are used by the container to create an intrusive singly-linked list of memory blocks with erased elements which can be re-used for future insertions.
  2. Metadata for each memory block also includes a 'free list head' index number, which gives the index within the memory block, of the last erased element. The memory space of this element is reinterpret_cast'd as two index numbers, the first ("previous" index) giving the index of the previously erased element, the second ("next" index) giving the next index in the sequence (in this case a unique number because it's the head of the free list), and so on - this forms a free list of erased element memory locations which may be re-used.

Previous versions of the reference implementation used a singly-linked free list instead of a doubly-linked one, this is possible with the advanced jump-counting skipfield, not possible using a Bentley pattern for various reasons.

One cannot use a stack of pointers to erased elements for this mechanism, as early versions of the reference implementation did, because this can create allocations during erasure, which changes the exception guarantees of erase. One could instead scan all skipfields until an erased location is found, though this would be slow.

Implementation of iterator class

The reference implementation's iterator stores a pointer to the current 'group' struct mentioned above, plus a pointer to the current element and a pointer to its corresponding skipfield node. An alternative approach is to store the group pointer + an index, since the index can indicate both the offset from the memory block for the element, as well as the offset from the start of the skipfield for the skipfield node. However multiple implementations and benchmarks across many processors have shown this to be worse-performing than the separate pointer-based approach, despite the increased memory cost for the iterator class itself.

++ operation is as follows, utilizing the reference implementation's Bentley skipfield pattern:

  1. Add 1 to the existing element and skipfield pointers.
  2. Dereference skipfield pointer to get content of skipfield node, add content of skipfield node to both the skipfield pointer and the element pointer. If the node is erased, its value will be a positive integer indicating the number of nodes until the next non-erased node, if not erased it will be zero.
  3. If element pointer is beyond end of element memory block, change group pointer to next group, element pointer to the start of the next group's element memory block, skipfield pointer to the start of the next group's skipfield. Then go back to 2.

-- operation is the same except both step 1 and 2 involve subtraction rather than adding, and step 3 checks to see if element pointer is before the beginning of the element memory blocks and if so relocates to the previous group rather than the next group.

Results of implementation

In practical application the reference implementation is generally faster for insertion and (non-back) erasure than current standard library containers, and generally faster for iteration than any container except vector and deque. See benchmarks here.

V. Technical Specifications

Time complexities for basic operations

General specification

Colony meets the requirements of the C++ Container, AllocatorAwareContainer, and ReversibleContainer concepts.

For the most part the syntax and semantics of colony functions are very similar to all std:: c++ libraries. Formal description is as follows:

template <class T, class Allocator = std::allocator<T>, typename Skipfield_Type = unsigned short> class colony

T - the element type. In general T must meet the requirements of Erasable, CopyAssignable and CopyConstructible.
However, if emplace is utilized to insert elements into the colony, and no functions which involve copying or moving are utilized, T is only required to meet the requirements of Erasable.
If move-insert is utilized instead of emplace, T must also meet the requirements of MoveConstructible.

Allocator - an allocator that is used to acquire memory to store the elements. The type must meet the requirements of Allocator. The behavior is undefined if Allocator::value_type is not the same as T.

Skipfield_Type - an unsigned integer type. This type is used to form the skipfield which skips over erased T elements. In terms of the reference implementation, this also acts as a limiting factor to the maximum size of memory blocks, due to the way that the skipfield pattern works (e.g. unsigned short is 16-bit on most platforms which constrains the size of individual memory blocks to a maximum of 65535 elements). unsigned short has been found to be the optimal type for the current reference implementation. However in the case of small collections (i.e. < 1000 elements) in a memory-constrained environment, it may be useful to reduce the memory usage of the skipfield by reducing the skipfield bit depth to a Uint8 type. The reduced skipfield size may also reduce cache saturation in this case without impacting iteration speed due to the low amount of elements. However whether or not this constitutes a performance advantage is largely situational, so it is best to leave control in the end user's hands.

Basic example of usage (using reference implementation)

#include <iostream>
#include "plf_colony.h"

int main(int argc, char **argv)
{
  plf::colony<int> i_colony;

  // Insert 100 ints:
  for (int i = 0; i != 100; ++i)
  {
    i_colony.insert(i);
  }

  // Erase half of them:
  for (plf::colony<int>::iterator it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    it = i_colony.erase(it);
  }

  // Total the remaining ints:
  int total = 0;

  for (plf::colony<int>::iterator it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    total += *it;
  }

  std::cout << "Total: " << total << std::endl;
  std::cin.get();
  return 0;
} 

Example demonstrating pointer stability

#include <iostream>
#include "plf_colony.h"

int main(int argc, char **argv)
{
  plf::colony<int> i_colony;
  plf::colony<int>::iterator it;
  plf::colony<int *> p_colony;
  plf::colony<int *>::iterator p_it;

  // Insert 100 ints to i_colony and pointers to those ints to p_colony:
  for (int i = 0; i != 100; ++i)
  {
    it = i_colony.insert(i);
    p_colony.insert(&(*it));
  }

  // Erase half of the ints:
  for (it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    it = i_colony.erase(it);
  }

  // Erase half of the int pointers:
  for (p_it = p_colony.begin(); p_it != p_colony.end(); ++p_it)
  {
    p_it = p_colony.erase(p_it);
  }

  // Total the remaining ints via the pointer colony (pointers will still be valid even after insertions and erasures):
  int total = 0;

  for (p_it = p_colony.begin(); p_it != p_colony.end(); ++p_it)
  {
    total += *(*p_it);
  }

  std::cout << "Total: " << total << std::endl;

  if (total == 2500)
  {
    std::cout << "Pointers still valid!" << std::endl;
  }

  std::cin.get();
  return 0;
} 

Iterator Invalidation

All read-only operations, swap, std::swap, free_unused_memory Never
clear, sort, reinitialize, operator = Always
change_block_sizes, change_minimum_block_size, change_maximum_block_size Only if supplied minimum block size is larger than smallest block in colony, or supplied maximum block size is smaller than largest block in colony.
erase Only for the erased element. If an iterator is == end() it may be invalidated if the last element in the colony is erased, in some cases (similar to std::deque). If a reverse_iterator is == rend() it may be invalidated if the first element in the colony is erased, in some cases.
insert, emplace If an iterator is == end() it may be invalidated by a subsequent insert/emplace, in some cases.

Member types

Member type Definition
value_type T
allocator_type Allocator
skipfield_type T_skipfield_type
size_type std::allocator_traits<Allocator>::size_type
difference_type std::allocator_traits<Allocator>::difference_type
reference value_type &
const_reference const value_type &
pointer std::allocator_traits<Allocator>::pointer
const_pointer std::allocator_traits<Allocator>::const_pointer
iterator BidirectionalIterator
const_iterator Constant BidirectionalIterator
reverse_iterator BidirectionalIterator
const_reverse_iterator Constant BidirectionalIterator

Constructors

standard colony()

explicit colony(allocator_type &alloc)
fill colony(size_type n, Skipfield_type min_block_size = 8, Skipfield_type max_block_size = std::numeric_limits<Skipfield_type>::max(), allocator_type &alloc = allocator_type())

explicit colony(size_type n, value_type &element, Skipfield_type min_block_size = 8, Skipfield_type max_block_size = std::numeric_limits<Skipfield_type>::max(), allocator_type &alloc = allocator_type())
range template<typename InputIterator> colony(const InputIterator &first, InputIterator &last, Skipfield_type min_block_size = 8, Skipfield_type max_block_size = std::numeric_limits<Skipfield_type>::max(), allocator_type &alloc = allocator_type())
copy colony(colony &source)

colony(colony &source, allocator_type &alloc)
move colony(colony &&source) noexcept

colony(colony &&source, allocator_type &alloc)

Note: postcondition state of source colony is the same as that of an empty colony.

initializer list colony(std::initializer_list<value_type> &element_list, Skipfield_type min_block_size = 8, Skipfield_type max_block_size = std::numeric_limits<Skipfield_type>::max(), allocator_type &alloc = allocator_type())
Some constructor usage examples

Iterators

Iterators are bidirectional but also provide O(1) time complexity >, <, >= and <= operators for convenience (for example, for use in for loops when skipping over multiple elements per loop). The O(1) complexity of these operators are achieved by keeping a record of the order of memory blocks in some way (in the reference implementation this is done via assigning a number to each memory block in its metadata), comparing the relative order of the two iterators' memory blocks via this number, then comparing the memory locations of the elements themselves, if they happen to be in the same memory block. The full list of operators for iterator, reverse_iterator, const_iterator and const_reverse_iterator follow:

operator *
operator ->
operator ++
operator --
operator =
operator ==
operator !=
operator <
operator >
operator <=
operator >=
base() (reverse_iterator and const_reverse_iterator only)

For more information see the member functions list in the appendices.

VI. Acknowledgements

Matt would like to thank: Glen Fernandes and Ion Gaztanaga for restructuring advice, Robert Ramey for documentation advice, various Boost and SG14 members for support, Baptiste Wicht for teaching me how to construct decent benchmarks, Jonathan Wakely for standards-compliance advice and critiques, Sean Middleditch, Patrice Roy and Guy Davidson for critiques, support and bug reports, that guy from Lionhead for annoying me enough to force me to implement the original skipfield pattern, Jon Blow for some initial advice and Mike Acton for some influence.
Also Nico Josuttis for doing such an excellent job in terms of explaining the general format of the structure to the committee.

Appendices

Appendix A: Member functions

Insert

single element iterator insert (value_type &val)
fill iterator insert (size_type n, value_type &val)
range template <class InputIterator> iterator insert (InputIterator first, InputIterator last)
move iterator insert (value_type&& val)
initializer list iterator insert (std::initializer_list<value_type> il)

Erase

single element iterator erase(const_iterator it)
range void erase(const_iterator first, const_iterator last)

Other functions

Non-member functions

Note: the four immediately above are member functions in the reference implementation as a workaround for an unfixed bug in MSVC2013.

Appendix B - reference implementation benchmarks

Benchmark results for the colony v5 reference implementation under GCC 8.1 x64 on an Intel Xeon E3-1241 (Haswell) are here.

Old benchmark results for an earlier version of colony under MSVC 2015 update 3, on an Intel Xeon E3-1241 (Haswell) are here. There is no commentary for the MSVC results.

Appendix C - Frequently Asked Questions

  1. Where is it worth using a colony in place of other std:: containers?

    As mentioned, it is worthwhile for performance reasons in situations where the order of container elements is not important and:

    1. Insertion order is unimportant
    2. Insertions and erasures to the container occur frequently in performance-critical code, and
    3. Links to non-erased container elements may not be invalidated by insertion or erasure.

    Under these circumstances a colony will generally out-perform other std:: containers. In addition, because it never invalidates pointer references to container elements (except when the element being pointed to has been previously erased) it may make many programming tasks involving inter-relating structures in an object-oriented or modular environment much faster, and could be considered in those circumstances.

  2. What are some examples of situations where a colony might improve performance?

    Some ideal situations to use a colony: cellular/atomic simulation, persistent octtrees/quadtrees, game entities or destructible-objects in a video game, particle physics, anywhere where objects are being created and destroyed continuously. Also, anywhere where a vector of pointers to dynamically-allocated objects or a std::list would typically end up being used in order to preserve pointer stability but where order is unimportant.

  3. Is it similar to a deque?

    A deque is reasonably dissimilar to a colony - being a double-ended queue, it requires a different internal framework. In addition, being a random-access container, having a growth factor for memory blocks in a deque is problematic (not impossible though). A deque and colony have no comparable performance characteristics except for insertion (assuming a good deque implementation). Deque erasure performance varies wildly depending on the implementation, but is generally similar to vector erasure performance. A deque invalidates pointers to subsequent container elements when erasing elements, which a colony does not, and is ordered.

  4. What are the thread-safe guarantees?

    Unlike a std::vector, a colony can be read from and inserted into at the same time (assuming different locations for read and write), however it cannot be iterated over and written to at the same time. If we look at a (non-concurrent implementation of) std::vector's threadsafe matrix to see which basic operations can occur at the same time, it reads as follows (please note push_back() is the same as insertion in this regard):

    std::vector Insertion Erasure Iteration Read
    Insertion No No No No
    Erasure No No No No
    Iteration No No Yes Yes
    Read No No Yes Yes

    In other words, multiple reads and iterations over iterators can happen simultaneously, but the potential reallocation and pointer/iterator invalidation caused by insertion/push_back and erasure means those operations cannot occur at the same time as anything else.

    Colony on the other hand does not invalidate pointers/iterators to non-erased elements during insertion and erasure, resulting in the following matrix:

    colony Insertion Erasure Iteration Read
    Insertion No No No Yes
    Erasure No No No Mostly*
    Iteration No No Yes Yes
    Read Yes Mostly* Yes Yes

    * Erasures will not invalidate iterators unless the iterator points to the erased element.

    In other words, reads may occur at the same time as insertions and erasures (provided that the element being erased is not the element being read), multiple reads and iterations may occur at the same time, but iterations may not occur at the same time as an erasure or insertion, as either of these may change the state of the skipfield which is being iterated over. Note that iterators pointing to end() may be invalidated by insertion.

    So, colony could be considered more inherently threadsafe than a (non-concurrent implementation of) std::vector, but still has some areas which would require mutexes or atomics to navigate in a multithreaded environment.

  5. Any pitfalls to watch out for?

    Because erased-element memory locations may be reused by insert() and emplace(), insertion position is essentially random unless no erasures have been made, or an equal number of erasures and insertions have been made.

  6. What is the purpose of limiting memory block minimum and maximum sizes?

    One reason might be to ensure that memory blocks match a certain processor's cache or memory pathway sizes. Another reason to do this is that it is slightly slower to obtain an erased-element location from the list of groups-with-erasures (subsequently utilizing that group's free list of erased locations) and to reuse that space than to insert a new element to the back of the colony (the default behaviour when there are no previously-erased elements). If there are any erased elements in the colony, the colony will recycle those memory locations, unless the entire block is empty, at which point it is freed to memory.

    So if a block size is large, and many erasures occur but the block is not completely emptied, iterative performance might suffer due to large memory gaps between any two non-erased elements and subsequent drop in data locality and cache performance. In that scenario you may want to experiment with benchmarking and limiting the minimum/maximum sizes of the blocks, such that memory blocks are freed earlier and find the optimal size for the given use case.

  7. What is colony's Abstract Data Type (ADT)?

    Though I am happy to be proven wrong I suspect colonies/bucket arrays are their own abstract data type. Some have suggested it's ADT is of type bag, I would somewhat dispute this as it does not have typical bag functionality such as searching based on value (you can use std::find but it's o(n)) and adding this functionality would slow down other performance characteristics. Multisets/bags are also not sortable (by means other than automatically by key value). Colony does not utilize key values, is sortable, and does not provide the sort of functionality frequently associated with a bag (e.g. counting the number of times a specific value occurs).

  8. Why must blocks be removed when empty?

    Two reasons:

    1. Standards compliance: if blocks aren't removed then ++ and -- iterator operations become undefined in terms of time complexity, making them non-compliant with the C++ standard. At the moment they are O(1) amortized, typically one update for both skipfield and element pointers, but two if a skipfield jump takes the iterator beyond the bounds of the current block and into the next block. But if empty blocks are allowed, there could be anywhere between 1 and std::numeric_limits<size_type>::max empty blocks between the current element and the next. Essentially you get the same scenario as you do when iterating over a boolean skipfield. It would be possible to move these to the back of the colony as trailing blocks, or house them in a separate list or vector for future usage, but this may create performance issues if any of the blocks are not at their maximum size (see below).
    2. Performance: iterating over empty blocks is slower than them not being present, of course - but also if you have to allow for empty blocks while iterating, then you have to include a while loop in every iteration operation, which increases cache misses and code size. The strategy of removing blocks when they become empty also statistically removes (assuming randomized erasure patterns) smaller blocks from the colony before larger blocks, which has a net result of improving iteration, because with a larger block, more iterations within the block can occur before the end-of-block condition is reached and a jump to the next block (and subsequent cache miss) occurs. Lastly, pushing to the back of a colony, provided there is still space and no new block needs to be allocated, will be faster than recycling memory locations as each subsequent insertion occurs in a subsequent memory location (which is cache-friendlier) and also less computational work is necessary. If a block is removed its recyclable memory locations are also of course removed, hence subsequent insertions are more likely to be pushed to the back of the colony.
  9. Why not preserve empty memory blocks for future use, in a separate list or vector instead of freeing them to the OS, or leave this decision undefined by the specification?

    The default scenario, for reasons of predictability, should be to free the memory block rather than making this undefined. If a scenario calls for retaining memory blocks instead of deallocating them, this should be left to an allocator to manage. Otherwise you get unpredictable memory behaviour across implementations, and this is one of the things that SG14 members have complained about time-and-time again, the lack of predictable behaviour across standard library implementations. Ameliorating this unpredictability is best in my view.

  10. Memory block sizes - what are they based on, how do they expand, etc

    In the reference implementation memory block sizes start from either the default minimum size (8 elements, larger if the type stored is small) or an amount defined by the programmer (with a minimum of 3 elements). Subsequent block sizes then increase the total capacity of the colony by a factor of 2 (so, 1st block 8 elements, 2nd 8 elements, 3rd 16 elements, 4th 32 elements etcetera) until the maximum block size is reached. The default maximum block size is the maximum possible number that the skipfield bitdepth is capable of representing (std::numeric_limits<skipfield_type>::max()). By default the skipfield bitdepth is 16 so the maximum size of a block is 65535 elements.

    However the skipfield bitdepth is also a template parameter which can be set to any unsigned integer - unsigned char, unsigned int, Uint_64, etc. Unsigned short (guaranteed to be at least 16 bit, equivalent to C++11's uint_least16_t type) was found to have the best performance in real-world testing due to the balance between memory contiguousness, memory waste and the number of allocations.

  11. Can a colony be used with SIMD instructions?

    No and yes. Yes if you're careful, no if you're not.
    On platforms which support scatter and gather operations via hardware (e.g. AVX512) you can use colony with SIMD as much as you want, using gather to load elements from disparate or sequential locations, directly into a SIMD register, in parallel. Then use scatter to push the post-SIMD-process values elsewhere after. On platforms which do not support this in hardware, you would need to manually implement a scalar gather-and-scatter operation which may be significantly slower.
    In situations where gather and scatter operations are too expensive, which require elements to be contiguous in memory for SIMD processing, this is more complicated. When you have a bunch of erasures in a colony, there's no guarantee that your objects will be contiguous in memory, even though they are sequential during iteration. Some of them may also be in different memory blocks to each other. In these situations if you want to use SIMD with colony, you must do the following:

    Generally if you want to use SIMD without gather/scatter, it's probably preferable to use a vector or an array.

Appendix D - Specific responses to previous committee feedback

  1. "Why not 'bag'? Colony is too selective a id."

    'bag' is problematic partially because it has been synonymous with a multiset (and colony is not one of those) in both computer science and mathematics since the 1970s, and partially because it's a bit vague - it doesn't describe how the container works. However I accept that it is a familiar name and describes a similar territory, for most programmers and will accept that as a id if needed. 'colony' is an intuitive name if you understand the container, and allows for easy conveyance of how it functions internally (colony = human colony/ant colony etc, memory blocks = houses, elements = people/ants in the houses who come and go). The claim that the use of the word is selective in terms of its meaning, is also true for vector, set, 'bag', and many other C++ names.

  2. "Unordered and no associative lookup, so this only supports use cases where you're going to do something to every element."

    As noted the container was originally designed for highly object-oriented situations where you have many elements in different containers linking to many other elements in other containers. This linking can be done with pointers or iterators in colony (insert returns an iterator which can be dereferenced to get a pointer, pointers can be converted into iterators with the supplied functions (for erase etc)) and because pointers/iterators stay stable regardless of insertion/erasure, this usage is unproblematic. You could say the pointer is equivalent to a key in this case (but without the overhead). That is the first access pattern, the second is straight iteration over the container, as you say. Secondly, the container does have (typically better than O(n)) advance/next/prev implementations, so multiple elements can be skipped.

  3. "Do we really need the skipfield_type template argument?"

    This argument currently promotes use of the container in heavily memory-constrained environments, and in high-performance small-N collections (where the type of the skipfield can be reduced to 8 bits without having a negative effect on maximum block sizes and subsequent iteration speed). See more explanation in V. Technical Specifications. Unfortunately this parameter also means operator = and some other functions won't work between colonies of the same type but differing skipfield types. Further, the template argument is chiefly relevant to the use of the skipfield patterns utilized in the reference implementations, and there may be better techniques.

    However, the parameter can always be ignored in an implementation. Retaining it, even if significantly advanced structures are discovered for skipping elements, harms nothing and can be deprecated if necessary. At this point in time I do not personally see many alternatives to the two skipfield patterns which have been used in the references implementations, both of which benefit from having this optional parameter. Please note, that is not the same as saying there are no alternatives, just ones never thought of yet. This is something I am flexible on, as a singular skipfield type will cover the majority of scenarios.

    Research into this area has determined that there is only really an advantage to using unsigned char for the skipfield type if the number of elements is under 1000, and not in all scenarios. So whether or not this constitutes a performance gain is largely scenario-dependent, certainly it always constitutes a memory usage reduction but the relative effect of this depends on the size of your stored type.

  4. "Prove this is not an allocator"

    I'm not really sure how to answer this, as I don't see the resemblance, unless you count maps, vectors etc as being allocators also. The only aspect of it which resembles what an allocator might do, is the memory re-use mechanism. It would be impossible for an allocator to perform a similar function while still allowing the container to iterate over the data linearly in memory, preserving locality, in the manner described in this document.

  5. "If this is for games, won't game devs just write their own versions for specific types in order to get a 1% speed increase anyway?"

    This is true for many/most AAA game companies who are on the bleeding edge, but they also do this for vector etc, so they aren't the target audience of std:: for the most part; sub-AAA game companies are more likely to use third party/pre-existing tools. As mentioned earlier, this structure (bucket-array-like) crops up in many, many fields, not just game dev. So the target audience is probably everyone other than AAA gaming, but even then, it facilitates communication across fields and companies as to this type of container, giving it a standardised name and understanding.

  6. "Is there active research in this problem space? Is it likely to change in future?"

    The only current analysis has been around the question of whether it's possible for this specification to fail to allow for a better implementation in future. This is unlikely given the container's requirements and how this impacts on implementation. Bucket arrays have been around since the 1990s, there's been no significant innovation in them until now. I've been researching/working on colony since early 2015, and while I can't say for sure that a better implementation might not be possible, I am confident that no change should be necessary to the specification to allow for future implementations, if it is done correctly.

    The requirement of allowing no reallocations upon insertion or erasure, truncates possible implementation strategies significantly. Memory blocks have to be independently allocated so that they can be removed (when empty) without triggering reallocation of subsequent elements. There's limited numbers of ways to do that and keep track of the memory blocks at the same time. Erased element locations must be recorded (for future re-use by insertion) in a way that doesn't create allocations upon erasure, and there's limited numbers of ways to do this also. Multiple consecutive erased elements have to be skipped in O(1) time, and again there's limits to how many ways you can do that. That covers the three core aspects upon which this specification is based. See IV. Design Decisions for the various ways these aspects can be designed.

    Skipfield update time complexity should, I think, be left implementation-defined, as defining time complexity may obviate better solutions which are faster but are not necessarily O(1). Skipfield updates occur during erasure, insertion, splicing, sorting and container copying. I have looked into alternatives to a 1-node-per-element skipfield, such as a compressed skipfield (a series of numbers denoting alternating lengths of non-erased/erased elements), but all the possible implementations I can think of either involve resizing of an array on-the-fly (which doesn't work well with low latency) and/or slowing down iteration time significantly.

Appendix E - Typical game engine requirements

Here are some more specific requirements with regards to game engines, verified by game developers within SG14:

  1. Elements within data collections refer to elements within other data collections (through a variety of methods - indices, pointers, etc). These references must stay valid throughout the course of the game/level. Any container which causes pointer or index invalidation creates difficulties or necessitates workarounds.
  2. Order is unimportant for the most part. The majority of data is simply iterated over, transformed, referred to and utilized with no regard to order.
  3. Erasing or otherwise "deactivating" objects occurs frequently in performance-critical code. For this reason methods of erasure which create strong performance penalties are avoided.
  4. Inserting new objects in performance-critical code (during gameplay) is also common - for example, a tree drops leaves, or a player spawns in an online multiplayer game.
  5. It is not always clear in advance how many elements there will be in a container at the beginning of development, or at the beginning of a level during play. Genericized game engines in particular have to adapt to considerably different user requirements and scopes. For this reason extensible containers which can expand and contract in realtime are necessary.
  6. Due to the effects of cache on performance, memory storage which is more-or-less contiguous is preferred.
  7. Memory waste is avoided.

std::vector in its default state does not meet these requirements due to:

  1. Poor (non-fill) singular insertion performance (regardless of insertion position) due to the need for reallocation upon reaching capacity
  2. Insert invalidates pointers/iterators to all elements
  3. Erase invalidates pointers/iterators/indexes to all elements after the erased element

Game developers therefore either develop custom solutions for each scenario or implement workarounds for vector. The most common workarounds are most likely the following or derivatives:

  1. Using a boolean flag or similar to indicate the inactivity of an object (as opposed to actually erasing from the vector). Elements flagged as inactive are skipped during iteration.

    Advantages: Fast "deactivation". Easy to manage in multi-access environments.
    Disadvantages: Can be slower to iterate due to branching.
  2. Using a vector of data and a secondary vector of indexes. When erasing, the erasure occurs only in the vector of indexes, not the vector of data. When iterating it iterates over the vector of indexes and accesses the data from the vector of data via the remaining indexes.

    Advantages: Fast iteration.
    Disadvantages: Erasure still incurs some reallocation cost which can increase jitter.
  3. Combining a swap-and-pop approach to erasure with some form of dereferenced lookup system to enable contiguous element iteration (sometimes called a 'packed array': http://bitsquid.blogspot.ca/2011/09/managing-decoupling-part-4-id-lookup.html).
    Advantages: Iteration is at standard vector speed.
    Disadvantages: Erasure will be slow if objects are large and/or non-trivially copyable, thereby making swap costs large. All link-based access to elements incur additional costs due to the dereferencing system.

Colony brings a more generic solution to these contexts. While some developers, particularly AAA developers, will almost always develop a custom solution for specific use-cases within their engine, I believe most sub-AAA and indie developers are more likely to rely on third party solutions. Regardless, standardising the container will allow for greater cross-discipline communication.

Appendix F - Questions for reviewers

Please feel free to get in touch with information and opinions on the following topics:

Appendix G - Paper revision history