==========================================================

Related: N2219: Clarifying Pointer Provenance (Q1-Q20) v3 N2223: Clarifying the C Memory Object Model: Introduction to N2219 - N2222, N2090, Section 4 of N2012, Questions 3/15, 4/15, and 5/15 of our survey, Section 2.1-2.9 (Q1-20) of our N2013, and DR260

This is a revision of N2219, which was based on N2090, which was based on N2012 (Section 4) and N2013.

1 Summary

1.1 Problem

C pointer values could traditionally be considered to be concrete numeric values (our survey, Questions 3, 4, and 5, indicates many still do). The DR260 Committee Response suggests otherwise, hinting at a notion of provenance associated to values that keeps track of their "origins":

"Implementations are permitted to track the origins of a bit-pattern and treat those representing an indeterminate value as distinct from those representing a determined value. They may also treat pointers based on different origins as distinct even though they are bitwise identical."

Current compilers exploit this, using it to justify alias analysis based on provenance distinctions. However, the DR260 CR has not been incorporated in the standard text, and it also leaves many specific questions unclear: it's ambiguous whether some idioms are allowed or not, and unclear what alias analysis and optimisation are allowed to do.

In this note we go through a sequence of specific questions, supported by concrete examples, and make a candidate proposal for discussion.

1.2 Status and summary of changes since N2219

We're reasonably confident in the basic proposal, with provenances (single or empty) carried with pointer and integer values. The main open questions are:

the treatment of IO and an escape hatch for inter-object pointer arithmetic
whether provenance should be purely per-allocation or also involve subobject-granularity checks
the treatment of provenance for objects within allocated regions, and reconciling that with C++

Changes:

We've added discussion of provenance escape hatches, for IO and inter-object arithmetic - and the problem with the N2090/N2219 wildcard provenance notion
We've added a realloc example.
We've added discussion of the problem with lost escapes.
We now propose a more refined treatment of subobjects.
We've noted that the previous TC phrasing was mis-drafted in some ways.

1.3 Acknowledgements

This has been informed by helpful discussions with many people, including members of WG14 (including among others Martin Sebor, Jens Gustedt, Martin Uecker, Florian Weimar, Blaine Garst, David Keaton, and Freek Wiedijk), Richard Smith, Chris Lattner, Kostya Serebryany, Nuno Lopes, Juneyoung Lee, David Chisnall, Robert Watson, Paul McKenney, Gil Hur, Frédéric Besson, John T. Criswell, Xavier Leroy, Santosh Nagarakatte, Steve Zdancewic, and all the contributors to our surveys. Please let us know if we've forgotten to include anyone.
The work has been partly funded by the UK EPSRC REMS Programme Grant, EP/K008528/1.

1.4 Outline of proposal

The basic idea is to associate a provenance with every pointer value, essentially identifying the original allocation the pointer is derived from. This is for the C abstract machine as defined in the standard: compilers might rely on provenance in their alias analysis and optimisation, but one would not expect normal implementations to record or manipulate provenance at runtime (though dynamic or static analysis tools might). Accordingly, provenances do not have any representation.

Pointer values and integer values both carry a provenance, normally either a single provenance ID or the "empty" provenance. We also need some escape hatch(es) to support pointers derived from I/O or from rare cases of inter-object pointer arithmetic, and for device memory; we return to these below.
On every allocation (of objects with static, thread, automatic, and allocated storage duration), the abstract machine nondeterministically chooses a fresh provenance ID (unique in the entire execution), and the resulting pointer value carries that single provenance ID.
At any access via a pointer value, its numeric address must be consistent with its provenance, with undefined behaviour otherwise. In particular:
- access via a pointer value with a single provenance ID must be at least within the memory footprint of the corresponding original allocation. We impose a tighter condition w.r.t. subobjects below.
- access via a pointer value with empty provenance is undefined behaviour (except where the numeric value is within an implementation-defined set of "device" memory addresses).
This undefined behaviour is what justifies optimisation based on provenance-based alias analysis.

Then there are many specific choices of how provenance is affected by arithmetic operations and suchlike. We first discuss the questions and then summarise our proposal.

Note that provenance-based alias analysis should not be confused with type-based alias analysis, and it is (contrary to the expectations of some expert C users and OS developers) still observable in current mainstream implementations even with -fno-strict-aliasing, opening the door to subtle bugs. Compilers should probably provide an option to turn it off (which proposals for "safe C" might want to mandate). We define the behaviour with and without a hypothetical -fno-provenance flag here, but exactly how that option is expressed might be left implementation-defined (in some implementations it might not be a command-line flag). There should be a corresponding feature-test macro to expose whether it has been enabled. What granularity it should have (e.g. per-function / per-compilation-unit / per-compilation / per-link-job) needs thought.

2 Questions and examples

2.1 Basic provenance

2.1.1 Q1. Must the pointer used for a memory access have the right provenance, i.e. be derived from the pointer to the original allocation (with undefined behaviour otherwise)? (This lets compilers do provenance-based alias analysis)

Our reading of C11 and proposal for C2x: C11: yes (clear from DR260CR, but not in the standard text). C2x proposal: yes. C2x proposal (no-provenance option): no

Here DR260CR clearly says yes. Our experimental data and discussions with compiler developers show that recent compilers do assume non-aliasing of pointers with identical representation values but distinct provenance. This is incompatible with a concrete semantics of pointers (in which pointers would be fully characterised by their representation values). Tracking of provenance in the abstract machine is therefore necessary to make these compilers sound with respect to the standard. For example, consider the following pathological code (adapted from DR260).

Example: provenance_basic_global_yx.c (experimental data)

#include <stdio.h>
#include <string.h> 
int  y=2, x=1;
int main() {
  int *p = &x + 1;
  int *q = &y;
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11;  // does this have undefined behaviour?
    printf("x=%d y=%d *p=%d *q=%d\n",x,y,*p,*q);
  }
  return 0;
}

Depending on the implementation, x and y might happen to be allocated in adjacent memory, in which case &x+1 and &y will have bitwise-identical representation values, the memcmp will succeed, and p (derived from a pointer to x) will have the same representation value as a pointer to a different object, y, at the point of the update *p=11. This can occur in practice, e.g. with GCC 8.1 -O2. Its output of

x=1 y=2 *p=11 *q=2

suggests that the compiler is reasoning that *p does not alias with y or *q, and hence that the initial value of y=2 can be propagated to the final printf.

This outcome would not be correct with respect to a concrete semantics, and so to make the compiler sound it is necessary for this program to be deemed to have undefined behaviour. Note that this example does not involve type-based alias analysis, and the outcome is not affected by GCC's -fno-strict-aliasing flag. One might ask whether the mere formation of the pointer x+1 is legal, but this is explicitly permitted by the ISO standard.

This test is only interesting in executions in which the variables happen to be allocated adjacently. Different compilers and flag choices lead to different allocations, so we also include the xy variant below. Whether the variables have global or automatic storage duration should not have any effect on the semantics here, though it obviously it can affect the allocation addresses; we also include automatic storage-duration versions below. One can also write analogous tests using distinct malloc'd regions, but they are typically not allocated adjacently, so constructing the aliasing pointer is already illegal according to ISO for those.

Example: provenance_basic_global_xy.c (code and experimental data)

Example: provenance_basic_auto_yx.c (code and experimental data)

Example: provenance_basic_auto_xy.c (code and experimental data)

2.1.2 Q2. Can equality testing on pointers be affected by pointer provenance information?

Our reading of C11 and proposal for C2x: C11: yes (clear from DR260CR). C2x proposal: yes (nondeterministically). C2x proposal (no-provenance option): no

Example: provenance_equality_global_yx.c (experimental data)

#include <stdio.h>
#include <string.h> 
int  y=2, x=1;
int main() {
  int *p = &x + 1;
  int *q = &y;
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
  _Bool b = (p==q);
  // can this be false even with identical addresses?
  printf("(p==q) = %s\n", b?"true":"false");
  return 0;
}

This is also allowed according to DR260CR. We observe (eg) GCC 8.1 -O2 regarding two pointers with different provenance as nonequal (according to ==) even though they have the same address. This happens in some circumstances but not others (e.g. it happens if the test is pulled into a simple separate function, but not if it is in a separate compilation unit), so we suggest that whether pointer equality takes provenance into account or not should be made indeterminate in the standard (again to make the observed compiler behaviour sound with respect to the standard). Note that requiring equality to always take provenance into account would require implementations to track provenance at runtime.

The ISO C11 standard text is too strong here: 6.5.9p6 says "Two pointers compare equal if and only if both are [...] or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space", which requires such pointers to compare equal (reasonable pre-DR260CR, but not after it). We don't expect programmers to rely on that behaviour and GCC does not satisfy it, so, to be consistent with DR260CR and with the indeterminate behaviour we suggest, it should permit them to compare either equal or non-equal.

Example: provenance_equality_global_xy.c (code and experimental data)

Example: provenance_equality_global_fn_yx.c (code and experimental data)

Example: provenance_equality_global_fn_xy.c (code and experimental data)

2.2 Pointer provenance via integer types

2.2.1 Q3. Can one make a usable pointer via casts to `intptr_t` and back?

Our reading of C11 and proposal for C2x: C11: yes (when intptr_t is present). C2x proposal: yes (when intptr_t is present). C2x proposal (no-provenance option): yes (when intptr_t is present)

ISO C11 optionally allows implementations to provide the type intptr_t (along with an unsigned variant) with guaranteed round-trip properties for pointer/integer casts. The following should be allowed, and that means that the C abstract machine should track provenance via such casts to and from integer values, otherwise implementations would have to make very coarse aliasing assumptions.

Example: provenance_roundtrip_via_intptr_t.c (experimental data)

#include <stdio.h>
#include <inttypes.h>
int  x=1;
int main() {
  int *p = &x;
  intptr_t i = (intptr_t)p;
  int *q = (int *)i;
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);  
}

2.2.2 Q4. Can one make a usable pointer via casts to unsigned long and back?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: implementation defined. C2x proposal (no-provenance option): implementation defined

It also seems to be common practice (e.g. in Linux) to extend these roundtrip properties to unsigned long, as in the example below, when its implementation is large enough. We suggest that it be implementation-defined which integer types support this - typically this might be all the types with wide-enough implementation representations, but implementations can restrict to intptr_t/uintptr_t if they choose.

Example: provenance_roundtrip_via_unsigned_long.c (experimental data)

#include <stdio.h>
int  x=1;
int main() {
  int *p = &x;
  unsigned long i = (unsigned long)p;
  int *q = (int *)i;
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);  
}

One should also ask whether provenance can be carried via float types. We tentatively suggest not.

2.2.3 Q5. Must provenance information be tracked via casts to integer types and integer arithmetic?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes. C2x proposal (no-provenance option): no

Example: provenance_basic_using_intptr_t_global_xy.c (code and experimental data)

Example: provenance_basic_using_intptr_t_global_yx.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <stdint.h>
#include <inttypes.h>
int  y = 2, x = 1;
int main() {
  intptr_t ux = (intptr_t)&x;
  intptr_t uy = (intptr_t)&y;
  intptr_t offset = 4;
  int *p = (int *)(ux + offset);
  int *q = &y;
  printf("Addresses: &x=%"PRIiPTR" p=%p &y=%"PRIiPTR\
         "\n",ux,(void*)p,uy);
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11; // does this have undefined behaviour?
    printf("x=%d  y=%d  *p=%d  *q=%d\n",x,y,*p,*q); 
  }
}

Given the type intptr_t, this question asks whether one can return to a concrete view of pointers as simple numbers, by casting to intptr_t followed by integer arithmetic and casting back to a pointer type. Here again, we observe GCC behaving the same as with Q1, reasoning that pointers obtained in this way cannot alias even if they have the same numerical values. This observation is reinforced by the GCC documentation, which mentions an "original pointer" associated to integer values cast to pointer type, so the answer seems to be "yes". This leads to many more questions regarding the specifics of how provenance information affect the semantics of each integer operator. Some of these are discussed in the next subsection and the remainder are given a complete treatment in the summary of our proposal at the end.

2.2.4 Q6. Can one use bit manipulation and integer casts to store information in unused bits of pointers?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes. C2x proposal (no-provenance option): no

Now we extend the example of Q3, that cast a pointer to intptr_t and back, to use bitwise logical operations on the integer value to store some tag bits. The following code exhibits a strong form of this, storing the address and tag bit combination as a pointer (which thereby creates a misaligned pointer value, though one not used for accesses); a weaker form would store the combined value only as an integer.

Example: provenance_tag_bits_via_uintptr_t_1.c (experimental data)

#include <assert.h>
#include <stdio.h>
#include <stdint.h>
int x=1;
int main() {
  int *p = &x;
  // cast &x to an integer 
  uintptr_t i = (uintptr_t) p;
  // check the bottom two bits of an int* are not used
  assert(_Alignof(int) >= 4);
  assert((i & 3u) == 0u);
  // construct an integer like &x with low-order bit set
  i = i | 1u;  
  // cast back to a pointer
  int *q = (int *) i; // does this have defined behaviour?
  // cast to integer and mask out the low-order two bits
  uintptr_t j = ((uintptr_t)q) & ~((uintptr_t)3u);  
  // cast back to a pointer
  int *r = (int *) j; 
  // are r and p now equivalent?  
  *r = 11;           //  does this have defined behaviour? 
  _Bool b = (r==p);  //  is this true?
  printf("x=%i *r=%i (r==p)=%s\n",x,*r,b?"true":"false");  
}

The standard leaves conversions between integer and pointer types implementation-defined (6.3.2.3p{5,6}), but it is common practice to use unused pointer bits (either low-order bits from alignment requirements or high-order bits beyond the maximum address range). We suggest that the set of unused bits for pointer types of each alignment should be made implementation-defined, to make this practice legal where necessary.

Note that in our proposal XOR'ing with a pure number and then back (as in some pointer encryption schemes) is well-defined.

Moreover, where the standard does give a guarantee about round-trip casts, e.g. for round-trips through intptr_t (7.20.1.4p1), it says only that the result "will compare equal". In a provenance-aware semantics, that may not be enough to make the result usable to reference memory; the standard text should be strengthened here to guarantee that by giving the result a usable provenance - the same as that of the source. And similarly for casts to void * and back, and adding and removing qualifiers (c.f. 6.3.2.3p{1,2}).

2.2.5 Q7. Can equality testing on integers that are derived from pointer values be affected by their provenance?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: no. C2x proposal (no-provenance option): no

Example: provenance_equality_uintptr_t_global_xy.c (code and experimental data)

Example: provenance_equality_uintptr_t_global_yx.c (experimental data)

#include <stdio.h>
#include <inttypes.h> 
int y=2, x=1;
int main() {
  uintptr_t p = (uintptr_t)(&x + 1);
  uintptr_t q = (uintptr_t)&y;
  printf("Addresses: p=%" PRIxPTR " q=%" PRIxPTR "\n",
         p,q);
  _Bool b = (p==q);
  // can this be false even with identical addresses?
  printf("(p==q) = %s\n", b?"true":"false");
  return 0;
}

GCC did at one point print false for this, but it was regarded as a bug and fixed (from 4.7.1 to 4.8). DR 260CR does not address the question. We believe that integer equality testing should not be affected by provenance, i.e. "no".

This example is inspired by one from Krebbers' PhD thesis.

2.3 Pointers constructed from multiple provenances

2.3.1 Q8. Should intra-object pointer subtraction give provenance-free integer results?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes. C2x proposal (no-provenance option): yes

DR260CR does not address this, but it is uncontroversially "yes" in practice: an intra-object pointer subtraction, say between the addresses of two elements of an array, should give a provenance-free integer offset that can then be used for indexing into that or another array.

Example: provenance_multiple_1_global.c (code and experimental data)

Example: provenance_multiple_2_global.c (experimental data)

#include <stdio.h>
int y[2], x[2];
int main() {
  int *p = &(x[0]) + (&(y[1])-&(y[0]));
  *p = 11;  // is this free of undefined behaviour?
  printf("x[1]=%d *p=%d\n",x[1],*p);
  return 0;
}

We say a provenance-free integer offset because the fact that an offset is calculated from pointers to y should not, when added to a pointer to a distinct (perhaps adjacent) object x, license its use to access y. The example below should not be allowed to access y[0]. We see GCC 8.1 -O2 optimise based on alias analysis here.

Example: provenance_multiple_4_global_yx.c (experimental data)

#include <stdio.h>
#include <string.h> 
int  y[2], x[2];
int main() {
  int *p = &x[1] + (&y[1]-&y[0]) + 0;
  int *q = &y[0];
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11;  // does this have undefined behaviour?
    printf("y[0]=%d *p=%d *q=%d\n",y[0],*p,*q);
  }
  return 0;
}

2.3.2 Q9. Can one make a usable offset between two separately allocated objects by inter-object subtraction (using either pointer or integer arithmetic), to make a usable pointer to the second by adding the offset to the first?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: no (without explicit programmer annotation). C2x proposal (no-provenance option): yes

This is asking about pointers that potentially have multiple provenances, which is not addressed in DR260CR or current GCC or Clang compiler documentation - they refer to "the origin" of a pointer as if there were necessarily only one. The example below is a variant of the Q5 provenance_basic_using_intptr_t_global_yx.c in which the constant offset is replaced by a subtraction (here after casting from pointer to integer type).

Example: pointer_offset_from_subtraction_1_global.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <stdint.h>
#include <inttypes.h>
int  y = 2, x=1;
int main() {
  intptr_t ux = (intptr_t)&x;
  intptr_t uy = (intptr_t)&y;
  intptr_t offset = uy - ux;
  printf("Addresses: &x=%"PRIiPTR" &y=%"PRIiPTR\
         " offset=%"PRIiPTR" \n",ux,uy,offset);
  int *p = (int *)(ux + offset);
  int *q = &y;
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11; // is this free of undefined behaviour?
    printf("x=%d  y=%d  *p=%d  *q=%d\n",x,y,*p,*q); 
  }
}

Our survey responses suggest that compilers do not in general support this, and we imagine it is uncommon in practice. For normal code, we suggest "no". Here are analogous tests with automatic storage duration variables and malloc'd regions, and the latter using a constant rather than calculated offset. We see GCC 8.1 -O2 optimise based on a no-aliasing assumption only for the last.

Example: pointer_offset_from_subtraction_1_auto.c (code and experimental data)

Example: pointer_offset_from_subtraction_1_malloc.c (code and experimental data)

Example: pointer_offset_constant_8_malloc.c (code and experimental data)

However, there do seem to be specific important use cases. These include fixing up pointers to objects after realloc calls, as in Q9b below, and the Linux and FreeBSD per-CPU-variable implementations - though it is unclear whether the latter are between multiple allocations in the C sense. We suspect there are other cases, e.g. within fancy allocators. What kind of escape hatch is needed to support this code needs further discussion. The basic question is how such code should communicate with compiler alias analyses.

2.3.3 Q9b. Can one make a usable offset between a malloc'd region and a realloc'd version thereof by inter-object subtraction, to make a usable pointer to data within the latter by adding the offset to some pointer to data within the former?

Our reading of C11 and proposal for C2x: C11: no. C2x proposal: yes (with suitable annotation, TBD). C2x proposal (no-provenance option): yes

Example: pointer_offset_from_subtraction_realloc_1.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
int main() {
  // malloc a region and initialise some data within it
  void *p=malloc(2*sizeof(int));                      // allocation P
  assert(p != NULL); // pick out the interesting execution
  int *i=(int*)p;
  *(i+0)=10;
  *(i+1)=11;
  // construct some pointers to that data
  int *j0 = i+0;
  int *j1 = i+1;
  // realloc the region
  void *q=realloc(p,65536*sizeof(int));                    // allocation Q
  assert(q != NULL && q != p); // pick out the interesting execution (UB?)
  // calculate the offset between the two allocations
  ptrdiff_t offset=(unsigned char*)q-(unsigned char*)p;// does this have UB?
  // try to adapt the data pointers to new allocation by adding the offset
  int *k0 = (int*)((unsigned char *)j0 + offset);      // do these have UB?
  int *k1 = (int*)((unsigned char *)j1 + offset); 
  // and use one of those for an access
  *k0=20;                                              // does this have UB? 
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);  // does this have UB?
  printf("*k0=%d *k1=%d\n",*k0,*k1);
  return 0;
}

This example uses pointer arithmetic to fix up pointers to data within a malloc'd region after it is realloc'd, calculating an offset that it uniformly adds to each such pointer (as suggested by Florian Weimar).

On the one hand, something like this seems to be needed, otherwise the utility of realloc would be limited - one couldn't have pointers from outside the region to anything except the start of the region across the realloc call, and one couldn't have any pointers within the region to itself. We don't know how widely they those occur in practice.

But in ISO C11, it is undefined behaviour, for several reasons.

In executions in which the realloc returns a distinct address to that it is passed, the original object has been deallocated (7.22.3.5p2), so its lifetime has ended, and so has that of the integers within it (following 6.2.4p2 "An object exists, has a constant address, and retains its last-stored value throughout its lifetime"). But then 6.2.4p2 continues "The value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime", so the values of p, j0, and j1 are indeterminate in the second assert, the subsequent arithmetic, and the printf.
Even if one ignores that, the pointer subtraction is undefined, as the two pointers are not within or one-past the same array object (6.5.6p9).
And, ignoring that, the arithmetic of adding offset goes outside the original allocation, which is also undefined (6.5.6p8).
Additionally, as someone (Fred Tydeman?) remarked for Q9 at WG14 Brno, because ptrdiff_t is a signed type, the pointer subtraction could also overflow. To avoid that, one either needs an extension or option that enables wrapping pointer arithmetic, or some (perhaps implementation-defined) bounds on the size of the heap address space.

It seems clear that ISO C2x should permit some version of this code, probably requiring some programmer annotation to explain to the compiler that exotic pointer arithmetic is involved. But what form should that annotation take?

See N2222 Q31 and N2222 Q43 for additional (older) discussion of the pointer-lifetime-end-zap and out-of-bounds pointer construction issues.

2.3.4 Q9c. Can one make a usable offset between two objects within a single allocated region by inter-object subtraction (using either pointer or integer arithmetic), to make a usable pointer to the second by adding the offset to the first?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes. C2x proposal (no-provenance option): yes

Within a malloc'd region, it seems this has to be supported, to allow the use of the region to hold an array - as in the example below. In our proposal, this is allowed because provenance checking is done on a per-allocation granularity.

Example: pointer_offset_from_subtraction_within_malloc_int_1.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <stdlib.h>
#include <stddef.h>
int main() {
  void *a = malloc(4*sizeof(int)); // allocation P
  // initialise two elements of a notional array within the allocation
  int *p1 = (int*)((unsigned char*)a+1*sizeof(int));
  int *p3 = (int*)((unsigned char*)a+3*sizeof(int));
  *p1 = 1;
  *p3 = 3;
  // calculate an unsigned char* offset between pointers to those elements
  ptrdiff_t offset=(unsigned char*)p3-(unsigned char*)p1;  // provenance ?
  // add the offset to a pointer to the first
  unsigned char *q1 = (unsigned char*)p1;                  // provenance P
  unsigned char *q3 = (unsigned char*)p1 + offset;         // provenance ?
  int *r1 = (int*)q1;
  int *r3 = (int*)q3;
  printf("Addresses: a=%p p3=%p r3=%p\n",a,(void*)p3,(void*)r3);
  // if that has the same representation as the pointer to the third...
  if (memcmp(&p3, &r3, sizeof(p3)) == 0) {
    // try to use it to access that
    *r3 = 11;  // is this free of undefined behaviour?
    printf("*p1=%d *r1=%d *r3=%d \n",
           *p1, *r1, *r3);
  }
  return 0;
}

The following example is similar but using structs, and accessing their members only via lvalues of the member types, to illustrate the case where the compiler has no explicit information that the accesses are conceptually to parts of an array of structs.

Example: pointer_offset_from_subtraction_within_malloc_struct_1.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <stdlib.h>
#include <stddef.h>
typedef struct { char c; int i; } st;
int main() {
  void *a = malloc(4*sizeof(st)); // allocation P
  // initialise one member of two elements of a notional array of structs
  char *p1 = (char*)((unsigned char*)a+1*sizeof(int)+offsetof(st,c));
  int  *p3 = (int *)((unsigned char*)a+3*sizeof(int)+offsetof(st,i));
  *p1 = 'a';
  *p3 = 3;
  // calculate an unsigned char* offset between pointers to those elements
  ptrdiff_t offset=((unsigned char*)p3-offsetof(st,i)) -
                   ((unsigned char*)p1-offsetof(st,c));  // provenance ?
  // add the offset to a pointer to the first struct
  unsigned char *q1  = (unsigned char*)p1 - offsetof(st,c);// provenance P
  unsigned char *q3  = ((unsigned char*)p1 - offsetof(st,c)) + offset;     // provenance P
  // and adapt to point to the i element of the third
  unsigned char *q3i = q3 + offsetof(st,i);                // provenance P
  char *r1 = (char*)q1;
  int  *r3 = (int*)q3i;
  printf("Addresses: a=%p p3=%p r3=%p\n",a,(void*)p3,(void*)r3);
  // if that has the same representation as the pointer to the i member of the third...
  if (memcmp(&p3, &r3, sizeof(p3)) == 0) {
    // try to use it to access that
    *r3 = 33;  // is this free of undefined behaviour?
    printf("*p3=%d *r3=%d \n", *p3, *r3);
  }
  return 0;
}

2.3.5 Q11. Is the XOR linked list idiom supported?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: no. C2x proposal (no-provenance option): yes

The classic XOR linked list algorithm (implementing a doubly linked list with only one pointer per node, by storing the XOR of two pointers) also makes essential use of pointers constructed from pointer values with multiple distinct provenances. In this example we XOR the integer values from two pointers and XOR the result again with one of them.

Example: pointer_offset_xor_auto.c (code and experimental data)

Example: pointer_offset_xor_global.c (experimental data)

#include <stdio.h>
#include <inttypes.h>
int x=1;
int y=2;  
int main() {
  int *p = &x;
  int *q = &y;
  uintptr_t i = (uintptr_t) p;
  uintptr_t j = (uintptr_t) q;
  uintptr_t k = i ^ j;
  uintptr_t l = k ^ i;
  int *r = (int *)l;
  // are r and q now equivalent?  
  *r = 11;     // does this have defined behaviour?             
  _Bool b = (r==q); 
  printf("x=%i y=%i *r=%i (r==p)=%s\n",x,y,*r,
         b?"true":"false");  
}

It appears (though it is not completely clear) that this idiom is not important in modern practice. One respondent remarks that the XOR list implementation interacts badly with modern pipelines and the space saving is not a big win; another doesn't know of modern usages, though suspects that it is probably still used in places. We don't know whether current compiler alias analysis permits it or not. Our suggested semantics would not allow it.

If one did want to allow it in specific code, it's not clear what kind of escape hatch would be necessary. At the point of construction of the final pointer, that code doesn't have any other way to refer to the provenance of the original allocation, so an attribute to add a specific provenance to a pointer would not suffice for this case. A general wildcard would, but that would be problematic w.r.t. alias analysis.

2.4 Pointer provenance via pointer representation copying

2.4.1 Q13. Can one make a usable copy of a pointer by copying its representation bytes using the library memcpy?

Our reading of C11 and proposal for C2x: C11: yes. C2x proposal: yes. C2x proposal (no-provenance option): yes

Example: pointer_copy_memcpy.c (experimental data)

#include <stdio.h>
#include <string.h>
int x=1;
int main() {
  int *p = &x;
  int *q;
  memcpy (&q, &p, sizeof p); 
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);  
}

The ISO C11 text does not explicitly address this. In a pre-provenance semantics, before DR260, it did not need to, but now (as this surely should be allowed) the standard needs to guarantee that the result has the appropriate provenance to make it usable. One could allow it by special-casing memcpy() to preserve provenance, but the following questions suggest a less ad hoc approach.

2.4.2 Q14. Can one make a usable copy of a pointer by copying its representation bytes (unchanged) in user code?

Our reading of C11 and proposal for C2x: C11: yes (from effective types text?). C2x proposal: yes. C2x proposal (no-provenance option): yes

Example: pointer_copy_user_dataflow_direct_bytewise.c (experimental data)

#include <stdio.h>
#include <string.h>
int  x=1;
void user_memcpy(unsigned char* dest, 
                 unsigned char *src, size_t n) {
  while (n > 0)  {      
    *dest = *src;
    src += 1;
    dest += 1;
    n -= 1;
  }
}
int main() {
  int *p = &x;
  int *q;
  user_memcpy((unsigned char*)&q, 
              (unsigned char*)&p, sizeof(int *));
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);
}

ISO C11 and DR260CR again do not mention this explicitly (though the 6.5p6 effective type text weakly implies it is allowed). We believe it is widely relied on. (The only exceptions we are aware of are capability machines such as IBM system 38 and descendents, and CHERI. In CHERI you have to copy pointers at pointer types for it to work properly, but capability loads and stores can operate generically, because the capability registers have tag bits. There is also some new tagged memory support for Oracle Sparc, to find invalid pointers; we've not looked into that.)

Our proposed semantics makes it legal by regarding each representation byte (as an integer value) as having the provenance of the original pointer, and the result pointer, being composed of representation bytes of which at least one has that provenance and none have a conflicting provenance, as having the same.

One could instead require, more restrictively, that one has all the original bytes of some legitimate pointer. There may not be much reasonable code that would be sensitive to the distinctions between these, but there is some, e.g. manipulations of pointers where one knows the high-order bytes are the same, as in the survey response mentioning encoding 64-bit pointers in 48 bits. Our semantics will permit that.

Real memcpy() implementations are more complex. The glibc memcpy involves copying byte-by-byte, as above, and also word-by-word and, using virtual memory manipulation, page-by-page. Word-by-word copying is not permitted by the ISO standard, as it violates the effective type rules, but we believe C2x should support it in some form. Virtual memory manipulation is outside our scope at present.

2.4.3 Q15. Can one make a usable copy of a pointer by copying its representation bytes by user code that indirectly computes the identity function on those bytes?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes (if multiple pointers are not mixed). C2x proposal (no-provenance option): yes

The following code is a simple version of this, just using a XOR twice; one should imagine a more complex transform, with the transform and its inverse separated in the code and in time so that the compiler cannot analyse them.

Example: pointer_copy_user_dataflow_indirect_bytewise.c (experimental data)

#include <stdio.h>
#include <string.h>
int  x=1;
void user_memcpy2(unsigned char* dest, 
                  unsigned char *src, size_t n) {
  while (n > 0)  {      
    *dest = ((*src) ^ 1) ^ 1;
    src += 1;
    dest += 1;
    n -= 1;
  }
}
int main() {
  int *p = &x;
  int *q;
  user_memcpy2((unsigned char*)&q, (unsigned char*)&p, 
              sizeof(p));
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);
}

Whether this should be supported is unclear, and DR260 CR does not speak to it (it calls out the library memcpy and memmove as special cases: "Note that using assignment or bitwise copying via memcpy or memmove of a determinate value makes the destination acquire the same determinate value.").

Our proposal would allow it (without escape-hatch annotations) only in the case where bytes of different pointer values are not mixed during the computation; it would forbid code that compresses or encrypts multiple pointers together, as such a computation would (somewhere) be combining values of different provenances, with an empty-provenance result. That would need an escape hatch.

Richard Smith argued that pointer construction via intptr_t and back via any value-dependent identity function should be required to work. That would admit the encryption/compressing of multiple pointers, which our proposal forbids. But defining that notion of "value-dependent" is exactly the thing that is hard in the concurrency thin-air problem, and we don't believe compilers can be made to respect it in general.

2.4.4 Q16. Can one carry provenance through dataflow alone (not through control flow)?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes. C2x proposal (no-provenance option): n/a

Our provenance examples so far have all only involved dataflow; we also have to ask if a usable pointer can be constructed via non-dataflow control-flow paths, e.g. if testing equality of an unprovenanced integer value against a valid pointer permits the integer to be used as if it had the same provenance as the pointer. We don't expect that this is relied on in practice, and our proposed semantics does not permit it - we track provenance only through dataflow. This needs to be discussed with respect to current compiler analysis behaviour.

For example, consider a version of the previous indirect memcpy example with a control-flow choice on the value of the bytes:

Example: pointer_copy_user_ctrlflow_bytewise.c (code and experimental data)

Example: pointer_copy_user_ctrlflow_bytewise_abbrev.c

#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <limits.h>
int  x=1;
unsigned char control_flow_copy(unsigned char c) {
  assert(UCHAR_MAX==255);
  switch (c) {
  case 0: return(0);
  case 1: return(1);
  case 2: return(2);
  ...
  case 255: return(255);
  }
}
void user_memcpy2(unsigned char* dest, 
                  unsigned char *src, size_t n) {
  while (n > 0)  {      
    *dest = control_flow_copy(*src);
    src += 1;
    dest += 1;
    n -= 1;
  }
}
int main() {
  int *p = &x;
  int *q;
  user_memcpy2((unsigned char*)&q, (unsigned char*)&p, 
              sizeof(p));
  *q = 11; // does this have undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);
}

Similarly, one can imagine copying a pointer via uintptr_t bit-by-bit via a control-flow choice for each bit.

Finally, contrasting with the first two examples above, that recover all the concrete value information of the original pointer, we can consider a variant of the Q5 provenance_basic_using_intptr_t_global_yx.c example in which there is a control-flow choice based on partial information of the intended target pointer (here just whether q is null) and the concrete value information is obtained otherwise:

Example: provenance_basic_mixed_global_offset+4.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <stdint.h>
#include <inttypes.h>
int  y = 2, x=1;
int main() {
  intptr_t ux = (intptr_t)&x;
  intptr_t uy = (intptr_t)&y;
  intptr_t offset = 4;
  printf("Addresses: &x=%"PRIiPTR" &y=%"PRIiPTR\
         "\n",ux,uy);
  int *q = &y;
  if (q != NULL) {
    int *p = (int *)(ux + offset);
    if (memcmp(&p, &q, sizeof(p)) == 0) {
      *p = 11; // does this have undefined behaviour?
      printf("x=%d  y=%d  *p=%d  *q=%d\n",x,y,*p,*q); 
    }
  }
}

A semantics that tracks provenance only through dataflow dependency is our preferred option. It seems to be the simplest, and is probably compatible with programming practice - we imagine that none of these idioms occur in normal practice. Our proposal would forbid the above examples, while permitting the dataflow bitwise copy example.

Allowing provenance to be propagated via any control-flow dependency would allow all these examples, but it seems clear that the last example should be forbidden, in ISO or de facto semantics, and indeed GCC is again doing an optimisation that would not be sound if it were. In real code we imagine that many pointer accesses are in some way control-flow dependent on others, given the many null-pointer checks required in C, so tracking that would neither be feasible nor useful.

2.5 Pointer provenance and union type punning

2.5.1 Q17. Is type punning between integer and pointer values allowed?

Our reading of C11 and proposal for C2x: C11: yes (if the representations are the same). C2x proposal: implementation defined (if the representations are the same). C2x proposal (no-provenance option): implementation defined (if the representations are the same)

The following example (analogous to the provenance_roundtrip_via_intptr_t.c of Q3) constructs a pointer by casting a pointer to uintptr_t, storing that in a member of a union of that type, and then reading from a member of the union of pointer type.

Example: provenance_union_punning_1_global.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <inttypes.h>
int x=1;
typedef union { uintptr_t ui; int *p; } un;
int main() {
  un u; 
  int *px = &x;
  uintptr_t i = (uintptr_t)px;
  u.ui = i;
  int *p = u.p;
  printf("Addresses: p=%p &x=%p\n",(void*)p,(void*)&x);
  *p = 11;  // is this free of undefined behaviour?
  printf("x=%d *p=%d\n",x,*p);
  return 0;
}

The ISO standard says "the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type", but says little about that reinterpretation, and DR260 CR does not speak to the provenance of the result.

For any particular implementation, pointers to normal object types might or might not have the same representation as uintptr_t values. That might well not hold for some implementations, but it does for our usual ones. Our semantics has to have an implementation-defined map for the conversions between pointer representations and uintptr_t in any case, so we can say that this example is allowed, preserving the original provenance, iff that is the identity.

2.6 Pointer provenance via IO, device memory, and linking

2.6.1 Q19. Can one make a usable pointer via IO?

Our reading of C11 and proposal for C2x: C11: yes (clear for %p; unclear elsewhere). C2x proposal: yes (modulo IO-escape check for %p, and with programmer annotation for other cases). C2x proposal (no-provenance option): yes

We now consider the extreme example of pointer provenance flowing via IO, if one writes the address of an object to a file and reads it back in. We have three versions: one using fprintf/fscanf and the %p format, one using fwrite/fread on the pointer representation bytes, and one converting the pointer to and from uintptr_t and using fprintf/fscanf on that value with the PRIuPTR/SCNuPTR formats.
The first gives a syntactic indication of a potentially escaping pointer value, while the others (after preprocessing) do not. Giving just the first in full:

Example: provenance_via_io_percentp_global.c (experimental data)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>
int x=1;
int main() {
  int *p = &x;
  FILE *f = fopen(
    "provenance_via_io_percentp_global.tmp","w+b");
  printf("Addresses: p=%p\n",(void*)p);
  // print pointer address to a file
  fprintf(f,"%p\n",(void*)p);
  rewind(f);
  void *rv;
  int n = fscanf(f,"%p\n",&rv);
  int *r = (int *)rv;
  if (n != 1) exit(EXIT_FAILURE);
  printf("Addresses: r=%p\n",(void*)r);
  // are r and p now equivalent?  
  *r=12; // is this free of undefined behaviour?                                                           
  _Bool b1 = (r==p); // do they compare equal?                      
  _Bool b2 = (0==memcmp(&r,&p,sizeof(r)));//same reps?
  printf("x=%i *r=%i b1=%s b2=%s\n",x,*r,
         b1?"true":"false",b2?"true":"false");
}

Example: provenance_via_io_bytewise_global.c (code and experimental data)

Example: provenance_via_io_uintptr_t_global.c (code and experimental data)

This is used in practice: in graphics code for marshalling/unmarshalling (using %p), in xlib (using SCNuPTR), and in debuggers.

In the ISO standard, the standard text for fprintf and scanf for %p say that this should work: "If the input item is a value converted earlier during the same program execution, the pointer that results shall compare equal to that value; otherwise the behavior of the %p conversion is undefined" (modulo the usual remarks about "compare equal"), and the text for uintptr_t and the presence of SCNuPTR in inttypes.h implies the same there.

To make the standard allow the %p case in a provenance-aware C abstract machine, we suggest that the provenances (of pointer or integer values) the pointers output during an execution should be recorded (in the abstract machine, not in normal implementations). Then when a pointer is read via %p, we check whether the numeric address is strictly within any live allocation whose provenance has escaped via IO. If it has, we give it the associated provenance, otherwise we give it the empty provenance (thus making it UB to use for accesses). (Freek Wiedijk observes one might want to allow one-past pointers, which is nicely uniform but means it can be ambiguous which of two provenances to allow.) To nail down what counts as "output", one might enumerate a subset of the standard library call arguments, or, more elegantly, introduce some annotation for those.

This does not directly support reading pointers from IO by reading their representation bytes or by reading a uintptr_t value. In those cases there's no clear syntactic indication that the pointer might be coming from IO rather than other integer/pointer manipulation (as in our earlier examples), and so no clear place to tell alias analysis what's going on. This seems to need an explicit programmer annotation of some form.

2.6.2 Q20. Can one make a usable pointer from a concrete address (of device memory)?

Our reading of C11 and proposal for C2x: C11: no. C2x proposal: yes (in implementation-defined device memory). C2x proposal (no-provenance option): yes (likewise)

C programs should normally not form pointers from particular concrete addresses. For example, the following should normally be considered to have undefined behaviour, as address 0xABC might not be mapped or, if it is, might alias with other data used by the runtime. By the ISO standard it does have undefined behaviour, consistent with an abstract view of pointers.

Example: pointer_from_concrete_address_1.c (experimental data)

int main() { 
  // on systems where 0xABC is not a legal non-stack/heap 
  // address, does this have undefined behaviour?
  *((int *)0xABC) = 123;
}

But in some circumstances, especially for embedded devices, it is idiomatic to use concrete addresses in C to access memory-mapped devices, e.g.

Example: pointer_from_concrete_address_2.c (experimental data)

#define PORTBASE 0x40000000
unsigned int volatile * const port = 
  (unsigned int *) PORTBASE;
int main() {
  unsigned int value = 0;
  // on systems where PORTBASE is a legal non-stack/heap 
  // address, does this have defined behaviour?
  *port = value; /* write to port */
  value = *port; /* read from port */
}

Our suggestion is to introduce an implementation-defined set of "device memory" addresses (which may depend on linking), and which is guaranteed to be disjoint from normal C-accessible stack, heap, and program memory, for which the creation of such pointers be allowed.

Linking provides many mechanisms for controlling memory layout. It doesn't seem feasible to incorporate much of that into the C standard, but perhaps we should provide a more coherent semantic interface to allow it to be tied in?

2.7 Pointer provenance for subobjects

2.7.1 Q86. Are provenance checks only on a per-allocation granularity (not per-subobject)?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes. C2x proposal (no-provenance option): n/a

Our previous proposal (N2219 and before) had a simple per-allocation notion of provenance, allowing access via a pointer anywhere within the memory footprint of the original allocation (though note that without '-fno-strict-aliasing' there would be some additional restrictions). In some cases this flexibility is needed, e.g. for any bytewise computation of the representation of a struct, e.g. via a user-code 'memcpy', as below, or to serialise/unserialise it, where an unsigned char * pointer has to be able to traverse the entire allocation footprint.

Example: pointer_copy_user_dataflow_direct_bytewise_struct.c (experimental data)

#include <stdio.h>
#include <string.h>
typedef struct { int i; float f; } st;
st s1 = {.i=1, .f=1.0 };
st s2;
void user_memcpy(unsigned char* dest, 
                 unsigned char *src, size_t n) {
  while (n > 0)  {      
    *dest = *src;
    src += 1;
    dest += 1;
    n -= 1;
  }
}
int main() {
  st *p = &s1;
  st *q = &s2;
  user_memcpy((unsigned char*)q, 
              (unsigned char*)p, sizeof(st));
  // is this free of undefined behaviour?
  printf("p->i = %d  q->i = %d\n",p->i,q->i);
}

But in other cases per-allocation provenance allows intra-allocation examples which we don't think should be supported, e.g. as below, and which will unnecessarily impede alias analysis. Existing compilers may well do this (according to Nuno Lopes, GCC and ICC already support some subobject analysis, and people are working on it for LLVM. We don't know whether or not these analyses are turned on even with -fno-strict-aliasing.)

Example: provenance_intra_object_1.c (experimental data)

#include <stdio.h>
#include <string.h> 
typedef struct { int x; int y; } st;
int main() {
  st s = { .x=1, .y=2 };
  int *p = &s.x + 1;
  int *q = &s.y;
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11;  // is this free of undefined behaviour?
    printf("s.x=%d s.y=%d *p=%d *q=%d\n",s.x,s.y,*p,*q);
  }
}

A possible reconcilation of all this would be to have a per-subobject provenance restriction by default, but relax this (to per-allocation provenance) for pointers that have been formed by an explicit cast. Perhaps only for casts to void *, unsigned char *, intptr_t, or uintptr_t, or perhaps (for simplicity) for all casts. This semantics was suggested by documentation from the RV-match group.

We agree that something like that is needed in some situations, especially given the status of compiler subobject alias analysis mentioned above. But is it already covered by effective types, or do we need separate subobject provenance enforcement even with -fno-strict-aliasing? That isn't clear. The per-allocation provenance proposal is attracticely simple.

For the moment the detailed proposal we provide does not do subobject provenance, so it allows the above example.

To enforce subobject provenance, we'd adapt the proposal to keep, with every pointer and integer value, both a provenance ID and a subrange of the original allocation footprint. When constructing a pointer to a struct member, by &(x.p), we'd take the subrange of that member. When constructing a pointer to a union member, we'd take the size of that member. For a pointer constructed as the address of a specific array element, it's unclear whether one should take just that element or the whole array. Then on any cast, we'd expand the subrange to that of the original allocation. Similarly on any representation-byte write, except that we'd like a user-memcopy to reconstruct the original limited pointer value.

2.8 Pointer provenance in C++, and provenance within allocated regions

[This is from discussion with Richard Smith, C++ editor, at EuroLLVM 2018]

Ignoring malloc'd regions and wildcard-provenance pointers, our proposal seems broadly consistent with the current C++ draft for the "C fragment" of C++ (whatever that is exactly).

Note that this relies on the "implicit forwarding rule", and that in C++ they split allocation lifetime and object lifetime.

For plain-old-data (POD) object creation within malloc'd regions, which in practice will happen incrementally as struct/union members or array elements are written, they have an exotic style for angelically specifying ghost "object creation" in whatever way is necessary to make the program defined (if such exists). It'd probably be better to accumulate constraints explicitly.

In C++, if you create two distinct objects in the same allocated region from operator-new (a possibly user-defined allocator), then implementations will assume you can't get from one to the other by pointer arithmetic.

If you really malloc a region and create two POD objects, do the same rules apply? In principle, when you create an object, the pointer you get can only be used to access that object. In practice, RS doesn't know of implementations that take advantage of that (except maybe XL C/C++, which is rather aggressive for this), and there probably is user code that takes advantage of that. C++ text currently forbids it. Probably it should be allowed? Otherwise how could one use incrementally initialised arrays in an allocated region? Question 9b above illustrates this.

XL C/C++ might have an optimisation that assumes if you pass a function a pointer to a member, it assumes it doesn't modify the rest of the struct. That's arguably over-aggressive.

Also relevant here is our N2013 Q75. Can an unsigned character array with static or automatic storage duration be used (in the same way as a malloc’d region) to hold values of other types?. As we say there, a literal reading of the ISO effective type rules for C forbids this, but we believe it needs to be allowed. C++ explicitly allows it.

2.9 The problem with lost escapes

2.9.1 Q87. Can provenance be carried via values which could be algebraically optimised away?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: yes. C2x proposal (no-provenance option): n/a

[From discussion with Richard Smith]

Our provenance proposal allows computations that erase the numeric value (and hence a concrete view of the "semantic dependencies") of a pointer but retain provenance. This makes examples like that below allowed in our proposal. Note that we don't believe it is especially desirable to allow this particular example - this is just a consequence of choosing a straightforward provenance semantics that allows the bytewise copying and the bitwise manipulation of pointers of Q14 and Q15.

Example: provenance_lost_escape_2.c (code and experimental data)

Example: provenance_lost_escape_1.c (experimental data)

#include <stdio.h>
#include <string.h> 
#include <stdint.h> 
int x=1;
int main() {
  int *p = &x;                             // assume allocated at 0x6009b8
  uintptr_t i1 = (intptr_t)p;              // value 0x6009b8 provanance x
  uintptr_t i2 = i1 & 0x00000000FFFFFFFF;  // 
  uintptr_t i3 = i2 & 0xFFFFFFFF00000000;  // value 0x0,     provenance x
  uintptr_t i4 = i3 + 0x6009b8;            // value 0x6009b8 provanance x
  int *q = (int *)i4;
  printf("Addresses: p=%p\n",(void*)p);
  if (memcmp(&i1, &i4, sizeof(i1)) == 0) {
    *q = 11;  // does this have defined behaviour?
    printf("x=%d *p=%d *q=%d\n",x,*p,*q);
  }
  return 0;
}

However, in implementations some optimisation may be done before alias analysis - in particular, in LLVM, algebraic optimisations are - and those optimisations might erase the escape of &x, replacing it and all the calculation of i3 by 0x0 (a similar example would have i3 = i1-i1).
But then alias analysis might be unable to see that *q could access x, and so report that it couldn't, and hence enable optimisations that are unsound for this case. The basic point is that escapes are not preserved by optimisation.

Implementations that take a very conservative view of all pointers formed from integers (deeming them to potentially alias with almost anything) would automatically be ok w.r.t. this, but implementations that do track provenance through integers would need to do more work.

A possible solution, which would need some implementation work but (we guess) not very much, would be to require those initial optimisation passes to record the escapes involved in computations they erase, so that that could be passed in explicitly to alias analysis.

This would also be necessary for a more refined treatment of pointers from I/O, in which the source-language semantics permits them to alias only with source-language I/O-escaped pointers - more on that below.

2.10 Provenance escape hatches, for IO and inter-object arithmetic - and the problem with wildcards

Our previous proposal suggested that pointers from I/O, and pointers from inter-object arithmetic explicitly flagged as such, could be supported with a "wildcard" provenance, allowing them to be used to access any current allocation with the same concrete address and type. But (as pointed out by several people, including Martin Uecker), this would be problematic: given the existence of such pointers, unless they are somehow kept distinct (e.g. by having distinct types for potentially-wildcard pointers), which seems challenging, a compiler would have to assume that any pointer-typed function argument could be such a wildcard, and then in the semantics it might legitimately alias (and be used to access) function local variables, including those that are only created later in the function.

The LLVM internal-language semantics proposal by Lopes, Lee, Hur, et al. uses timestamps to limit what their (somewhat different) wildcard pointers can alias to.

For the IO case, one could keep track in the semantics of the provenances which have escaped via IO (in all forms, including %p and integer IO of intptr_t-cast pointers and pointer representation bytes). Then regard a read pointer as potentially aliasing any of those (of the same type, without -fno-strict-aliasing).

For pointers constructed via escape-hatch-tagged inter-object pointer arithmetic, it's less clear what would be sensible.
What syntactic form should such an escape hatch take? One could define a source-language-semantics notion of all the provenances which have escaped by the combination of casts to integer type, access to their representations, and I/O, and regard wildcard provenances as allowed to access any of those. That's rather crude - is it too liberal for implemented alias analysis?

At the other extreme, one could require the programmer to explicitly annotate allocations that might later be aliased by pointers constructed by wild casts. Syntactically, these might just be function calls, so we'd have void mark_as_escaping(void *p) and void * wild_cast(intptr_t i) (probably these should have better names, so as not to be confused with normal escape analysis). The former would just record the provenance/allocation of p in the abstract machine; the latter would check the integer value of i is within one of those allocations, and, if it is, give the result the corresponding provenance - with UB otherwise.

A proper solution to this would also handle allocators and their abstraction boundaries. In general an allocator (malloc or other OS or user allocators) starts with some region of memory which is essentially private to it, obtained from linking or from some other allocator, and hands out pointers to parts thereof. As far as the rest of the code is concerned, this should have fresh provenance - roughly as captured by the GCC malloc function attribute: "This tells the compiler that a function is malloc-like, i.e., that the pointer P returned by the function cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P. Using this attribute can improve optimization. Functions like malloc and calloc have this property because they return a pointer to uninitialized or zeroed-out storage. However, functions like realloc do not have this property, as they can return a pointer to storage containing pointers.". When such pointers are passed back into the allocator, e.g. in a call to free, they can alias with the whole region, though presumably the allocator doesn't write into the issued memory until it reallocates it. It might write into metadata via a computation based on the pointer passed into free - which could be handled via an explicit re-provenancing operation, taking that pointer and giving it the whole-region provenance. Proper treatment of this needs annotations to manually annotate lifetime too? Do existing alias analysis implementations get this right? With link-time optimisation (LTO) it presumably really matters.

2.11 Algebraic laws

2.11.1 Q12. For arithmetic over provenanced integer values, is the provenance of the result invariant under commutativity and under plus/minus associativity?

Our reading of C11 and proposal for C2x: C11: the text does not clearly specify. C2x proposal: no. C2x proposal (no-provenance option): n/a

Normal integer arithmetic or modular arithmetic satisfies various algebraic laws, e.g. (a+b)=(b+a) (commutativity for +) and a+(b-c) = (a+b)-c (which we call plus/minus associativity, in the absence of a standard name). Do these still hold for provenanced values? For C pointer arithmetic, addition of two pointers is a type error, so the question does not arise for those two laws. But in semantics in which integer values also carry provenance data of some kind, we have the same question for analogous examples that do the arithmetic at uintptr_t type, e.g. asking whether the following two programs behave the same:

Example: pointer_arith_algebraic_properties_2_global.c (experimental data)

#include <stdio.h>
#include <inttypes.h>
int y[2], x[2];
int main() {
  int *p=(int*)(((uintptr_t)&(x[0])) + 
    (((uintptr_t)&(y[1]))-((uintptr_t)&(y[0]))));
  *p = 11;  // is this free of undefined behaviour?
  printf("x[1]=%d *p=%d\n",x[1],*p);
  return 0;
}

Example: pointer_arith_algebraic_properties_3_global.c (experimental data)

#include <stdio.h>
#include <inttypes.h>
int y[2], x[2];
int main() {
  int *p=(int*)(
    (((uintptr_t)&(x[0])) + ((uintptr_t)&(y[1])))
    -((uintptr_t)&(y[0])) );
  *p = 11;  // is this free of undefined behaviour?
  //(equivalent to the &x[0]+(&(y[1])-&(y[0])) version?)
  printf("x[1]=%d *p=%d\n",x[1],*p);
  return 0;
}

Note that because the arithmetic is happening at uintptr_t, constructing the transiently "out of bounds" value is not in itself illegal.

In our proposal, provanance for binary operations on integers is unaffected by swapping the arguments, so they are commutative iff the underlying operation is. For plus/minus associativity, both examples would have empty provenance, as they combine integer values with distinct single provenances.

Automatic storage-duration analogues:

Example: pointer_arith_algebraic_properties_2_auto.c (code and experimental data)

Example: pointer_arith_algebraic_properties_3_auto.c (code and experimental data)

3 Proposed semantics, in detail

This gives details for the core proposal only, without escape hatches and without subobject provenance. It is essentially the N2219 proposal with wildcard provenances removed.

Pointer values and integer values both carry a provenance, either the "empty" provenance or a single provenance ID.
On every allocation (of objects with static, thread, automatic, and allocated storage duration), we choose a fresh provenance ID (unique in the entire execution), and the resulting pointer value carries that single provenance ID.
At any access via a pointer value, then without the -fno-provenance option, its numeric address must be consistent with its provenance, with undefined behaviour otherwise. In particular:
- access via a pointer value with empty provenance is undefined behaviour (except where the numeric value is within an implementation-defined set of "device" memory addresses);
- access via a pointer value with a single provenance ID must be within the corresponding allocation;
This and the result of == are the only places where provenance is used - all the rest is just calculating provenances appropriately.

With the -fno-provenance option, the numeric address and footprint of the access must be within some live allocation, with undefined behaviour otherwise. Note that the effective-types semantics (without -fno-strict-aliasing) will impose additional constraints here.
NULL pointers constructed from integer constant expressions have the empty provenance.
With the -fno-provenanance option, whether pointer equality comparison (with == or !=) takes the associated provenances into account or not is indeterminate. Without that option, pointer equality comparison should not take the associated provenances into account.
Pointer relational comparison (with <, <=, >, and >=) is unaffected by the associated provenances.
All casts among pointer and integer types preserve provenance.
The result of subtraction of two pointer values is an integer value with empty provenance, irrespective of the operand provenances (in particular, irrespective of whether they point within the same object or not).
The result of an addition or subtraction of a pointer value and an integer value has the provenance of the pointer value
The result of operations on integer values is as follows:
- the result of the address-of has the corresponding provenance of the object associated with the lvalue, for non-function-pointers, or empty for function pointers.
- the provenance of the result of the unary * operator is whatever was stored
- integer unary +, unary -, and ~ operators preserve the original provenance; logical negation ! gives a value with empty provenance.
- sizeof and _Alignof operators give values with empty provenance
- multiplicative and additive operators, bitwise AND, bitwise exclusive OR, and bitwise inclusive OR operators have provenance as follows:
  - if both have empty provenance, the result has that
  - if exactly one argument has non-empty provenance, the result has that
  - if both have the same single provenance, the result has that
  - if they have different single provenances, the result has empty provenance
  Summarising:
```
                  |  empty    ID       ID'   
         ---------+--------------------------
         empty    |  empty    ID       ID'   
         ID       |  ID       ID       empty 
         ID'      |  ID'      empty    ID'   
```
  Note that this use of empty for the ID/ID' cases is liberal as far as alias analysis is concerned, but requires programmers to be conservative.
- bitwise shift has the provenance of its first operand
- relational, equality, logical AND, logical OR, and constant expressions give values with empty provenance
- prefix increment and decrement operators follow the pointer or integer arithmetic rules above
- the conditional operator gives the provenance of the second or third operand as appropriate; simple assignment gives the provenance of the expression, as do function arguments; compound assignment follows the pointer or integer arithmetic rules above; the comma operator gives the provenance of the second operand,
The representation bytes of a pointer have the provenance of the pointer.
The provenance of a pointer constructed by writing individual representation bytes
- if all have the empty provenance, the result has the empty provenance
- if one has a single provenance and the others all have either the same single provenance or empty provenance, the result has the single provenance
- if two have distinct single provenance, the result has empty provenance
Provenance is not propagated via control flow (e.g. by conditionals that check equality of a pointer value).
To permit pointers to be constructed via IO (e.g. via %p or by marshalling and unmarshalling their representation bytes, in the same runtime), there are several choices, and it would be useful to know what compiler alias analysis actually does here.

4 Proposed Technical Corrigendum

TODO: UPDATE This is currently the old N2219 proposal. The wildcard aspects need to be removed, and it needs to be adapted to support escape hatches and perhaps also subobject provenance.

TODO: FIX There are errors in the lvalue-conversion and equality text, identified by Martin Sebor, that need to be fixed.

add the following term to (§3):

provenance
abstract information associated to each concrete value of pointer or integer type. It can either be the empty provenance, a single provenance for a given object or region, or the wildcard provenance.

add a new clause to (§6.3.2.1):

Without the -fno-provenance option, when the lvalue conversion is performed on an lvalue with pointer type, if it has the empty provenance and the associated memory location is not within the implementation-defined device memory, the behavior is undefined. If it has a single provenance and the object designated by the lvalue is not the same as or within the object from the single provenance, the behavior is undefined.

NOTE: for the wildcard provenance there is no additional check.

With the -fno-provenance option, when the lvalue conversion is performed on an lvalue with pointer type, if the associated memory location is not within a current allocation, the behaviour is undefined.

add a new clause to (§6.3.2.1):

The provenance of a value resulting from an lvalue conversion is has follows: if all the bytes of the designated object have the empty provenance, then so does the resulting value; if some bytes have a single provenance and all the other bytes have the empty provenance, then the resulting value has the same single provenance; if any two bytes have different single provenances, then the resulting value has the empty provenance; if any byte has the wildcard provenance, then so does the resulting value.

append after the first sentence of (§6.3.2.3#5):

There is an implementation-defined set of integer values for which the resulting pointer is within a device region of storage.

Null pointers add the following clause to (§6.3.2.3):

A null pointer has the empty provenance

Conversions between pointer and integer type
- append to the end of (§6.3.2.3#5):
  
  The resulting pointer has the provenance of the converted integer.
- append to the end of (§6.3.2.3#6):
  
  The resulting integer has the provenance of the converted pointer.
- add the following clause to (§6.3.2.3):
  
  All conversions between two pointer types leave the provenance unchanged.
Expression operators add the following clause to (§6.5) (for later reference we call this clause X, to be replaced by its section number)

Some expression operators evaluate to an integer value when both operands have integer type. For these the provenance of the resulting integer is as follows: if the value of both operands have empty provenance, so does the result; if the value of one operand has a single provenance and the other has empty provenance, or both values have the same single provenance, the result has that single provenance; if the values of the two operands have different single provenances, the result has the empty provenance; if the value of one operand has the wildcard provenance, the result has the wildcard provenance.

Address operator append to the end of (§6.5.3.2#3):

When the operand designates an object, the result has the single provenance of the outermost object containing that object.

Unary arithmetic operators add the following clause to (§6.5.3.3):

For all these operators, when the operand has integer type, the result has the empty provenance.

sizeof and _Alignof operators append to the end of both (§6.5.3.4#2) and (§6.5.3.4#3):

The result has the empty provenance.

Multiplicative operators add the following clause to (§6.5.5):

When both operands have integer type, the resulting integer has provenance as described in X.

Additive operators
- add the following to (§6.5.6#5) (or as a new clause):
When two integers are added or subtracted, the resulting integer has provenance as described in X.
- append to the end of (§6.5.6#8):
The result has the provenance of the pointer operand.
- append to the end of (§6.5.6#9):
The result has the empty provenance.
Bitwise shift operators add the following clause to (§6.5.7):

The result has the provenance of the first operand.

Relational operator append to the last sentence of (§6.5.8#6):

, and has the empty provenance.

Equality operator
- append to the penultimate sentence of (§6.5.9#3):
, and has the empty provenance.
- modify (§6.5.9#6) from the current:
Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.109)
```
    to read:
```
Without the -fno-provenance option, two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space, and they have the same single provenance (for the other cases involving their provenances, it is unspecified whether the pointers compare equal). With the -fno-provenance option, two pointers compare equal iff they have the same numeric address.109)
Bitwise AND operator append to the end of (§6.5.10#4):

The result has provenance as described in X.

Bitwise exclusive OR operator append to the end of (§6.5.11#4):

The result has provenance as described in X.

Bitwise inclusive OR operator append to the end of (§6.5.12#4):

The result has provenance as described in X.

Logical AND operator append to the last sentence of (§6.5.13#3):

, and has the empty provenance.

Logical OR operator append to the last sentence of (§6.5.14#3):

, and has the empty provenance.

Object representation modify the following sentence from (§6.2.6.1#4): after the second sentence

The value may be copied into an object of type unsigned char [n] (e.g., by memcpy)

   to read:

The value may be copied into an object of type unsigned char [n] (e.g., by memcpy), if the value is an integer or a pointer (and therefore has a provenance), the elements of the resulting array all have that provenance.

Provenance via IO modify the sentence from (§7.21.6.2#12 %p conversion specifier):

If the input item is a value converted earlier during the same program execution, the pointer that results shall compare equal to that value

   to read:

If the input item is a value converted earlier during the same program execution, the pointer that results shall compare equal to the most recent such value and have the same provenance

n2263: Clarifying Pointer Provenance v4

Kayvan Memarian, Victor Gomes, Peter Sewell. University of Cambridge

Wed May 30 13:16:51 EDT 2018