AdaptiveCpp runtime specification
The AdaptiveCpp runtime library follows the requirements of a SYCL runtime library as described in the SYCL specification. The following specification assumes the SYCL specification, but expands on it where AdaptiveCpp provides stronger or slightly different guarantees. It is assumed that the reader is at least familiar with the SYCL programming model.
AdaptiveCpp buffer-accessor model
Buffer behavior
Overview
A buffer
is an object that provides storage of a fixed size, and makes that storage accessible on an arbitrary amount of devices. To this end, it manages allocations of the fixed buffer size on all devices where the buffer is accessed.
Persistent allocations
A goal of the buffer
implementation is delivering predictable performance; as such all allocations managed by a buffer
shall be of the fixed buffer size. No reallocations shall occur during the lifetime of the buffer
without explicit user request, and managed allocations shall not be freed without explicit user request before buffer
destruction. Once a buffer
object has started to manage an allocation on a particular device, this allocation shall be used for all operations that access the buffer
object on that device.
A pointer to buffer data obtained in a kernel shall be valid and point to the same memory for all subsequent kernels that are executed on the same device as long as the buffer object exists.
Explicit USM as foundation
Memory management operations of the buffer
and storage shall be performed using SYCL 2020 USM pointers. This implies inherent interoperability between USM pointers and buffers. For example, if a pointer to a memory allocation managed by a buffer
is obtained by the user, it shall behave like a USM pointer and USM operations shall work with that pointer as if it were a pointer obtained from a USM memory allocation function.
If buffer
allocates memory, this shall be done using explicit USM allocations by default. If the buffer
provides additional interoperability mechanisms that allow constructing buffers on top of user-provided USM pointers, those may be of other USM allocation types. In this case, the allocation shall still be interpreted by the buffer
as a USM allocation that is bound to a single device.
For allocations on CPU backends, a buffer
implementation may use USM host allocations (i.e. page-locked memory).
Mapping between allocations and devices
Allocations managed by a buffer
shall not be shared between different physical devices; instead a buffer shall allocate individual memory buffers for each physical device on which it operates. This allows the scheduler and user to make stronger assumptions regarding necessary data migration and the performance impact of executing kernels simultaneously on different devices that read data from the same buffer
. Allocations may only be shared between different SYCL devices that refer to the same physical hardware. For example, it may be desirable to have a single host allocation that is used by all CPU devices if there are multiple CPU backends available.
Allocation behavior
Memory shall be allocated lazily on a particular device when a buffer
is first used on that device. However, some buffer
constructors may require that data from a user-provided input pointer is copied to internal buffer storage. In this case buffer
will perform an allocation in the constructor, typically on the host device, to hold that data.
Comments
- If a device pointer has been extracted from a buffer, it is valid at least until buffer destruction, and can be used for USM operations - provided the user manually synchronizes these USM operations with any operations the
buffer
is involved in. - Because there are no partial allocations, accessing memory outside the bounds of a ranged accessor, but within the buffer bounds, is not undefined behavior. However, it is not guaranteed that this data is up-to-date and there might be other kernels operating on it simultaneously if the user does not manually synchronize (details below).
Data transfers, accessors and dependencies
Data state tracking and pages
A SYCL implementation needs to track whether data stored in the buffer
in an allocation on a particular device is up-to-date or outdated. This information allows it to determine whether the implicit requirements formulated by accessors need to be translated into actual data transfers.
In the AdaptiveCpp model, the range of the buffer is interpreted as a 3D grid that is divided into 3D chunks of fixed size in each dimension. These chunks will in the following be referred to as pages (unrelated to virtual memory pages of the operating system). An implementation may expose mechanisms that allow the user to set the page size in each dimension in the buffer
constructor. The page size determines the granularity of memory management and data state tracking.
For each allocation managed on each device, the buffer
implementation shall track for each page whether the data contained within the page is up-to-date or outdated.
If a page is fully contained within or overlaps with the accessed range of an accessor (taking into account the accessor's access offset and range), we use the terminology that the page is part of the accessor's page range.
Using an accessor that is not of a read-only access mode on a device d shall cause all pages within its page range to be marked as outdated on all allocations except for those on device d. This is because the implementation has to assume that data was modified on d
.
Data transfers generated from accessors (see below) shall cause transferred pages to be marked as up-to-date on the target allocation.
If a buffer
is reinterpreted to a data type of different size than the original buffer element size or reshaped into a different range, the implementation may assume a page range for accessors to the reinterpreted buffer that is larger than the page range as defined above.
Data transfers
Accessors of discard
access mode (no_init
in SYCL 2020) shall never lead to data transfers.
Accessors referring to a buffer
that does not contain any initialized data (e.g. because it was never written to and was not constructed with a user-provided input pointer) shall never lead to data transfers.
When a non-discard
accessor is used on a particular device, a data transfer shall occur only if at least one of the pages within the accessor's page range is marked as outdated.
The implementation shall attempt to minimize both the number of transferred pages and the total number of backend data transfers, although the precise mechanism used and the detailed optimization criteria are implementation-defined.
Dependencies
Two accessors referring to the same buffer
are considered conflicting, if one or both are not of read-only access mode and their page ranges overlap.
Two accessors referring to different buffer
objects are never conflicting.
If two accessors are conflicting, a dependency is established between the command groups that they are used in. Dependent command groups are executed in submission order.
Independent command groups may be executed in parallel. For example, this includes the possibility of executing kernels in parallel on the same device, if this is supported by the backend and hardware.
Comments
- A smaller page size means a finer data management granularity; it may allow for more operations to be executed without dependencies in between them, but may also lead to a larger runtime overhead when tracking data state. The optimal page size is therefore a tradeoff.
- Note that in the AdaptiveCpp model, subbuffers are neither needed, nor necessary, nor recommended to obtain parallel execution of kernels.