Skip to content

Latest commit

 

History

History
204 lines (152 loc) · 6.92 KB

sycl_ext_oneapi_auto_local_range.asciidoc

File metadata and controls

204 lines (152 loc) · 6.92 KB

sycl_ext_oneapi_auto_local_range

Notice

Copyright © 2022-2022 Intel Corporation. All rights reserved.

Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.

Contact

To report problems with this extension, please open a new issue at:

Dependencies

This extension is written against the SYCL 2020 revision 4 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.

Status

This is a proposed extension specification, intended to gather community feedback. Interfaces defined in this specification may not be implemented yet or may be in a preliminary state. The specification itself may also change in incompatible ways before it is finalized. Shipping software products should not rely on APIs defined in this specification.

Overview

SYCL provides a basic form of parallel_for that allows developers to specify the total number of work-items to launch (i.e. the size of the global range) without also specifying the number of work-items per work-group (i.e. the size of the local range). However, this form of parallel_for does not provide access to the sycl::group or sycl::sub_group classes; if a developer wants to use these classes, a kernel must be launched with a sycl::nd_range that specifies both the global and local sizes.

The only way to allow an implementation to choose the local work-group size with the ND-range form of parallel_for is to use kernel queries, following an approach like the one shown below:

auto bundle = sycl::get_kernel_bundle(q.get_context());
auto kernel = bundle.get_kernel<class KernelName>();
auto multiple = kernel.get_info<sycl::info::kernel_device_specific::preferred_work_group_size_multiple>(q.get_device());
auto max = kernel.get_info<sycl::info::kernel_device_specific::work_group_size>(q.get_device());
sycl::range<1> local;
if (N % multiple == 0) {
  // Use largest work-group size compatible with preferred multiple
  local = static_cast<size_t>(max / multiple) * multiple;
}
else {
  local = /* find largest work-group size smaller than max that divides N */;
}
q.parallel_for<class KernelName>(sycl::nd_range<1>{N, local}, [=](sycl::nd_item<1>) {
  /* kernel body */
});

The example above is very verbose, and requires developers to learn about kernel naming, sycl::kernel_bundle, sycl::kernel and a number of device/kernel queries. Extending the logic to support two- and three-dimensional kernels complicates things further. With this extension, the example simplifies to:

q.parallel_for(sycl::nd_range<1>{N, sycl::ext::oneapi::experimental::auto_range<1>}, [=](sycl::nd_item<1>) {
  /* kernel body */
});

The SYCL 2020 specification recommends that extensions should not alter existing constructors without ensuring that one of the parameters comes from the vendor’s extension namespace. This restriction prevents this extension from defaulting the second argument of the sycl::nd_range constructor to auto_range, which would make the example above even simpler:

q.parallel_for(sycl::nd_range<1>{N}, [=](sycl::nd_item<1>) {
  /* kernel body */
});

If this extension is proposed for inclusion in a future SYCL standard, altering the definition of sycl::nd_range should be considered.

Specification

Feature test macro

This extension provides a feature-test macro as described in the core SYCL specification. An implementation supporting this extension must predefine the macro SYCL_EXT_ONEAPI_AUTO_LOCAL_RANGE to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro’s value to determine which of the extension’s features the implementation supports.

Value Description

1

Initial version of this extension.

auto_range

This extension defines a new sycl::ext::oneapi::experimental::auto_range variable which can be used to define a sycl::nd_range with an unspecified work-group size. If such a sycl::nd_range object is used to launch a SYCL kernel, an implementation is free to launch the kernel with any valid work-group size (as defined by the SYCL specification).

The manner in which a work-group size is selected is implementation-defined. However, the total number of work-items launched by the kernel must match the number specified as the first argument to the sycl::nd_range constructor; an implementation may not adjust the size of the global range.

Note
Developers must take care to avoid awkward global range sizes when using an auto_range, to avoid performance issues. Since implementations must still choose a work-group size that divides the total number of work-items, the implementation may be forced to choose a sub-optimal work-group size (e.g. if the total number of work-items is a prime number, the work-group size must be 1). Although this division requirement holds for all SYCL kernels, it is easier to miss when using an auto_range.
namespace sycl {
namespace ext {
namespace oneapi {
namespace experimental {

template <int Dimensions>
static const inline range<Dimensions> auto_range = /* implementation-defined */;

}
}
}
}
Note
The auto_range variable is not constexpr because sycl::range is not required to be a literal type by SYCL 2020. If this changes in the future, this extension will be updated.

Implementation notes

This non-normative section provides information about one possible implementation of this extension. It is not part of the specification of the extension’s API.

The value of auto_range is implementation-defined to maximize freedom for implementations. If an implementation wants to use a reserved value to represent a request for an automatic local range, a range with every element set to 0 has no valid interpretation in SYCL 2020 and may be a logical choice. A trivial implementation of this extension can use a range with every element set to 1, since this is always a valid local range.

Implementations using OpenCL backends can build on existing OpenCL functionality where clEnqueueNDRangeKernel is called with a local_work_size value of NULL.

Implementations using other backends (e.g. Level Zero, CUDA) can use a combination of device and kernel queries to determine a good work-group size.

Issues

None.