From patchwork Sat Jan 15 01:05:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Hridya Valsaraju X-Patchwork-Id: 80011 Received: from vger.kernel.org ([23.128.96.18]) by www.linuxtv.org with esmtp (Exim 4.92) (envelope-from ) id 1n8XXR-009qEp-CJ; Sat, 15 Jan 2022 01:07:26 +0000 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231777AbiAOBHY (ORCPT + 1 other); Fri, 14 Jan 2022 20:07:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54854 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229534AbiAOBHX (ORCPT ); Fri, 14 Jan 2022 20:07:23 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 42905C06173E for ; Fri, 14 Jan 2022 17:07:23 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id h2-20020a5b0a82000000b0061192499188so21621042ybq.9 for ; Fri, 14 Jan 2022 17:07:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=VVL93OjDpl50EsTf01Y02AFzpDgsoiwDg8IFEaHEx7o=; b=SeDudgjZy+YJyTXR9JLoFhR5vDMcIwZaaa8yQTqwNo0eCjpi1e+Z6fuk5CVH0j8T6Q yftZP1F1lLghLhdqxp/d2Hi9b3XVn1g1t87rKY7jFrOaC36QUheGeR1xOTUz/2bLyPhV 62QLjtSW0fnKArrUMAigsH+EedrTR+GBYAk7odkg7PsOu9hJD19pSdQygjNHP69Oxkqd AAUXclJGnEsBrXj622qDr25tJi+BZW4ce4s3tyPfO7BAn5XNcrCyZAECVNeP7XHBD3HS ZHn3F7wLmnaUtrfd6x1VO2YBDfTcTt+rETEo/hAE820ttJA+H9Wi63ki5ATMsl4pmovf PDfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=VVL93OjDpl50EsTf01Y02AFzpDgsoiwDg8IFEaHEx7o=; b=duPdqty45lNXOuFuAMLVpxD5donzL0OgrhSaPrj+jSP1AV/gAk1rMsE/wkBgpkE2GP NKLfhrHQt+lglODkbeKSuLy2Zlg7rtO8qNqQibLLhyi0oN20QCala6CB5nJl8VpUM4wu OBI/rdg3M4WnoF0Dva14JOJSyqpFfAlYSKOMwtSdHwzi/v+d0BqjIY4Aj6BjtOvcMBru tZBwDdLebVX+LqxetWtHUAAPh4tKk6njw2ogI3tlW3+65A4h4+N4spFhYGsth0G/+7La sVI2u5v3FHjfKN16RiLni+z09udsqCkSHUJz0jVlMWjtY93MhaEjw0qNp1YThbYeiIcD QYjg== X-Gm-Message-State: AOAM530rSQlyvHH9916NGZUHAH0daJ/YoLxsUN6m6myNRvLTb7dzF/b0 U91Kaxrqiasc1EsbsI9EPAdXpfQRQcI= X-Google-Smtp-Source: ABdhPJzyM7cNAy3s+OktTpih4qEUrIx/QPHtlR7VJWxQbBa5BvlJwh/TEx9gX5GKJVVbkPqjsVs2iQfq0sE= X-Received: from hridya.mtv.corp.google.com ([2620:15c:211:200:5860:362a:3112:9d85]) (user=hridya job=sendgmr) by 2002:a25:bf82:: with SMTP id l2mr16594693ybk.356.1642208842425; Fri, 14 Jan 2022 17:07:22 -0800 (PST) Date: Fri, 14 Jan 2022 17:05:59 -0800 In-Reply-To: <20220115010622.3185921-1-hridya@google.com> Message-Id: <20220115010622.3185921-2-hridya@google.com> Mime-Version: 1.0 References: <20220115010622.3185921-1-hridya@google.com> X-Mailer: git-send-email 2.34.1.703.g22d0c6ccf7-goog Subject: [RFC 1/6] gpu: rfc: Proposal for a GPU cgroup controller From: Hridya Valsaraju To: David Airlie , Daniel Vetter , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , Jonathan Corbet , Greg Kroah-Hartman , " =?utf-8?q?Arve_Hj=C3=B8n?= =?utf-8?q?nev=C3=A5g?= " , Todd Kjos , Martijn Coenen , Joel Fernandes , Christian Brauner , Hridya Valsaraju , Suren Baghdasaryan , Sumit Semwal , Benjamin Gaignard , Liam Mark , Laura Abbott , Brian Starkey , John Stultz , " =?utf-8?q?Christian_K=C3=B6nig?= " , Tejun Heo , Zefan Li , Johannes Weiner , Dave Airlie , Kenneth Graunke , Simon Ser , Jason Ekstrand , Matthew Auld , Matthew Brost , Li Li , Marco Ballesio , Finn Behrens , Hang Lu , Wedson Almeida Filho , Masahiro Yamada , Andrew Morton , Nathan Chancellor , Kees Cook , Nick Desaulniers , Miguel Ojeda , Vipin Sharma , Chris Down , Daniel Borkmann , Vlastimil Babka , Arnd Bergmann , dri-devel@lists.freedesktop.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org, cgroups@vger.kernel.org Cc: Kenny.Ho@amd.com, daniels@collabora.com, kaleshsingh@google.com, tjmercier@google.com Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org X-LSpam-Score: -15.0 (---------------) X-LSpam-Report: No, score=-15.0 required=5.0 tests=BAYES_00=-1.9,DKIMWL_WL_MED=0.001,DKIM_SIGNED=0.1,DKIM_VALID=-0.1,DKIM_VALID_AU=-0.1,HEADER_FROM_DIFFERENT_DOMAINS=0.5,MAILING_LIST_MULTI=-1,RCVD_IN_DNSWL_HI=-5,USER_IN_DEF_DKIM_WL=-7.5 autolearn=ham autolearn_force=no This patch adds a proposal for a new GPU cgroup controller for accounting/limiting GPU and GPU-related memory allocations. The proposed controller is based on the DRM cgroup controller[1] and follows the design of the RDMA cgroup controller. The new cgroup controller would: * Allow setting per-cgroup limits on the total size of buffers charged to it. * Allow setting per-device limits on the total size of buffers allocated by device within a cgroup. * Expose a per-device/allocator breakdown of the buffers charged to a cgroup. The prototype in the following patches are only for memory accounting using the GPU cgroup controller and does not implement limit setting. [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/ Signed-off-by: Hridya Valsaraju --- Hi all, Here is the RFC documentation for the GPU cgroup controller that we talked about at LPC 2021 along with a prototype. I reached out to Tejun with the idea recently and he mentioned that cgroup-aware BPF(by Kenny Ho) or the new misc cgroup controller can also be considered as alternatives to track GPU resources. I am sending the RFC to the list to give everyone else a chance to chime in with their thoughts as well so that we can reach an agreement on how to proceed. Thanks in advance! Regards, Hridya Documentation/gpu/rfc/gpu-cgroup.rst | 192 +++++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 196 insertions(+) create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst new file mode 100644 index 000000000000..9bff23007b22 --- /dev/null +++ b/Documentation/gpu/rfc/gpu-cgroup.rst @@ -0,0 +1,192 @@ +=================================== +GPU cgroup controller +=================================== + +Goals +===== +This document intends to outline a plan to create a cgroup v2 controller subsystem +for the per-cgroup accounting of device and system memory allocated by the GPU +and related subsystems. + +The new cgroup controller would: + +* Allow setting per-cgroup limits on the total size of buffers charged to it. + +* Allow setting per-device limits on the total size of buffers allocated by a + device/allocator within a cgroup. + +* Expose a per-device/allocator breakdown of the buffers charged to a cgroup. + +Alternatives Considered +======================= + +The following alternatives were considered: + +The memory cgroup controller +____________________________ + +1. As was noted in [1], memory accounting provided by the GPU cgroup +controller is not a good fit for integration into memcg due to the +differences in how accounting is performed. It implements a mechanism +for the allocator attribution of GPU and GPU-related memory by +charging each buffer to the cgroup of the process on behalf of which +the memory was allocated. The buffer stays charged to the cgroup until +it is freed regardless of whether the process retains any references +to it. On the other hand, the memory cgroup controller offers a more +fine-grained charging and uncharging behavior depending on the kind of +page being accounted. + +2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model, +a process takes a reference to the entire buffer(hence keeping it alive) even if +it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF +memory accounting would only introduce additional overhead without any benefits. + +[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705 + +Userspace service to keep track of buffer allocations and releases +__________________________________________________________________ + +1. There is no way for a userspace service to intercept all allocations and releases. +2. In case the process gets killed or restarted, we lose all accounting so far. + +UAPI +==== +When enabled, the new cgroup controller would create the following files in every cgroup. + +:: + + gpu.memory.current (R) + gpu.memory.max (R/W) + +gpu.memory.current is a read-only file and would contain per-device memory allocations +in a key-value format where key is a string representing the device name +and the value is the size of memory charged to the device in the cgroup in bytes. + +For example: + +:: + + cat /sys/kernel/fs/cgroup1/gpu.memory.current + dev1 4194304 + dev2 4194304 + +The string key for each device is set by the device driver when the device registers +with the GPU cgroup controller to participate in resource accounting(see section +'Design and Implementation' for more details). + +gpu.memory.max is a read/write file. It would show the current total +size limits on memory usage for the cgroup and the limits on total memory usage +for each allocator/device. + +Setting a total limit for a cgroup can be done as follows: + +:: + + echo “total 41943040” > /sys/kernel/fs/cgroup1/gpu.memory.max + +Setting a total limit for a particular device/allocator can be done as follows: + +:: + + echo “dev1 4194304” > /sys/kernel/fs/cgroup1/gpu.memory.max + +In this example, 'dev1' is the string key set by the device driver during +registration. + +Design and Implementation +========================= + +The cgroup controller would closely follow the design of the RDMA cgroup controller +subsystem where each cgroup maintains a list of resource pools. +Each resource pool contains a struct device and the counter to track current total, +and the maximum limit set for the device. + +The below code block is a preliminary estimation on how the core kernel data structures +and APIs would look like. + +.. code-block:: c + + /** + * The GPU cgroup controller data structure. + */ + struct gpucg { + struct cgroup_subsys_state css; + /* list of all resource pools that belong to this cgroup */ + struct list_head rpools; + }; + + struct gpucg_device { + /* + * list of various resource pools in various cgroups that the device is + * part of. + */ + struct list_head rpools; + /* list of all devices registered for GPU cgroup accounting */ + struct list_head dev_node; + /* name to be used as identifier for accounting and limit setting */ + const char *name; + }; + + struct gpucg_resource_pool { + /* The device whose resource usage is tracked by this resource pool */ + struct gpucg_device *device; + + /* list of all resource pools for the cgroup */ + struct list_head cg_node; + + /* + * list maintained by the gpucg_device to keep track of its + * resource pools + */ + struct list_head dev_node; + + /* tracks memory usage of the resource pool */ + struct page_counter total; + }; + + /** + * gpucg_register_device - Registers a device for memory accounting using the + * GPU cgroup controller. + * + * @device: The device to register for memory accounting. Must remain valid + * after registration. + * @name: Pointer to a string literal to denote the name of the device. + */ + void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name); + + /** + * gpucg_try_charge - charge memory to the specified gpucg and gpucg_device. + * + * @gpucg: The gpu cgroup to charge the memory to. + * @device: The device to charge the memory to. + * @usage: size of memory to charge in bytes. + * + * Return: returns 0 if the charging is successful and otherwise returns an + * error code. + */ + int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage); + + /** + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_device. + * + * @gpucg: The gpu cgroup to uncharge the memory from. + * @device: The device to charge the memory from. + * @usage: size of memory to uncharge in bytes. + */ + void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage); + +Future Work +=========== +Additional GPU resources can be supported by adding new controller files. + +Upstreaming Plan +================ +* Decide on a UAPI that accommodates all use-cases for the upstream GPU ecosystem + as well as for Android. + +* Prototype the GPU cgroup controller and integrate its usage into the DMA-BUF + system heap. + +* Demonstrate its usage from userspace in the Android Open Space Project. + +* Send out RFCs to LKML for the GPU cgroup controller and iterate. diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..0a9bcd94e95d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree:: i915_scheduler.rst + +.. toctree:: + + gpu-cgroup.rst