[RFC,v4,5/8] dmabuf: Add gpu cgroup charge transfer function

Message ID 20220328035951.1817417-6-tjmercier@google.com (mailing list archive)
State Not Applicable
Headers
Series Proposal for a GPU cgroup controller |

Commit Message

T.J. Mercier March 28, 2022, 3:59 a.m. UTC
  From: Hridya Valsaraju <hridya@google.com>

The dma_buf_charge_transfer function provides a way for processes to
transfer charge of a buffer to a different process. This is essential
for the cases where a central allocator process does allocations for
various subsystems, hands over the fd to the client who requested the
memory and drops all references to the allocated memory.

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v4 changes
Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutný.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/dma-buf/dma-buf.c  | 49 +++++++++++++++++++++++++++++++
 include/linux/cgroup_gpu.h | 12 ++++++++
 include/linux/dma-buf.h    |  2 ++
 kernel/cgroup/gpu.c        | 59 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 122 insertions(+)
  

Comments

Michal Koutný March 29, 2022, 3:21 p.m. UTC | #1
Hi.

On Mon, Mar 28, 2022 at 03:59:44AM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
> From: Hridya Valsaraju <hridya@google.com>
> 
> The dma_buf_charge_transfer function provides a way for processes to

(s/dma_bug_charge_transfer/dma_bug_transfer_charge/)

> transfer charge of a buffer to a different process. This is essential
> for the cases where a central allocator process does allocations for
> various subsystems, hands over the fd to the client who requested the
> memory and drops all references to the allocated memory.

I understood from [1] some buffers are backed by regular RAM. How are
these charges going to be transferred (if so)?


Thanks,
Michal

[1]
https://lore.kernel.org/r/CABdmKX2NSAKMC6rReMYfo2SSVNxEXcS466hk3qF6YFt-j-+_NQ@mail.gmail.com
  
T.J. Mercier April 1, 2022, 6:41 p.m. UTC | #2
On Tue, Mar 29, 2022 at 8:21 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hi.
>
> On Mon, Mar 28, 2022 at 03:59:44AM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
> > From: Hridya Valsaraju <hridya@google.com>
> >
> > The dma_buf_charge_transfer function provides a way for processes to
>
> (s/dma_bug_charge_transfer/dma_bug_transfer_charge/)
>
Doh! Thanks.

> > transfer charge of a buffer to a different process. This is essential
> > for the cases where a central allocator process does allocations for
> > various subsystems, hands over the fd to the client who requested the
> > memory and drops all references to the allocated memory.
>
> I understood from [1] some buffers are backed by regular RAM. How are
> these charges going to be transferred (if so)?
>
This link doesn't work for me, but I think you're referring to the
discussion about your "RAM_backed_buffers" comment from March 23rd. I
wanted to do a simple test to confirm my own understanding here, but
that got delayed due to some problems on my end. Anyway the test I did
goes like this: enable memcg and gpu cgoups tracking and run a process
that allocates 100MiB of dmabufs. Observe memcg and gpu accounting
values before and after the allocation.

Before
# cat memory.current gpu.memory.current
14909440
system 0

<Test program does the allocation of 100MiB of dmabufs>

After
# cat memory.current gpu.memory.current
48025600
system 104857600

So the memcg value increases by about 30 MiB while the gpu values
increases by 100 MiB. This is with kmem enabled, and the /proc/maps
file for this process indicates that the majority of that 30 MiB is
kernel memory. I think this result shows that neither the kernel nor
process memory overlap with the gpu cgroup tracking of these
allocations. So despite the fact that these buffers are in main
memory, they are allocated in a way that does not result in memcg
attribution. (It looks to me like __GFP_ACCOUNT is not set for these.)

>
> Thanks,
> Michal
>
> [1]
> https://lore.kernel.org/r/CABdmKX2NSAKMC6rReMYfo2SSVNxEXcS466hk3qF6YFt-j-+_NQ@mail.gmail.com
  
Michal Koutný April 5, 2022, 12:12 p.m. UTC | #3
On Fri, Apr 01, 2022 at 11:41:36AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> This link doesn't work for me, but I think you're referring to the
> discussion about your "RAM_backed_buffers" comment from March 23rd.

(Oops, it's a non-public message. But yes, you guessed it right ;-))

> Anyway the test I did goes like this: enable memcg and gpu cgoups
> tracking and run a process that allocates 100MiB of dmabufs. Observe
> memcg and gpu accounting values before and after the allocation.

Thanks for this measurement/dem/demoo.

> Before
> # cat memory.current gpu.memory.current
> 14909440
> system 0
> 
> <Test program does the allocation of 100MiB of dmabufs>
> 
> After
> # cat memory.current gpu.memory.current
> 48025600
> system 104857600
> 
> So the memcg value increases by about 30 MiB while the gpu values
> increases by 100 MiB.

> This is with kmem enabled, and the /proc/maps
> file for this process indicates that the majority of that 30 MiB is
> kernel memory.

> I think this result shows that neither the kernel nor process memory
> overlap with the gpu cgroup tracking of these allocations.

It depends how the semantics of the 'system' entry is defined, no?
As I grasped from other thread, the 'total' is going to be removed, so
'system' represents exclusively device memory?


> So despite the fact that these buffers are in main memory, they are
> allocated in a way that does not result in memcg attribution. (It
> looks to me like __GFP_ACCOUNT is not set for these.)

(I thought you knew what dmabufs your program used :-p)

So, the goal is to do the tracking and migrations only via the gpu cg
layer, regardless how memcg charges it (or not).

(I have no opinion on that, I'm just summing it so that we're on the
same page.)

Michal
  
T.J. Mercier April 5, 2022, 5:48 p.m. UTC | #4
On Tue, Apr 5, 2022 at 5:12 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Apr 01, 2022 at 11:41:36AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> > This link doesn't work for me, but I think you're referring to the
> > discussion about your "RAM_backed_buffers" comment from March 23rd.
>
> (Oops, it's a non-public message. But yes, you guessed it right ;-))
>
> > Anyway the test I did goes like this: enable memcg and gpu cgoups
> > tracking and run a process that allocates 100MiB of dmabufs. Observe
> > memcg and gpu accounting values before and after the allocation.
>
> Thanks for this measurement/dem/demoo.
>
> > Before
> > # cat memory.current gpu.memory.current
> > 14909440
> > system 0
> >
> > <Test program does the allocation of 100MiB of dmabufs>
> >
> > After
> > # cat memory.current gpu.memory.current
> > 48025600
> > system 104857600
> >
> > So the memcg value increases by about 30 MiB while the gpu values
> > increases by 100 MiB.
>
> > This is with kmem enabled, and the /proc/maps
> > file for this process indicates that the majority of that 30 MiB is
> > kernel memory.
>
> > I think this result shows that neither the kernel nor process memory
> > overlap with the gpu cgroup tracking of these allocations.
>
> It depends how the semantics of the 'system' entry is defined, no?
> As I grasped from other thread, the 'total' is going to be removed, so
> 'system' represents exclusively device memory?
>
That's right. The system charges (soon to be renamed "system-heap")
result only from an allocator (in this case the system heap) deciding
to call gpucg_try_charge for the buffer which is entirely device
memory.
>
> > So despite the fact that these buffers are in main memory, they are
> > allocated in a way that does not result in memcg attribution. (It
> > looks to me like __GFP_ACCOUNT is not set for these.)
>
> (I thought you knew what dmabufs your program used :-p)
>
I'm coming up to speed on a lot of new-to-me code here. :)
Just for completeness, these buffers were allocated with
libdmabufheap's AllocSystem.

> So, the goal is to do the tracking and migrations only via the gpu cg
> layer, regardless how memcg charges it (or not).
>
> (I have no opinion on that, I'm just summing it so that we're on the
> same page.)
>
Yes, this reflects my intention and current state of the code in this series.

> Michal

Thanks,
T.J.
  

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 1ee5c60d3d6d..7748c3453b91 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1380,6 +1380,55 @@  void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map)
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap, DMA_BUF);
 
+/**
+ * dma_buf_transfer_charge - Change the GPU cgroup to which the provided dma_buf is charged.
+ * @dmabuf:	[in]	buffer whose charge will be migrated to a different GPU cgroup
+ * @gpucg:	[in]	the destination GPU cgroup for dmabuf's charge
+ *
+ * Only tasks that belong to the same cgroup the buffer is currently charged to
+ * may call this function, otherwise it will return -EPERM.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct gpucg *gpucg)
+{
+#ifdef CONFIG_CGROUP_GPU
+	struct gpucg *current_gpucg;
+	int ret;
+
+	/* If the source and destination cgroups are the same, don't do anything. */
+	current_gpucg = gpucg_get(current);
+	if (current_gpucg == gpucg) {
+		ret = 0;
+		goto skip_transfer;
+	}
+
+	/*
+	 * Verify that the cgroup of the process requesting the transfer is the
+	 * same as the one the buffer is currently charged to.
+	 */
+	current_gpucg = gpucg_get(current);
+	mutex_lock(&dmabuf->lock);
+	if (current_gpucg != dmabuf->gpucg) {
+		ret = -EPERM;
+		goto err;
+	}
+
+	ret = gpucg_transfer_charge(current_gpucg, gpucg, dmabuf->gpucg_dev, dmabuf->size);
+	if (ret)
+		goto err;
+	dmabuf->gpucg = gpucg;
+err:
+	mutex_unlock(&dmabuf->lock);
+skip_transfer:
+	gpucg_put(current_gpucg);
+	return ret;
+#else
+	return 0;
+#endif /* CONFIG_CGROUP_GPU */
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_transfer_charge, DMA_BUF);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
index c90069719022..e30f15d5e9be 100644
--- a/include/linux/cgroup_gpu.h
+++ b/include/linux/cgroup_gpu.h
@@ -87,6 +87,10 @@  static inline struct gpucg *gpucg_parent(struct gpucg *cg)
 
 int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
 void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_device *device,
+			  u64 usage);
 void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name);
 #else /* CONFIG_CGROUP_GPU */
 
@@ -121,6 +125,14 @@  static inline void gpucg_uncharge(struct gpucg *gpucg,
 				  struct gpucg_device *device,
 				  u64 usage) {}
 
+static inline int gpucg_transfer_charge(struct gpucg *source,
+					struct gpucg *dest,
+					struct gpucg_device *device,
+					u64 usage)
+{
+	return 0;
+}
+
 static inline void gpucg_register_device(struct gpucg_device *gpucg_dev,
 					 const char *name) {}
 #endif /* CONFIG_CGROUP_GPU */
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 742f29c3daaf..646827156213 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -646,4 +646,6 @@  int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 		 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
+
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct gpucg *gpucg);
 #endif /* __DMA_BUF_H__ */
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
index ac4c470914b5..40531323d6da 100644
--- a/kernel/cgroup/gpu.c
+++ b/kernel/cgroup/gpu.c
@@ -247,6 +247,65 @@  void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage)
 	css_put_many(&gpucg->css, nr_pages);
 }
 
+/**
+ * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another.
+ * @source:	[in]	The GPU cgroup the charge will be transferred from.
+ * @dest:	[in]	The GPU cgroup the charge will be transferred to.
+ * @device:	[in]	The GPU cgroup device corresponding to the charge.
+ * @usage:	[in]	The size of the memory in bytes.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_device *device,
+			  u64 usage)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp_source, *rp_dest;
+	int ret = 0;
+
+	nr_pages = PAGE_ALIGN(usage) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp_source = find_cg_rpool_locked(source, device);
+	if (unlikely(!rp_source)) {
+		ret = -ENOENT;
+		goto exit_early;
+	}
+
+	rp_dest = get_cg_rpool_locked(dest, device);
+	if (IS_ERR(rp_dest)) {
+		ret = PTR_ERR(rp_dest);
+		goto exit_early;
+	}
+
+	/*
+	 * First uncharge from the pool it's currently charged to. This ordering avoids double
+	 * charging while the transfer is in progress, which could cause us to hit a limit.
+	 * If the try_charge fails for this transfer, we need to be able to reverse this uncharge,
+	 * so we continue to hold the gpucg_mutex here.
+	 */
+	page_counter_uncharge(&rp_source->total, nr_pages);
+	css_put_many(&source->css, nr_pages);
+
+	/* Now attempt the new charge */
+	if (page_counter_try_charge(&rp_dest->total, nr_pages, &counter)) {
+		css_get_many(&dest->css, nr_pages);
+	} else {
+		/*
+		 * The new charge failed, so reverse the uncharge from above. This should always
+		 * succeed since charges on source are blocked by gpucg_mutex.
+		 */
+		WARN_ON(!page_counter_try_charge(&rp_source->total, nr_pages, &counter));
+		css_get_many(&source->css, nr_pages);
+	}
+exit_early:
+	mutex_unlock(&gpucg_mutex);
+	return ret;
+}
+
 /**
  * gpucg_register_device - Registers a device for memory accounting using the
  * GPU cgroup controller.