Message ID | 20240530201616.1316526-3-almasrymina@google.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers |
Received: from sy.mirrors.kernel.org ([147.75.48.161]) by linuxtv.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from <linux-media+bounces-12286-patchwork=linuxtv.org@vger.kernel.org>) id 1sCmD4-00043E-0a for patchwork@linuxtv.org; Thu, 30 May 2024 20:17:15 +0000 Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 5B4ADB22CC9 for <patchwork@linuxtv.org>; Thu, 30 May 2024 20:17:10 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 3259E158DBE; Thu, 30 May 2024 20:16:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="A7QPoA+v" X-Original-To: linux-media@vger.kernel.org Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E36DD158A1A for <linux-media@vger.kernel.org>; Thu, 30 May 2024 20:16:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717100187; cv=none; b=fSvsJeJ84VI657BNTJUtfRTmQO/nd20cYYpcb3nQc/fHwb1VNTxM0MdW1gKdZho6SdueMsz+U4eCfjgBAsCejOuoWQjhfF+qqQVCBSL5VkZuOe/nccAMEV8SKdIwgbmtg0BUFx1cddrEPSgWlmbpH9Mfxf0gGHmDLoEZS9gulYI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717100187; c=relaxed/simple; bh=aBnYje6A82XaEKFyLV9gCbV9fgLuPXwNN+l/HwJvBXA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=PcM0jg+9LEGPnmHEOK8/Za23UJawyXcMYpfDEnlrpCRgPm1+SOvJl4MWivFe7YGw5r1ZotCIONHfNILeaATDNT1eVx4k0R+UPOMNUDbzio+2zfv0hqCIy01cPfthdNrt+3M+ivw9+WEbZ93kIk11IgAnIUZVzSR8umOxudTU5Zs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=A7QPoA+v; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-629f8a92145so5090557b3.0 for <linux-media@vger.kernel.org>; Thu, 30 May 2024 13:16:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717100183; x=1717704983; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=2EPevCT7/2IjjDXAWaiob99tAoIzY7exVAkr0lb7mAI=; b=A7QPoA+vWRgks12DbQcyyuCzLN3AdkvhIKcn2khvdMl+oNMIDXkWEet9lYrpi1Q/Jz bF3n1YnPASjfwoevU+4b/rzxB2yPMEW3pPXakBHexyiYswIcYMhsVQN8VkAn5MN1ngfG hecKLjyzCt6Vw8rgbgUqDwOkrvPmqd0nTi+GuU+tHApc0VgjhNfJxiJqIjW8cixs8caM bepHo1p4BLiSYqWXYC0QqFge2ApRXt3+99xoPRArxDlw36vWrcuPFVG4DYfV8wpY7Zlx 2iF3tUlX24Ju1paSp38Mwuh2zrjsWDY1IU34GX9ahBtcxAle+xJqrExJKFzM5DNMsFgE ZWvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717100183; x=1717704983; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2EPevCT7/2IjjDXAWaiob99tAoIzY7exVAkr0lb7mAI=; b=fJZGYumDuw/8JHpFvP/2B+NR34mREKny1A3gEFuYn4fCldVYNyLYZH5PlOaXx7zxkB m1hX/eSc6jWqDlBl4uX6wZpbzeoEN6uq5Bm1okn2wqzfda7CXhcWLJXFyh/r3OkgFNrP WnPw1IKvWhtUFBH0yUqafx0yUtDipEWh0ckYoQPbF9Wjy8FAPKPXmuVXcxrLke40tjwD CtP18CQlOY3kXckCjrG+BGC2wTegyhipwz4PX/w8qB79fI5AwThrw3zbvr0CJEXRqkjt tPLPMa7cpXjeCQaP1JXP/T/6eP5JCA7OFGhLLb9QKQZz7lt6STgp66AWSt9dM7nWEsob sl4Q== X-Forwarded-Encrypted: i=1; AJvYcCWySlUG+kqGwSsedzrcRzJJQUdazxKGFYtaFZKuQzau+kd20kGJzkeMLGVlzwJSC1rwkv6Md8WK6UrAV0IoO5xJ/cEOmaQ58fDaRQI= X-Gm-Message-State: AOJu0YwdU4rfz57RzeVDd5GtKJmxuhCiNIEh50kj0eEFuCRIDLfLUZwM OizQpypcytzvI/pe5mOjBJpDRTgoLVMBH2geZzA8uEd8IwTlqf1jilGKvHX1LUel+YBt+RVWK8S TQg6oo4dlu4sOMGfsYzWe2g== X-Google-Smtp-Source: AGHT+IEzZ69JqoW6GJOeqts3Sq7g6h07tqQC0w1VVoiY60y4YSnBAwIT5aGEAkdPhLJOkoxRCW8vNj1B6RaErZFLlg== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:690c:e:b0:62a:415b:a137 with SMTP id 00721157ae682-62c6bc22eecmr8577877b3.1.1717100182928; Thu, 30 May 2024 13:16:22 -0700 (PDT) Date: Thu, 30 May 2024 20:16:01 +0000 In-Reply-To: <20240530201616.1316526-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-media@vger.kernel.org List-Id: <linux-media.vger.kernel.org> List-Subscribe: <mailto:linux-media+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-media+unsubscribe@vger.kernel.org> Mime-Version: 1.0 References: <20240530201616.1316526-1-almasrymina@google.com> X-Mailer: git-send-email 2.45.1.288.g0e0cd299f1-goog Message-ID: <20240530201616.1316526-3-almasrymina@google.com> Subject: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers From: Mina Almasry <almasrymina@google.com> To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry <almasrymina@google.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Donald Hunter <donald.hunter@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Richard Henderson <richard.henderson@linaro.org>, Ivan Kokshaysky <ink@jurassic.park.msu.ru>, Matt Turner <mattst88@gmail.com>, Thomas Bogendoerfer <tsbogend@alpha.franken.de>, "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>, Helge Deller <deller@gmx.de>, Andreas Larsson <andreas@gaisler.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Arnd Bergmann <arnd@arndb.de>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>, Yonghong Song <yonghong.song@linux.dev>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>, Steffen Klassert <steffen.klassert@secunet.com>, Herbert Xu <herbert@gondor.apana.org.au>, David Ahern <dsahern@kernel.org>, Willem de Bruijn <willemdebruijn.kernel@gmail.com>, Shuah Khan <shuah@kernel.org>, Sumit Semwal <sumit.semwal@linaro.org>, " =?utf-8?q?Christian_K=C3=B6nig?= " <christian.koenig@amd.com>, Pavel Begunkov <asml.silence@gmail.com>, David Wei <dw@davidwei.uk>, Jason Gunthorpe <jgg@ziepe.ca>, Yunsheng Lin <linyunsheng@huawei.com>, Shailend Chand <shailend@google.com>, Harshitha Ramamurthy <hramamurthy@google.com>, Shakeel Butt <shakeel.butt@linux.dev>, Jeroen de Borst <jeroendb@google.com>, Praveen Kaligineedi <pkaligineedi@google.com>, Christoph Hellwig <hch@infradead.org> Content-Type: text/plain; charset="UTF-8" X-LSpam-Score: -11.1 (-----------) X-LSpam-Report: No, score=-11.1 required=5.0 tests=ARC_SIGNED=0.001,ARC_VALID=-0.1,BAYES_00=-1.9,DKIMWL_WL_MED=-1,DKIM_SIGNED=0.1,DKIM_VALID=-0.1,DKIM_VALID_AU=-0.1,DMARC_PASS=-0.001,HEADER_FROM_DIFFERENT_DOMAINS=0.5,MAILING_LIST_MULTI=-1,SPF_HELO_NONE=0.001,SPF_PASS=-0.001,USER_IN_DEF_DKIM_WL=-7.5 autolearn=unavailable autolearn_force=no |
Series |
Device Memory TCP
|
|
Commit Message
Mina Almasry
May 30, 2024, 8:16 p.m. UTC
From: Jakub Kicinski <kuba@kernel.org> The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mina Almasry <almasrymina@google.com> --- - This is implemented by Jakub in his RFC: https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/T/ I take no credit for the idea or implementation; I only added minor edits to make this workable with device memory TCP, and removed some hacky test code. This is a critical dependency of device memory TCP and thus I'm pulling it into this series to make it revewable and mergeable. - There is a pending discussion about the acceptance of the page_pool memory provider hooks: https://lore.kernel.org/netdev/20240403002053.2376017-3-almasrymina@google.com/ I'm unsure if the discussion has been resolved yet. Sending the series anyway to get reviews/feedback on the (unrelated) rest of the series. Cc: Christoph Hellwig <hch@infradead.org> v10: - Renamed alloc_pages -> alloc_netmems. alloc_pages is now a preprocessor macro, and reusing the string results in a build error. RFC v3 -> v1 - Removed unusued mem_provider. (Yunsheng). - Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub). --- include/net/page_pool/types.h | 12 ++++++++++ net/core/page_pool.c | 43 +++++++++++++++++++++++++++++++---- 2 files changed, 50 insertions(+), 5 deletions(-)
Comments
On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: > I'm unsure if the discussion has been resolved yet. Sending the series > anyway to get reviews/feedback on the (unrelated) rest of the series. As far as I'm concerned it is not. I've not seen any convincing argument for more than page/folio allocator including larger order / huge page and dmabuf.
On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: > > I'm unsure if the discussion has been resolved yet. Sending the series > > anyway to get reviews/feedback on the (unrelated) rest of the series. > > As far as I'm concerned it is not. I've not seen any convincing > argument for more than page/folio allocator including larger order / > huge page and dmabuf. > Thanks Christoph, this particular patch series adds dmabuf, so I assume no objection there. I assume the objection is that you want the generic, extensible hooks removed. To be honest, I don't think the hooks are an integral part of the design, and at this point I think we've argued for them enough. I think we can easily achieve the same thing with just raw if statements in a couple of places. We can always add the hooks if and only if we actually justify many memory providers. Any objections to me removing the hooks and directing to memory allocations via simple if statements? Something like (very rough draft, doesn't compile): diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 92be1aaf18ccc..2cc986455bce6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) return netmem; /* Slow-path: cache empty, do real allocation */ - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - netmem = pool->mp_ops->alloc_pages(pool, gfp); + if (unlikely(page_pool_is_dmabuf(pool))) + netmem = mp_dmabuf_devmem_alloc_pages(): else netmem = __page_pool_alloc_pages_slow(pool, gfp); return netmem;
On 6/3/24 15:17, Mina Almasry wrote: > On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: >>> I'm unsure if the discussion has been resolved yet. Sending the series >>> anyway to get reviews/feedback on the (unrelated) rest of the series. >> >> As far as I'm concerned it is not. I've not seen any convincing >> argument for more than page/folio allocator including larger order / >> huge page and dmabuf. >> > > Thanks Christoph, this particular patch series adds dmabuf, so I > assume no objection there. I assume the objection is that you want the > generic, extensible hooks removed. > > To be honest, I don't think the hooks are an integral part of the > design, and at this point I think we've argued for them enough. I > think we can easily achieve the same thing with just raw if statements > in a couple of places. We can always add the hooks if and only if we > actually justify many memory providers. > > Any objections to me removing the hooks and directing to memory > allocations via simple if statements? Something like (very rough > draft, doesn't compile): The question for Christoph is what exactly is the objection here? Why we would not be using well defined ops when we know there will be more users? Repeating what I said in the last thread, for io_uring it's used to implement the flow of buffers from userspace to the kernel, the ABI, which is orthogonal to the issue of what memory type it is and how it came there. And even if you mandate unnecessary dmabuf condoms for user memory in one form or another IMHO for no clear reason, the callbacks (or yet another if-else) would still be needed. Sure, Mina can drop and hard code devmem path to easy the pain for him and delay the discussion, but then shortly after I will be re-sending same shit. So, what's the convincing argument _not_ to have it? > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 92be1aaf18ccc..2cc986455bce6 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool > *pool, gfp_t gfp) > return netmem; > > /* Slow-path: cache empty, do real allocation */ > - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) > - netmem = pool->mp_ops->alloc_pages(pool, gfp); > + if (unlikely(page_pool_is_dmabuf(pool))) > + netmem = mp_dmabuf_devmem_alloc_pages(): > else > netmem = __page_pool_alloc_pages_slow(pool, gfp); > return netmem; > >
On Mon, Jun 3, 2024 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 6/3/24 15:17, Mina Almasry wrote: > > On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: > >> > >> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: > >>> I'm unsure if the discussion has been resolved yet. Sending the series > >>> anyway to get reviews/feedback on the (unrelated) rest of the series. > >> > >> As far as I'm concerned it is not. I've not seen any convincing > >> argument for more than page/folio allocator including larger order / > >> huge page and dmabuf. > >> > > > > Thanks Christoph, this particular patch series adds dmabuf, so I > > assume no objection there. I assume the objection is that you want the > > generic, extensible hooks removed. > > > > To be honest, I don't think the hooks are an integral part of the > > design, and at this point I think we've argued for them enough. I > > think we can easily achieve the same thing with just raw if statements > > in a couple of places. We can always add the hooks if and only if we > > actually justify many memory providers. > > > > Any objections to me removing the hooks and directing to memory > > allocations via simple if statements? Something like (very rough > > draft, doesn't compile): > > The question for Christoph is what exactly is the objection here? Why we > would not be using well defined ops when we know there will be more > users? Repeating what I said in the last thread, for io_uring it's used > to implement the flow of buffers from userspace to the kernel, the ABI, > which is orthogonal to the issue of what memory type it is and how it > came there. And even if you mandate unnecessary dmabuf condoms for user > memory in one form or another IMHO for no clear reason, the callbacks > (or yet another if-else) would still be needed. > > Sure, Mina can drop and hard code devmem path to easy the pain for > him and delay the discussion, but then shortly after I will be > re-sending same shit. You don't need to re-send the same ops again, right? You can add io uring support without ops. Something like: diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 92be1aaf18ccc..2cc986455bce6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) return netmem; /* Slow-path: cache empty, do real allocation */ - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - netmem = pool->mp_ops->alloc_pages(pool, gfp); + if (unlikely(page_pool_is_dmabuf(pool))) + netmem = mp_dmabuf_devmem_alloc_pages(): + else if (unlikely(page_pool_is_iouring(pool))) + netmem = mp_io_uring_alloc_pages(): else netmem = __page_pool_alloc_pages_slow(pool, gfp); return netmem; So IMO, the ops themselves, which Christoph is repeatedly nacking, are not that important. I humbly think the energy should be spent convincing maintainers of the use case of io uring memory, not the ops. The ops are a cosmetic change to the code, and can be added later. Christoph is nacking the ops because it gives people too much rope [1]. But if you disagree and think the ops themselves are important for a reason I missed, I'm happy waiting until agreement is reached here. Sorry, just voicing my 2 cents. [1] https://lore.kernel.org/netdev/ZjjHUh1eINPg1wkn@infradead.org/
On Mon, Jun 03, 2024 at 07:17:05AM -0700, Mina Almasry wrote: > On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: > > > I'm unsure if the discussion has been resolved yet. Sending the series > > > anyway to get reviews/feedback on the (unrelated) rest of the series. > > > > As far as I'm concerned it is not. I've not seen any convincing > > argument for more than page/folio allocator including larger order / > > huge page and dmabuf. > > > > Thanks Christoph, this particular patch series adds dmabuf, so I > assume no objection there. I assume the objection is that you want the > generic, extensible hooks removed. Exactly! Note that this isn't a review of the dmabuf bits as there are people more qualified with me. > To be honest, I don't think the hooks are an integral part of the > design, and at this point I think we've argued for them enough. I > think we can easily achieve the same thing with just raw if statements > in a couple of places. We can always add the hooks if and only if we > actually justify many memory providers. > > Any objections to me removing the hooks and directing to memory > allocations via simple if statements? Something like (very rough > draft, doesn't compile): I like this approach, thanks! You might still want to keep the static key, though.
On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: > The question for Christoph is what exactly is the objection here? Why we > would not be using well defined ops when we know there will be more > users? The point is that there should be no more users. If you need another case you are doing something very wrong.
On 6/3/24 16:43, Mina Almasry wrote: > On Mon, Jun 3, 2024 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> On 6/3/24 15:17, Mina Almasry wrote: >>> On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: >>>> >>>> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: >>>>> I'm unsure if the discussion has been resolved yet. Sending the series >>>>> anyway to get reviews/feedback on the (unrelated) rest of the series. >>>> >>>> As far as I'm concerned it is not. I've not seen any convincing >>>> argument for more than page/folio allocator including larger order / >>>> huge page and dmabuf. >>>> >>> >>> Thanks Christoph, this particular patch series adds dmabuf, so I >>> assume no objection there. I assume the objection is that you want the >>> generic, extensible hooks removed. >>> >>> To be honest, I don't think the hooks are an integral part of the >>> design, and at this point I think we've argued for them enough. I >>> think we can easily achieve the same thing with just raw if statements >>> in a couple of places. We can always add the hooks if and only if we >>> actually justify many memory providers. >>> >>> Any objections to me removing the hooks and directing to memory >>> allocations via simple if statements? Something like (very rough >>> draft, doesn't compile): >> >> The question for Christoph is what exactly is the objection here? Why we >> would not be using well defined ops when we know there will be more >> users? Repeating what I said in the last thread, for io_uring it's used >> to implement the flow of buffers from userspace to the kernel, the ABI, >> which is orthogonal to the issue of what memory type it is and how it >> came there. And even if you mandate unnecessary dmabuf condoms for user >> memory in one form or another IMHO for no clear reason, the callbacks >> (or yet another if-else) would still be needed. >> >> Sure, Mina can drop and hard code devmem path to easy the pain for >> him and delay the discussion, but then shortly after I will be >> re-sending same shit. > > You don't need to re-send the same ops again, right? You can add io > uring support without ops. Something like: > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 92be1aaf18ccc..2cc986455bce6 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool > *pool, gfp_t gfp) > return netmem; > > /* Slow-path: cache empty, do real allocation */ > - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) > - netmem = pool->mp_ops->alloc_pages(pool, gfp); > + if (unlikely(page_pool_is_dmabuf(pool))) > + netmem = mp_dmabuf_devmem_alloc_pages(): > + else if (unlikely(page_pool_is_iouring(pool))) > + netmem = mp_io_uring_alloc_pages(): > else > netmem = __page_pool_alloc_pages_slow(pool, gfp); > return netmem; > > So IMO, the ops themselves, which Christoph is repeatedly nacking, are > not that important. > > I humbly think the energy should be spent convincing maintainers of > the use case of io uring memory, not the ops. The ops are a cosmetic I haven't seen any arguments against from the (net) maintainers so far. Nor I see any objection against callbacks from them (considering that either option adds an if). And just not to confuse folks, it's just user pages, not some weird special io_uring memory. > change to the code, and can be added later. Christoph is nacking the > ops because it gives people too much rope [1]. Yes, it is cosmetic, just as much as removing it is a cosmetic change. You can apply same "too much rope" argument basically to anything. Take io_uring, nothing would change in the process, it'd still be sent to net and reviewed exactly same way, while being less clean, with poorer subsystem separation, allowing custom formats / argument list, etc. I think it's cleaner with callbacks, Mr. Christoph has other beliefs and keeps coercing to them, even though from time to time it backfires for the author, just personal experience. > But if you disagree and think the ops themselves are important for a > reason I missed, I'm happy waiting until agreement is reached here. > Sorry, just voicing my 2 cents. > > [1] https://lore.kernel.org/netdev/ZjjHUh1eINPg1wkn@infradead.org/ >
On 6/5/24 09:24, Christoph Hellwig wrote: > On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: >> The question for Christoph is what exactly is the objection here? Why we >> would not be using well defined ops when we know there will be more >> users? > > The point is that there should be no more users. If you need another Does that "No more" stops after devmem tcp? Or after io_uring proposal? For the latter I explained why io_uring has to do it for good design and that's it's not even related to the memory type used. > case you are doing something very wrong. That's not a very illuminating answer
On 6/7/24 7:42 AM, Pavel Begunkov wrote: > I haven't seen any arguments against from the (net) maintainers so > far. Nor I see any objection against callbacks from them (considering > that either option adds an if). I have said before I do not understand why the dmabuf paradigm is not sufficient for both device memory and host memory. A less than ideal control path to put hostmem in a dmabuf wrapper vs extra checks and changes in the datapath. The former should always be preferred. I also do not understand why the ifq cache and overloading xdp functions have stuck around; I always thought both were added by Jonathan to simplify kernel ports during early POC days.
On Fri, Jun 07, 2024 at 08:27:29AM -0600, David Ahern wrote: > On 6/7/24 7:42 AM, Pavel Begunkov wrote: > > I haven't seen any arguments against from the (net) maintainers so > > far. Nor I see any objection against callbacks from them (considering > > that either option adds an if). > > I have said before I do not understand why the dmabuf paradigm is not > sufficient for both device memory and host memory. A less than ideal > control path to put hostmem in a dmabuf wrapper vs extra checks and > changes in the datapath. The former should always be preferred. I think Pavel explained this - his project is principally to replace the lifetime policy of pages in the data plane. He wants to change when a page is considered available for re-allocation because userspace may continue to use the page after the netstack thinks it is done with it. It sounds like having a different source of the pages is the less important part. IMHO it seems to compose poorly if you can only use the io_uring lifecycle model with io_uring registered memory, and not with DMABUF memory registered through Mina's mechanism. Jason
On 6/7/24 15:27, David Ahern wrote: > On 6/7/24 7:42 AM, Pavel Begunkov wrote: >> I haven't seen any arguments against from the (net) maintainers so >> far. Nor I see any objection against callbacks from them (considering >> that either option adds an if). > > I have said before I do not understand why the dmabuf paradigm is not > sufficient for both device memory and host memory. A less than ideal > control path to put hostmem in a dmabuf wrapper vs extra checks and > changes in the datapath. The former should always be preferred. If we're talking about types of memory specifically, I'm not strictly against wrapping into dmabuf in kernel, but that just doesn't give anything. But the main reason for allocations there is the difference in approaches to the api. With io_uring the allocation callback is responsible for getting buffers back from the user (via a shared ring). No locking for the ring, and buffers are already in the context (napi) where they would be consumed from. Removes some headaches for the user (like batching before returning buffers), and should go better with smaller buffers and such. > I also do not understand why the ifq cache I'm not sure what you mean by ifq cache. Can you elaborate? > and overloading xdp functions Assuming it's about setup via xdp, it was marked for remaking in RFCs for longer than desired but it's gone now in our tree (but maybe not in the latest series). > have stuck around; I always thought both were added by Jonathan to > simplify kernel ports during early POC days.
On 6/7/24 16:42, Pavel Begunkov wrote: > On 6/7/24 15:27, David Ahern wrote: >> On 6/7/24 7:42 AM, Pavel Begunkov wrote: >>> I haven't seen any arguments against from the (net) maintainers so >>> far. Nor I see any objection against callbacks from them (considering >>> that either option adds an if). >> >> I have said before I do not understand why the dmabuf paradigm is not >> sufficient for both device memory and host memory. A less than ideal >> control path to put hostmem in a dmabuf wrapper vs extra checks and >> changes in the datapath. The former should always be preferred. > > If we're talking about types of memory specifically, I'm not strictly > against wrapping into dmabuf in kernel, but that just doesn't give > anything. And the reason I don't have too strong of an opinion on that is mainly because it's just setup/cleanup path. > But the main reason for allocations there is the difference in > approaches to the api. With io_uring the allocation callback is > responsible for getting buffers back from the user (via a shared > ring). No locking for the ring, and buffers are already in the > context (napi) where they would be consumed from. Removes some > headaches for the user (like batching before returning buffers), > and should go better with smaller buffers and such. > >> I also do not understand why the ifq cache > > I'm not sure what you mean by ifq cache. Can you elaborate? > >> and overloading xdp functions > > Assuming it's about setup via xdp, it was marked for remaking in > RFCs for longer than desired but it's gone now in our tree (but > maybe not in the latest series). > >> have stuck around; I always thought both were added by Jonathan to >> simplify kernel ports during early POC days. >
On Fri, Jun 7, 2024 at 8:47 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 6/7/24 16:42, Pavel Begunkov wrote: > > On 6/7/24 15:27, David Ahern wrote: > >> On 6/7/24 7:42 AM, Pavel Begunkov wrote: > >>> I haven't seen any arguments against from the (net) maintainers so > >>> far. Nor I see any objection against callbacks from them (considering > >>> that either option adds an if). > >> > >> I have said before I do not understand why the dmabuf paradigm is not > >> sufficient for both device memory and host memory. A less than ideal > >> control path to put hostmem in a dmabuf wrapper vs extra checks and > >> changes in the datapath. The former should always be preferred. > > > > If we're talking about types of memory specifically, I'm not strictly > > against wrapping into dmabuf in kernel, but that just doesn't give > > anything. > > And the reason I don't have too strong of an opinion on that is > mainly because it's just setup/cleanup path. > I agree wrapping io uring in dmabuf seems to be an unnecessary detour. I never understood the need or upside to do that, but it could be a lack of understanding on my part. However, the concern that David brings up may materialize. I've had to spend a lot of time minimizing or justifying checks to the code with page pool benchmarks that detect even 1 cycle regressions. You may be asked to run the same benchmarks and minimize similar overhead. The benchmark in question is Jesper's bench_page_pool_simple. I've forked it and applied it on top of net-next here: https://github.com/mina/linux/commit/927596f87ab5791a8a6ba8597ba2189747396e54 As io_uring ZC comes close to merging, I suspect it would be good to run this to understand the regression in the fast path, if any. If there are no to little regressions, I have no concerns over io uring memory not being wrapped in dmabufs, and David may agree as well. -- Thanks, Mina
On 2024-06-07 17:27, David Ahern wrote: > I also do not understand why the ifq cache and overloading xdp functions > have stuck around; I always thought both were added by Jonathan to > simplify kernel ports during early POC days. Setting up an Rx queue for ZC w/ a different pp will be done properly using the new queue API that Mina merged recently. Those custom XDP hooks will be gone in a non-RFC patchset.
On 2024-06-07 17:52, Jason Gunthorpe wrote: > IMHO it seems to compose poorly if you can only use the io_uring > lifecycle model with io_uring registered memory, and not with DMABUF > memory registered through Mina's mechanism. By this, do you mean io_uring must be exclusively used to use this feature? And you'd rather see the two decoupled, so userspace can register w/ say dmabuf then pass it to io_uring? > > Jason
On 6/10/24 01:37, David Wei wrote: > On 2024-06-07 17:52, Jason Gunthorpe wrote: >> IMHO it seems to compose poorly if you can only use the io_uring >> lifecycle model with io_uring registered memory, and not with DMABUF >> memory registered through Mina's mechanism. > > By this, do you mean io_uring must be exclusively used to use this > feature? > > And you'd rather see the two decoupled, so userspace can register w/ say > dmabuf then pass it to io_uring? Personally, I have no clue what Jason means. You can just as well say that it's poorly composable that write(2) to a disk cannot post a completion into a XDP ring, or a netlink socket, or io_uring's main completion queue, or name any other API. The devmem TCP callback can implement it in a way feasible to the project, but it cannot directly post events to an unrelated API like io_uring. And devmem attaches buffers to a socket, for which a ring for returning buffers might even be a nuisance.
On 6/7/24 17:59, Mina Almasry wrote: > On Fri, Jun 7, 2024 at 8:47 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> On 6/7/24 16:42, Pavel Begunkov wrote: >>> On 6/7/24 15:27, David Ahern wrote: >>>> On 6/7/24 7:42 AM, Pavel Begunkov wrote: >>>>> I haven't seen any arguments against from the (net) maintainers so >>>>> far. Nor I see any objection against callbacks from them (considering >>>>> that either option adds an if). >>>> >>>> I have said before I do not understand why the dmabuf paradigm is not >>>> sufficient for both device memory and host memory. A less than ideal >>>> control path to put hostmem in a dmabuf wrapper vs extra checks and >>>> changes in the datapath. The former should always be preferred. >>> >>> If we're talking about types of memory specifically, I'm not strictly >>> against wrapping into dmabuf in kernel, but that just doesn't give >>> anything. >> >> And the reason I don't have too strong of an opinion on that is >> mainly because it's just setup/cleanup path. >> > > I agree wrapping io uring in dmabuf seems to be an unnecessary detour. > I never understood the need or upside to do that, but it could be a > lack of understanding on my part. > > However, the concern that David brings up may materialize. I've had to > spend a lot of time minimizing or justifying checks to the code with > page pool benchmarks that detect even 1 cycle regressions. You may be > asked to run the same benchmarks and minimize similar overhead. > > The benchmark in question is Jesper's bench_page_pool_simple. I've > forked it and applied it on top of net-next here: > https://github.com/mina/linux/commit/927596f87ab5791a8a6ba8597ba2189747396e54 > > As io_uring ZC comes close to merging, I suspect it would be good to > run this to understand the regression in the fast path, if any. If > there are no to little regressions, I have no concerns over io uring > memory not being wrapped in dmabufs, and David may agree as well. That's the easiest part as io_uring only reusing call points you added for devmem and thus doesn't add anything new on top to hot paths.
On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: > On 6/10/24 01:37, David Wei wrote: > > On 2024-06-07 17:52, Jason Gunthorpe wrote: > > > IMHO it seems to compose poorly if you can only use the io_uring > > > lifecycle model with io_uring registered memory, and not with DMABUF > > > memory registered through Mina's mechanism. > > > > By this, do you mean io_uring must be exclusively used to use this > > feature? > > > > And you'd rather see the two decoupled, so userspace can register w/ say > > dmabuf then pass it to io_uring? > > Personally, I have no clue what Jason means. You can just as > well say that it's poorly composable that write(2) to a disk > cannot post a completion into a XDP ring, or a netlink socket, > or io_uring's main completion queue, or name any other API. There is no reason you shouldn't be able to use your fast io_uring completion and lifecycle flow with DMABUF backed memory. Those are not widly different things and there is good reason they should work together. Pretending they are totally different just because two different people wrote them is a very siloed view. > The devmem TCP callback can implement it in a way feasible to > the project, but it cannot directly post events to an unrelated > API like io_uring. And devmem attaches buffers to a socket, > for which a ring for returning buffers might even be a nuisance. If you can't compose your io_uring completion mechanism with a DMABUF provided backing store then I think it needs more work. Jason
Am 10.06.24 um 14:16 schrieb Jason Gunthorpe: > On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: >> On 6/10/24 01:37, David Wei wrote: >>> On 2024-06-07 17:52, Jason Gunthorpe wrote: >>>> IMHO it seems to compose poorly if you can only use the io_uring >>>> lifecycle model with io_uring registered memory, and not with DMABUF >>>> memory registered through Mina's mechanism. >>> By this, do you mean io_uring must be exclusively used to use this >>> feature? >>> >>> And you'd rather see the two decoupled, so userspace can register w/ say >>> dmabuf then pass it to io_uring? >> Personally, I have no clue what Jason means. You can just as >> well say that it's poorly composable that write(2) to a disk >> cannot post a completion into a XDP ring, or a netlink socket, >> or io_uring's main completion queue, or name any other API. > There is no reason you shouldn't be able to use your fast io_uring > completion and lifecycle flow with DMABUF backed memory. Those are not > widly different things and there is good reason they should work > together. Well there is the fundamental problem that you can't use io_uring to implement the semantics necessary for a dma_fence. That's why we had to reject the io_uring work on DMA-buf sharing from Google a few years ago. But this only affects the dma_fence synchronization part of DMA-buf, but *not* the general buffer sharing. Regards, Christian. > > Pretending they are totally different just because two different > people wrote them is a very siloed view. > >> The devmem TCP callback can implement it in a way feasible to >> the project, but it cannot directly post events to an unrelated >> API like io_uring. And devmem attaches buffers to a socket, >> for which a ring for returning buffers might even be a nuisance. > If you can't compose your io_uring completion mechanism with a DMABUF > provided backing store then I think it needs more work. > > Jason
On 6/10/24 6:16 AM, Jason Gunthorpe wrote: > On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: >> On 6/10/24 01:37, David Wei wrote: >>> On 2024-06-07 17:52, Jason Gunthorpe wrote: >>>> IMHO it seems to compose poorly if you can only use the io_uring >>>> lifecycle model with io_uring registered memory, and not with DMABUF >>>> memory registered through Mina's mechanism. >>> >>> By this, do you mean io_uring must be exclusively used to use this >>> feature? >>> >>> And you'd rather see the two decoupled, so userspace can register w/ say >>> dmabuf then pass it to io_uring? >> >> Personally, I have no clue what Jason means. You can just as >> well say that it's poorly composable that write(2) to a disk >> cannot post a completion into a XDP ring, or a netlink socket, >> or io_uring's main completion queue, or name any other API. > > There is no reason you shouldn't be able to use your fast io_uring > completion and lifecycle flow with DMABUF backed memory. Those are not > widly different things and there is good reason they should work > together. > > Pretending they are totally different just because two different > people wrote them is a very siloed view. > >> The devmem TCP callback can implement it in a way feasible to >> the project, but it cannot directly post events to an unrelated >> API like io_uring. And devmem attaches buffers to a socket, >> for which a ring for returning buffers might even be a nuisance. > > If you can't compose your io_uring completion mechanism with a DMABUF > provided backing store then I think it needs more work. > exactly. io_uring, page_pool, dmabuf - all kernel building blocks for solutions. This why I was pushing for Mina's set not to be using the name `devmem` - it is but one type of memory and with dmabuf it should not matter if it is gpu or host (or something else later on - cxl?).
On Mon, Jun 10, 2024 at 5:38 AM Christian König <christian.koenig@amd.com> wrote: > > Am 10.06.24 um 14:16 schrieb Jason Gunthorpe: > > On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: > >> On 6/10/24 01:37, David Wei wrote: > >>> On 2024-06-07 17:52, Jason Gunthorpe wrote: > >>>> IMHO it seems to compose poorly if you can only use the io_uring > >>>> lifecycle model with io_uring registered memory, and not with DMABUF > >>>> memory registered through Mina's mechanism. > >>> By this, do you mean io_uring must be exclusively used to use this > >>> feature? > >>> > >>> And you'd rather see the two decoupled, so userspace can register w/ say > >>> dmabuf then pass it to io_uring? > >> Personally, I have no clue what Jason means. You can just as > >> well say that it's poorly composable that write(2) to a disk > >> cannot post a completion into a XDP ring, or a netlink socket, > >> or io_uring's main completion queue, or name any other API. > > There is no reason you shouldn't be able to use your fast io_uring > > completion and lifecycle flow with DMABUF backed memory. Those are not > > widly different things and there is good reason they should work > > together. > > Well there is the fundamental problem that you can't use io_uring to > implement the semantics necessary for a dma_fence. > > That's why we had to reject the io_uring work on DMA-buf sharing from > Google a few years ago. > Any chance someone can link me to this? io_uring, as far as my primitive understanding goes, is not yet very adopted at Google, and I'm curious what this effort is.
On Mon, Jun 10, 2024 at 02:38:18PM +0200, Christian König wrote: > Am 10.06.24 um 14:16 schrieb Jason Gunthorpe: > > On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: > > > On 6/10/24 01:37, David Wei wrote: > > > > On 2024-06-07 17:52, Jason Gunthorpe wrote: > > > > > IMHO it seems to compose poorly if you can only use the io_uring > > > > > lifecycle model with io_uring registered memory, and not with DMABUF > > > > > memory registered through Mina's mechanism. > > > > By this, do you mean io_uring must be exclusively used to use this > > > > feature? > > > > > > > > And you'd rather see the two decoupled, so userspace can register w/ say > > > > dmabuf then pass it to io_uring? > > > Personally, I have no clue what Jason means. You can just as > > > well say that it's poorly composable that write(2) to a disk > > > cannot post a completion into a XDP ring, or a netlink socket, > > > or io_uring's main completion queue, or name any other API. > > There is no reason you shouldn't be able to use your fast io_uring > > completion and lifecycle flow with DMABUF backed memory. Those are not > > widly different things and there is good reason they should work > > together. > > Well there is the fundamental problem that you can't use io_uring to > implement the semantics necessary for a dma_fence. > > That's why we had to reject the io_uring work on DMA-buf sharing from Google > a few years ago. > > But this only affects the dma_fence synchronization part of DMA-buf, but > *not* the general buffer sharing. More precisely, it only impacts the userspace/data access implicit synchronization part of dma-buf. For tracking buffer movements like on invalidations/refault with a dynamic dma-buf importer/exporter I think the dma-fence rules are acceptable. At least they've been for rdma drivers. But the escape hatch is to (temporarily) pin the dma-buf, which is exactly what direct I/O also does when accessing pages. So aside from the still unsolved question on how we should account/track pinned dma-buf, there shouldn't be an issue. Or at least I'm failing to see one. And for synchronization to data access the dma-fence stuff on dma-buf is anyway rather deprecated on the gpu side too, exactly because of all these limitations. On the gpu side we've been moving to free-standing drm_syncobj instead, but those are fairly gpu specific and any other subsystem should be able to just reuse what they have already to signal transaction completions. Cheers, Sima > > Regards, > Christian. > > > > > Pretending they are totally different just because two different > > people wrote them is a very siloed view. > > > > > The devmem TCP callback can implement it in a way feasible to > > > the project, but it cannot directly post events to an unrelated > > > API like io_uring. And devmem attaches buffers to a socket, > > > for which a ring for returning buffers might even be a nuisance. > > If you can't compose your io_uring completion mechanism with a DMABUF > > provided backing store then I think it needs more work. > > > > Jason >
On 6/10/24 16:16, David Ahern wrote: > On 6/10/24 6:16 AM, Jason Gunthorpe wrote: >> On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: >>> On 6/10/24 01:37, David Wei wrote: >>>> On 2024-06-07 17:52, Jason Gunthorpe wrote: >>>>> IMHO it seems to compose poorly if you can only use the io_uring >>>>> lifecycle model with io_uring registered memory, and not with DMABUF >>>>> memory registered through Mina's mechanism. >>>> >>>> By this, do you mean io_uring must be exclusively used to use this >>>> feature? >>>> >>>> And you'd rather see the two decoupled, so userspace can register w/ say >>>> dmabuf then pass it to io_uring? >>> >>> Personally, I have no clue what Jason means. You can just as >>> well say that it's poorly composable that write(2) to a disk >>> cannot post a completion into a XDP ring, or a netlink socket, >>> or io_uring's main completion queue, or name any other API. >> >> There is no reason you shouldn't be able to use your fast io_uring >> completion and lifecycle flow with DMABUF backed memory. Those are not >> widly different things and there is good reason they should work >> together. Let's not mix up devmem TCP and dmabuf specifically, as I see it your question was concerning the latter: "... DMABUF memory registered through Mina's mechanism". io_uring's zcrx can trivially get dmabuf support in future, as mentioned it's mostly the setup side. ABI, buffer workflow and some details is a separate issue, and I don't see how further integration aside from what we're already sharing is beneficial, on opposite it'll complicate things. >> Pretending they are totally different just because two different >> people wrote them is a very siloed view. io_uring zcrx and devmem? They are not, nobody is saying otherwise, _very_ similar approaches if anything but with different API, which is the reason we already use common infra. >>> The devmem TCP callback can implement it in a way feasible to >>> the project, but it cannot directly post events to an unrelated >>> API like io_uring. And devmem attaches buffers to a socket, >>> for which a ring for returning buffers might even be a nuisance. >> >> If you can't compose your io_uring completion mechanism with a DMABUF >> provided backing store then I think it needs more work. As per above, it conflates devmem TCP with dmabuf. > exactly. io_uring, page_pool, dmabuf - all kernel building blocks for > solutions. This why I was pushing for Mina's set not to be using the > name `devmem` - it is but one type of memory and with dmabuf it should > not matter if it is gpu or host (or something else later on - cxl?).
On 6/10/24 16:41, Mina Almasry wrote: > On Mon, Jun 10, 2024 at 5:38 AM Christian König > <christian.koenig@amd.com> wrote: >> >> Am 10.06.24 um 14:16 schrieb Jason Gunthorpe: >>> On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: >>>> On 6/10/24 01:37, David Wei wrote: >>>>> On 2024-06-07 17:52, Jason Gunthorpe wrote: >>>>>> IMHO it seems to compose poorly if you can only use the io_uring >>>>>> lifecycle model with io_uring registered memory, and not with DMABUF >>>>>> memory registered through Mina's mechanism. >>>>> By this, do you mean io_uring must be exclusively used to use this >>>>> feature? >>>>> >>>>> And you'd rather see the two decoupled, so userspace can register w/ say >>>>> dmabuf then pass it to io_uring? >>>> Personally, I have no clue what Jason means. You can just as >>>> well say that it's poorly composable that write(2) to a disk >>>> cannot post a completion into a XDP ring, or a netlink socket, >>>> or io_uring's main completion queue, or name any other API. >>> There is no reason you shouldn't be able to use your fast io_uring >>> completion and lifecycle flow with DMABUF backed memory. Those are not >>> widly different things and there is good reason they should work >>> together. >> >> Well there is the fundamental problem that you can't use io_uring to >> implement the semantics necessary for a dma_fence. >> >> That's why we had to reject the io_uring work on DMA-buf sharing from >> Google a few years ago. >> > > Any chance someone can link me to this? io_uring, as far as my > primitive understanding goes, is not yet very adopted at Google, and > I'm curious what this effort is. I'm curious as well, I don't remember it floating anywhere in mailing lists. The only discussion I recall was about DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT, but it didn't get through only because someone pushed for evenfds.
On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote: > On 6/10/24 16:16, David Ahern wrote: > > > There is no reason you shouldn't be able to use your fast io_uring > > > completion and lifecycle flow with DMABUF backed memory. Those are not > > > widly different things and there is good reason they should work > > > together. > > Let's not mix up devmem TCP and dmabuf specifically, as I see it > your question was concerning the latter: "... DMABUF memory registered > through Mina's mechanism". io_uring's zcrx can trivially get dmabuf > support in future, as mentioned it's mostly the setup side. ABI, > buffer workflow and some details is a separate issue, and I don't > see how further integration aside from what we're already sharing > is beneficial, on opposite it'll complicate things. Again, I am talking about composability here, duplicating the DMABUF stuff into io_uring is not composable, it is just duplicating things. It does not match the view that there should be two distinct layers here, one that provides the pages and one that manages the lifecycle. As HCH pushes for pages either come from the allocator and get to use the struct folio or the come from a dmabuf and they don't. That is it, the only two choices. The iouring stuff is trying to confuse the source of the pages with the lifecycle - which is surely convenient, but is why Christoph is opposing it. Jason
On Mon, Jun 10, 2024 at 02:38:18PM +0200, Christian König wrote: > Well there is the fundamental problem that you can't use io_uring to > implement the semantics necessary for a dma_fence. What is the exact problem there?
On Mon, Jun 10, 2024 at 09:16:43AM -0600, David Ahern wrote: > > exactly. io_uring, page_pool, dmabuf - all kernel building blocks for > solutions. This why I was pushing for Mina's set not to be using the > name `devmem` - it is but one type of memory and with dmabuf it should > not matter if it is gpu or host (or something else later on - cxl?). While not really realted to the rest of the discussion I agree. It really is dmabuf integration now, so let's call it that?
On Fri, Jun 07, 2024 at 02:45:55PM +0100, Pavel Begunkov wrote: > On 6/5/24 09:24, Christoph Hellwig wrote: > > On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: > > > The question for Christoph is what exactly is the objection here? Why we > > > would not be using well defined ops when we know there will be more > > > users? > > > > The point is that there should be no more users. If you need another > > Does that "No more" stops after devmem tcp? There should be no other memory source other than the page allocator and dmabuf. If you need different life time control for your zero copy proposal don't mix that up with the contol of the memory source.
Am 11.06.24 um 08:25 schrieb Christoph Hellwig: > On Mon, Jun 10, 2024 at 02:38:18PM +0200, Christian König wrote: >> Well there is the fundamental problem that you can't use io_uring to >> implement the semantics necessary for a dma_fence. > What is the exact problem there? It's an intentional design decision that dma_fences can be waited on with quite a bunch of locks held. Including the DMA-buf reservation lock, mmap lock, anything page fault related, shrinker etc... When you give userspace control over the signaling of a dma_fence then that has the same effect as returning to userspace with those locks held - you can basically trivially deadlock the system. I think nearly a dozen implementations fell into that trap: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences It's well understood and documented by now why this approach doesn't work. So not much of an issue any more, we just have to reject implementations from time to time which try doing the same thing again. Regards, Christian.
On Mon, Jun 10, 2024 at 11:26 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Jun 10, 2024 at 09:16:43AM -0600, David Ahern wrote: > > > > exactly. io_uring, page_pool, dmabuf - all kernel building blocks for > > solutions. This why I was pushing for Mina's set not to be using the > > name `devmem` - it is but one type of memory and with dmabuf it should > > not matter if it is gpu or host (or something else later on - cxl?). > > While not really realted to the rest of the discussion I agree. > It really is dmabuf integration now, so let's call it that? My mental model is that the feature folks care about is the ability to use TCP with device memory, and dmabuf is an implementation detail that is the format that device memory is packaged in. Although not likely given this discussion, in theory we could want to extend devmem TCP to support p2pdma for nvme, or some other format if a new one arises in device drivers. I also think it's more obvious to an end user what 'devmem TCP' aims to do rather than 'dmabuf TCP' especially if the user is not a kernel developer familiar with dmabuf.
On Mon, Jun 10, 2024 at 3:15 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote: > > On 6/10/24 16:16, David Ahern wrote: > > > > > There is no reason you shouldn't be able to use your fast io_uring > > > > completion and lifecycle flow with DMABUF backed memory. Those are not > > > > widly different things and there is good reason they should work > > > > together. > > > > Let's not mix up devmem TCP and dmabuf specifically, as I see it > > your question was concerning the latter: "... DMABUF memory registered > > through Mina's mechanism". io_uring's zcrx can trivially get dmabuf > > support in future, as mentioned it's mostly the setup side. ABI, > > buffer workflow and some details is a separate issue, and I don't > > see how further integration aside from what we're already sharing > > is beneficial, on opposite it'll complicate things. > > Again, I am talking about composability here, duplicating the DMABUF > stuff into io_uring is not composable, it is just duplicating things. > > It does not match the view that there should be two distinct layers > here, one that provides the pages and one that manages the > lifecycle. As HCH pushes for pages either come from the allocator and > get to use the struct folio or the come from a dmabuf and they > don't. That is it, the only two choices. > > The iouring stuff is trying to confuse the source of the pages with > the lifecycle - which is surely convenient, but is why Christoph is > opposing it. > Just curious: in Pavel's effort, io_uring - which is not a device - is trying to share memory with the page_pool, which is also not a device. And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf going to be the kernel's standard for any memory sharing between any 2 components in the future, even when they're not devices? As in you expect dmabuf exporters which are not devices to be added to the kernel? Currently the only dmabuf exporter which is not a device (AFAIK) is udmabuf, which is used for testing and emulation, not really a production thing, I think.
On Tue, Jun 11, 2024 at 11:09:15AM -0700, Mina Almasry wrote: > Just curious: in Pavel's effort, io_uring - which is not a device - is > trying to share memory with the page_pool, which is also not a device. > And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf > going to be the kernel's standard for any memory sharing between any 2 > components in the future, even when they're not devices? dmabuf is how we are refcounting non-struct page memory, there is nothing about it that says it has to be MMIO memory, or even that the memory doesn't have struct pages. All it says is that the memory is alive according to dmabuf refcounting rules. And the importer obviously don't get to touch the underlying folios, if any. Jason
On 6/12/24 6:06 AM, Jason Gunthorpe wrote: > On Tue, Jun 11, 2024 at 11:09:15AM -0700, Mina Almasry wrote: > >> Just curious: in Pavel's effort, io_uring - which is not a device - is >> trying to share memory with the page_pool, which is also not a device. >> And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf >> going to be the kernel's standard for any memory sharing between any 2 >> components in the future, even when they're not devices? > > dmabuf is how we are refcounting non-struct page memory, there is > nothing about it that says it has to be MMIO memory, or even that the > memory doesn't have struct pages. > > All it says is that the memory is alive according to dmabuf > refcounting rules. And the importer obviously don't get to touch the > underlying folios, if any. > In addition, the io_uring developers should be considering the use case of device memory. There is no reason for this design to be limited to host memory. io_uring should not care (it is not peeking inside the memory buffers); it is just memory references. One of io_uring's primary benefits is avoiding system calls. io_uring works with TCP sockets. Let it work with any dmabuf without concern of memory type. The performance benefits the Google crowd sees with system call based apps should be even better with io_uring. Focus on primitives, building blocks with solid APIs for other subsystems to leverage and let them be wired up in ways you cannot imagine today.
On 6/11/24 07:34, Christoph Hellwig wrote: > On Fri, Jun 07, 2024 at 02:45:55PM +0100, Pavel Begunkov wrote: >> On 6/5/24 09:24, Christoph Hellwig wrote: >>> On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: >>>> The question for Christoph is what exactly is the objection here? Why we >>>> would not be using well defined ops when we know there will be more >>>> users? >>> >>> The point is that there should be no more users. If you need another >> >> Does that "No more" stops after devmem tcp? > > There should be no other memory source other than the page allocator > and dmabuf. If you need different life time control for your > zero copy proposal don't mix that up with the contol of the memory > source. No idea how I'm mixing it up when I was explaining exactly this all along as well as that the callback (and presumably the call site in general) you was so eager to nack is used exactly to implement the life time control.
On 6/10/24 23:15, Jason Gunthorpe wrote: > On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote: >> On 6/10/24 16:16, David Ahern wrote: > >>>> There is no reason you shouldn't be able to use your fast io_uring >>>> completion and lifecycle flow with DMABUF backed memory. Those are not >>>> widly different things and there is good reason they should work >>>> together. >> >> Let's not mix up devmem TCP and dmabuf specifically, as I see it >> your question was concerning the latter: "... DMABUF memory registered >> through Mina's mechanism". io_uring's zcrx can trivially get dmabuf >> support in future, as mentioned it's mostly the setup side. ABI, >> buffer workflow and some details is a separate issue, and I don't >> see how further integration aside from what we're already sharing >> is beneficial, on opposite it'll complicate things. > > Again, I am talking about composability here, duplicating the DMABUF > stuff into io_uring is not composable, it is just duplicating things. Ok, then registering, say, a dmabuf via devmem TCP and then using it in io_uring. Let's say we make devmem TCP API to be able to register a dmabuf without using it, from where io_uring can take ownership over it and use in the flow. And I strongly believe the same memory region/dmabuf should never be used by both at the same time and hence lifetime of any such memory should be exclusively bound to io_uring. That leaves the user api, where to add memory you need to create a netlink socket and pass everything through it, which is an extra step, and then letting know io_uring that it can use the memory, not forgetting to eject it from netlink. That's not a good api as far as it goes with io_uring. I don't think slight duplicating of registration is a problem when the upside is much cleaner API. Internals, however, can be easily shared. We can even say that the net stack should provide helpers like init_page_pool_from_dmabuf_fd() and now allow poking into related bits aside from it (initialising net_iov / etc.). > It does not match the view that there should be two distinct layers > here, one that provides the pages and one that manages the > lifecycle. As HCH pushes for pages either come from the allocator and > get to use the struct folio or the come from a dmabuf and they > don't. That is it, the only two choices. > > The iouring stuff is trying to confuse the source of the pages with > the lifecycle - which is surely convenient, but is why Christoph is > opposing it.
On Mon, Jun 17, 2024 at 07:04:43PM +0100, Pavel Begunkov wrote: > > There should be no other memory source other than the page allocator > > and dmabuf. If you need different life time control for your > > zero copy proposal don't mix that up with the contol of the memory > > source. > > No idea how I'm mixing it up when I was explaining exactly this > all along as well as that the callback (and presumably the call > site in general) you was so eager to nack is used exactly to > implement the life time control. And that's exactly my point. You want to use one callback to mix allocation source and life time control. That's the perfect recipe to create an un-extensible un-composable mess.
On 6/18/24 07:43, Christoph Hellwig wrote: > On Mon, Jun 17, 2024 at 07:04:43PM +0100, Pavel Begunkov wrote: >>> There should be no other memory source other than the page allocator >>> and dmabuf. If you need different life time control for your >>> zero copy proposal don't mix that up with the contol of the memory >>> source. >> >> No idea how I'm mixing it up when I was explaining exactly this >> all along as well as that the callback (and presumably the call >> site in general) you was so eager to nack is used exactly to >> implement the life time control. > > And that's exactly my point. You want to use one callback to mix > allocation source and life time control. No, it only takes the role of life time control and doesn't care about the source. The allocation source step with corresponding initialisation happens separately and priorly, at initialisation time. > That's the perfect recipe > to create an un-extensible un-composable mess
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index b088d131aeb0d..b038b838f042f 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -51,6 +51,7 @@ struct pp_alloc_cache { * @dev: device, for DMA pre-mapping purposes * @netdev: netdev this pool will serve (leave as NULL if none or multiple) * @napi: NAPI which is the sole consumer of pages, otherwise NULL + * @queue: struct netdev_rx_queue this page_pool is being created for. * @dma_dir: DMA mapping direction * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV @@ -64,6 +65,7 @@ struct page_pool_params { int nid; struct device *dev; struct napi_struct *napi; + struct netdev_rx_queue *queue; enum dma_data_direction dma_dir; unsigned int max_len; unsigned int offset; @@ -127,6 +129,13 @@ struct page_pool_stats { }; #endif +struct memory_provider_ops { + int (*init)(struct page_pool *pool); + void (*destroy)(struct page_pool *pool); + struct page *(*alloc_netmems)(struct page_pool *pool, gfp_t gfp); + bool (*release_page)(struct page_pool *pool, struct page *page); +}; + struct page_pool { struct page_pool_params_fast p; @@ -193,6 +202,9 @@ struct page_pool { */ struct ptr_ring ring; + void *mp_priv; + const struct memory_provider_ops *mp_ops; + #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f4444b4e39e63..251c9356c9202 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -26,6 +26,8 @@ #include "page_pool_priv.h" +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); + #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -186,6 +188,7 @@ static int page_pool_init(struct page_pool *pool, int cpuid) { unsigned int ring_qsize = 1024; /* Default */ + int err; page_pool_struct_check(); @@ -267,7 +270,22 @@ static int page_pool_init(struct page_pool *pool, if (pool->dma_map) get_device(pool->p.dev); + if (pool->mp_ops) { + err = pool->mp_ops->init(pool); + if (err) { + pr_warn("%s() mem-provider init failed %d\n", __func__, + err); + goto free_ptr_ring; + } + + static_branch_inc(&page_pool_mem_providers); + } + return 0; + +free_ptr_ring: + ptr_ring_cleanup(&pool->ring, NULL); + return err; } static void page_pool_uninit(struct page_pool *pool) @@ -569,7 +587,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) return page; /* Slow-path: cache empty, do real allocation */ - page = __page_pool_alloc_pages_slow(pool, gfp); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + page = pool->mp_ops->alloc_netmems(pool, gfp); + else + page = __page_pool_alloc_pages_slow(pool, gfp); return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -627,10 +648,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count; + bool put; - __page_pool_release_page_dma(pool, page); - - page_pool_clear_pp_info(page); + put = true; + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + put = pool->mp_ops->release_page(pool, page); + else + __page_pool_release_page_dma(pool, page); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -638,7 +662,10 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - put_page(page); + if (put) { + page_pool_clear_pp_info(page); + put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -937,6 +964,12 @@ static void __page_pool_destroy(struct page_pool *pool) page_pool_unlist(pool); page_pool_uninit(pool); + + if (pool->mp_ops) { + pool->mp_ops->destroy(pool); + static_branch_dec(&page_pool_mem_providers); + } + kfree(pool); }