[net-next,v18,07/14] memory-provider: dmabuf devmem memory provider

Message ID 20240805212536.2172174-8-almasrymina@google.com (mailing list archive)
State Not Applicable
Headers
Series Device Memory TCP |

Commit Message

Mina Almasry Aug. 5, 2024, 9:25 p.m. UTC
  Implement a memory provider that allocates dmabuf devmem in the form of
net_iov.

The provider receives a reference to the struct netdev_dmabuf_binding
via the pool->mp_priv pointer. The driver needs to set this pointer for
the provider in the net_iov.

The provider obtains a reference on the netdev_dmabuf_binding which
guarantees the binding and the underlying mapping remains alive until
the provider is destroyed.

Usage of PP_FLAG_DMA_MAP is required for this memory provide such that
the page_pool can provide the driver with the dma-addrs of the devmem.

Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order !=
0.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>

---

v17:
- Use ASSERT_RTNL (Jakub)

v16:
- Add DEBUG_NET_WARN_ON_ONCE(!rtnl_is_locked()), to catch cases if
  page_pool_init without rtnl_locking when the queue is provided. In
  this case, the queue configuration may be changed while we're initing
  the page_pool, which could be a race.

v13:
- Return on warning (Pavel).
- Fixed pool->recycle_stats not being freed on error (Pavel).
- Applied reviewed-by from Pavel.

v11:
- Rebase to not use the ops. (Christoph)

v8:
- Use skb_frag_size instead of frag->bv_len to fix patch-by-patch build
  error

v6:
- refactor new memory provider functions into net/core/devmem.c (Pavel)

v2:
- Disable devmem for p.order != 0

v1:
- static_branch check in page_is_page_pool_iov() (Willem & Paolo).
- PP_DEVMEM -> PP_IOV (David).
- Require PP_FLAG_DMA_MAP (Jakub).

---
 include/net/mp_dmabuf_devmem.h |  44 ++++++++++++++
 include/net/netmem.h           |  15 +++++
 include/net/page_pool/types.h  |   2 +
 net/core/devmem.c              |  78 +++++++++++++++++++++++++
 net/core/page_pool.c           | 102 ++++++++++++++++++++++++---------
 net/core/page_pool_priv.h      |   3 +
 6 files changed, 218 insertions(+), 26 deletions(-)
 create mode 100644 include/net/mp_dmabuf_devmem.h
  

Comments

Jakub Kicinski Aug. 6, 2024, 8:59 p.m. UTC | #1
On Mon,  5 Aug 2024 21:25:20 +0000 Mina Almasry wrote:
> +	if (pool->p.queue) {
> +		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
> +		 * configuration doesn't change while we're initializing the
> +		 * page_pool.
> +		 */
> +		ASSERT_RTNL();
> +		pool->mp_priv = pool->p.queue->mp_params.mp_priv;

How do you know that the driver:
 - supports net_iov at all (let's not make implicit assumptions based
   on presence of queue API);
 - supports net_iov in current configuration (eg header-data split is
   enabled)
 - supports net_iov for _this_ pool (all drivers must have separate
   buffer pools for headers and data for this to work, some will use
   page pool for both)

What comes to mind is adding an "I can gobble up net_iovs from this
pool" flag in page pool params (the struct that comes from the driver),
and then on the installation path we can check if after queue reset
the refcount of the binding has increased. If it did - driver has
created a pool as we expected, otherwise - fail, something must be off.
Maybe that's a bit hacky?
  
Mina Almasry Aug. 8, 2024, 8:36 p.m. UTC | #2
On Tue, Aug 6, 2024 at 4:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
...
> On Mon,  5 Aug 2024 21:25:20 +0000 Mina Almasry wrote:
> > +     if (pool->p.queue) {
> > +             /* We rely on rtnl_lock()ing to make sure netdev_rx_queue
> > +              * configuration doesn't change while we're initializing the
> > +              * page_pool.
> > +              */
> > +             ASSERT_RTNL();
> > +             pool->mp_priv = pool->p.queue->mp_params.mp_priv;
>

Hi Jakub,

Sorry for the late reply, it took a bit of code reading to understand
what you mean with the deactivation request on the other patch, but I
think I got it down and have a patch on the way.

> How do you know that the driver:
>  - supports net_iov at all (let's not make implicit assumptions based
>    on presence of queue API);
>  - supports net_iov in current configuration (eg header-data split is
>    enabled)
>  - supports net_iov for _this_ pool (all drivers must have separate
>    buffer pools for headers and data for this to work, some will use
>    page pool for both)
>
> What comes to mind is adding an "I can gobble up net_iovs from this
> pool" flag in page pool params (the struct that comes from the driver),

This already sorta exists in the current iteration, although maybe in
an implicit way. As written, drivers need to set params.queue,
otherwise core will not attempt to grab the mp information from
params.queue. A driver can set params.queue for its data pages pool
and not set it for the headers pool. AFAICT that deals with all 3
issues you present above.

The awkward part is if params.queue starts getting used for other
reasons rather than passing mp configuration, but as of today that's
not the case so I didn't add the secondary flag. If you want a second
flag to be added preemptively, I can do that, no problem. Can you
confirm params.queue is not good enough?

> and then on the installation path we can check if after queue reset
> the refcount of the binding has increased. If it did - driver has
> created a pool as we expected, otherwise - fail, something must be off.
> Maybe that's a bit hacky?

What's missing is for core to check at binding time that the driver
supports net_iov. I had relied on the implicit presence of the
queue-API.

What you're proposing works, but AFAICT it's quite hacky, yes. I
basically need to ASSERT_RTNL in net_devmem_binding_get() to ensure
nothing can increment the refcount while the binding is happening so
that the refcount check is valid.

I think a less hacky approach is to add a function to the queue-API
like ndo_queue_supported_features(), which lets the driver declare
that it supports net_iov at a given rx queue. However I'm open to both
approaches. What do you prefer?
  
Jakub Kicinski Aug. 9, 2024, 2:24 a.m. UTC | #3
On Thu, 8 Aug 2024 16:36:24 -0400 Mina Almasry wrote:
> > How do you know that the driver:
> >  - supports net_iov at all (let's not make implicit assumptions based
> >    on presence of queue API);
> >  - supports net_iov in current configuration (eg header-data split is
> >    enabled)
> >  - supports net_iov for _this_ pool (all drivers must have separate
> >    buffer pools for headers and data for this to work, some will use
> >    page pool for both)
> >
> > What comes to mind is adding an "I can gobble up net_iovs from this
> > pool" flag in page pool params (the struct that comes from the driver),  
> 
> This already sorta exists in the current iteration, although maybe in
> an implicit way. As written, drivers need to set params.queue,
> otherwise core will not attempt to grab the mp information from
> params.queue. A driver can set params.queue for its data pages pool
> and not set it for the headers pool. AFAICT that deals with all 3
> issues you present above.
> 
> The awkward part is if params.queue starts getting used for other
> reasons rather than passing mp configuration, but as of today that's
> not the case so I didn't add the secondary flag. If you want a second
> flag to be added preemptively, I can do that, no problem. Can you
> confirm params.queue is not good enough?

I'd prefer a flag. The setting queue in a param struct is not a good
API for conveying that the page pool is for netmem payloads only.

> > and then on the installation path we can check if after queue reset
> > the refcount of the binding has increased. If it did - driver has
> > created a pool as we expected, otherwise - fail, something must be off.
> > Maybe that's a bit hacky?  
> 
> What's missing is for core to check at binding time that the driver
> supports net_iov. I had relied on the implicit presence of the
> queue-API.
> 
> What you're proposing works, but AFAICT it's quite hacky, yes. I
> basically need to ASSERT_RTNL in net_devmem_binding_get() to ensure
> nothing can increment the refcount while the binding is happening so
> that the refcount check is valid.

True. Shooting from the hip, but we could walk the page pools of the
netdev and find the one that has the right mp installed, and matches
queue? The page pools are on a list hooked up to the netdev, trivial
to walk.

> I think a less hacky approach is to add a function to the queue-API
> like ndo_queue_supported_features(), which lets the driver declare
> that it supports net_iov at a given rx queue. However I'm open to both
> approaches. What do you prefer?

I kinda like trying to query the page pools more, because it's both
fewer driver changes, and it actually validates that the driver did 
the right thing based on outcomes. Driver callback may have bugs.

If you prefer strongly - fine, but hm.
  
Mina Almasry Aug. 9, 2024, 2:10 p.m. UTC | #4
On Thu, Aug 8, 2024 at 10:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 8 Aug 2024 16:36:24 -0400 Mina Almasry wrote:
> > > How do you know that the driver:
> > >  - supports net_iov at all (let's not make implicit assumptions based
> > >    on presence of queue API);
> > >  - supports net_iov in current configuration (eg header-data split is
> > >    enabled)
> > >  - supports net_iov for _this_ pool (all drivers must have separate
> > >    buffer pools for headers and data for this to work, some will use
> > >    page pool for both)
> > >
> > > What comes to mind is adding an "I can gobble up net_iovs from this
> > > pool" flag in page pool params (the struct that comes from the driver),
> >
> > This already sorta exists in the current iteration, although maybe in
> > an implicit way. As written, drivers need to set params.queue,
> > otherwise core will not attempt to grab the mp information from
> > params.queue. A driver can set params.queue for its data pages pool
> > and not set it for the headers pool. AFAICT that deals with all 3
> > issues you present above.
> >
> > The awkward part is if params.queue starts getting used for other
> > reasons rather than passing mp configuration, but as of today that's
> > not the case so I didn't add the secondary flag. If you want a second
> > flag to be added preemptively, I can do that, no problem. Can you
> > confirm params.queue is not good enough?
>
> I'd prefer a flag. The setting queue in a param struct is not a good
> API for conveying that the page pool is for netmem payloads only.
>
> > > and then on the installation path we can check if after queue reset
> > > the refcount of the binding has increased. If it did - driver has
> > > created a pool as we expected, otherwise - fail, something must be off.
> > > Maybe that's a bit hacky?
> >
> > What's missing is for core to check at binding time that the driver
> > supports net_iov. I had relied on the implicit presence of the
> > queue-API.
> >
> > What you're proposing works, but AFAICT it's quite hacky, yes. I
> > basically need to ASSERT_RTNL in net_devmem_binding_get() to ensure
> > nothing can increment the refcount while the binding is happening so
> > that the refcount check is valid.
>
> True. Shooting from the hip, but we could walk the page pools of the
> netdev and find the one that has the right mp installed, and matches
> queue? The page pools are on a list hooked up to the netdev, trivial
> to walk.
>

I think this is good, and it doesn't seem hacky to me, because we can
check the page_pools of the netdev while we hold rtnl, so we can be
sure nothing is messing with the pp configuration in the meantime.
Like you say below it does validate the driver rather than rely on the
driver saying it's doing the right thing. I'll look into putting this
in the next version.


--
Thanks,
Mina
  
Pavel Begunkov Aug. 9, 2024, 3:45 p.m. UTC | #5
On 8/9/24 15:10, Mina Almasry wrote:
> On Thu, Aug 8, 2024 at 10:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Thu, 8 Aug 2024 16:36:24 -0400 Mina Almasry wrote:
>>>> How do you know that the driver:
>>>>   - supports net_iov at all (let's not make implicit assumptions based
>>>>     on presence of queue API);
>>>>   - supports net_iov in current configuration (eg header-data split is
>>>>     enabled)
>>>>   - supports net_iov for _this_ pool (all drivers must have separate
>>>>     buffer pools for headers and data for this to work, some will use
>>>>     page pool for both)
>>>>
>>>> What comes to mind is adding an "I can gobble up net_iovs from this
>>>> pool" flag in page pool params (the struct that comes from the driver),
>>>
>>> This already sorta exists in the current iteration, although maybe in
>>> an implicit way. As written, drivers need to set params.queue,
>>> otherwise core will not attempt to grab the mp information from
>>> params.queue. A driver can set params.queue for its data pages pool
>>> and not set it for the headers pool. AFAICT that deals with all 3
>>> issues you present above.
>>>
>>> The awkward part is if params.queue starts getting used for other
>>> reasons rather than passing mp configuration, but as of today that's
>>> not the case so I didn't add the secondary flag. If you want a second
>>> flag to be added preemptively, I can do that, no problem. Can you
>>> confirm params.queue is not good enough?
>>
>> I'd prefer a flag. The setting queue in a param struct is not a good
>> API for conveying that the page pool is for netmem payloads only.
>>
>>>> and then on the installation path we can check if after queue reset
>>>> the refcount of the binding has increased. If it did - driver has
>>>> created a pool as we expected, otherwise - fail, something must be off.
>>>> Maybe that's a bit hacky?
>>>
>>> What's missing is for core to check at binding time that the driver
>>> supports net_iov. I had relied on the implicit presence of the
>>> queue-API.
>>>
>>> What you're proposing works, but AFAICT it's quite hacky, yes. I
>>> basically need to ASSERT_RTNL in net_devmem_binding_get() to ensure
>>> nothing can increment the refcount while the binding is happening so
>>> that the refcount check is valid.
>>
>> True. Shooting from the hip, but we could walk the page pools of the
>> netdev and find the one that has the right mp installed, and matches
>> queue? The page pools are on a list hooked up to the netdev, trivial
>> to walk.
>>
> 
> I think this is good, and it doesn't seem hacky to me, because we can
> check the page_pools of the netdev while we hold rtnl, so we can be
> sure nothing is messing with the pp configuration in the meantime.
> Like you say below it does validate the driver rather than rely on the
> driver saying it's doing the right thing. I'll look into putting this
> in the next version.

Why not have a flag set by the driver and advertising whether it
supports providers or not, which should be checked for instance in
netdev_rx_queue_restart()? If set, the driver should do the right
thing. That's in addition to a new pp_params flag explicitly telling
if pp should use providers. It's more explicit and feels a little
less hacky.
  
Jakub Kicinski Aug. 10, 2024, 3:52 a.m. UTC | #6
On Fri, 9 Aug 2024 16:45:50 +0100 Pavel Begunkov wrote:
> > I think this is good, and it doesn't seem hacky to me, because we can
> > check the page_pools of the netdev while we hold rtnl, so we can be
> > sure nothing is messing with the pp configuration in the meantime.
> > Like you say below it does validate the driver rather than rely on the
> > driver saying it's doing the right thing. I'll look into putting this
> > in the next version.  
> 
> Why not have a flag set by the driver and advertising whether it
> supports providers or not, which should be checked for instance in
> netdev_rx_queue_restart()? If set, the driver should do the right
> thing. That's in addition to a new pp_params flag explicitly telling
> if pp should use providers. It's more explicit and feels a little
> less hacky.

You mean like I suggested in the previous two emails? :)

Given how easy the check is to implement, I think it's worth
adding as a sanity check. But the flag should be the main API,
if the sanity check starts to be annoying we'll ditch it.
  
Mina Almasry Aug. 11, 2024, 2:21 a.m. UTC | #7
On Fri, Aug 9, 2024 at 11:52 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 9 Aug 2024 16:45:50 +0100 Pavel Begunkov wrote:
> > > I think this is good, and it doesn't seem hacky to me, because we can
> > > check the page_pools of the netdev while we hold rtnl, so we can be
> > > sure nothing is messing with the pp configuration in the meantime.
> > > Like you say below it does validate the driver rather than rely on the
> > > driver saying it's doing the right thing. I'll look into putting this
> > > in the next version.
> >
> > Why not have a flag set by the driver and advertising whether it
> > supports providers or not, which should be checked for instance in
> > netdev_rx_queue_restart()? If set, the driver should do the right
> > thing. That's in addition to a new pp_params flag explicitly telling
> > if pp should use providers. It's more explicit and feels a little
> > less hacky.
>
> You mean like I suggested in the previous two emails? :)
>
> Given how easy the check is to implement, I think it's worth
> adding as a sanity check. But the flag should be the main API,
> if the sanity check starts to be annoying we'll ditch it.

I think we're talking about 2 slightly different flags, AFAIU.

Pavel and I are suggesting the driver reports "I support memory
providers" directly to core (via the queue-api or what not), and we
check that flag directly in netdev_rx_queue_restart(), and fail
immediately if the support is not there.

Jakub is suggesting a page_pool_params flag which lets the driver
report "I support memory providers". If the driver doesn't support it
but core is trying to configure that, then the page_pool_create will
fail, which will cause the queue API operation
(ndo_queue_alloc_mem_alloc) to fail, which causes
netdev_rx_queue_restart() to fail.

Both are fine, I don't see any extremely strong reason to pick one of
the other. I prefer Jakub's suggestion, just because it's closer to
the page_pool and may be more reusable in the future. I'll err on the
side of that unless I hear strong preference to the contrary.

I also think the additional check that Jakub is requesting is easy to
implement and unobjectionable. It would let core validate that the
driver did actually create the page_pool with the memory provider. I
think one of the goals of the queue API was to allow core to do more
validation on driver configuration anyway.
  
Pavel Begunkov Aug. 11, 2024, 9:51 p.m. UTC | #8
On 8/11/24 03:21, Mina Almasry wrote:
> On Fri, Aug 9, 2024 at 11:52 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Fri, 9 Aug 2024 16:45:50 +0100 Pavel Begunkov wrote:
>>>> I think this is good, and it doesn't seem hacky to me, because we can
>>>> check the page_pools of the netdev while we hold rtnl, so we can be
>>>> sure nothing is messing with the pp configuration in the meantime.
>>>> Like you say below it does validate the driver rather than rely on the
>>>> driver saying it's doing the right thing. I'll look into putting this
>>>> in the next version.
>>>
>>> Why not have a flag set by the driver and advertising whether it
>>> supports providers or not, which should be checked for instance in
>>> netdev_rx_queue_restart()? If set, the driver should do the right
>>> thing. That's in addition to a new pp_params flag explicitly telling
>>> if pp should use providers. It's more explicit and feels a little
>>> less hacky.
>>
>> You mean like I suggested in the previous two emails? :)
>>
>> Given how easy the check is to implement, I think it's worth
>> adding as a sanity check. But the flag should be the main API,
>> if the sanity check starts to be annoying we'll ditch it.
> 
> I think we're talking about 2 slightly different flags, AFAIU.>
> Pavel and I are suggesting the driver reports "I support memory
> providers" directly to core (via the queue-api or what not), and we
> check that flag directly in netdev_rx_queue_restart(), and fail
> immediately if the support is not there.

I might've misread Jakub, but yes, I believe it's different. It'd
communicate about support for providers to upper layers, so we can
fail even before attempting to allocate a new queue and init a
page pool.

> Jakub is suggesting a page_pool_params flag which lets the driver
> report "I support memory providers". If the driver doesn't support it
> but core is trying to configure that, then the page_pool_create will
> fail, which will cause the queue API operation
> (ndo_queue_alloc_mem_alloc) to fail, which causes
> netdev_rx_queue_restart() to fail.

And I'm not against this way either if we explicitly get an error
back instead of trying to figure it out post-factum like by
checking the references and possibly reverting the allocation.
Maybe that's where I was confused, and that refcount thing was
suggested as a WARN_ONCE?

FWIW, I think it warrants two flags. The first saying that the
driver supports providers at all:

page_pool_init() {
	if (rxq->mp_params)
		if (!(flags & PP_PROVIDERS_SUPPORTED))
			goto fail;
}

And the second telling whether the driver wants to install
providers for this particular page pool, so if there is a
separate pool for headers we can set it with plain old kernel
pages.

payload_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED);
header_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED |
                                     PP_IGNORE_PROVIDERS);

(or invert the flag). That's assuming page_pool_params::queue is
a generic thing and we don't want to draw equivalence between
it and memory providers.
  
Jakub Kicinski Aug. 12, 2024, 5:57 p.m. UTC | #9
On Sun, 11 Aug 2024 22:51:13 +0100 Pavel Begunkov wrote:
> > I think we're talking about 2 slightly different flags, AFAIU.>
> > Pavel and I are suggesting the driver reports "I support memory
> > providers" directly to core (via the queue-api or what not), and we
> > check that flag directly in netdev_rx_queue_restart(), and fail
> > immediately if the support is not there.  
> 
> I might've misread Jakub, but yes, I believe it's different. It'd
> communicate about support for providers to upper layers, so we can
> fail even before attempting to allocate a new queue and init a
> page pool.

Got it. Since allocating memory happens before stopping traffic
I think it's acceptable to stick to a single flag.

> > Jakub is suggesting a page_pool_params flag which lets the driver
> > report "I support memory providers". If the driver doesn't support it
> > but core is trying to configure that, then the page_pool_create will
> > fail, which will cause the queue API operation
> > (ndo_queue_alloc_mem_alloc) to fail, which causes
> > netdev_rx_queue_restart() to fail.  
> 
> And I'm not against this way either if we explicitly get an error
> back instead of trying to figure it out post-factum like by
> checking the references and possibly reverting the allocation.
> Maybe that's where I was confused, and that refcount thing was
> suggested as a WARN_ONCE?

Yup, the refcount (now: check of the page pool list) was meant
as a WARN_ONCE() to catch bad drivers.

> FWIW, I think it warrants two flags. The first saying that the
> driver supports providers at all:
> 
> page_pool_init() {
> 	if (rxq->mp_params)
> 		if (!(flags & PP_PROVIDERS_SUPPORTED))
> 			goto fail;
> }
> 
> And the second telling whether the driver wants to install
> providers for this particular page pool, so if there is a
> separate pool for headers we can set it with plain old kernel
> pages.

The implementation of the queue API should be resilient against
failures in alloc, and not being MP capable is just a form of 
alloc failure. I don't see the upside of double-flag. 

> payload_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED);
> header_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED |
>                                      PP_IGNORE_PROVIDERS);

Also don't see the upside of the explicit "non-capable" flag,
but I haven't thought of that. Is there any use?

One important note. The flag should not be tied to memory providers
but rather to netmem, IOW unreadable memory. MP is an internal detail,
the important fact from the driver-facing API perspective is that the
driver doesn't need struct pages.

> (or invert the flag). That's assuming page_pool_params::queue is
> a generic thing and we don't want to draw equivalence between
> it and memory providers.
  
Mina Almasry Aug. 12, 2024, 6:55 p.m. UTC | #10
On Mon, Aug 12, 2024 at 1:57 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 11 Aug 2024 22:51:13 +0100 Pavel Begunkov wrote:
> > > I think we're talking about 2 slightly different flags, AFAIU.>
> > > Pavel and I are suggesting the driver reports "I support memory
> > > providers" directly to core (via the queue-api or what not), and we
> > > check that flag directly in netdev_rx_queue_restart(), and fail
> > > immediately if the support is not there.
> >
> > I might've misread Jakub, but yes, I believe it's different. It'd
> > communicate about support for providers to upper layers, so we can
> > fail even before attempting to allocate a new queue and init a
> > page pool.
>
> Got it. Since allocating memory happens before stopping traffic
> I think it's acceptable to stick to a single flag.
>
> > > Jakub is suggesting a page_pool_params flag which lets the driver
> > > report "I support memory providers". If the driver doesn't support it
> > > but core is trying to configure that, then the page_pool_create will
> > > fail, which will cause the queue API operation
> > > (ndo_queue_alloc_mem_alloc) to fail, which causes
> > > netdev_rx_queue_restart() to fail.
> >
> > And I'm not against this way either if we explicitly get an error
> > back instead of trying to figure it out post-factum like by
> > checking the references and possibly reverting the allocation.
> > Maybe that's where I was confused, and that refcount thing was
> > suggested as a WARN_ONCE?
>
> Yup, the refcount (now: check of the page pool list) was meant
> as a WARN_ONCE() to catch bad drivers.
>
> > FWIW, I think it warrants two flags. The first saying that the
> > driver supports providers at all:
> >
> > page_pool_init() {
> >       if (rxq->mp_params)
> >               if (!(flags & PP_PROVIDERS_SUPPORTED))
> >                       goto fail;
> > }
> >
> > And the second telling whether the driver wants to install
> > providers for this particular page pool, so if there is a
> > separate pool for headers we can set it with plain old kernel
> > pages.
>
> The implementation of the queue API should be resilient against
> failures in alloc, and not being MP capable is just a form of
> alloc failure. I don't see the upside of double-flag.
>
> > payload_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED);
> > header_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED |
> >                                      PP_IGNORE_PROVIDERS);
>
> Also don't see the upside of the explicit "non-capable" flag,
> but I haven't thought of that. Is there any use?
>

There are 2 things we're trying to accomplish:

1. Drivers need to be able to say "I support unreadable netmem".
Failure to report unreadable netmem support should cause the netlink
API to fail when the user tries to bind dmabuf/io uring memory.

2. Drivers need to be able to say "I want a header pool (with readable
netmem)" or "I want a data pool (potentially with unreadable netmem)".

Pavel is suggesting implementing both of these in 2 different flags.

Jakub is suggesting implementing both with 1 flag which says "I can
support unreadable netmem for this pool" , and guarding against #1
with a refcount check to detect if a dmabuf pool should have been
created but wasn't.

I prefer Jakub's suggestion, but beware that if we go with Jakub's
suggestion, we can't WARN_ON when the core-net check fails, because
it's not a kernel warning. All that's happening is that the user asked
for dmabuf binding but the driver didn't ask for it (because likely it
doesn't support it). That's not cause for a warning to be printed in
the kernel. The netlink API should just fail and return -EOPNOTSUPP or
something.

> One important note. The flag should not be tied to memory providers
> but rather to netmem, IOW unreadable memory. MP is an internal detail,
> the important fact from the driver-facing API perspective is that the
> driver doesn't need struct pages.
>

Agreed.
  
Pavel Begunkov Aug. 12, 2024, 6:57 p.m. UTC | #11
On 8/12/24 18:57, Jakub Kicinski wrote:
> On Sun, 11 Aug 2024 22:51:13 +0100 Pavel Begunkov wrote:
>>> I think we're talking about 2 slightly different flags, AFAIU.>
>>> Pavel and I are suggesting the driver reports "I support memory
>>> providers" directly to core (via the queue-api or what not), and we
>>> check that flag directly in netdev_rx_queue_restart(), and fail
>>> immediately if the support is not there.
>>
>> I might've misread Jakub, but yes, I believe it's different. It'd
>> communicate about support for providers to upper layers, so we can
>> fail even before attempting to allocate a new queue and init a
>> page pool.
> 
> Got it. Since allocating memory happens before stopping traffic
> I think it's acceptable to stick to a single flag.

I agree, it's an failure case of init path, shouldn't be
a problem.


>>> Jakub is suggesting a page_pool_params flag which lets the driver
>>> report "I support memory providers". If the driver doesn't support it
>>> but core is trying to configure that, then the page_pool_create will
>>> fail, which will cause the queue API operation
>>> (ndo_queue_alloc_mem_alloc) to fail, which causes
>>> netdev_rx_queue_restart() to fail.
>>
>> And I'm not against this way either if we explicitly get an error
>> back instead of trying to figure it out post-factum like by
>> checking the references and possibly reverting the allocation.
>> Maybe that's where I was confused, and that refcount thing was
>> suggested as a WARN_ONCE?
> 
> Yup, the refcount (now: check of the page pool list) was meant
> as a WARN_ONCE() to catch bad drivers.
> 
>> FWIW, I think it warrants two flags. The first saying that the
>> driver supports providers at all:
>>
>> page_pool_init() {
>> 	if (rxq->mp_params)
>> 		if (!(flags & PP_PROVIDERS_SUPPORTED))
>> 			goto fail;
>> }
>>
>> And the second telling whether the driver wants to install
>> providers for this particular page pool, so if there is a
>> separate pool for headers we can set it with plain old kernel
>> pages.
> 
> The implementation of the queue API should be resilient against
> failures in alloc, and not being MP capable is just a form of
> alloc failure. I don't see the upside of double-flag.
> 
>> payload_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED);
>> header_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED |
>>                                       PP_IGNORE_PROVIDERS);
> 
> Also don't see the upside of the explicit "non-capable" flag,
> but I haven't thought of that. Is there any use?

Considering that we pass a mp to page pool indirectly via
a queue

rxq->mp_param = devmemtcp;
... // later
pp_params.queue = rxq;
page_pool_create(pp_params);

How can we create a separate header pool without mp and let
the other data pool be initialized with mp? We can make
sure pp_params.queue is NULL, but API isn't sound, I think
the plans are to reuse the queue argument for some other
purposes.

param_tmp = rxq->mp_param;
rxq->mp_param = NULL;
page_pool_create(pp_params);
rxq->mp_param = param_tmp;

Temporarily erasing mp_param is another option, but I think
by far the simplest is another flag.

> One important note. The flag should not be tied to memory providers
> but rather to netmem, IOW unreadable memory. MP is an internal detail,
> the important fact from the driver-facing API perspective is that the
> driver doesn't need struct pages.

Sure, didn't want to go into describing how these are not
the same thing. Naming should reflect it, and if we expect
providers that don't produce net_iov, we might want to have
a flag in mp_param or so.
  
Pavel Begunkov Aug. 12, 2024, 7:04 p.m. UTC | #12
On 8/12/24 19:57, Pavel Begunkov wrote:
> On 8/12/24 18:57, Jakub Kicinski wrote:
>> On Sun, 11 Aug 2024 22:51:13 +0100 Pavel Begunkov wrote:
>>>> I think we're talking about 2 slightly different flags, AFAIU.>
>>>> Pavel and I are suggesting the driver reports "I support memory
>>>> providers" directly to core (via the queue-api or what not), and we
>>>> check that flag directly in netdev_rx_queue_restart(), and fail
>>>> immediately if the support is not there.
>>>
>>> I might've misread Jakub, but yes, I believe it's different. It'd
>>> communicate about support for providers to upper layers, so we can
>>> fail even before attempting to allocate a new queue and init a
>>> page pool.
>>
>> Got it. Since allocating memory happens before stopping traffic
>> I think it's acceptable to stick to a single flag.
> 
> I agree, it's an failure case of init path, shouldn't be
> a problem.
> 
> 
>>>> Jakub is suggesting a page_pool_params flag which lets the driver
>>>> report "I support memory providers". If the driver doesn't support it
>>>> but core is trying to configure that, then the page_pool_create will
>>>> fail, which will cause the queue API operation
>>>> (ndo_queue_alloc_mem_alloc) to fail, which causes
>>>> netdev_rx_queue_restart() to fail.
>>>
>>> And I'm not against this way either if we explicitly get an error
>>> back instead of trying to figure it out post-factum like by
>>> checking the references and possibly reverting the allocation.
>>> Maybe that's where I was confused, and that refcount thing was
>>> suggested as a WARN_ONCE?
>>
>> Yup, the refcount (now: check of the page pool list) was meant
>> as a WARN_ONCE() to catch bad drivers.
>>
>>> FWIW, I think it warrants two flags. The first saying that the
>>> driver supports providers at all:
>>>
>>> page_pool_init() {
>>>     if (rxq->mp_params)
>>>         if (!(flags & PP_PROVIDERS_SUPPORTED))
>>>             goto fail;
>>> }
>>>
>>> And the second telling whether the driver wants to install
>>> providers for this particular page pool, so if there is a
>>> separate pool for headers we can set it with plain old kernel
>>> pages.
>>
>> The implementation of the queue API should be resilient against
>> failures in alloc, and not being MP capable is just a form of
>> alloc failure. I don't see the upside of double-flag.
>>
>>> payload_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED);
>>> header_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED |
>>>                                       PP_IGNORE_PROVIDERS);
>>
>> Also don't see the upside of the explicit "non-capable" flag,
>> but I haven't thought of that. Is there any use?

Or maybe I don't get what you're asking, I explained
why to have that "PP_IGNORE_PROVIDERS" on top of the flag
saying that it's supported.

Which "non-capable" flag you have in mind? A page pool create
flag or one facing upper layers like devmem tcp?


> Considering that we pass a mp to page pool indirectly via
> a queue
> 
> rxq->mp_param = devmemtcp;
> ... // later
> pp_params.queue = rxq;
> page_pool_create(pp_params);
> 
> How can we create a separate header pool without mp and let
> the other data pool be initialized with mp? We can make
> sure pp_params.queue is NULL, but API isn't sound, I think
> the plans are to reuse the queue argument for some other
> purposes.
> 
> param_tmp = rxq->mp_param;
> rxq->mp_param = NULL;
> page_pool_create(pp_params);
> rxq->mp_param = param_tmp;
> 
> Temporarily erasing mp_param is another option, but I think
> by far the simplest is another flag.
> 
>> One important note. The flag should not be tied to memory providers
>> but rather to netmem, IOW unreadable memory. MP is an internal detail,
>> the important fact from the driver-facing API perspective is that the
>> driver doesn't need struct pages.
> 
> Sure, didn't want to go into describing how these are not
> the same thing. Naming should reflect it, and if we expect
> providers that don't produce net_iov, we might want to have
> a flag in mp_param or so.
>
  
Pavel Begunkov Aug. 12, 2024, 7:10 p.m. UTC | #13
On 8/12/24 19:55, Mina Almasry wrote:
> On Mon, Aug 12, 2024 at 1:57 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Sun, 11 Aug 2024 22:51:13 +0100 Pavel Begunkov wrote:
>>>> I think we're talking about 2 slightly different flags, AFAIU.>
>>>> Pavel and I are suggesting the driver reports "I support memory
>>>> providers" directly to core (via the queue-api or what not), and we
>>>> check that flag directly in netdev_rx_queue_restart(), and fail
>>>> immediately if the support is not there.
>>>
>>> I might've misread Jakub, but yes, I believe it's different. It'd
>>> communicate about support for providers to upper layers, so we can
>>> fail even before attempting to allocate a new queue and init a
>>> page pool.
>>
>> Got it. Since allocating memory happens before stopping traffic
>> I think it's acceptable to stick to a single flag.
>>
>>>> Jakub is suggesting a page_pool_params flag which lets the driver
>>>> report "I support memory providers". If the driver doesn't support it
>>>> but core is trying to configure that, then the page_pool_create will
>>>> fail, which will cause the queue API operation
>>>> (ndo_queue_alloc_mem_alloc) to fail, which causes
>>>> netdev_rx_queue_restart() to fail.
>>>
>>> And I'm not against this way either if we explicitly get an error
>>> back instead of trying to figure it out post-factum like by
>>> checking the references and possibly reverting the allocation.
>>> Maybe that's where I was confused, and that refcount thing was
>>> suggested as a WARN_ONCE?
>>
>> Yup, the refcount (now: check of the page pool list) was meant
>> as a WARN_ONCE() to catch bad drivers.
>>
>>> FWIW, I think it warrants two flags. The first saying that the
>>> driver supports providers at all:
>>>
>>> page_pool_init() {
>>>        if (rxq->mp_params)
>>>                if (!(flags & PP_PROVIDERS_SUPPORTED))
>>>                        goto fail;
>>> }
>>>
>>> And the second telling whether the driver wants to install
>>> providers for this particular page pool, so if there is a
>>> separate pool for headers we can set it with plain old kernel
>>> pages.
>>
>> The implementation of the queue API should be resilient against
>> failures in alloc, and not being MP capable is just a form of
>> alloc failure. I don't see the upside of double-flag.
>>
>>> payload_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED);
>>> header_pool = page_pool_create(rqx, PP_PROVIDERS_SUPPORTED |
>>>                                       PP_IGNORE_PROVIDERS);
>>
>> Also don't see the upside of the explicit "non-capable" flag,
>> but I haven't thought of that. Is there any use?
>>
> 
> There are 2 things we're trying to accomplish:
> 
> 1. Drivers need to be able to say "I support unreadable netmem".
> Failure to report unreadable netmem support should cause the netlink
> API to fail when the user tries to bind dmabuf/io uring memory.
> 
> 2. Drivers need to be able to say "I want a header pool (with readable
> netmem)" or "I want a data pool (potentially with unreadable netmem)".
> 
> Pavel is suggesting implementing both of these in 2 different flags.
> 
> Jakub is suggesting implementing both with 1 flag which says "I can
> support unreadable netmem for this pool" , and guarding against #1
> with a refcount check to detect if a dmabuf pool should have been
> created but wasn't.

That would be iffy IIUC, but I think Jakub just explicitly said
that the refcount trick was just for debugging purposes and not
for gauging errors like "providers are not supported by the driver".

"Yup, the refcount (now: check of the page pool list) was meant
as a WARN_ONCE() to catch bad drivers."


> I prefer Jakub's suggestion, but beware that if we go with Jakub's
> suggestion, we can't WARN_ON when the core-net check fails, because
> it's not a kernel warning. All that's happening is that the user asked
> for dmabuf binding but the driver didn't ask for it (because likely it
> doesn't support it). That's not cause for a warning to be printed in
> the kernel. The netlink API should just fail and return -EOPNOTSUPP or
> something.
  
Jakub Kicinski Aug. 12, 2024, 11:57 p.m. UTC | #14
On Mon, 12 Aug 2024 20:10:39 +0100 Pavel Begunkov wrote:
> > 1. Drivers need to be able to say "I support unreadable netmem".
> > Failure to report unreadable netmem support should cause the netlink
> > API to fail when the user tries to bind dmabuf/io uring memory.
> > 
> > 2. Drivers need to be able to say "I want a header pool (with readable
> > netmem)" or "I want a data pool (potentially with unreadable netmem)".
> > 
> > Pavel is suggesting implementing both of these in 2 different flags.
> > 
> > Jakub is suggesting implementing both with 1 flag which says "I can
> > support unreadable netmem for this pool" , and guarding against #1
> > with a refcount check to detect if a dmabuf pool should have been
> > created but wasn't.  
> 
> That would be iffy IIUC, but I think Jakub just explicitly said
> that the refcount trick was just for debugging purposes and not
> for gauging errors like "providers are not supported by the driver".
> 
> "Yup, the refcount (now: check of the page pool list) was meant
> as a WARN_ONCE() to catch bad drivers."

Sorry, insufficient caffeine level in the morning.
We can't WARN_ONCE(), indeed.
  
Jakub Kicinski Aug. 13, 2024, 12:15 a.m. UTC | #15
On Mon, 12 Aug 2024 20:04:41 +0100 Pavel Begunkov wrote:
> >> Also don't see the upside of the explicit "non-capable" flag,
> >> but I haven't thought of that. Is there any use?  
> 
> Or maybe I don't get what you're asking, I explained
> why to have that "PP_IGNORE_PROVIDERS" on top of the flag
> saying that it's supported.
> 
> Which "non-capable" flag you have in mind? A page pool create
> flag or one facing upper layers like devmem tcp?

Let me rephrase - what's the point of having both PP_PROVIDERS_SUPPORTED
and PP_IGNORE_PROVIDERS at the page pool level? PP_CAP_NET(MEM|IOV),
and it's either there or it's not.

If you're thinking about advertising the support all the way to the
user, I'm not sure if page pool is the right place to do so. It's more
of a queue property.

BTW, Mina, the core should probably also check that XDP isn't installed
before / while the netmem is bound to a queue.
  
Pavel Begunkov Aug. 13, 2024, 1:56 a.m. UTC | #16
On 8/13/24 01:15, Jakub Kicinski wrote:
> On Mon, 12 Aug 2024 20:04:41 +0100 Pavel Begunkov wrote:
>>>> Also don't see the upside of the explicit "non-capable" flag,
>>>> but I haven't thought of that. Is there any use?
>>
>> Or maybe I don't get what you're asking, I explained
>> why to have that "PP_IGNORE_PROVIDERS" on top of the flag
>> saying that it's supported.
>>
>> Which "non-capable" flag you have in mind? A page pool create
>> flag or one facing upper layers like devmem tcp?
> 
> Let me rephrase - what's the point of having both PP_PROVIDERS_SUPPORTED
> and PP_IGNORE_PROVIDERS at the page pool level? PP_CAP_NET(MEM|IOV),
> and it's either there or it's not.

The second flag solves a problem with initializing page pools
with headers, but let's forget about it for now, it's rather a
small nuance, would probably reappear when someone would try to
use pp_params->queue for purposes different from memory providers.

> If you're thinking about advertising the support all the way to the
> user, I'm not sure if page pool is the right place to do so. It's more
> of a queue property.

Nope. Only the first "SUPPORTED" flag serves that purpose in a way
by failing setup like netlink devmem dmabuf binding and returning
the error back to user.

> BTW, Mina, the core should probably also check that XDP isn't installed
> before / while the netmem is bound to a queue.
  
Pavel Begunkov Aug. 13, 2024, 2:31 a.m. UTC | #17
On 8/13/24 00:57, Jakub Kicinski wrote:
> On Mon, 12 Aug 2024 20:10:39 +0100 Pavel Begunkov wrote:
>>> 1. Drivers need to be able to say "I support unreadable netmem".
>>> Failure to report unreadable netmem support should cause the netlink
>>> API to fail when the user tries to bind dmabuf/io uring memory.
>>>
>>> 2. Drivers need to be able to say "I want a header pool (with readable
>>> netmem)" or "I want a data pool (potentially with unreadable netmem)".
>>>
>>> Pavel is suggesting implementing both of these in 2 different flags.
>>>
>>> Jakub is suggesting implementing both with 1 flag which says "I can
>>> support unreadable netmem for this pool" , and guarding against #1
>>> with a refcount check to detect if a dmabuf pool should have been
>>> created but wasn't.
>>
>> That would be iffy IIUC, but I think Jakub just explicitly said
>> that the refcount trick was just for debugging purposes and not
>> for gauging errors like "providers are not supported by the driver".
>>
>> "Yup, the refcount (now: check of the page pool list) was meant
>> as a WARN_ONCE() to catch bad drivers."
> 
> Sorry, insufficient caffeine level in the morning.
> We can't WARN_ONCE(), indeed.

I'm getting lost, so repeating myself a bit. What I think
would be a good approach is if we get an error back from
the driver if it doesn't support netiov / providers.

netdev_rx_queue_restart() {
	...
	err = dev->queue_mgmt_ops->ndo_queue_mem_alloc();
	if (err == -EOPNOTSUPP) // the driver doesn't support netiov
		return -EOPNOTSUPP;
	...
}

That can be done if drivers opt in to support providers,
e.g. via a page pool flag.

What I think wouldn't be a great option is getting back a
"success" from the driver even though it ignored

netdev_rx_queue_restart() {
	...
	err = dev->queue_mgmt_ops->ndo_queue_mem_alloc();
	if (err)
		return err;

	// we get err==0 even if the driver doesn't support
	// providers, verify it is _actually_ installed
	if (rxq->mp_params) {
		// or walking pp list, same thing
		if (rxq->mp_params->refcount == 0)
			goto fail;
	}
}

And if we go with the first version, the refcount check can
also be added but as a warning. Maybe it's easier to put it
into code and discuss then.
  
Mina Almasry Aug. 13, 2024, 8:39 a.m. UTC | #18
On Mon, Aug 12, 2024 at 8:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> BTW, Mina, the core should probably also check that XDP isn't installed
> before / while the netmem is bound to a queue.

Sorry if noob question, but what is the proper check for this? I tried
adding this to net_devmem_bind_dmabuf_to_queue():

if (xdp_rxq_info_is_reg(&rxq->xdp_rxq))
                 return -EEXIST;

But quickly found out that in  netif_alloc_rx_queues() we initialize
all the rxq->xdp_rxq to state REGISTERED regardless whether xdp is
installed or not, so this check actually fails.

Worthy of note is that GVE holds an instance of xdp_rxq_info in
gve_rx_ring, and seems to use that for its xdp information, not the
one that hangs off of netdev_rx_queue in core.

Additionally, my understanding of XDP is limited, but why do we want
to disable it? My understanding is that XDP is a kernel bypass that
hands the data directly to userspace. In theory at least there should
be no issue binding dmabuf to a queue, then getting the data in the
queue via an XDP program instead of via TCP sockets or io uring. Is
there some fundamental reason why dmabuf and XDP are incompatible?
  
Mina Almasry Aug. 13, 2024, 9:03 a.m. UTC | #19
On Tue, Aug 13, 2024 at 4:39 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Aug 12, 2024 at 8:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > BTW, Mina, the core should probably also check that XDP isn't installed
> > before / while the netmem is bound to a queue.
>
> Sorry if noob question, but what is the proper check for this? I tried
> adding this to net_devmem_bind_dmabuf_to_queue():
>
> if (xdp_rxq_info_is_reg(&rxq->xdp_rxq))
>                  return -EEXIST;
>
> But quickly found out that in  netif_alloc_rx_queues() we initialize
> all the rxq->xdp_rxq to state REGISTERED regardless whether xdp is
> installed or not, so this check actually fails.
>
> Worthy of note is that GVE holds an instance of xdp_rxq_info in
> gve_rx_ring, and seems to use that for its xdp information, not the
> one that hangs off of netdev_rx_queue in core.
>

To elaborate further, in order to disable binding dmabuf and XDP on
the same rx queue for GVE, AFAIT the check would need to be inside of
GVE. Inside of GVE I'd check if gve_priv->xdp_prog is installed, and
check if the gve_rx_ring->xdp_info is registered. If so, then the rx
queue is XDP enabled, and should not be bound to dmabuf. I think that
would work.

At the moment I can't think of a check inside of core that would be
compatible with GVE, but above you clearly are specifically asking for
a check in core. Any pointers to what you have in mind would be
appreciated here, but I'll try to take a deeper look.

> Additionally, my understanding of XDP is limited, but why do we want
> to disable it? My understanding is that XDP is a kernel bypass that
> hands the data directly to userspace. In theory at least there should
> be no issue binding dmabuf to a queue, then getting the data in the
> queue via an XDP program instead of via TCP sockets or io uring. Is
> there some fundamental reason why dmabuf and XDP are incompatible?
>
> --
> Thanks,
> Mina
  
Jakub Kicinski Aug. 13, 2024, 2:33 p.m. UTC | #20
On Tue, 13 Aug 2024 04:39:47 -0400 Mina Almasry wrote:
> On Mon, Aug 12, 2024 at 8:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > BTW, Mina, the core should probably also check that XDP isn't installed
> > before / while the netmem is bound to a queue.  
> 
> Sorry if noob question, but what is the proper check for this?

	if (dev_xdp_prog_count(netdev))

> I tried adding this to net_devmem_bind_dmabuf_to_queue():
> 
> if (xdp_rxq_info_is_reg(&rxq->xdp_rxq))
>                  return -EEXIST;
> 
> But quickly found out that in  netif_alloc_rx_queues() we initialize
> all the rxq->xdp_rxq to state REGISTERED regardless whether xdp is
> installed or not, so this check actually fails.
> 
> Worthy of note is that GVE holds an instance of xdp_rxq_info in
> gve_rx_ring, and seems to use that for its xdp information, not the
> one that hangs off of netdev_rx_queue in core.
> 
> Additionally, my understanding of XDP is limited, but why do we want
> to disable it? My understanding is that XDP is a kernel bypass that
> hands the data directly to userspace. In theory at least there should
> be no issue binding dmabuf to a queue, then getting the data in the
> queue via an XDP program instead of via TCP sockets or io uring. Is
> there some fundamental reason why dmabuf and XDP are incompatible?

You're conflating XDP with AF_XDP, two different things, slightly
related but different.
  
Jakub Kicinski Aug. 13, 2024, 2:39 p.m. UTC | #21
On Tue, 13 Aug 2024 03:31:13 +0100 Pavel Begunkov wrote:
> I'm getting lost, so repeating myself a bit. What I think
> would be a good approach is if we get an error back from
> the driver if it doesn't support netiov / providers.
> 
> netdev_rx_queue_restart() {
> 	...
> 	err = dev->queue_mgmt_ops->ndo_queue_mem_alloc();
> 	if (err == -EOPNOTSUPP) // the driver doesn't support netiov
> 		return -EOPNOTSUPP;
> 	...
> }
> 
> That can be done if drivers opt in to support providers,
> e.g. via a page pool flag.
> 
> What I think wouldn't be a great option is getting back a
> "success" from the driver even though it ignored

page pool params are not the right place for a supported flag.
Sooner or later we'll want to expose this flag to user space.
  
Pavel Begunkov Aug. 13, 2024, 3:11 p.m. UTC | #22
On 8/13/24 15:39, Jakub Kicinski wrote:
> On Tue, 13 Aug 2024 03:31:13 +0100 Pavel Begunkov wrote:
>> I'm getting lost, so repeating myself a bit. What I think
>> would be a good approach is if we get an error back from
>> the driver if it doesn't support netiov / providers.
>>
>> netdev_rx_queue_restart() {
>> 	...
>> 	err = dev->queue_mgmt_ops->ndo_queue_mem_alloc();
>> 	if (err == -EOPNOTSUPP) // the driver doesn't support netiov
>> 		return -EOPNOTSUPP;
>> 	...
>> }
>>
>> That can be done if drivers opt in to support providers,
>> e.g. via a page pool flag.
>>
>> What I think wouldn't be a great option is getting back a
>> "success" from the driver even though it ignored
> 
> page pool params are not the right place for a supported flag.
> Sooner or later we'll want to expose this flag to user space.

Fair enough, it appeared to me that's what you was suggesting

"What comes to mind is adding an "I can gobble up net_iovs from this
pool" flag in page pool params (the struct that comes from the driver),
and then on the installation path we can check ..."

We can also move it from pp flags to queue API callbacks, however if we
want to expose it to the userspace, I'd imagine we need a queue flag set
by the driver, which then can be queried by netlink or whichever
interface is appropriate. And it can be used can be used to fail
netdev_rx_queue_restart() for queues/drivers that don't support mp.

netdev_rx_queue_restart() {
	if (rxq->mp_params && !rxq->netiov_supported)
		fail;
}

Perhaps, I don't understand what approach you contemplate, but
maybe Mina has cracked it.
  
Jakub Kicinski Aug. 13, 2024, 3:26 p.m. UTC | #23
On Tue, 13 Aug 2024 16:11:15 +0100 Pavel Begunkov wrote:
> On 8/13/24 15:39, Jakub Kicinski wrote:
> > On Tue, 13 Aug 2024 03:31:13 +0100 Pavel Begunkov wrote:  
> >> I'm getting lost, so repeating myself a bit. What I think
> >> would be a good approach is if we get an error back from
> >> the driver if it doesn't support netiov / providers.
> >>
> >> netdev_rx_queue_restart() {
> >> 	...
> >> 	err = dev->queue_mgmt_ops->ndo_queue_mem_alloc();
> >> 	if (err == -EOPNOTSUPP) // the driver doesn't support netiov
> >> 		return -EOPNOTSUPP;
> >> 	...
> >> }
> >>
> >> That can be done if drivers opt in to support providers,
> >> e.g. via a page pool flag.
> >>
> >> What I think wouldn't be a great option is getting back a
> >> "success" from the driver even though it ignored  
> > 
> > page pool params are not the right place for a supported flag.
> > Sooner or later we'll want to expose this flag to user space.  
> 
> Fair enough, it appeared to me that's what you was suggesting
> 
> "What comes to mind is adding an "I can gobble up net_iovs from this
> pool" flag in page pool params (the struct that comes from the driver),
> and then on the installation path we can check ..."

Yes, we still need one flag in page pool params -- functioning like 
the inverse of PP_IGNORE_PROVIDERS, I'd call it something like
PP_CAP_NET(MEM|IOV). To distinguish the header and data pools.

> We can also move it from pp flags to queue API callbacks, however if we
> want to expose it to the userspace, I'd imagine we need a queue flag set
> by the driver, which then can be queried by netlink or whichever
> interface is appropriate. And it can be used can be used to fail
> netdev_rx_queue_restart() for queues/drivers that don't support mp.
> 
> netdev_rx_queue_restart() {
> 	if (rxq->mp_params && !rxq->netiov_supported)
> 		fail;
> }

Yup!
  

Patch

diff --git a/include/net/mp_dmabuf_devmem.h b/include/net/mp_dmabuf_devmem.h
new file mode 100644
index 0000000000000..300a2356eed00
--- /dev/null
+++ b/include/net/mp_dmabuf_devmem.h
@@ -0,0 +1,44 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Dmabuf device memory provider.
+ *
+ * Authors:	Mina Almasry <almasrymina@google.com>
+ *
+ */
+#ifndef _NET_MP_DMABUF_DEVMEM_H
+#define _NET_MP_DMABUF_DEVMEM_H
+
+#include <net/netmem.h>
+
+#if defined(CONFIG_DMA_SHARED_BUFFER) && defined(CONFIG_GENERIC_ALLOCATOR)
+int mp_dmabuf_devmem_init(struct page_pool *pool);
+
+netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool, gfp_t gfp);
+
+void mp_dmabuf_devmem_destroy(struct page_pool *pool);
+
+bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem);
+#else
+static inline int mp_dmabuf_devmem_init(struct page_pool *pool)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool,
+							gfp_t gfp)
+{
+	return 0;
+}
+
+static inline void mp_dmabuf_devmem_destroy(struct page_pool *pool)
+{
+}
+
+static inline bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
+						 netmem_ref netmem)
+{
+	return false;
+}
+#endif
+
+#endif /* _NET_MP_DMABUF_DEVMEM_H */
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 99531780e53af..34dd146f9e940 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -100,6 +100,21 @@  static inline struct page *netmem_to_page(netmem_ref netmem)
 	return (__force struct page *)netmem;
 }
 
+static inline struct net_iov *netmem_to_net_iov(netmem_ref netmem)
+{
+	if (netmem_is_net_iov(netmem))
+		return (struct net_iov *)((__force unsigned long)netmem &
+					  ~NET_IOV);
+
+	DEBUG_NET_WARN_ON_ONCE(true);
+	return NULL;
+}
+
+static inline netmem_ref net_iov_to_netmem(struct net_iov *niov)
+{
+	return (__force netmem_ref)((unsigned long)niov | NET_IOV);
+}
+
 static inline netmem_ref page_to_netmem(struct page *page)
 {
 	return (__force netmem_ref)page;
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 4afd6dd56351a..e01ccb42e582f 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -52,6 +52,7 @@  struct pp_alloc_cache {
  * @nid:	NUMA node id to allocate from pages from
  * @dev:	device, for DMA pre-mapping purposes
  * @napi:	NAPI which is the sole consumer of pages, otherwise NULL
+ * @queue:	struct netdev_rx_queue this page_pool is being created for.
  * @dma_dir:	DMA mapping direction
  * @max_len:	max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
  * @offset:	DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
@@ -66,6 +67,7 @@  struct page_pool_params {
 		int		nid;
 		struct device	*dev;
 		struct napi_struct *napi;
+		struct netdev_rx_queue *queue;
 		enum dma_data_direction dma_dir;
 		unsigned int	max_len;
 		unsigned int	offset;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index befff59a2ee64..f8e78907c6c68 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -17,8 +17,11 @@ 
 #include <linux/genalloc.h>
 #include <linux/dma-buf.h>
 #include <net/devmem.h>
+#include <net/mp_dmabuf_devmem.h>
 #include <net/netdev_queues.h>
 
+#include "page_pool_priv.h"
+
 /* Device memory support */
 
 #if defined(CONFIG_DMA_SHARED_BUFFER) && defined(CONFIG_GENERIC_ALLOCATOR)
@@ -297,4 +300,79 @@  void dev_dmabuf_uninstall(struct net_device *dev)
 					xa_erase(&binding->bound_rxqs, xa_idx);
 	}
 }
+
+/*** "Dmabuf devmem memory provider" ***/
+
+int mp_dmabuf_devmem_init(struct page_pool *pool)
+{
+	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+
+	if (!binding)
+		return -EINVAL;
+
+	if (!pool->dma_map)
+		return -EOPNOTSUPP;
+
+	if (pool->dma_sync)
+		return -EOPNOTSUPP;
+
+	if (pool->p.order != 0)
+		return -E2BIG;
+
+	net_devmem_dmabuf_binding_get(binding);
+	return 0;
+}
+
+netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool, gfp_t gfp)
+{
+	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+	netmem_ref netmem;
+	struct net_iov *niov;
+	dma_addr_t dma_addr;
+
+	niov = net_devmem_alloc_dmabuf(binding);
+	if (!niov)
+		return 0;
+
+	dma_addr = net_devmem_get_dma_addr(niov);
+
+	netmem = net_iov_to_netmem(niov);
+
+	page_pool_set_pp_info(pool, netmem);
+
+	if (page_pool_set_dma_addr_netmem(netmem, dma_addr))
+		goto err_free;
+
+	pool->pages_state_hold_cnt++;
+	trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt);
+	return netmem;
+
+err_free:
+	net_devmem_free_dmabuf(niov);
+	return 0;
+}
+
+void mp_dmabuf_devmem_destroy(struct page_pool *pool)
+{
+	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+
+	net_devmem_dmabuf_binding_put(binding);
+}
+
+bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
+{
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return false;
+
+	if (WARN_ON_ONCE(atomic_long_read(netmem_get_pp_ref_count_ref(netmem)) !=
+		     1))
+		return false;
+
+	page_pool_clear_pp_info(netmem);
+
+	net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
+
+	/* We don't want the page pool put_page()ing our net_iovs. */
+	return false;
+}
 #endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index c5c303746d494..b70f779cb7856 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -13,6 +13,7 @@ 
 
 #include <net/page_pool/helpers.h>
 #include <net/xdp.h>
+#include <net/netdev_rx_queue.h>
 
 #include <linux/dma-direction.h>
 #include <linux/dma-mapping.h>
@@ -21,6 +22,9 @@ 
 #include <linux/poison.h>
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
+#include <linux/genalloc.h>
+#include <net/devmem.h>
+#include <net/mp_dmabuf_devmem.h>
 
 #include <trace/events/page_pool.h>
 
@@ -28,6 +32,7 @@ 
 #include "netmem_priv.h"
 
 DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
+EXPORT_SYMBOL(page_pool_mem_providers);
 
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
@@ -190,6 +195,7 @@  static int page_pool_init(struct page_pool *pool,
 			  int cpuid)
 {
 	unsigned int ring_qsize = 1024; /* Default */
+	int err;
 
 	page_pool_struct_check();
 
@@ -271,7 +277,35 @@  static int page_pool_init(struct page_pool *pool,
 	if (pool->dma_map)
 		get_device(pool->p.dev);
 
+	if (pool->p.queue) {
+		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
+		 * configuration doesn't change while we're initializing the
+		 * page_pool.
+		 */
+		ASSERT_RTNL();
+		pool->mp_priv = pool->p.queue->mp_params.mp_priv;
+	}
+
+	if (pool->mp_priv) {
+		err = mp_dmabuf_devmem_init(pool);
+		if (err) {
+			pr_warn("%s() mem-provider init failed %d\n", __func__,
+				err);
+			goto free_ptr_ring;
+		}
+
+		static_branch_inc(&page_pool_mem_providers);
+	}
+
 	return 0;
+
+free_ptr_ring:
+	ptr_ring_cleanup(&pool->ring, NULL);
+#ifdef CONFIG_PAGE_POOL_STATS
+	if (!pool->system)
+		free_percpu(pool->recycle_stats);
+#endif
+	return err;
 }
 
 static void page_pool_uninit(struct page_pool *pool)
@@ -455,28 +489,6 @@  static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	return false;
 }
 
-static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
-{
-	netmem_set_pp(netmem, pool);
-	netmem_or_pp_magic(netmem, PP_SIGNATURE);
-
-	/* Ensuring all pages have been split into one fragment initially:
-	 * page_pool_set_pp_info() is only called once for every page when it
-	 * is allocated from the page allocator and page_pool_fragment_page()
-	 * is dirtying the same cache line as the page->pp_magic above, so
-	 * the overhead is negligible.
-	 */
-	page_pool_fragment_netmem(netmem, 1);
-	if (pool->has_init_callback)
-		pool->slow.init_callback(netmem, pool->slow.init_arg);
-}
-
-static void page_pool_clear_pp_info(netmem_ref netmem)
-{
-	netmem_clear_pp_magic(netmem);
-	netmem_set_pp(netmem, NULL);
-}
-
 static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 						 gfp_t gfp)
 {
@@ -572,7 +584,10 @@  netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp)
 		return netmem;
 
 	/* Slow-path: cache empty, do real allocation */
-	netmem = __page_pool_alloc_pages_slow(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
+		netmem = mp_dmabuf_devmem_alloc_netmems(pool, gfp);
+	else
+		netmem = __page_pool_alloc_pages_slow(pool, gfp);
 	return netmem;
 }
 EXPORT_SYMBOL(page_pool_alloc_netmem);
@@ -608,6 +623,28 @@  s32 page_pool_inflight(const struct page_pool *pool, bool strict)
 	return inflight;
 }
 
+void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
+{
+	netmem_set_pp(netmem, pool);
+	netmem_or_pp_magic(netmem, PP_SIGNATURE);
+
+	/* Ensuring all pages have been split into one fragment initially:
+	 * page_pool_set_pp_info() is only called once for every page when it
+	 * is allocated from the page allocator and page_pool_fragment_page()
+	 * is dirtying the same cache line as the page->pp_magic above, so
+	 * the overhead is negligible.
+	 */
+	page_pool_fragment_netmem(netmem, 1);
+	if (pool->has_init_callback)
+		pool->slow.init_callback(netmem, pool->slow.init_arg);
+}
+
+void page_pool_clear_pp_info(netmem_ref netmem)
+{
+	netmem_clear_pp_magic(netmem);
+	netmem_set_pp(netmem, NULL);
+}
+
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 							 netmem_ref netmem)
 {
@@ -636,8 +673,13 @@  static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 {
 	int count;
+	bool put;
 
-	__page_pool_release_page_dma(pool, netmem);
+	put = true;
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
+		put = mp_dmabuf_devmem_release_page(pool, netmem);
+	else
+		__page_pool_release_page_dma(pool, netmem);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -645,8 +687,10 @@  void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, netmem, count);
 
-	page_pool_clear_pp_info(netmem);
-	put_page(netmem_to_page(netmem));
+	if (put) {
+		page_pool_clear_pp_info(netmem);
+		put_page(netmem_to_page(netmem));
+	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -965,6 +1009,12 @@  static void __page_pool_destroy(struct page_pool *pool)
 
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
+
+	if (pool->mp_priv) {
+		mp_dmabuf_devmem_destroy(pool);
+		static_branch_dec(&page_pool_mem_providers);
+	}
+
 	kfree(pool);
 }
 
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index 90665d40f1eb7..e5b019ab0b735 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -9,4 +9,7 @@  int page_pool_list(struct page_pool *pool);
 void page_pool_detached(struct page_pool *pool);
 void page_pool_unlist(struct page_pool *pool);
 
+void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
+void page_pool_clear_pp_info(netmem_ref netmem);
+
 #endif