get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)

Message ID 202405031110.6F47982593@keescook (mailing list archive)
State New
Headers
Series get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) |

Commit Message

Kees Cook May 3, 2024, 6:26 p.m. UTC
  On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
> [...]
> Root cause:
> AFAIK, eventpoll (epoll) does not increase the registered file's reference.
> To ensure the safety, when the registered file is deallocated in __fput,
> eventpoll_release is called to unregister the file from epoll. When calling
> poll on epoll, epoll will loop through registered files and call vfs_poll on
> these files. In the file's poll, file is guaranteed to be alive, however, as
> epoll does not increase the registered file's reference, the file may be
> dying
> and it's not safe the get the file for later use. And dma_buf_poll violates
> this. In the dma_buf_poll, it tries to get_file to use in the callback. This
> leads to a race where the dmabuf->file can be fput twice.
> 
> Here is the race occurs in the above proof-of-concept
> 
> close(dmabuf->file)
> __fput_sync (f_count == 1, last ref)
> f_count-- (f_count == 0 now)
> __fput
>                                     epoll_wait
>                                     vfs_poll(dmabuf->file)
>                                     get_file(dmabuf->file)(f_count == 1)
> eventpoll_release
> dmabuf->file deallocation
>                                     fput(dmabuf->file) (f_count == 1)
>                                     f_count--
>                                     dmabuf->file deallocation
> 
> I am not familiar with the dma_buf so I don't know the proper fix for the
> issue. About the rule that don't get the file for later use in poll callback
> of
> file, I wonder if it is there when only select/poll exist or just after
> epoll
> appears.
> 
> I hope the analysis helps us to fix the issue.

Thanks for doing this analysis! I suspect at least a start of a fix
would be this:



What's the right way to deal with the dmabuf situation? (And I suspect
it applies to get_dma_buf_unless_doomed() as well...)

-Kees
  

Comments

Jens Axboe May 3, 2024, 6:49 p.m. UTC | #1
On 5/3/24 12:26 PM, Kees Cook wrote:
> Thanks for doing this analysis! I suspect at least a start of a fix
> would be this:
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8fe5aa67b167..15e8f74ee0f2 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>  
>  		if (events & EPOLLOUT) {
>  			/* Paired with fput in dma_buf_poll_cb */
> -			get_file(dmabuf->file);
> -
> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> +			    !dma_buf_poll_add_cb(resv, true, dcb))
>  				/* No callback queued, wake up any other waiters */

Don't think this is sane at all. I'm assuming you meant:

	atomic_long_inc_not_zero(&dmabuf->file->f_count);

but won't fly as you're not under RCU in the first place. And what
protects it from being long gone before you attempt this anyway? This is
sane way to attempt to fix it, it's completely opposite of what sane ref
handling should look like.

Not sure what the best fix is here, seems like dma-buf should hold an
actual reference to the file upfront rather than just stash a pointer
and then later _hope_ that it can just grab a reference. That seems
pretty horrible, and the real source of the issue.

> Due to this issue I've proposed fixing get_file() to detect pathological states:
> https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/

I don't think this would catch this case, as the memory could just be
garbage at this point.
  
Kees Cook May 3, 2024, 7:22 p.m. UTC | #2
On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> On 5/3/24 12:26 PM, Kees Cook wrote:
> > Thanks for doing this analysis! I suspect at least a start of a fix
> > would be this:
> > 
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 8fe5aa67b167..15e8f74ee0f2 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> >  
> >  		if (events & EPOLLOUT) {
> >  			/* Paired with fput in dma_buf_poll_cb */
> > -			get_file(dmabuf->file);
> > -
> > -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> > +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> > +			    !dma_buf_poll_add_cb(resv, true, dcb))
> >  				/* No callback queued, wake up any other waiters */
> 
> Don't think this is sane at all. I'm assuming you meant:
> 
> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);

Oops, yes, sorry. I was typed from memory instead of copy/paste.

> but won't fly as you're not under RCU in the first place. And what
> protects it from being long gone before you attempt this anyway? This is
> sane way to attempt to fix it, it's completely opposite of what sane ref
> handling should look like.
> 
> Not sure what the best fix is here, seems like dma-buf should hold an
> actual reference to the file upfront rather than just stash a pointer
> and then later _hope_ that it can just grab a reference. That seems
> pretty horrible, and the real source of the issue.

AFAICT, epoll just doesn't hold any references at all. It depends,
I think, on eventpoll_release() (really eventpoll_release_file())
synchronizing with epoll_wait() (but I don't see how this happens, and
the race seems to be against ep_item_poll() ...?)

I'm really confused about how eventpoll manages the lifetime of polled
fds.

> > Due to this issue I've proposed fixing get_file() to detect pathological states:
> > https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
> 
> I don't think this would catch this case, as the memory could just be
> garbage at this point.

It catches it just fine! :) I tested it against the published PoC.

And for cases where further allocations have progressed far enough to
corrupt the freed struct file and render the check pointless, nothing
different has happened than what happens today. At least now we have a
window to catch the situation across the time frame before it is both
reallocated _and_ the contents at the f_count offset gets changed to
non-zero.
  
Jens Axboe May 3, 2024, 7:35 p.m. UTC | #3
On 5/3/24 1:22 PM, Kees Cook wrote:
> On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
>> On 5/3/24 12:26 PM, Kees Cook wrote:
>>> Thanks for doing this analysis! I suspect at least a start of a fix
>>> would be this:
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 8fe5aa67b167..15e8f74ee0f2 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>>>  
>>>  		if (events & EPOLLOUT) {
>>>  			/* Paired with fput in dma_buf_poll_cb */
>>> -			get_file(dmabuf->file);
>>> -
>>> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
>>> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
>>> +			    !dma_buf_poll_add_cb(resv, true, dcb))
>>>  				/* No callback queued, wake up any other waiters */
>>
>> Don't think this is sane at all. I'm assuming you meant:
>>
>> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);
> 
> Oops, yes, sorry. I was typed from memory instead of copy/paste.

Figured :-)

>> but won't fly as you're not under RCU in the first place. And what
>> protects it from being long gone before you attempt this anyway? This is
>> sane way to attempt to fix it, it's completely opposite of what sane ref
>> handling should look like.
>>
>> Not sure what the best fix is here, seems like dma-buf should hold an
>> actual reference to the file upfront rather than just stash a pointer
>> and then later _hope_ that it can just grab a reference. That seems
>> pretty horrible, and the real source of the issue.
> 
> AFAICT, epoll just doesn't hold any references at all. It depends,
> I think, on eventpoll_release() (really eventpoll_release_file())
> synchronizing with epoll_wait() (but I don't see how this happens, and
> the race seems to be against ep_item_poll() ...?)
>
> I'm really confused about how eventpoll manages the lifetime of polled
> fds.

epoll doesn't hold any references, and it's got some ugly callback to
deal with that. It's not ideal, nor pretty, but that's how it currently
works. See eventpoll_release() and how it's called. This means that
epoll itself is supposedly safe from the file going away, even though it
doesn't hold a reference to it.

Except that in this case, the file is already gone by the time
eventpoll_release() is called. Which presumably is some interaction with
the somewhat suspicious file reference management that dma-buf is doing.
But I didn't look into that much, outside of noting it looks a bit
suspect.

>>> Due to this issue I've proposed fixing get_file() to detect pathological states:
>>> https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
>>
>> I don't think this would catch this case, as the memory could just be
>> garbage at this point.
> 
> It catches it just fine! :) I tested it against the published PoC.

Sure it _may_ catch the issue, but the memory may also just be garbage
at that point. Not saying it's a useless addition, outside of the usual
gripes of making the hot path slower, just that it won't catch all
cases. Which I guess is fine and expected.

> And for cases where further allocations have progressed far enough to
> corrupt the freed struct file and render the check pointless, nothing
> different has happened than what happens today. At least now we have a
> window to catch the situation across the time frame before it is both
> reallocated _and_ the contents at the f_count offset gets changed to
> non-zero.

Right.
  
Kees Cook May 3, 2024, 7:59 p.m. UTC | #4
On Fri, May 03, 2024 at 01:35:09PM -0600, Jens Axboe wrote:
> On 5/3/24 1:22 PM, Kees Cook wrote:
> > On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> >> On 5/3/24 12:26 PM, Kees Cook wrote:
> >>> Thanks for doing this analysis! I suspect at least a start of a fix
> >>> would be this:
> >>>
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index 8fe5aa67b167..15e8f74ee0f2 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> >>>  
> >>>  		if (events & EPOLLOUT) {
> >>>  			/* Paired with fput in dma_buf_poll_cb */
> >>> -			get_file(dmabuf->file);
> >>> -
> >>> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> >>> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> >>> +			    !dma_buf_poll_add_cb(resv, true, dcb))
> >>>  				/* No callback queued, wake up any other waiters */
> >>
> >> Don't think this is sane at all. I'm assuming you meant:
> >>
> >> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);
> > 
> > Oops, yes, sorry. I was typed from memory instead of copy/paste.
> 
> Figured :-)
> 
> >> but won't fly as you're not under RCU in the first place. And what
> >> protects it from being long gone before you attempt this anyway? This is
> >> sane way to attempt to fix it, it's completely opposite of what sane ref
> >> handling should look like.
> >>
> >> Not sure what the best fix is here, seems like dma-buf should hold an
> >> actual reference to the file upfront rather than just stash a pointer
> >> and then later _hope_ that it can just grab a reference. That seems
> >> pretty horrible, and the real source of the issue.
> > 
> > AFAICT, epoll just doesn't hold any references at all. It depends,
> > I think, on eventpoll_release() (really eventpoll_release_file())
> > synchronizing with epoll_wait() (but I don't see how this happens, and
> > the race seems to be against ep_item_poll() ...?)
> >
> > I'm really confused about how eventpoll manages the lifetime of polled
> > fds.
> 
> epoll doesn't hold any references, and it's got some ugly callback to
> deal with that. It's not ideal, nor pretty, but that's how it currently
> works. See eventpoll_release() and how it's called. This means that
> epoll itself is supposedly safe from the file going away, even though it
> doesn't hold a reference to it.

Right -- what remains unclear to me is how struct file lifetime is
expected to work in the struct file_operations::poll callbacks. Because
using get_file() there looks clearly unsafe...

> Except that in this case, the file is already gone by the time
> eventpoll_release() is called. Which presumably is some interaction with
> the somewhat suspicious file reference management that dma-buf is doing.
> But I didn't look into that much, outside of noting it looks a bit
> suspect.

Not yet, though. Here's (one) race state from the analysis. I added lines
for the dma_fence_add_callback()/dma_buf_poll_cb() case, since that's
the case that would escape any eventpoll_release/epoll_wait
synchronization (if it exists?):

close(dmabuf->file)
__fput_sync (f_count == 1, last ref)
f_count-- (f_count == 0 now)
__fput
                                     epoll_wait
                                     vfs_poll(dmabuf->file)
                                     get_file(dmabuf->file)(f_count == 1)
                                     dma_fence_add_callback()
eventpoll_release
dmabuf->file deallocation
                                     dma_buf_poll_cb()
                                     fput(dmabuf->file) (f_count == 1)
                                     f_count--
                                     dmabuf->file deallocation

Without fences to create a background callback, we just do a double-free:

close(dmabuf->file)
__fput_sync (f_count == 1, last ref)
f_count-- (f_count == 0 now)
__fput
                                     epoll_wait
                                     vfs_poll(dmabuf->file)
                                     get_file(dmabuf->file)(f_count == 1)
                                     dma_buf_poll_cb()
                                     fput(dmabuf->file) (f_count == 1)
                                     f_count--
                                     eventpoll_release
                                     dmabuf->file deallocation
eventpoll_release
dmabuf->file deallocation


get_file(), via epoll_wait()->vfs_poll()->dma_buf_poll(), has raised
f_count again. Then eventpoll_release() is doing things to remove
dmabuf->file from the eventpoll lists, but I *think* this is synchronized
so that an epoll_wait() will only call .poll handlers with a valid
(though possibly f_count==0) file, but I can't figure out where that
happens. (If it's not happening, we have a much bigger problem, but I
imagine we'd see massive corruption all the time, which we don't.)

So, yeah, I can't figure out how eventpoll_release() and epoll_wait()
are expected to behave safely for .poll handlers.

Regardless, for the simple case: it seems like it's just totally illegal
to use get_file() in a poll handler. Is this known/expected? And if so,
how can dmabuf possibly deal with that?
  
Kees Cook May 3, 2024, 8:28 p.m. UTC | #5
On Fri, May 03, 2024 at 12:59:52PM -0700, Kees Cook wrote:
> So, yeah, I can't figure out how eventpoll_release() and epoll_wait()
> are expected to behave safely for .poll handlers.
> 
> Regardless, for the simple case: it seems like it's just totally illegal
> to use get_file() in a poll handler. Is this known/expected? And if so,
> how can dmabuf possibly deal with that?

Is this the right approach? It still feels to me like get_file() needs
to happen much earlier...

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 882b89edc52a..c6c29facf228 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -991,9 +991,13 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 	__poll_t res;
 
 	pt->_key = epi->event.events;
-	if (!is_file_epoll(file))
-		res = vfs_poll(file, pt);
-	else
+	if (!is_file_epoll(file)) {
+		if (atomic_long_inc_not_zero(&file->f_count)) {
+			res = vfs_poll(file, pt);
+			fput(file);
+		} else
+			res = EPOLLERR;
+	} else
 		res = __ep_eventpoll_poll(file, pt, depth);
 	return res & epi->event.events;
 }
  
Al Viro May 3, 2024, 9:11 p.m. UTC | #6
On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
> 
> Is this the right approach? It still feels to me like get_file() needs
> to happen much earlier...

I don't believe it needs to happen at all.  The problem is not that
->release() can be called during ->poll() - it can't and it doesn't.
It's that this instance of ->poll() is trying to extend the lifetime
of that struct file, when it might very well be past the point of no
return.

What we need is
	* promise that ep_item_poll() won't happen after eventpoll_release_file().
AFAICS, we do have that.
	* ->poll() not playing silly buggers.

As it is, dma_buf ->poll() is very suspicious regardless of that
mess - it can grab reference to file for unspecified interval.
Have that happen shortly before reboot and you are asking for failing
umount.

->poll() must be refcount-neutral wrt file passed to it.  I'm seriously
tempted to make ->poll() take const struct file * and see if there's
anything else that would fall out.
  
Linus Torvalds May 3, 2024, 9:24 p.m. UTC | #7
On Fri, 3 May 2024 at 14:11, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What we need is
>         * promise that ep_item_poll() won't happen after eventpoll_release_file().
> AFAICS, we do have that.
>         * ->poll() not playing silly buggers.

No. That is not enough at all.

Because even with perfectly normal "->poll()", and even with the
ep_item_poll() happening *before* eventpoll_release_file(), you have
this trivial race:

  ep_item_poll()
     ->poll()

and *between* those two operations, another CPU does "close()", and
that causes eventpoll_release_file() to be called, and now f_count
goes down to zero while ->poll() is running.

So you do need to increment the file count around the ->poll() call, I feel.

Or, alternatively, you'd need to serialize with
eventpoll_release_file(), but that would need to be some sleeping lock
held over the ->poll() call.

> As it is, dma_buf ->poll() is very suspicious regardless of that
> mess - it can grab reference to file for unspecified interval.

I think that's actually much preferable to what epoll does, which is
to keep using files without having reference counts to them (and then
relying on magically not racing with eventpoll_release_file().

                Linus
  
Al Viro May 3, 2024, 9:30 p.m. UTC | #8
On Fri, May 03, 2024 at 02:24:45PM -0700, Linus Torvalds wrote:
> Because even with perfectly normal "->poll()", and even with the
> ep_item_poll() happening *before* eventpoll_release_file(), you have
> this trivial race:
> 
>   ep_item_poll()
>      ->poll()
> 
> and *between* those two operations, another CPU does "close()", and
> that causes eventpoll_release_file() to be called, and now f_count
> goes down to zero while ->poll() is running.
> 
> So you do need to increment the file count around the ->poll() call, I feel.
> 
> Or, alternatively, you'd need to serialize with
> eventpoll_release_file(), but that would need to be some sleeping lock
> held over the ->poll() call.
> 
> > As it is, dma_buf ->poll() is very suspicious regardless of that
> > mess - it can grab reference to file for unspecified interval.
> 
> I think that's actually much preferable to what epoll does, which is
> to keep using files without having reference counts to them (and then
> relying on magically not racing with eventpoll_release_file().

eventpoll_release_file() calling __ep_remove() while ep_item_poll()
is something we need to avoid anyway - having epi freed under
ep_item_poll() would be a problem regardless of struct file
lifetime issues.
  
Al Viro May 3, 2024, 9:36 p.m. UTC | #9
On Fri, May 03, 2024 at 10:11:09PM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
> > 
> > Is this the right approach? It still feels to me like get_file() needs
> > to happen much earlier...
> 
> I don't believe it needs to happen at all.  The problem is not that
> ->release() can be called during ->poll() - it can't and it doesn't.
> It's that this instance of ->poll() is trying to extend the lifetime
> of that struct file, when it might very well be past the point of no
> return.
> 
> What we need is
> 	* promise that ep_item_poll() won't happen after eventpoll_release_file().
> AFAICS, we do have that.
> 	* ->poll() not playing silly buggers.
> 
> As it is, dma_buf ->poll() is very suspicious regardless of that
> mess - it can grab reference to file for unspecified interval.
> Have that happen shortly before reboot and you are asking for failing
> umount.
> 
> ->poll() must be refcount-neutral wrt file passed to it.  I'm seriously
> tempted to make ->poll() take const struct file * and see if there's
> anything else that would fall out.

... the last part is no-go - poll_wait() must be able to grab a reference
(well, the callback in it must)
  
Linus Torvalds May 3, 2024, 9:42 p.m. UTC | #10
On Fri, 3 May 2024 at 14:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... the last part is no-go - poll_wait() must be able to grab a reference
> (well, the callback in it must)

Yeah. I really think that *poll* itself is doing everything right. It
knows that it's called with a file pointer with a reference, and it
adds its own references as needed.

And I think that's all fine - both for dmabuf in particular, but for
poll in general. That's how things are *supposed* to work. You can
keep references to other things in your 'struct file *', knowing that
files are properly refcounted, and won't go away while you are dealing
with them.

The problem, of course, is that then epoll violates that "called with
reference" part.  epoll very much by design does *not* take references
to the files it keeps track of, and then tears them down at close()
time.

Now, epoll has its reasons for doing that. They are even good reasons.
But that does mean that when epoll needs to deal with that hackery.

I wish we could remove epoll entirely, but that isn't an option, so we
need to just make sure that when it accesses the ffd.file pointer, it
does so more carefully.

              Linus
  
Al Viro May 3, 2024, 9:53 p.m. UTC | #11
On Fri, May 03, 2024 at 02:42:22PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 14:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... the last part is no-go - poll_wait() must be able to grab a reference
> > (well, the callback in it must)
> 
> Yeah. I really think that *poll* itself is doing everything right. It
> knows that it's called with a file pointer with a reference, and it
> adds its own references as needed.

Not really.  Note that select's __pollwait() does *NOT* leave a reference
at the mercy of driver - it's stuck into poll_table_entry->filp and
the poll_freewait() knows how to take those out.


dmabuf does something very different - it grabs the damn thing into
its private data structures and for all we know it could keep it for
a few hours, until some even materializes.
  
Christian Brauner May 4, 2024, 9:45 a.m. UTC | #12
On Fri, May 03, 2024 at 11:26:32AM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
> > [...]
> > Root cause:
> > AFAIK, eventpoll (epoll) does not increase the registered file's reference.
> > To ensure the safety, when the registered file is deallocated in __fput,
> > eventpoll_release is called to unregister the file from epoll. When calling
> > poll on epoll, epoll will loop through registered files and call vfs_poll on
> > these files. In the file's poll, file is guaranteed to be alive, however, as
> > epoll does not increase the registered file's reference, the file may be
> > dying
> > and it's not safe the get the file for later use. And dma_buf_poll violates
> > this. In the dma_buf_poll, it tries to get_file to use in the callback. This
> > leads to a race where the dmabuf->file can be fput twice.
> > 
> > Here is the race occurs in the above proof-of-concept
> > 
> > close(dmabuf->file)
> > __fput_sync (f_count == 1, last ref)
> > f_count-- (f_count == 0 now)
> > __fput
> >                                     epoll_wait
> >                                     vfs_poll(dmabuf->file)
> >                                     get_file(dmabuf->file)(f_count == 1)
> > eventpoll_release
> > dmabuf->file deallocation
> >                                     fput(dmabuf->file) (f_count == 1)
> >                                     f_count--
> >                                     dmabuf->file deallocation
> > 
> > I am not familiar with the dma_buf so I don't know the proper fix for the
> > issue. About the rule that don't get the file for later use in poll callback
> > of
> > file, I wonder if it is there when only select/poll exist or just after
> > epoll
> > appears.
> > 
> > I hope the analysis helps us to fix the issue.
> 
> Thanks for doing this analysis! I suspect at least a start of a fix
> would be this:
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8fe5aa67b167..15e8f74ee0f2 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>  
>  		if (events & EPOLLOUT) {
>  			/* Paired with fput in dma_buf_poll_cb */
> -			get_file(dmabuf->file);
> -
> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> +			    !dma_buf_poll_add_cb(resv, true, dcb))
>  				/* No callback queued, wake up any other waiters */
>  				dma_buf_poll_cb(NULL, &dcb->cb);
>  			else
> @@ -290,9 +289,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>  
>  		if (events & EPOLLIN) {
>  			/* Paired with fput in dma_buf_poll_cb */
> -			get_file(dmabuf->file);
> -
> -			if (!dma_buf_poll_add_cb(resv, false, dcb))
> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> +			    !dma_buf_poll_add_cb(resv, false, dcb))
>  				/* No callback queued, wake up any other waiters */
>  				dma_buf_poll_cb(NULL, &dcb->cb);
>  			else
> 
> 
> But this ends up leaving "active" non-zero, and at close time it runs
> into:
> 
>         BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> 
> But the bottom line is that get_file() is unsafe to use in some places,
> one of which appears to be in the poll handler. There are maybe some
> other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:
> 
> static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
> {
> 	return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
> }
> 
> Which I also note involves a dmabuf...
> 
> Due to this issue I've proposed fixing get_file() to detect pathological states:
> https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
> 
> But that has run into some push-back. I'm hoping that seeing this epoll
> example will help illustrate what needs fixing a little better.
> 
> I think the best current proposal is to just WARN sooner instead of a
> full refcount_t implementation:
> 
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8dfd53b52744..e09107d0a3d6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1040,7 +1040,8 @@ struct file_handle {
>  
>  static inline struct file *get_file(struct file *f)
>  {
> -	atomic_long_inc(&f->f_count);
> +	long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
> +	WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
>  	return f;
>  }
>  
> 
> 
> What's the right way to deal with the dmabuf situation? (And I suspect
> it applies to get_dma_buf_unless_doomed() as well...)

No, it doesn't. That's safe afaict as I've explained at lenght in
the other thread.
  
Christian Brauner May 4, 2024, 9:59 a.m. UTC | #13
On Fri, May 03, 2024 at 12:59:52PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 01:35:09PM -0600, Jens Axboe wrote:
> > On 5/3/24 1:22 PM, Kees Cook wrote:
> > > On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> > >> On 5/3/24 12:26 PM, Kees Cook wrote:
> > >>> Thanks for doing this analysis! I suspect at least a start of a fix
> > >>> would be this:
> > >>>
> > >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > >>> index 8fe5aa67b167..15e8f74ee0f2 100644
> > >>> --- a/drivers/dma-buf/dma-buf.c
> > >>> +++ b/drivers/dma-buf/dma-buf.c
> > >>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> > >>>  
> > >>>  		if (events & EPOLLOUT) {
> > >>>  			/* Paired with fput in dma_buf_poll_cb */
> > >>> -			get_file(dmabuf->file);
> > >>> -
> > >>> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> > >>> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> > >>> +			    !dma_buf_poll_add_cb(resv, true, dcb))
> > >>>  				/* No callback queued, wake up any other waiters */
> > >>
> > >> Don't think this is sane at all. I'm assuming you meant:
> > >>
> > >> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);
> > > 
> > > Oops, yes, sorry. I was typed from memory instead of copy/paste.
> > 
> > Figured :-)
> > 
> > >> but won't fly as you're not under RCU in the first place. And what
> > >> protects it from being long gone before you attempt this anyway? This is
> > >> sane way to attempt to fix it, it's completely opposite of what sane ref
> > >> handling should look like.
> > >>
> > >> Not sure what the best fix is here, seems like dma-buf should hold an
> > >> actual reference to the file upfront rather than just stash a pointer
> > >> and then later _hope_ that it can just grab a reference. That seems
> > >> pretty horrible, and the real source of the issue.
> > > 
> > > AFAICT, epoll just doesn't hold any references at all. It depends,
> > > I think, on eventpoll_release() (really eventpoll_release_file())
> > > synchronizing with epoll_wait() (but I don't see how this happens, and
> > > the race seems to be against ep_item_poll() ...?)
> > >
> > > I'm really confused about how eventpoll manages the lifetime of polled
> > > fds.
> > 
> > epoll doesn't hold any references, and it's got some ugly callback to
> > deal with that. It's not ideal, nor pretty, but that's how it currently
> > works. See eventpoll_release() and how it's called. This means that
> > epoll itself is supposedly safe from the file going away, even though it
> > doesn't hold a reference to it.
> 
> Right -- what remains unclear to me is how struct file lifetime is
> expected to work in the struct file_operations::poll callbacks. Because
> using get_file() there looks clearly unsafe...

If you're in ->poll() you're holding the epoll mutex and
eventpoll_release_file() needs to acquire ep->mtx as well. So if you're
in ->poll() then you know that eventpoll_release_file() can't progress
and therefore eventpoll_release() can't make progress. So
f_op->release() won't be able to be called as it happens after
eventpoll_release() in __fput(). But f_count being able to go to zero is
expected.
  
Daniel Vetter May 6, 2024, 12:23 p.m. UTC | #14
On Fri, May 03, 2024 at 10:53:03PM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 02:42:22PM -0700, Linus Torvalds wrote:
> > On Fri, 3 May 2024 at 14:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > ... the last part is no-go - poll_wait() must be able to grab a reference
> > > (well, the callback in it must)
> > 
> > Yeah. I really think that *poll* itself is doing everything right. It
> > knows that it's called with a file pointer with a reference, and it
> > adds its own references as needed.
> 
> Not really.  Note that select's __pollwait() does *NOT* leave a reference
> at the mercy of driver - it's stuck into poll_table_entry->filp and
> the poll_freewait() knows how to take those out.
> 
> 
> dmabuf does something very different - it grabs the damn thing into
> its private data structures and for all we know it could keep it for
> a few hours, until some even materializes.

dma_fence must complete in reasonable amount of time, where "reasonable"
is roughly in line with other i/o (including the option that there's
timeouts if the hw's gone busted).

So definitely not hours (aside from driver bugs when things go really
wrong ofc), but more like a few seconds in a worst case scenario.
-Sima
  
Stefan Metzmacher May 6, 2024, 5:46 p.m. UTC | #15
Am 03.05.24 um 23:24 schrieb Linus Torvalds:
> On Fri, 3 May 2024 at 14:11, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> What we need is
>>          * promise that ep_item_poll() won't happen after eventpoll_release_file().
>> AFAICS, we do have that.
>>          * ->poll() not playing silly buggers.
> 
> No. That is not enough at all.
> 
> Because even with perfectly normal "->poll()", and even with the
> ep_item_poll() happening *before* eventpoll_release_file(), you have
> this trivial race:
> 
>    ep_item_poll()
>       ->poll()
> 
> and *between* those two operations, another CPU does "close()", and
> that causes eventpoll_release_file() to be called, and now f_count
> goes down to zero while ->poll() is running.
> 
> So you do need to increment the file count around the ->poll() call, I feel.
> 
> Or, alternatively, you'd need to serialize with
> eventpoll_release_file(), but that would need to be some sleeping lock
> held over the ->poll() call.
> 
>> As it is, dma_buf ->poll() is very suspicious regardless of that
>> mess - it can grab reference to file for unspecified interval.
> 
> I think that's actually much preferable to what epoll does, which is
> to keep using files without having reference counts to them (and then
> relying on magically not racing with eventpoll_release_file().

I think it's a very important detail that epoll does not take
real references. Otherwise an application level 'close()' on a socket
would not trigger a tcp disconnect, when an fd is still registered with
epoll.

I noticed that some parts of Samba currently rely on this when I tried
to convert tevent from epoll to IORING_OP_POLL_ADD (which takes a longer term reference)

And I guess there will be other applications also relying on the current epoll
behavior. That a closed fs automatically removes itself from epoll.

A short term reference just around ->poll() might be fine,
but please no reference via EPOLL_CTL_ADD.

Changing that can cause security problems in user space.

I haven't followed all details of this thread,
please ignore me if that's all clear already :-)

Thanks!
metze
  
Linus Torvalds May 6, 2024, 6:17 p.m. UTC | #16
On Mon, 6 May 2024 at 10:46, Stefan Metzmacher <metze@samba.org> wrote:
>
> I think it's a very important detail that epoll does not take
> real references. Otherwise an application level 'close()' on a socket
> would not trigger a tcp disconnect, when an fd is still registered with
> epoll.

Yes, exactly.

epoll() ends up actually wanting the lifetime of the ep item be the
lifetime of the file _descriptor_, not the lifetime of the file
itself.

We approximate that - badly - with epoll not taking a reference on the
file pointer, and then at final fput() it tears things down.

But that has two real issues, and one of them is that "oh, now epoll
has file pointers that are actually dead" that caused this thread.

The other issue is that "approximates" thing, where it means that
duplicating the file pointer (dup*() and fork() end unix socket file
sending etc) will not mean that the epoll ref is also out of sync with
the lifetime of the file descriptor.

That's why I suggested that "clean up epoll references at
file_close_fd() time instead:

  https://lore.kernel.org/all/CAHk-=wj6XL9MGCd_nUzRj6SaKeN0TsyTTZDFpGdW34R+zMZaSg@mail.gmail.com/

because it would actually really *fix* the lifetime issue of ep items.

In the process, it would make it possible to actually take a f_count
reference at EPOLL_CTL_ADD time, since now the lifetime of the EP
wouldn't be tied to the lifetime of the 'struct file *' pointer, it
would be properly tied to the lifetime of the actual file descriptor
that you are adding.

So it would be a huge conceptual cleanup too.

(Of course - at that point EPOLL_CTL_ADD still doesn't actually _need_
a reference to the file, since the file being open in itself is
already that reference - but the point here being that there would
*be* a reference that the epoll code would effectively have, and you'd
never be in the situation we were in where there would be stale "dead"
file pointers that just haven't gone through the cleanup yet).

But I'd rather not touch the epoll code more than I have to.

Which is why I applied the minimal patch for just "refcount over
vfs_poll()", and am just mentioning my suggestion in the hope that
some eager beaver would like to see how painful it would do to make
the bigger surgery...

                   Linus
  
David Laight May 8, 2024, 8:47 a.m. UTC | #17
From: Linus Torvalds
> Sent: 06 May 2024 19:18
...
> Which is why I applied the minimal patch for just "refcount over
> vfs_poll()", and am just mentioning my suggestion in the hope that
> some eager beaver would like to see how painful it would do to make
> the bigger surgery...

I wonder if I can work out how it (doesn't) currently work...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..15e8f74ee0f2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -267,9 +267,8 @@  static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
 
 		if (events & EPOLLOUT) {
 			/* Paired with fput in dma_buf_poll_cb */
-			get_file(dmabuf->file);
-
-			if (!dma_buf_poll_add_cb(resv, true, dcb))
+			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+			    !dma_buf_poll_add_cb(resv, true, dcb))
 				/* No callback queued, wake up any other waiters */
 				dma_buf_poll_cb(NULL, &dcb->cb);
 			else
@@ -290,9 +289,8 @@  static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
 
 		if (events & EPOLLIN) {
 			/* Paired with fput in dma_buf_poll_cb */
-			get_file(dmabuf->file);
-
-			if (!dma_buf_poll_add_cb(resv, false, dcb))
+			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+			    !dma_buf_poll_add_cb(resv, false, dcb))
 				/* No callback queued, wake up any other waiters */
 				dma_buf_poll_cb(NULL, &dcb->cb);
 			else


But this ends up leaving "active" non-zero, and at close time it runs
into:

        BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);

But the bottom line is that get_file() is unsafe to use in some places,
one of which appears to be in the poll handler. There are maybe some
other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:

static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
	return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
}

Which I also note involves a dmabuf...

Due to this issue I've proposed fixing get_file() to detect pathological states:
https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/

But that has run into some push-back. I'm hoping that seeing this epoll
example will help illustrate what needs fixing a little better.

I think the best current proposal is to just WARN sooner instead of a
full refcount_t implementation:


diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8dfd53b52744..e09107d0a3d6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1040,7 +1040,8 @@  struct file_handle {
 
 static inline struct file *get_file(struct file *f)
 {
-	atomic_long_inc(&f->f_count);
+	long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
+	WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
 	return f;
 }