[v2,2/3] mutex: add support for reservation style locks, v2

Message ID 20130228102502.15191.14146.stgit@patser (mailing list archive)
State Not Applicable, archived
Headers

Commit Message

Maarten Lankhorst Feb. 28, 2013, 10:25 a.m. UTC
  New version. All of the documentation has been moved from the commit log to
Documentation/reservation-mutex-design.txt

Missing at the moment, maybe TODO?

Add a lockdep check in the *_slow calls that verifies that the lock
being nested into has no nested lock any more? This would be a check
to make sure that mutex_unreserve_unlock has been called on all other
locks correctly.

Changes since RFC patch v1:
 - Updated to use atomic_long instead of atomic, since the reservation_id was a long.
 - added mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow
 - removed mutex_locked_set_reservation_id (or w/e it was called)
Changes since RFC patch v2:
 - remove use of __mutex_lock_retval_arg, add warnings when using wrong combination of
   mutex_(,reserve_)lock/unlock.
Changes since v1:
 - Add __always_inline to __mutex_lock_common, otherwise reservation paths can be
   triggered from normal locks, because __builtin_constant_p might evaluate to false
   for the constant 0 in that case. Tests for this have been added in the next patch.
 - Updated documentation slightly.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
---
 Documentation/reservation-mutex-design.txt |  136 ++++++++++++
 include/linux/mutex.h                      |   86 +++++++
 kernel/mutex.c                             |  326 +++++++++++++++++++++++++++-
 3 files changed, 531 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/reservation-mutex-design.txt


--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  

Comments

Peter Zijlstra April 2, 2013, 10:56 a.m. UTC | #1
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow:
> +  Similar to mutex_reserve_lock, except it won't backoff with
> -EAGAIN.
> +  This is useful when mutex_reserve_lock failed with -EAGAIN, and you
> +  unreserved all reservation_locks so no deadlock can occur.
> +

I don't particularly like these function names, with lock
implementations the _slow post-fix is typically used for slow path
implementations, not API type interfaces.



--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 2, 2013, 10:57 a.m. UTC | #2
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> +struct ticket_mutex {
> +       struct mutex base;
> +       atomic_long_t reservation_id;
> +};

I'm not sure this is a good name, esp. due to the potential confusion
vs the ticket spinlocks we have. 

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 2, 2013, 11 a.m. UTC | #3
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> +Reservation type mutexes

> +struct ticket_mutex {

> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,

That's two different names and two different forms of one (for a total
of 3 variants) for the same scheme.

FAIL...

Also, is there anything in CS literature that comes close to this? I'd
think the DBMS people would have something similar with their
transactional systems. What do they call it?

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 2, 2013, 11:04 a.m. UTC | #4
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> +The algorithm that TTM came up with for dealing with this problem is
> +quite simple.  For each group of buffers (execbuf) that need to be
> +locked, the caller would be assigned a unique reservation_id, from a
> +global counter.  In case of deadlock while locking all the buffers
> +associated with a execbuf, the one with the lowest reservation_id
> +wins, and the one with the higher reservation_id unlocks all of the
> +buffers that it has already locked, and then tries again.

So the thing that made me cringe inside when I read this was making it
all work on -rt, where we also need to take PI into account and ensure
the entire thing is deterministic.

It _might_ be 'easy' and we could fall back to PI mutex behaviour in
the case the reservation IDs match; however the entire for-all-bufs
retry loops do worry me still.

Head hurts, needs more time to ponder. It would be good if someone else
(this would probably be you maarten) would also consider this and
explore
this 'interesting' problem space :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Maarten Lankhorst April 2, 2013, 2:57 p.m. UTC | #5
Hey,

Thanks for reviewing.

Op 02-04-13 13:00, Peter Zijlstra schreef:
> On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
>> +Reservation type mutexes
>> +struct ticket_mutex {
>> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,
> That's two different names and two different forms of one (for a total
> of 3 variants) for the same scheme.
>
> FAIL...
It's been hard since I haven't seen anything similar in the kernel, I originally went with tickets
since that's what ttm originally called it, and tried to kill as many references as I could
when I noticed ticket mutexes already being taken.

I'll fix up the ticket_mutex -> reservation_mutex, and mutex_reserve_* -> reserve_mutex_*

> On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
>> +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow:
>> +  Similar to mutex_reserve_lock, except it won't backoff with
>> -EAGAIN.
>> +  This is useful when mutex_reserve_lock failed with -EAGAIN, and you
>> +  unreserved all reservation_locks so no deadlock can occur.
>> +
> I don't particularly like these function names, with lock
> implementations the _slow post-fix is typically used for slow path
> implementations, not API type interfaces.
 I didn't intend for drivers to use the new calls directly, but rather through a wrapper,
for example by ttm_eu_reserve_buffers in drivers/gpu/drm/ttm/ttm_execbuf_util.c

> Also, is there anything in CS literature that comes close to this? I'd
> think the DBMS people would have something similar with their
> transactional systems. What do they call it?
I didn't study cs, but judging from your phrasing I guess you mean you want me to call it transaction_mutexes instead?

> Head hurts, needs more time to ponder. It would be good if someone else
> (this would probably be you maarten) would also consider this and
> explore
> this 'interesting' problem space :-)
My head too, evil priority stuff!

Hacky but pragmatical workaround for now: use a real mutex around all the reserve_mutex_lock* calls instead of a virtual lock.
It can be unlocked as soon as all locks have been taken, before any actual work is done.

It only slightly kills the point of having a reservation in the first place, but at least it won't break completely -rt completely for now.

~Maarten

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 2, 2013, 3:56 p.m. UTC | #6
On Tue, Apr 2, 2013 at 1:00 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Also, is there anything in CS literature that comes close to this? I'd
> think the DBMS people would have something similar with their
> transactional systems. What do they call it?

I've looked around a bit and in dbms row-locking land this seems to be
called the wound-wait deadlock avoidance algorithm. It's the same
approach where if you encounter an older ticket (there called
transaction timestamp) you drop all locked rows and retry (or abort)
the transaction. If you encounter a newer ticket when trying to grab a
lock simply do a blocking wait. So ticket/reservation in Maartens
patches is the analog of timestamp/transaction.
-Daniel
  
Peter Zijlstra April 2, 2013, 4:59 p.m. UTC | #7
On Tue, 2013-04-02 at 16:57 +0200, Maarten Lankhorst wrote:
> Hey,
> 
> Thanks for reviewing.

Only partway through so far :-)

> Op 02-04-13 13:00, Peter Zijlstra schreef:
> > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> >> +Reservation type mutexes
> >> +struct ticket_mutex {
> >> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,
> > That's two different names and two different forms of one (for a total
> > of 3 variants) for the same scheme.
> >
> > FAIL...

> It's been hard since I haven't seen anything similar in the kernel, I
> originally went with tickets since that's what ttm originally called
> it, and tried to kill as many references as I could when I noticed
> ticket mutexes already being taken.

Ticket mutexes as such don't exist, but we have ticket based spinlock
implementations. It seems a situation ripe for confusion to have two
locking primitives (mutex, spinlock) with similar names (ticket) but
vastly different semantics.

> I'll fix up the ticket_mutex -> reservation_mutex, and mutex_reserve_*
> -> reserve_mutex_*

Do a google for "lock reservation" and observe the results.. its some
scheme where they pre-assign lock ownership to the most likely thread.

> > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> >> +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow:
> >> +  Similar to mutex_reserve_lock, except it won't backoff with
> >> -EAGAIN.
> >> +  This is useful when mutex_reserve_lock failed with -EAGAIN, and you
> >> +  unreserved all reservation_locks so no deadlock can occur.
> >> +
> > I don't particularly like these function names, with lock
> > implementations the _slow post-fix is typically used for slow path
> > implementations, not API type interfaces.

>  I didn't intend for drivers to use the new calls directly, but rather
> through a wrapper, for example by ttm_eu_reserve_buffers in
> drivers/gpu/drm/ttm/ttm_execbuf_util.c

You're providing a generic interface to the core kernel, other people
will end up using it. Providing a proper API is helpful.

> > Also, is there anything in CS literature that comes close to this? I'd
> > think the DBMS people would have something similar with their
> > transactional systems. What do they call it?

> I didn't study cs, but judging from your phrasing I guess you mean you
> want me to call it transaction_mutexes instead?

Nah, me neither, I just hate reinventing names for something that's
already got a perfectly fine name under which a bunch of people know
it.

See the email from Daniel, apparently its known as wound-wait deadlock
avoidance -- its actually described in the "deadlock" wikipedia
article.

So how about we call the thing something like:

  struct ww_mutex; /* wound/wait */

  int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
  int mutex_wait_lock(struct ww_mutex *);  /* does not fail */

Hmm.. thinking about that,.. you only need that second variant because
you don't have a clear lock to wait for the 'older' process to
complete; but having the unconditional wait makes the entire thing
prone to accidents and deadlocks when the 'user' (read your fellow
programmers) make a mistake.

Ideally we'd only have the one primitive that returns -EDEADLK and use
a 'proper' mutex to wait on or somesuch.. let me ponder this a bit more.

> > Head hurts, needs more time to ponder. It would be good if someone else 
> > (this would probably be you maarten) would also consider this explore 
> > this 'interesting' problem space :-) 

> My head too, evil priority stuff!
> 
> Hacky but pragmatical workaround for now: use a real mutex around all
> the reserve_mutex_lock* calls instead of a virtual lock. It can be
> unlocked as soon as all locks have been taken, before any actual work
> is done.
> 
> It only slightly kills the point of having a reservation in the first
> place, but at least it won't break completely -rt completely for now.

Yeah, global lock, yay :-(

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 2, 2013, 5:23 p.m. UTC | #8
On Tue, Apr 2, 2013 at 6:59 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > Head hurts, needs more time to ponder. It would be good if someone else
>> > (this would probably be you maarten) would also consider this explore
>> > this 'interesting' problem space :-)
>
>> My head too, evil priority stuff!
>>
>> Hacky but pragmatical workaround for now: use a real mutex around all
>> the reserve_mutex_lock* calls instead of a virtual lock. It can be
>> unlocked as soon as all locks have been taken, before any actual work
>> is done.
>>
>> It only slightly kills the point of having a reservation in the first
>> place, but at least it won't break completely -rt completely for now.
>
> Yeah, global lock, yay :-(

We've discussed this quite a bit on irc and came up with a bunch of
other ideas. The global lock is completely transparent to users, since
the lockdep annotations already rely on ticket_init/fini being a
virtual lock. So we can always fall back to that option.

For more fancy approaches we need to consider the aim first - do we
just want to prevent deadlocks through PI or do we aim for bounded
per-reservation_mutex wait block-to-acquire times for the thread with
highest rt-prio.

If it's just the former I think we can get by by piggy-packing on top
of the existing PI mutex code. Only downside is that threads can lock
arbitrary many reservation locks and so we're looking at boosting an
entire tree of processes. Otoh common operations done while holding
such a lock are swapping buffer objects in or waiting for gpu
rendering. And since we can easily queue up a few ms of rendering rt
guarantees are out the window ;-)

If that's not good enough and the global lock not scalable enough we
could try to limit the fan-out by setting a PI-boost flag in the
reservation ticket (in additional to the normal PI boosting for the
reservation mutex). Threads which are boosted in that fashion will get
a -EAGAIN on the next mutex_reserv_lock call, ensuring that the
blocking doesn't spread to further threads. But that requires that we
pass around pointers to tickets instead of values, so lifetime fun
(atm the ticket is on the stack) and probably tons of races in
updating the ticket boost state. I'd like to avoid that until we've
demonstrated a need for it ...

In any way I think that all three approaches should fit into the
proposed interfaces, so we should be able to do something sane here.
But since I have pretty much zero clue about rt I have no idea which
of the first two approaches would be preferable.

Cheers, Daniel
  
Daniel Vetter April 2, 2013, 5:30 p.m. UTC | #9
On Tue, Apr 2, 2013 at 6:59 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > Also, is there anything in CS literature that comes close to this? I'd
>> > think the DBMS people would have something similar with their
>> > transactional systems. What do they call it?
>
>> I didn't study cs, but judging from your phrasing I guess you mean you
>> want me to call it transaction_mutexes instead?
>
> Nah, me neither, I just hate reinventing names for something that's
> already got a perfectly fine name under which a bunch of people know
> it.
>
> See the email from Daniel, apparently its known as wound-wait deadlock
> avoidance -- its actually described in the "deadlock" wikipedia
> article.
>
> So how about we call the thing something like:
>
>   struct ww_mutex; /* wound/wait */
>
>   int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
>   int mutex_wait_lock(struct ww_mutex *);  /* does not fail */

I'm not sold on this prefix, since wound-wait is just the
implementation detail of how it detects/handles deadlocks. For users a
really dumb strategy of just doing a mutex trylock and always
returning -EAGAIN if that fails (together with a msleep(rand) in the
slowpath) would have the same interface. Almost at least, we could
ditch the ticket - but the ticket is used as a virtual lock for the
lockdep annotation, so ditching it would also reduce lockdep
usefulness (due to all those trylocks). So in case we ever switch the
deadlock/backoff algo ww_ would be a bit misleading.

Otoh reservation_ is just what it's used for in graphics-land, so not
that much better. I don't really have a good idea for what this is
besides mutexes_with_magic_deadlock_detection_and_backoff. Which is a
bit long.
-Daniel
  
Peter Zijlstra April 4, 2013, 12:01 p.m. UTC | #10
I'm sorry, this email ended up quite a bit longer than I had hoped for;
please bear with me.

On Tue, 2013-04-02 at 18:59 +0200, Peter Zijlstra wrote:
>   struct ww_mutex; /* wound/wait */
> 
>   int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
>   int mutex_wait_lock(struct ww_mutex *);  /* does not fail */
> 
> Hmm.. thinking about that,.. you only need that second variant because
> you don't have a clear lock to wait for the 'older' process to
> complete; but having the unconditional wait makes the entire thing
> prone to accidents and deadlocks when the 'user' (read your fellow
> programmers) make a mistake.
> 
> Ideally we'd only have the one primitive that returns -EDEADLK and use
> a 'proper' mutex to wait on or somesuch.. let me ponder this a bit
> more.

So I had me a little ponder..

Its all really about breaking the symmetry (equivalence of both sides)
of the deadlock. Lets start with the simplest AB-BA deadlock:

	task-O	task-Y
	  A
		B
		A <-- blocks on O
	  B <-- blocks on Y

In this case the wound-wait algorithm says that the older task (O) has
precedence over the younger task (Y) and we'll 'kill' Y to allow
progress for O.

Now clearly we don't really want to kill Y, instead we'll allow its
lock operation to return -EDEADLK.

This would suggest that our 'new' mutex implementation (ww_mutex
henceforth) has owner tracking -- so O acquiring B can find Y to 'kill'
-- and that we have a new wakeup state TASK_DEADLOCK similar to
TASK_WAKEKILL (can we avoid this by using the extra state required
below? I don't think so since we'd have the risk of waking Y while it
would be blocked on a !ww_mutex)

Now this gets a little more interesting if we change the scenario a
little:

	task-O	task-Y
	  A
		B
	  B <-- blocks on Y
		* <-- could be A

In this case when O blocks Y isn't actually blocked, so our
TASK_DEADLOCK wakeup doesn't actually achieve anything.

This means we also have to track (task) state so that once Y tries to
acquire A (creating the actual deadlock) we'll not wait so our
TASK_DEADLOCK wakeup doesn't actually achieve anything.

Note that Y doesn't need to acquire A in order to return -EDEADLK, any
acquisition from the same set (see below) would return -EDEADLK even if
there isn't an actual deadlock. This is the cost of heuristic; we could
walk the actual block graph but that would be prohibitively expensive
since we'd have to do this on every acquire.

This raises the point of what would happen if Y were to acquire a
'regular' mutex in between the series of ww_mutexes. In this case we'd
clearly have an actual deadlock scenario. However lockdep would catch
this for we'd clearly invert the lock class order:

	ww_mutex -> other -> ww_mutex

For the more complex scenarios there's the case of multiple waiters. In
this case its desirable to order the waiters in age order so that we
don't introduce age inversion -- this minimizes the amount of -EDEADLK
occurrences, but also allows doing away with the unconditional wait.

With the lock queue ordering we can simply immediately re-try the lock
acquisition sequence. Since the older task is already queued it will
obtain the lock, even if we re-queue ourselves before the lock is
assigned.

Now the one thing I've so far 'ignored' is how to assign age. Forward
progress guarantees demand that the age doesn't get reset on retries;
doing so would mean you're always pushing yourself fwd, decreasing your
'priority', never getting to be the most eligible. However it also
means that we should (re)set the time every time we start a 'new'
acquisition sequence. If we'd use a static (per task) age the task with
the lowest age would be able to 'starve' all other.

This means we need means to mark the start of an acquisition sequence;
one that is not included in the restart loop.

Having a start suggests having an end marker too, and indeed we can
find a reason for having one; suppose our second scenario above, where
Y has already acquired all locks it needs to proceed. In this case we
would have marked Y to fail on another acquisition. This doesn't seem
like a problem until you consider the case where we nest different
mm_mutex sets. In this case Y would -EDEADLK on the first of the second
(nested) set, even though re-trying that set is pointless, O doesn't
care.

Furthermore, considering the first scenario, O could block on a lock of
the first set when Y is blocked on a lock of the second (nested) set;
we need means of discerning the different lock sets. This cannot be
done by local means, that is, any context created by the start/end
markers would be local to that task and would not be recognizable by
another as to belong to the same group.

Instead we'd need to create a set-class, much like lockdep has and
somehow ensure all locks in a set are of that class.

One thing I'm not entirely sure of is the strict need for a local
context it could be used to track the set-class, but I'm not entirely
sure we need that to clean up state at the end marker. I haven't been
creative enough to construct a fail case where both O and Y nest and a
pending EDEADLK state could cross the end marker wrongly.

However, having a local context, even if empty, does put a few
constraints on the API usage, which as per below, is a good thing.

To recap, we now have:

  struct ww_mutex_key { };
  struct ww_mutex;
  struct ww_mutex_ctx;

  void __ww_mutex_init(struct ww_mutex *, void *);

  #define ww_mutex_init(ww_mutex) 	 		\
    do {					\
	static struct ww_mutex_key __key;		\
	__ww_mutex_init(ww_mutex, &__key);		\
    } while (0)

  void ww_mutex_acquire_start(struct ww_mutex_ctx *);
  void ww_mutex_acquire_end  (struct ww_mutex_ctx *);

  int  ww_mutex_lock(struct ww_mutex_ctx *, struct ww_mutex *);
  void ww_mutex_unlock(struct ww_mutex *);

Which are to be used like:

  int bufs_lock(bufs)
  {
	struct ww_mutex_ctx ww_ctx;

	ww_mutex_acquire_start(&ww_ctx);
  retry:
	for-all-buffers(buf, bufs) {
		err = ww_mutex_lock(&ww_ctx, &buf->lock);
		if (err)
			goto error;
	}

	ww_mutex_acquire_end(&ww_ctx);

	return 0; /* success */

  error:
	for-all-buffers(buf2, bufs) {
		if (buf2 == buf)
			break;
		ww_mutex_unlock(&buf->lock);
	}
	if (err == -EDEADLK)
		goto again;

	return err; /* other error */
  }

  void bufs_unlock(bufs)
  {
	for-all-buffers(buf, bufs)
		ww_mutex_unlock(&buf->lock);
  }

This API has a few problems as per Rusty's API guidelines, it is fairly
trivial to use it wrong; mostly the placement of the start and end
markers are easy to get wrong, but of course one can get all kinds of
badness from doing the retry wrong. At least having the local context
forces one to use the start/end markers.

The only remaining 'problem' left is RT, however the above suggests a
very clean solution; change the lock queueing order to first sort on
priority (be it the SCHED_FIFO/RR static priority or SCHED_DEADLINE
dynamic priority) and secondly on the age. 

Note that (re)setting the age at every acquisition set is really only a
means to provide FIFO fairness between tasks, RT tasks don't strictly
require this, they have their own ordering.

With all that this locking scheme should be deterministic; and in cases
where the older task would block (the young task has passed the end
marker) we can apply PI and push Y's priority.


Did I miss something?

Anyway, I haven't actually looked at the details of your implementation
yet, I'll get to that next, but I think something like the above should
be what we want to aim for.

*phew* thanks for reading this far ! :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 4, 2013, 1:31 p.m. UTC | #11
On Thu, Apr 04, 2013 at 02:01:48PM +0200, Peter Zijlstra wrote:
> 
> I'm sorry, this email ended up quite a bit longer than I had hoped for;
> please bear with me.
> 
> On Tue, 2013-04-02 at 18:59 +0200, Peter Zijlstra wrote:
> >   struct ww_mutex; /* wound/wait */
> > 
> >   int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
> >   int mutex_wait_lock(struct ww_mutex *);  /* does not fail */
> > 
> > Hmm.. thinking about that,.. you only need that second variant because
> > you don't have a clear lock to wait for the 'older' process to
> > complete; but having the unconditional wait makes the entire thing
> > prone to accidents and deadlocks when the 'user' (read your fellow
> > programmers) make a mistake.
> > 
> > Ideally we'd only have the one primitive that returns -EDEADLK and use
> > a 'proper' mutex to wait on or somesuch.. let me ponder this a bit
> > more.
> 
> So I had me a little ponder..

I need to chew some more through this, but figured some early comments to
clarify a few things would be good.

> Its all really about breaking the symmetry (equivalence of both sides)
> of the deadlock. Lets start with the simplest AB-BA deadlock:
> 
> 	task-O	task-Y
> 	  A
> 		B
> 		A <-- blocks on O
> 	  B <-- blocks on Y
> 
> In this case the wound-wait algorithm says that the older task (O) has
> precedence over the younger task (Y) and we'll 'kill' Y to allow
> progress for O.
> 
> Now clearly we don't really want to kill Y, instead we'll allow its
> lock operation to return -EDEADLK.
> 
> This would suggest that our 'new' mutex implementation (ww_mutex
> henceforth) has owner tracking -- so O acquiring B can find Y to 'kill'
> -- and that we have a new wakeup state TASK_DEADLOCK similar to
> TASK_WAKEKILL (can we avoid this by using the extra state required
> below? I don't think so since we'd have the risk of waking Y while it
> would be blocked on a !ww_mutex)

We do add some form of owner tracking by storing the reservation ticket
of the current holder into every ww_mutex. So when task-Y in your above
example tries to acquire lock A it notices that it's behind in the global
queue and immedieately returns -EAGAIN to indicate the deadlock.

Aside, there's a bit of fun in that ttm uses -EDEADLCK for when you try to
reserve the same buffer twice - it can easily detect that by comparing the
lock owner ticket with the provided one and if it matches bail out. The
special error code is useful since userspace could prepare a gpu
batch command submission which lists a buffer twice, which is a userspace
bug with the current drm/gem apis. But since userspace usually doesn't try
to trick the kernel it's better to detect this lazily, and due to the
retry logic we have all the relevant error bail-out code anyway.

Hence we've kept the special -EDEADLCK semantics even for the ww_mutex
stuff.

> Now this gets a little more interesting if we change the scenario a
> little:
> 
> 	task-O	task-Y
> 	  A
> 		B
> 	  B <-- blocks on Y
> 		* <-- could be A

The current code at least simple blocks task-O on B until task-Y unlocks
B. Deadlocks cannot happen since if task-Y ever tries to acquire a lock
which is held by an older task (e.g. lock A) it will bail out with
-EAGAIN.

> In this case when O blocks Y isn't actually blocked, so our
> TASK_DEADLOCK wakeup doesn't actually achieve anything.
> 
> This means we also have to track (task) state so that once Y tries to
> acquire A (creating the actual deadlock) we'll not wait so our
> TASK_DEADLOCK wakeup doesn't actually achieve anything.
> 
> Note that Y doesn't need to acquire A in order to return -EDEADLK, any
> acquisition from the same set (see below) would return -EDEADLK even if
> there isn't an actual deadlock. This is the cost of heuristic; we could
> walk the actual block graph but that would be prohibitively expensive
> since we'd have to do this on every acquire.

Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait
times of older task. This could be interesting for RT, but I'm unsure of
the implications. The trick with the current code is that the oldest task
will never see an -EAGAIN ever and hence is guaranteed to make forward
progress. If the task is really unlucky though it might be forced to wait
for a younger task for every ww_mutex it tries to acquire.

The thing is now that you're not expected to hold these locks for a long
time - if you need to synchronously stall while holding a lock performance
will go down the gutters anyway. And since most current gpus/co-processors
still can't really preempt fairness isn't that high a priority, either.
So we didn't think too much about that.

> This raises the point of what would happen if Y were to acquire a
> 'regular' mutex in between the series of ww_mutexes. In this case we'd
> clearly have an actual deadlock scenario. However lockdep would catch
> this for we'd clearly invert the lock class order:
> 
> 	ww_mutex -> other -> ww_mutex
> 
> For the more complex scenarios there's the case of multiple waiters. In
> this case its desirable to order the waiters in age order so that we
> don't introduce age inversion -- this minimizes the amount of -EDEADLK
> occurrences, but also allows doing away with the unconditional wait.
> 
> With the lock queue ordering we can simply immediately re-try the lock
> acquisition sequence. Since the older task is already queued it will
> obtain the lock, even if we re-queue ourselves before the lock is
> assigned.
> 
> Now the one thing I've so far 'ignored' is how to assign age. Forward
> progress guarantees demand that the age doesn't get reset on retries;
> doing so would mean you're always pushing yourself fwd, decreasing your
> 'priority', never getting to be the most eligible. However it also
> means that we should (re)set the time every time we start a 'new'
> acquisition sequence. If we'd use a static (per task) age the task with
> the lowest age would be able to 'starve' all other.
> 
> This means we need means to mark the start of an acquisition sequence;
> one that is not included in the restart loop.
> 
> Having a start suggests having an end marker too, and indeed we can
> find a reason for having one; suppose our second scenario above, where
> Y has already acquired all locks it needs to proceed. In this case we
> would have marked Y to fail on another acquisition. This doesn't seem
> like a problem until you consider the case where we nest different
> mm_mutex sets. In this case Y would -EDEADLK on the first of the second
> (nested) set, even though re-trying that set is pointless, O doesn't
> care.

Another big reason for having a start/end marker like you've describe is
lockdep support. Lockdep is oblivious to the magic deadlock avoidance, so
when trying to just annotate the ww_mutexes themselves you can only
annotate them as trylocks (which in a way they are). But that completely
breaks lockdep warnings for the above mentioned

> 	ww_mutex -> other -> ww_mutex

loop. So we don't want that. The trick Maarten's patches employ is to add
a fake/virtual lockdep lock for the reservation ticket/age itself, which
is acquired in start and dropped again in end. All the ww_mutex locking is
then annotated as blocking locks nested within that virtual mutex (this is
already used in mm_take_all_locks). Maarten added a few patches to ensure
that lockdep properly checks this nesting, too:

commit d094595078d00b63839d0c5ccb8b184ef242cb45
Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Date:   Thu Sep 13 11:39:51 2012 +0200

    lockdep: Check if nested lock is actually held

> Furthermore, considering the first scenario, O could block on a lock of
> the first set when Y is blocked on a lock of the second (nested) set;
> we need means of discerning the different lock sets. This cannot be
> done by local means, that is, any context created by the start/end
> markers would be local to that task and would not be recognizable by
> another as to belong to the same group.
> 
> Instead we'd need to create a set-class, much like lockdep has and
> somehow ensure all locks in a set are of that class.

I'm a bit confused about the different classes you're talking about. Since
the ticket queue is currently a global counter there's only one class of
ww_mutexes. I guess we could change that once a second user shows up, but
currently with dma-bufs we _want_ all reservartion mutexes to be in the
same class. And if you nest reservation mutexe loops (i.e. nest calls to
acquire_start in your example below) we want that to fail. So currently we
only have one lockdep class for the mutexes plus one class for the virtual
lock everything nests in.

> One thing I'm not entirely sure of is the strict need for a local
> context it could be used to track the set-class, but I'm not entirely
> sure we need that to clean up state at the end marker. I haven't been
> creative enough to construct a fail case where both O and Y nest and a
> pending EDEADLK state could cross the end marker wrongly.
> 
> However, having a local context, even if empty, does put a few
> constraints on the API usage, which as per below, is a good thing.
> 
> To recap, we now have:
> 
>   struct ww_mutex_key { };
>   struct ww_mutex;
>   struct ww_mutex_ctx;
> 
>   void __ww_mutex_init(struct ww_mutex *, void *);
> 
>   #define ww_mutex_init(ww_mutex) 	 		\
>     do {					\
> 	static struct ww_mutex_key __key;		\
> 	__ww_mutex_init(ww_mutex, &__key);		\
>     } while (0)
> 
>   void ww_mutex_acquire_start(struct ww_mutex_ctx *);
>   void ww_mutex_acquire_end  (struct ww_mutex_ctx *);
> 
>   int  ww_mutex_lock(struct ww_mutex_ctx *, struct ww_mutex *);
>   void ww_mutex_unlock(struct ww_mutex *);
> 
> Which are to be used like:
> 
>   int bufs_lock(bufs)
>   {
> 	struct ww_mutex_ctx ww_ctx;
> 
> 	ww_mutex_acquire_start(&ww_ctx);
>   retry:
> 	for-all-buffers(buf, bufs) {
> 		err = ww_mutex_lock(&ww_ctx, &buf->lock);
> 		if (err)
> 			goto error;
> 	}
> 
> 	ww_mutex_acquire_end(&ww_ctx);
> 
> 	return 0; /* success */
> 
>   error:
> 	for-all-buffers(buf2, bufs) {
> 		if (buf2 == buf)
> 			break;
> 		ww_mutex_unlock(&buf->lock);
> 	}
> 	if (err == -EDEADLK)
> 		goto again;
> 
> 	return err; /* other error */
>   }
> 
>   void bufs_unlock(bufs)
>   {
> 	for-all-buffers(buf, bufs)
> 		ww_mutex_unlock(&buf->lock);
>   }
> 
> This API has a few problems as per Rusty's API guidelines, it is fairly
> trivial to use it wrong; mostly the placement of the start and end
> markers are easy to get wrong, but of course one can get all kinds of
> badness from doing the retry wrong. At least having the local context
> forces one to use the start/end markers.

I share your concerns about the api, it's way too easy to get wrong. But
with the lockdep nesting checks we should at least be covered for the
ordering in the fastpath. For the error paths themselves I've just thought
about ranodm -EAGAIN injection as a debug option. With that we should at
least be able to cover these paths sufficiently in regression test suites.

Would be a new patch for Maarten to write though ;-)

> The only remaining 'problem' left is RT, however the above suggests a
> very clean solution; change the lock queueing order to first sort on
> priority (be it the SCHED_FIFO/RR static priority or SCHED_DEADLINE
> dynamic priority) and secondly on the age. 
> 
> Note that (re)setting the age at every acquisition set is really only a
> means to provide FIFO fairness between tasks, RT tasks don't strictly
> require this, they have their own ordering.
> 
> With all that this locking scheme should be deterministic; and in cases
> where the older task would block (the young task has passed the end
> marker) we can apply PI and push Y's priority.

We've discussed this approach of using (rt-prio, age) instead of just age
to determine the the "oldness" of a task for deadlock-breaking with
-EAGAIN. The problem is that through PI boosting or normal rt-prio changes
while tasks are trying to acquire ww_mutexes you can create acyclic loops
in the blocked-on graph of ww_mutexes after the fact and so cause
deadlocks. So we've convinced ourselves that this approche doesn't work.

The only somewhat similar approache we couldn't poke holes into is to set
a PI-boost flag on the reservation ticket (your ww_ctx) of the task a
higher-prio RT thread is blocking on. Any task who has set that flag on
its ww_ctx would then unconditionally fail with -EAGAIN on the next
ww_mutex_lock. See my other mail from yesterday for some of the ideas
we've discussed about RT-compliant ww_mutexes on irc.

> Did I miss something?
> 
> Anyway, I haven't actually looked at the details of your implementation
> yet, I'll get to that next, but I think something like the above should
> be what we want to aim for.
> 
> *phew* thanks for reading this far ! :-)

Well, it was a good read and I'm rather happy that we agree on the ww_ctx
thing (whatever it's called in the end), even though we have slightly
different reasons for it.

I don't really have a useful idea to make the retry handling for users
more rusty-compliant though, and I'm still unhappy with all current naming
proposals ;-)

Cheers, Daniel
  
Peter Zijlstra April 4, 2013, 4:33 p.m. UTC | #12
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> We do add some form of owner tracking by storing the reservation
> ticket
> of the current holder into every ww_mutex. So when task-Y in your
> above
> example tries to acquire lock A it notices that it's behind in the
> global
> queue and immedieately returns -EAGAIN to indicate the deadlock.
> 
> Aside, there's a bit of fun in that ttm uses -EDEADLCK for when you
> try to
> reserve the same buffer twice - it can easily detect that by comparing
> the
> lock owner ticket with the provided one and if it matches bail out. 

Sure, but this should bear no influence on the design of the mutex
primitive. At most we should ensure this situation is recognizable.

> Hence we've kept the special -EDEADLCK semantics even for the ww_mutex
> stuff.

There's EDEADLK and EDEADLOCK, EDEADLCK does alas not exist :-)

> > Now this gets a little more interesting if we change the scenario a
> > little:
> > 
> >       task-O  task-Y
> >         A
> >               B
> >         B <-- blocks on Y
> >               * <-- could be A
> 
> The current code at least simple blocks task-O on B until task-Y
> unlocks
> B. Deadlocks cannot happen since if task-Y ever tries to acquire a
> lock
> which is held by an older task (e.g. lock A) it will bail out with
> -EAGAIN.

Agreed, O would have to wait until Y unlocks B. It was a 'detail' I
skimped over for the sake of brevity. I was merely trying to illustrate
the ways in which we must bring Y to return -EDEADLK (or -EAGAIN in
your version).

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 4, 2013, 4:38 p.m. UTC | #13
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
> wait
> times of older task.

No, imagine the following:

struct ww_mutex A, B;
struct mutex C;

	task-O	task-Y	task-X
	  	A
		B
			C
		C
	B

At this point O finds that Y owns B and thus we want to make Y 'yield'
B to make allow B progress. Since Y is blocked, we'll send a wakeup.
However Y is blocked on a different locking primitive; one that doesn't
collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
succeed.



--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 4, 2013, 4:39 p.m. UTC | #14
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> The trick with the current code is that the oldest task
> will never see an -EAGAIN ever and hence is guaranteed to make forward
> progress. If the task is really unlucky though it might be forced to
> wait
> for a younger task for every ww_mutex it tries to acquire.

Agreed on that.. while I didn't state this my proposed thing should
behave the same. It follows from the symmetry breaking in that only
younger tasks can get 'kill'ed.

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 4, 2013, 4:41 p.m. UTC | #15
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> The thing is now that you're not expected to hold these locks for a
> long
> time - if you need to synchronously stall while holding a lock
> performance
> will go down the gutters anyway. And since most current
> gpus/co-processors
> still can't really preempt fairness isn't that high a priority,
> either.
> So we didn't think too much about that.

Yeah but you're proposing a new synchronization primitive for the core
kernel.. all such 'fun' details need to be considered, not only those
few that bear on the one usecase.

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 4, 2013, 4:43 p.m. UTC | #16
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> Another big reason for having a start/end marker like you've describe
> is
> lockdep support.

Yeah, I saw how you did that.. but there's other ways of making it work
too, you could for instance create a new validation state for this type
of lock.

That said, I didn't consider lockdep too much, I first want the regular
semantics clear.

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 4, 2013, 4:46 p.m. UTC | #17
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> I'm a bit confused about the different classes you're talking about.
> Since
> the ticket queue is currently a global counter there's only one class
> of
> ww_mutexes. 

Right, so that's not something that's going to fly.. we need to support
multiple users, including nesting of those. Again, you're adding
something to the generic kernel infrastructure, it had better be usable
:-)

> I guess we could change that once a second user shows up

No, we fix that before it all goes in. I would _so_ hate to find out it
cannot be 'fixed' and be stuck with a half-arsed sync primitive in the
core kernel that's only every usable by the one existent user.

So for now, forget all about TTM, DMA-BUF and other such coolness
except to verify that whatever we end up with does indeed work for the
case you need it for ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 4, 2013, 4:49 p.m. UTC | #18
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> We've discussed this approach of using (rt-prio, age) instead of just
> age
> to determine the the "oldness" of a task for deadlock-breaking with
> -EAGAIN. The problem is that through PI boosting or normal rt-prio
> changes
> while tasks are trying to acquire ww_mutexes you can create acyclic
> loops
> in the blocked-on graph of ww_mutexes after the fact and so cause
> deadlocks. So we've convinced ourselves that this approche doesn't
> work.

Could you pretty please draw me a picture, I'm having trouble seeing
what you mean.

AFAICT as long as you boost Y while its the lock owner things should
work out, no?



--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 4, 2013, 4:54 p.m. UTC | #19
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> Well, it was a good read and I'm rather happy that we agree on the
> ww_ctx
> thing (whatever it's called in the end), even though we have slightly
> different reasons for it.

Yeah, I tried various weirdness to get out from under it, but the whole
progress/fairness thing made it rather hard. Ideally you'd be able to
use some existing scheduler state since its the same goal, but the
whole wakeup-retry muck makes that hard.

> I don't really have a useful idea to make the retry handling for users
> more rusty-compliant though, and I'm still unhappy with all current
> naming
> proposals ;-)

Ah, naming,.. yeah I'm not too terribly attached to most of them. I
just want to avoid something that's reasonably well known to mean
something different.

Furthermore, since we use the wound/wait symmetry breaking it would
make sense for that to appear somewhere in the name.



--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 4, 2013, 4:56 p.m. UTC | #20
On Thu, Apr 4, 2013 at 3:31 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> In this case when O blocks Y isn't actually blocked, so our
>> TASK_DEADLOCK wakeup doesn't actually achieve anything.
>>
>> This means we also have to track (task) state so that once Y tries to
>> acquire A (creating the actual deadlock) we'll not wait so our
>> TASK_DEADLOCK wakeup doesn't actually achieve anything.
>>
>> Note that Y doesn't need to acquire A in order to return -EDEADLK, any
>> acquisition from the same set (see below) would return -EDEADLK even if
>> there isn't an actual deadlock. This is the cost of heuristic; we could
>> walk the actual block graph but that would be prohibitively expensive
>> since we'd have to do this on every acquire.
>
> Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait
> times of older task. This could be interesting for RT, but I'm unsure of
> the implications. The trick with the current code is that the oldest task
> will never see an -EAGAIN ever and hence is guaranteed to make forward
> progress. If the task is really unlucky though it might be forced to wait
> for a younger task for every ww_mutex it tries to acquire.

[Aside: I'm writing this while your replies trickle in, but I think
it's not yet answered already.]

Ok, I've discussed this a lot with Maarten on irc and I think I see a
bit clearer now what's the aim with the new sleep state. Or at least I
have an illusion about it ;-) So let me try to recap my understanding
to check whether we're talking roughly about the same idea.

I think for starters we need to have a slightly more interesting example:

3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M
is in between.
2 ww_mutexes: A, B

Y has already acquired ww_mutex A, M has already acquired ww_mutex B.

Now O wants to acquire B and M wants to acquire A (let's ignore
detailed ordering for now), resulting in O blocking on M (M holds B
already, but O is older) and M blocking on Y (same for lock B).

Now first question to check my understanding: Your aim with that
special wakeup is to kick M so that it backs off and drops B? That way
O does not need to wait for Y to complete whatever it's currently
doing, unlock A and then in turn M to complete whatever it's doing so
that it can unlock A&B and finally allows O to grab the lock.

Presuming I'm still following we should be able to fix this with the
new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
(let's call it PF_GTFO for simplicity). Then every time a task does a
blocking wait on a ww_mutex it would set this special sleep state and
also check the PF_GTFO bit. If the later is set, it bails out with
-EAGAIN (so that all locks are dropped).

Now if a task wants to take a lock and notices that it's held by a
younger locker it can set that flag and wake the thread up (need to
think about all the races a bit, but we should be able to make this
work). Then it can do the normal blocking mutex slowpath and wait for
the unlock.

Now if O and M race a bit against each another M should either get
woken (if it's already blocked on Y) and back off, or notice that the
thread flag is set before it even tries to grab another mutex (and so
before the block tree can extend further to Y). And the special sleep
state is to make sure we don't cause any other spurious interrupts.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 4, 2013, 4:59 p.m. UTC | #21
On Thu, Apr 4, 2013 at 6:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
>> wait
>> times of older task.
>
> No, imagine the following:
>
> struct ww_mutex A, B;
> struct mutex C;
>
>         task-O  task-Y  task-X
>                 A
>                 B
>                         C
>                 C
>         B
>
> At this point O finds that Y owns B and thus we want to make Y 'yield'
> B to make allow B progress. Since Y is blocked, we'll send a wakeup.
> However Y is blocked on a different locking primitive; one that doesn't
> collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
> succeed.

Yeah, I've thought about this some more and the special sleep state
seems to be only required to ensure we don't cause spurious wakeups
for other any other reasons a task blocks. But I think we can use that
kick-current-holder approach to ensure that older tasks get the lock
in a more timely fashion than the current code. I've tried to detail
it a bit with another 3 task example - that seems to be the point
where the fun starts ;-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 4, 2013, 8:44 p.m. UTC | #22
On Thu, Apr 4, 2013 at 6:49 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> We've discussed this approach of using (rt-prio, age) instead of just
>> age
>> to determine the the "oldness" of a task for deadlock-breaking with
>> -EAGAIN. The problem is that through PI boosting or normal rt-prio
>> changes
>> while tasks are trying to acquire ww_mutexes you can create acyclic
>> loops
>> in the blocked-on graph of ww_mutexes after the fact and so cause
>> deadlocks. So we've convinced ourselves that this approche doesn't
>> work.
>
> Could you pretty please draw me a picture, I'm having trouble seeing
> what you mean.
>
> AFAICT as long as you boost Y while its the lock owner things should
> work out, no?

So this one here took a bit of pondering. But I think you'll like the
conclusion.

Now the deadlock detection works because it impose a dag onto all
lockers which clearly denotes who'll wait and who will get wounded.
The problem with using (rt_prio, ww_age/ticket) instead of just the
ticket si that it's ambigous in the face of PI boosting. E.g.
- rt_prio of A > rt_prio of B, but
- task A is younger than task B

Before boosting task A wins, but after boosting (when only the age
matters) task B wins, since it's now older. So now when B comes around
and tries to grab a lock A currently holds, it'll also block.

Now the cool thing with TASK_KILLABLE (hey, I start to appreciate it,
bear with me!) is that this is no problem - it limits the length of
any chain of blocked tasks to just one link of blocked tasks:
- If the length is currently one it cannot be extended - the blocked
task will set the PF_GTFO flag on the current holder, and since that
would be checked before blocking on another ww_mutex the chain can't
grow past one blocked task.
- If the current holder itself is blocked on another ww_mutex it'll be
in TASK_DEADLOCK state and get woken up.

In both case the current holder of the lock will either continue with
what it's been doing (PI-boosted ofc) until it unlocks everything or
hits the PF_GTFO check and backs off from all currently held locks. RT
mission accomplished I think since the wait time for the highest-prio
task for grabbing a lock is bounded by the lock hold time for the
completely uncontended case. And within a given rt-prio the normal
wait/wound algorithm will resolve conflicts (and ensure forward
progress).

So I'm now rather convinced that with the TASK_DEADLOCK implementation
we can make (rt_prio, age) lock ordering work. But it's not an
artifact of consistent PI boosting (I've raced down that alley first
trying to prove it to no avail), but the fact that lock holder kicking
through PF_GTFO naturally limits blocked task chains on ww_mutexes to
just one link. That in turn makes any post-PI-boosted considerations
moot and so can't result in havoc due to a PI-boost having flipped an
edge in the lock_er_ ordering DAG (we sort tasks with wait/wound, not
locks, hence it's not a lock ordering DAG).

Please poke holes into this argument, I've probably missed a sublety
somewhere ...
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 8, 2013, 10:39 a.m. UTC | #23
On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote:
> On Thu, Apr 4, 2013 at 3:31 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >> In this case when O blocks Y isn't actually blocked, so our
> >> TASK_DEADLOCK wakeup doesn't actually achieve anything.
> >>
> >> This means we also have to track (task) state so that once Y tries to
> >> acquire A (creating the actual deadlock) we'll not wait so our
> >> TASK_DEADLOCK wakeup doesn't actually achieve anything.
> >>
> >> Note that Y doesn't need to acquire A in order to return -EDEADLK, any
> >> acquisition from the same set (see below) would return -EDEADLK even if
> >> there isn't an actual deadlock. This is the cost of heuristic; we could
> >> walk the actual block graph but that would be prohibitively expensive
> >> since we'd have to do this on every acquire.
> >
> > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait
> > times of older task. This could be interesting for RT, but I'm unsure of
> > the implications. The trick with the current code is that the oldest task
> > will never see an -EAGAIN ever and hence is guaranteed to make forward
> > progress. If the task is really unlucky though it might be forced to wait
> > for a younger task for every ww_mutex it tries to acquire.
> 
> [Aside: I'm writing this while your replies trickle in, but I think
> it's not yet answered already.]
> 
> Ok, I've discussed this a lot with Maarten on irc and I think I see a
> bit clearer now what's the aim with the new sleep state. Or at least I
> have an illusion about it ;-) So let me try to recap my understanding
> to check whether we're talking roughly about the same idea.
> 
> I think for starters we need to have a slightly more interesting example:
> 
> 3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M
> is in between.
> 2 ww_mutexes: A, B
> 
> Y has already acquired ww_mutex A, M has already acquired ww_mutex B.
> 
> Now O wants to acquire B and M wants to acquire A (let's ignore
> detailed ordering for now), resulting in O blocking on M (M holds B
> already, but O is older) and M blocking on Y (same for lock B).

drawing the picture for myself:

	task-O	task-M	task-Y
			A
		B
	B
		A

> Now first question to check my understanding: Your aim with that
> special wakeup is to kick M so that it backs off and drops B? That way
> O does not need to wait for Y to complete whatever it's currently
> doing, unlock A and then in turn M to complete whatever it's doing so
> that it can unlock A&B and finally allows O to grab the lock.

No, we always need to wait for locks to be unlocked. The sole purpose
of the special wakeups state is to not wake other (!ww_mutex) locks
that might be held by the task holding the contended ww_mutex. While
all schedule() sites should deal with spurious wakeups its a sad fact
of life that they do not :/

> Presuming I'm still following we should be able to fix this with the
> new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
> (let's call it PF_GTFO for simplicity).

I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are
unsuitable since they are not atomic and we'd need to set it from
another thread.

>  Then every time a task does a
> blocking wait on a ww_mutex it would set this special sleep state and
> also check the PF_GTFO bit.

So its the contending task (O for B) setting PF_GTFO on the owning task
(M for B), right?

But yeah, all ww_mutex sleep states should have the new TASK_DEADLOCK
sleep state added.

>  If the later is set, it bails out with
> -EAGAIN (so that all locks are dropped).

I would really rather see -EDEADLK for that..

> Now if a task wants to take a lock and notices that it's held by a
> younger locker it can set that flag and wake the thread up (need to
> think about all the races a bit, but we should be able to make this
> work). Then it can do the normal blocking mutex slowpath and wait for
> the unlock.

Right.

> Now if O and M race a bit against each another M should either get
> woken (if it's already blocked on Y) and back off, or notice that the
> thread flag is set before it even tries to grab another mutex 

ww_mutex, it should block just fine on regular mutexes and other
primitives.

> (and so
> before the block tree can extend further to Y). And the special sleep
> state is to make sure we don't cause any other spurious interrupts.

Right, I think we're understanding one another here.

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 8, 2013, 11:50 a.m. UTC | #24
On Mon, Apr 08, 2013 at 12:39:24PM +0200, Peter Zijlstra wrote:
> On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote:
> > Presuming I'm still following we should be able to fix this with the
> > new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
> > (let's call it PF_GTFO for simplicity).
> 
> I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are
> unsuitable since they are not atomic and we'd need to set it from
> another thread.

I think the PF_ flag is a non-starter for a different reason, too. To make
different clases of ww_mutexes nest properly, we need to make sure that we
don't dead/live-lock trying to wake up a task holdinga ww_mutex of class
A, while that task is trying to acquire ww_mutexes of class B. Picture:

	task 1			task 2			task 3
							holds lock B
	ticket_A = acquire_start(class A)
	ww_mutex_lock(lock A, ticket_A)			busy doing something

	ticket_B = acquire_start(class B)
	ww_mutex_lock(lock B, ticket_B)
	-> contention with task 3
				
				ticket_task2 = acquire_start(class A)
				ww_mutex_lock(lock A, ticket_task2)
				-> contention with task 1

If we go with the PF_GTFO task flag, task 2 will set it on task 1
(handwave locking/atomicity again here ;-). But taks 1 will only back off
from any locks in class B. Or at least I think we should impose that
restriction, since otherwise a ww_mutex acquire loop will leak out badly
abstraction-wise and making nesting pretty much impossible in practice.

But if we really want nesting to work transparently, then task 1 should be
allowed to continue taking locks from class A (after an acquire_end(class
B) call ofc). But then it will have missed the wakeup from task 2, so task
2 blocks for too long and task 1 doesn't back off from further acquiring
locks in class A.

Presuming my reasoning for the rt case isn't broken, this break deadlock
avoidance if we sort by (rt_prio, ticket).

So I think we need to move the PF_GTFO flag to the ticket to make sure
that task1 will notice that there's contention with one of the locks it
holds from class A and that it better gtfo asap (i.e. back off, drop all
currently held locks in class A and go into the slowpath).

But I still need to think the nesting case through a bit more (and draw
some diagrams), so not sure yet. One thing I still need to ponder is how
to best stack tickets (for tasks doing nested ww_mutex locking) and where
all we need a pointer to relevant tickets and what kind of fun this
entails ... I need to ponder this some more.

> >  Then every time a task does a
> > blocking wait on a ww_mutex it would set this special sleep state and
> > also check the PF_GTFO bit.
> 
> So its the contending task (O for B) setting PF_GTFO on the owning task
> (M for B), right?

Yeah, the idea is that the contending task sets this bit on the current
holder (whether we put that bit into the task or the ticket), so that the
current owner can back off and drop all currently held locks at the next
opportunity (either due to being woken up in TASK_DEADLCOK state or in the
next ww_mutex_lock call).

> But yeah, all ww_mutex sleep states should have the new TASK_DEADLOCK
> sleep state added.
> 
> >  If the later is set, it bails out with
> > -EAGAIN (so that all locks are dropped).
> 
> I would really rather see -EDEADLK for that..

I agree it makes more sense for a general api (outside of ttm), but I
struggle a bit to come up with a good errno for the "you hold this lock
already" case. Best I could do is -EBUSY ...

Hm, looking through errno.h I've just spotted -EALREADY. Seems to be used
all over (not just networking), so I guess we could extend it's meaning a
bit?

> > Now if a task wants to take a lock and notices that it's held by a
> > younger locker it can set that flag and wake the thread up (need to
> > think about all the races a bit, but we should be able to make this
> > work). Then it can do the normal blocking mutex slowpath and wait for
> > the unlock.
> 
> Right.
> 
> > Now if O and M race a bit against each another M should either get
> > woken (if it's already blocked on Y) and back off, or notice that the
> > thread flag is set before it even tries to grab another mutex 
> 
> ww_mutex, it should block just fine on regular mutexes and other
> primitives.

Yes, that special behaviour should only apply to ww_mutexes not to any
other locking. I've not been terribly careful here ...
-Daniel
  
Steven Rostedt April 9, 2013, 10:18 p.m. UTC | #25
On Tue, Apr 02, 2013 at 06:59:14PM +0200, Peter Zijlstra wrote:
> 
> So how about we call the thing something like:
> 
>   struct ww_mutex; /* wound/wait */

Reading this I can't help but think of Elmer Fudd saying "Round Robin"
as "Wound Wobin"

-- Steve

> 
>   int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
>   int mutex_wait_lock(struct ww_mutex *);  /* does not fail */
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Steven Rostedt April 9, 2013, 10:27 p.m. UTC | #26
On Thu, Apr 04, 2013 at 06:38:36PM +0200, Peter Zijlstra wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
> > wait
> > times of older task.
> 
> No, imagine the following:
> 
> struct ww_mutex A, B;
> struct mutex C;
> 
> 	task-O	task-Y	task-X
> 	  	A
> 		B
> 			C
> 		C
> 	B
> 
> At this point O finds that Y owns B and thus we want to make Y 'yield'
> B to make allow B progress. Since Y is blocked, we'll send a wakeup.
> However Y is blocked on a different locking primitive; one that doesn't
> collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
> succeed.

I'm confused to why the above is a problem. Task-X will eventually
release C, and then Y will release B and O will get to continue. Do we
have to drop them once the owner is blocked? Can't we follow the chain
like the PI code does?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Steven Rostedt April 9, 2013, 10:28 p.m. UTC | #27
On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> > The thing is now that you're not expected to hold these locks for a
> > long
> > time - if you need to synchronously stall while holding a lock
> > performance
> > will go down the gutters anyway. And since most current
> > gpus/co-processors
> > still can't really preempt fairness isn't that high a priority,
> > either.
> > So we didn't think too much about that.
> 
> Yeah but you're proposing a new synchronization primitive for the core
> kernel.. all such 'fun' details need to be considered, not only those
> few that bear on the one usecase.

Which bares the question, what other use cases are there?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Steven Rostedt April 9, 2013, 10:42 p.m. UTC | #28
On Thu, Apr 04, 2013 at 06:56:58PM +0200, Daniel Vetter wrote:
> 
> I think for starters we need to have a slightly more interesting example:
> 
> 3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M
> is in between.
> 2 ww_mutexes: A, B
> 
> Y has already acquired ww_mutex A, M has already acquired ww_mutex B.
> 

Now I probably missed it in the thread somewhere, but what makes task O
the oldest and Y the youngest? Is it the actual time from when the task
was created? What about setting an age as soon as it starts the process
of grabbing one of these locks? And it keeps the age until it
successfully grabs and releases all the locks again. It wont reset if it
had to drop the locks and start over.

Or am I totally not understanding what's going on here? Which could also
be the case ;-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Peter Zijlstra April 10, 2013, 7:34 a.m. UTC | #29
On Tue, 2013-04-09 at 18:42 -0400, Steven Rostedt wrote:
> What about setting an age as soon as it starts the process
> of grabbing one of these locks? And it keeps the age until it
> successfully grabs and releases all the locks again. It wont reset if
> it
> had to drop the locks and start over.


That is indeed the proposed mechanism. It ensures FIFO fairness between
the various threads that try to acquire a set.

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 10, 2013, 8:27 a.m. UTC | #30
On Wed, Apr 10, 2013 at 12:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, Apr 04, 2013 at 06:38:36PM +0200, Peter Zijlstra wrote:
>> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
>> > wait
>> > times of older task.
>>
>> No, imagine the following:
>>
>> struct ww_mutex A, B;
>> struct mutex C;
>>
>>       task-O  task-Y  task-X
>>               A
>>               B
>>                       C
>>               C
>>       B
>>
>> At this point O finds that Y owns B and thus we want to make Y 'yield'
>> B to make allow B progress. Since Y is blocked, we'll send a wakeup.
>> However Y is blocked on a different locking primitive; one that doesn't
>> collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
>> succeed.
>
> I'm confused to why the above is a problem. Task-X will eventually
> release C, and then Y will release B and O will get to continue. Do we
> have to drop them once the owner is blocked? Can't we follow the chain
> like the PI code does?

Just waiting until every task already holding a lock completes and
unlucks it is indeed a viable solution - it's the currently
implemented algorithm in ttm and Maarten's current patches.

The nice thing with Peter's wakeup idea on top is:
- It bounds blocked times.
- And (at least I think so) it's the key thing making PI boosting
possible without any ugly PI inversion deadlocks happening. See

Message-ID: <CAKMK7uEUdtiDDCRPwpiumkrST6suFY7YuQcPAXR_nJ0XHKzsAw@mail.gmail.com>

for my current reasoning about this (I have not yet managed to poke a
hole into it).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Daniel Vetter April 10, 2013, 9:33 a.m. UTC | #31
On Tue, Apr 09, 2013 at 06:28:08PM -0400, Steven Rostedt wrote:
> On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
> > On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> > > The thing is now that you're not expected to hold these locks for a
> > > long
> > > time - if you need to synchronously stall while holding a lock
> > > performance
> > > will go down the gutters anyway. And since most current
> > > gpus/co-processors
> > > still can't really preempt fairness isn't that high a priority,
> > > either.
> > > So we didn't think too much about that.
> > 
> > Yeah but you're proposing a new synchronization primitive for the core
> > kernel.. all such 'fun' details need to be considered, not only those
> > few that bear on the one usecase.
> 
> Which bares the question, what other use cases are there?

Tbh I don't see any other either - but I agree with Peter and thinking
things through and making the api a bit more generic seems to help in
clarifying the semantics. Reminds me that I still need to draw a few
diagrams ;-)
-Daniel
  
Daniel Vetter April 10, 2013, 10:34 a.m. UTC | #32
On Mon, Apr 08, 2013 at 01:50:26PM +0200, Daniel Vetter wrote:
> On Mon, Apr 08, 2013 at 12:39:24PM +0200, Peter Zijlstra wrote:
> > On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote:
> > > Presuming I'm still following we should be able to fix this with the
> > > new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
> > > (let's call it PF_GTFO for simplicity).
> > 
> > I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are
> > unsuitable since they are not atomic and we'd need to set it from
> > another thread.
> 
> I think the PF_ flag is a non-starter for a different reason, too. To make
> different clases of ww_mutexes nest properly, we need to make sure that we
> don't dead/live-lock trying to wake up a task holdinga ww_mutex of class
> A, while that task is trying to acquire ww_mutexes of class B. Picture:
> 
> 	task 1			task 2			task 3
> 							holds lock B
> 	ticket_A = acquire_start(class A)
> 	ww_mutex_lock(lock A, ticket_A)			busy doing something
> 
> 	ticket_B = acquire_start(class B)
> 	ww_mutex_lock(lock B, ticket_B)
> 	-> contention with task 3
> 				
> 				ticket_task2 = acquire_start(class A)
> 				ww_mutex_lock(lock A, ticket_task2)
> 				-> contention with task 1
> 
> If we go with the PF_GTFO task flag, task 2 will set it on task 1
> (handwave locking/atomicity again here ;-). But taks 1 will only back off
> from any locks in class B. Or at least I think we should impose that
> restriction, since otherwise a ww_mutex acquire loop will leak out badly
> abstraction-wise and making nesting pretty much impossible in practice.
> 
> But if we really want nesting to work transparently, then task 1 should be
> allowed to continue taking locks from class A (after an acquire_end(class
> B) call ofc). But then it will have missed the wakeup from task 2, so task
> 2 blocks for too long and task 1 doesn't back off from further acquiring
> locks in class A.
> 
> Presuming my reasoning for the rt case isn't broken, this break deadlock
> avoidance if we sort by (rt_prio, ticket).
> 
> So I think we need to move the PF_GTFO flag to the ticket to make sure
> that task1 will notice that there's contention with one of the locks it
> holds from class A and that it better gtfo asap (i.e. back off, drop all
> currently held locks in class A and go into the slowpath).
> 
> But I still need to think the nesting case through a bit more (and draw
> some diagrams), so not sure yet. One thing I still need to ponder is how
> to best stack tickets (for tasks doing nested ww_mutex locking) and where
> all we need a pointer to relevant tickets and what kind of fun this
> entails ... I need to ponder this some more.

Ok, so I think the nesting case should all work if we have a
per-task/ticket flag to signal contention. The point I've mused over a bit
is how to get at both the flag and the corresponding task to do the
wakeup. I think for non-rt the simplest way is to store a pointer to the
ticket in the ww_mutex, and the ticket the holds both the contention flag
and a pointer to the task. Lifetime of that stuff would be protected by
the wait_lock from disappearing untimely, which should allow the
lock/unlock fastpaths to set/clear it non-atomically - only careful places
is in the contented unlock slowpath. So rough api sketch for nesting
ww_mutexes:

struct ww_class {
	atomic_t next_ticket;
	/* virtual lock to annotate the acquire_start/end sections. */
	struct lock_class_key acquire_key;
	/* lockdep class of all ww_mutexes in this class */
	struct lock_class_key mutex_key;
	/* for debug/tracepts */
	const char *name 
};

DEFINE_WW_CLASS(name) /* ... and all the other usual magic init funcitons */

struct ww_acquire_ctx {
	struct task_struct *task; /* invariant */
	usinged ticket; /* invariant */
	/* atomic is probably overkill, but need to review the resulting
	 * code with all the barriers first. */
	atomic_t contention;
}

void ww_acquire_start(struct ww_acuire_ctx *ctx, struct ww_class *class);
void ww_acquire_end(struct ww_acuire_ctx *ctx);

Originally I've thought we could store the pointer to the current acquire
ctx in task_struct (since we need that handy for PI boosting anyway), but
that makes things a bit ugly with nesting. Having a (somewhat) redundant
pointer to the acquiring taks in the ctx seemed less evil.

struct ww_mutex {
	struct mutex base;
	struct ww_acquire_ctx *holding_ctx;
}

I've considered whether we should try to pull clever tricks like with
lockdep keys and make the ww_class more implicit (by auto-instantiating it
somewhere and adding a pointer to it in each ww_mutex). But I think we
should not hide this complexity, but instead make it explicit.

All the ww_mutex_(un)lock* variants would then also need to take the
context instead of the ticket. Besides the usual return codes matching the
normal mutex functions we'd have:

-EDEADLK: Deadlock detected (or got kicked by a higher-prio task in RT),
drop all currently held locks (of the same class) and block on this lock
(since it's the contention point) with ww_mutex_lock_slow.

-EALREADY: Lock already held by the same acquire context. We need this to
be able to efficiently yell at evil userspace.

If ww_mutex_lock goes into the slowpath and it's not one of the above
cases (i.e. wait, not wound) then it sets the contention flag
(unconditionally) and wakes up the current holder iff the current holder
is in the TASK_DEADLOCK state.

If ww_mutex_lock gets woken up it checks the contextion flag on the
acquire context. If it is set it backs off and returns -EDEADLK. Obviously
it also needs to check that before blocking, and we might even want to
check unconditionally before attempting any lock operation (to ensure we
get out of the way of higher-prio tasks quicker).

One thing we probably need is to make TASK_DEADLOCK a flag, since we still
want the different interruptible types of blocking. At least i915 goes to
great lengths to make sure that we take every lock which could be held
while waiting for the gpu with interruptible sleeps (or at least
killable).

Did I miss anything?
-Daniel
  
Daniel Vetter April 17, 2013, 7:08 p.m. UTC | #33
On Wed, Apr 10, 2013 at 12:28 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
>> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> > The thing is now that you're not expected to hold these locks for a
>> > long
>> > time - if you need to synchronously stall while holding a lock
>> > performance
>> > will go down the gutters anyway. And since most current
>> > gpus/co-processors
>> > still can't really preempt fairness isn't that high a priority,
>> > either.
>> > So we didn't think too much about that.
>>
>> Yeah but you're proposing a new synchronization primitive for the core
>> kernel.. all such 'fun' details need to be considered, not only those
>> few that bear on the one usecase.
>
> Which bares the question, what other use cases are there?

Just stumbled over one I think: If we have a big graph of connected
things (happens really often for video pipelines). And we want
multiple users to use them in parallel. But sometimes a configuration
change could take way too long and so would unduly stall a 2nd thread
with just a global mutex, then per-object ww_mutexes would be a fit:
You'd start with grabbing all the locks for the objects you want to
change anything with, then grab anything in the graph that you also
need to check. Thanks to loop detection and self-recursion this would
all nicely work out, even for cyclic graphs of objects.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Ville Syrjälä April 18, 2013, 5:37 p.m. UTC | #34
On Wed, Apr 17, 2013 at 09:08:17PM +0200, Daniel Vetter wrote:
> On Wed, Apr 10, 2013 at 12:28 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
> >> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> >> > The thing is now that you're not expected to hold these locks for a
> >> > long
> >> > time - if you need to synchronously stall while holding a lock
> >> > performance
> >> > will go down the gutters anyway. And since most current
> >> > gpus/co-processors
> >> > still can't really preempt fairness isn't that high a priority,
> >> > either.
> >> > So we didn't think too much about that.
> >>
> >> Yeah but you're proposing a new synchronization primitive for the core
> >> kernel.. all such 'fun' details need to be considered, not only those
> >> few that bear on the one usecase.
> >
> > Which bares the question, what other use cases are there?
> 
> Just stumbled over one I think: If we have a big graph of connected
> things (happens really often for video pipelines). And we want
> multiple users to use them in parallel. But sometimes a configuration
> change could take way too long and so would unduly stall a 2nd thread
> with just a global mutex, then per-object ww_mutexes would be a fit:
> You'd start with grabbing all the locks for the objects you want to
> change anything with, then grab anything in the graph that you also
> need to check. Thanks to loop detection and self-recursion this would
> all nicely work out, even for cyclic graphs of objects.

Indeed, that would make the locking for atomic modeset/page flip very
easy to handle, while still allowing the use of suitable fine grained
locks. I like the idea.
  

Patch

diff --git a/Documentation/reservation-mutex-design.txt b/Documentation/reservation-mutex-design.txt
new file mode 100644
index 0000000..4e2866c
--- /dev/null
+++ b/Documentation/reservation-mutex-design.txt
@@ -0,0 +1,136 @@ 
+Reservation type mutexes
+---
+
+Please read mutex-design.txt first, as it applies to reservation mutexes too.
+
+GPU's do operations that commonly involve many buffers.  Those buffers
+can be shared across contexts/processes, exist in different memory
+domains (for example VRAM vs system memory), and so on.  And with
+PRIME / dmabuf, they can even be shared across devices.  So there are
+a handful of situations where the driver needs to wait for buffers to
+become ready.  If you think about this in terms of waiting on a buffer
+mutex for it to become available, this presents a problem because
+there is no way to guarantee that buffers appear in a execbuf/batch in
+the same order in all contexts.  That is directly under control of
+userspace, and a result of the sequence of GL calls that an application
+makes.  Which results in the potential for deadlock.  The problem gets
+more complex when you consider that the kernel may need to migrate the
+buffer(s) into VRAM before the GPU operates on the buffer(s), which
+may in turn require evicting some other buffers (and you don't want to
+evict other buffers which are already queued up to the GPU), but for a
+simplified understanding of the problem you can ignore this.
+
+The algorithm that TTM came up with for dealing with this problem is
+quite simple.  For each group of buffers (execbuf) that need to be
+locked, the caller would be assigned a unique reservation_id, from a
+global counter.  In case of deadlock while locking all the buffers
+associated with a execbuf, the one with the lowest reservation_id
+wins, and the one with the higher reservation_id unlocks all of the
+buffers that it has already locked, and then tries again.
+
+How it is used:
+---------------
+
+A very simplified version:
+
+    int lock_execbuf(execbuf, ticket)
+    {
+        struct buf *res_buf = NULL;
+
+        /* acquiring locks, before queuing up to GPU: */
+        *ticket = assign_global_seqno();
+
+    retry:
+        for (buf in execbuf->buffers) {
+            if (buf == res_buf) {
+                res_buf = NULL;
+                continue;
+            }
+            ret = mutex_reserve_lock(&buf->lock, ticket, ticket->seqno);
+            if (ret < 0)
+                goto err;
+        }
+
+        /* now everything is good to go, submit job to GPU: */
+        ...
+
+        return 0;
+
+    err:
+        for (all buf2 before buf in execbuf->buffers)
+            mutex_unreserve_unlock(&buf2->lock);
+        if (res_buf)
+            mutex_unreserve_unlock(&res_buf->lock);
+
+        if (ret == -EAGAIN) {
+            /* we lost out in a seqno race, lock and retry.. */
+            mutex_reserve_lock_slow(&buf->lock, ticket, ticket->seqno);
+            res_buf = buf;
+            goto retry;
+        }
+        release_global_seqno(ticket);
+
+        return ret;
+    }
+
+    int unlock_execbuf(execbuf, ticket)
+    {
+        /* when GPU is finished; */
+        for (buf in execbuf->buffers)
+            mutex_unreserve_unlock(&buf->lock);
+        release_global_seqno(ticket);
+    }
+
+Functions:
+----------
+
+mutex_reserve_lock, and mutex_reserve_lock_interruptible:
+  Lock a reservation_lock with a reservation_id set. reservation_id must not
+  be set to 0, since this is a special value that means no reservation_id.
+
+  Normally if reservation_id is not set, or is older than the
+  reservation_id which is currently set on the mutex, the behavior will
+  be to wait normally.  However, if  the reservation_id is newer than
+  the current reservation_id, -EAGAIN will be returned.
+
+  These functions will return -EDEADLK instead of -EAGAIN if
+  reservation_id is the same as the reservation_id that's attempted to
+  lock the mutex with, since in that case you presumably attempted to
+  lock the same lock twice.
+
+mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow:
+  Similar to mutex_reserve_lock, except it won't backoff with -EAGAIN.
+  This is useful when mutex_reserve_lock failed with -EAGAIN, and you
+  unreserved all reservation_locks so no deadlock can occur.
+
+mutex_unreserve_unlock:
+   Unlock a reservation_lock reserved with one of the mutex_reserve_*lock*
+   calls.
+
+Design:
+  ticket_mutex encapsulates a struct mutex, this means no extra overhead for
+  normal mutex locks, which are far more common. As such there is only a small
+  increase in code size if reservation mutexes are not used.
+
+  In general, not much contention is expected. The locks are intended to
+  serialize access to resources for devices. The only way to make wakeups
+  smarter would be at the cost of adding a field to struct mutex_waiter. This
+  would add overhead to all cases where normal mutexes are used, and
+  ticket_mutexes are less performance sensitive. For this reason
+  mutex_reserve_lock wakes up all waiters, so they can recheck themselves.
+
+Lockdep:
+  Lockdep operates on classes of locks. Ideally there should be only 2 lockdep
+  classes, one for the reservation_id itself, and one shared across all
+  reservation mutexes.
+
+  It may seem odd that reservation_id needs a lock class, but it's needed for
+  lockdep to work correctly. Because the lock class of all reservation mutexes
+  are equal, without nesting lockdep will complain. Using multiple lockdep
+  classes will result in reduced effectiveness.
+
+  The reason only a single lockdep class is wanted for reservation mutexes is
+  because it's always nested to the reservation_id's "lock", or only a single
+  lock is taken without any nesting at all. In the latter case no more than 1
+  mutex should be locked at the same time, which lockdep can only verify
+  if the lock class is always the same.
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..602c247 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -62,6 +62,11 @@  struct mutex {
 #endif
 };
 
+struct ticket_mutex {
+	struct mutex base;
+	atomic_long_t reservation_id;
+};
+
 /*
  * This is the control structure for tasks blocked on mutex,
  * which resides on the blocked task's kernel stack:
@@ -109,12 +114,24 @@  static inline void mutex_destroy(struct mutex *lock) {}
 		__DEBUG_MUTEX_INITIALIZER(lockname) \
 		__DEP_MAP_MUTEX_INITIALIZER(lockname) }
 
+#define __TICKET_MUTEX_INITIALIZER(lockname) \
+		{ .base = __MUTEX_INITIALIZER(lockname) \
+		, .reservation_id = ATOMIC_LONG_INIT(0) }
+
 #define DEFINE_MUTEX(mutexname) \
 	struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)
 
 extern void __mutex_init(struct mutex *lock, const char *name,
 			 struct lock_class_key *key);
 
+static inline void __ticket_mutex_init(struct ticket_mutex *lock,
+				       const char *name,
+				       struct lock_class_key *key)
+{
+	__mutex_init(&lock->base, name, key);
+	atomic_long_set(&lock->reservation_id, 0);
+}
+
 /**
  * mutex_is_locked - is the mutex locked
  * @lock: the mutex to be queried
@@ -133,26 +150,91 @@  static inline int mutex_is_locked(struct mutex *lock)
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
 extern void _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock);
+
 extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
 					unsigned int subclass);
 
+extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,
+					struct lockdep_map *nest_lock,
+					unsigned long reservation_id);
+
+extern int __must_check _mutex_reserve_lock_interruptible(struct ticket_mutex *,
+					struct lockdep_map *nest_lock,
+					unsigned long reservation_id);
+
+extern void _mutex_reserve_lock_slow(struct ticket_mutex *lock,
+				     struct lockdep_map *nest_lock,
+				     unsigned long reservation_id);
+
+extern int __must_check _mutex_reserve_lock_intr_slow(struct ticket_mutex *,
+					struct lockdep_map *nest_lock,
+					unsigned long reservation_id);
+
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
 
 #define mutex_lock_nest_lock(lock, nest_lock)				\
 do {									\
-	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);	\
 	_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);		\
 } while (0)
 
+#define mutex_reserve_lock(lock, nest_lock, reservation_id)		\
+({									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);	\
+	_mutex_reserve_lock(lock, &(nest_lock)->dep_map, reservation_id);	\
+})
+
+#define mutex_reserve_lock_interruptible(lock, nest_lock, reservation_id)	\
+({									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);	\
+	_mutex_reserve_lock_interruptible(lock, &(nest_lock)->dep_map,	\
+					   reservation_id);		\
+})
+
+#define mutex_reserve_lock_slow(lock, nest_lock, reservation_id)	\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);	\
+	_mutex_reserve_lock_slow(lock, &(nest_lock)->dep_map, reservation_id);	\
+} while (0)
+
+#define mutex_reserve_lock_intr_slow(lock, nest_lock, reservation_id)	\
+({									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);	\
+	_mutex_reserve_lock_intr_slow(lock, &(nest_lock)->dep_map,	\
+				      reservation_id);			\
+})
+
 #else
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
 extern int __must_check mutex_lock_killable(struct mutex *lock);
 
+extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,
+					    unsigned long reservation_id);
+extern int __must_check _mutex_reserve_lock_interruptible(struct ticket_mutex *,
+						unsigned long reservation_id);
+
+extern void _mutex_reserve_lock_slow(struct ticket_mutex *lock,
+				     unsigned long reservation_id);
+extern int __must_check _mutex_reserve_lock_intr_slow(struct ticket_mutex *,
+						unsigned long reservation_id);
+
+#define mutex_reserve_lock(lock, nest_lock, reservation_id)		\
+	_mutex_reserve_lock(lock, reservation_id)
+
+#define mutex_reserve_lock_interruptible(lock, nest_lock, reservation_id)	\
+	_mutex_reserve_lock_interruptible(lock, reservation_id)
+
+#define mutex_reserve_lock_slow(lock, nest_lock, reservation_id)	\
+	_mutex_reserve_lock_slow(lock, reservation_id)
+
+#define mutex_reserve_lock_intr_slow(lock, nest_lock, reservation_id)	\
+	_mutex_reserve_lock_intr_slow(lock, reservation_id)
+
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
@@ -167,6 +249,8 @@  extern int __must_check mutex_lock_killable(struct mutex *lock);
  */
 extern int mutex_trylock(struct mutex *lock);
 extern void mutex_unlock(struct mutex *lock);
+extern void mutex_unreserve_unlock(struct ticket_mutex *lock);
+
 extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 
 #ifndef CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 84a5f07..d6999a5 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -127,16 +127,116 @@  void __sched mutex_unlock(struct mutex *lock)
 
 EXPORT_SYMBOL(mutex_unlock);
 
+/**
+ * mutex_unreserve_unlock - release the mutex
+ * @lock: the mutex to be released
+ *
+ * Unlock a mutex that has been locked by this task previously
+ * with _mutex_reserve_lock*.
+ *
+ * This function must not be used in interrupt context. Unlocking
+ * of a unlocked mutex is not allowed.
+ */
+void __sched mutex_unreserve_unlock(struct ticket_mutex *lock)
+{
+	/*
+	 * The unlocking fastpath is the 0->1 transition from 'locked'
+	 * into 'unlocked' state:
+	 */
+
+	/*
+	 * mark mutex as no longer part of a reservation, next
+	 * locker can set this again
+	 */
+#ifdef CONFIG_DEBUG_MUTEXES
+	unsigned long rid;
+
+	rid = atomic_long_xchg(&lock->reservation_id, 0);
+
+	/*
+	 * If this WARN_ON triggers, you used mutex_lock to acquire,
+	 * but released with mutex_unreserve_unlock in this call.
+	 */
+	DEBUG_LOCKS_WARN_ON(!rid);
+#else
+	atomic_long_set(&lock->reservation_id, 0);
+
+	/*
+	 * When debugging is enabled we must not clear the owner before time,
+	 * the slow path will always be taken, and that clears the owner field
+	 * after verifying that it was indeed current.
+	 */
+	mutex_clear_owner(&lock->base);
+#endif
+	__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+}
+EXPORT_SYMBOL(mutex_unreserve_unlock);
+
+static inline int __sched
+__mutex_lock_check_reserve(struct mutex *lock, unsigned long reservation_id)
+{
+	struct ticket_mutex *m = container_of(lock, struct ticket_mutex, base);
+	unsigned long cur_id;
+
+	cur_id = atomic_long_read(&m->reservation_id);
+	if (!cur_id)
+		return 0;
+
+	if (unlikely(reservation_id == cur_id))
+		return -EDEADLK;
+
+	if (unlikely(reservation_id - cur_id <= LONG_MAX))
+		return -EAGAIN;
+
+	return 0;
+}
+
+/*
+ * after acquiring lock with fastpath or when we lost out in contested
+ * slowpath, set reservation_id and wake up any waiters so they can recheck.
+ *
+ * This function is never called when CONFIG_DEBUG_LOCK_ALLOC is set,
+ * as the fastpath and opportunistic spinning are disabled in that case.
+ */
+static __always_inline void
+mutex_set_reservation_fastpath(struct ticket_mutex *lock,
+			       unsigned long reservation_id)
+{
+	unsigned long flags;
+	struct mutex_waiter *cur;
+
+	atomic_long_set(&lock->reservation_id, reservation_id);
+
+	/*
+	 * Check if lock is contended, if not there is nobody to wake up
+	 */
+	if (likely(atomic_read(&lock->base.count) == 0))
+		return;
+
+	/*
+	 * Uh oh, we raced in fastpath, wake up everyone in this case,
+	 * so they can see the new reservation_id
+	 */
+	spin_lock_mutex(&lock->base.wait_lock, flags);
+	list_for_each_entry(cur, &lock->base.wait_list, list) {
+		debug_mutex_wake_waiter(&lock->base, cur);
+		wake_up_process(cur->task);
+	}
+	spin_unlock_mutex(&lock->base.wait_lock, flags);
+}
+
 /*
  * Lock a mutex (possibly interruptible), slowpath:
  */
-static inline int __sched
+static __always_inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-		    struct lockdep_map *nest_lock, unsigned long ip)
+		    struct lockdep_map *nest_lock, unsigned long ip,
+		    unsigned long reservation_id, bool res_slow)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
+	int ret;
 
 	preempt_disable();
 	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
@@ -163,6 +263,12 @@  __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	for (;;) {
 		struct task_struct *owner;
 
+		if (!__builtin_constant_p(reservation_id) && !res_slow) {
+			ret = __mutex_lock_check_reserve(lock, reservation_id);
+			if (ret)
+				goto err_nowait;
+		}
+
 		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
@@ -173,6 +279,13 @@  __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 
 		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
 			lock_acquired(&lock->dep_map, ip);
+			if (res_slow) {
+				struct ticket_mutex *m;
+				m = container_of(lock, struct ticket_mutex, base);
+
+				mutex_set_reservation_fastpath(m, reservation_id);
+			}
+
 			mutex_set_owner(lock);
 			preempt_enable();
 			return 0;
@@ -228,15 +341,16 @@  __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * TASK_UNINTERRUPTIBLE case.)
 		 */
 		if (unlikely(signal_pending_state(state, task))) {
-			mutex_remove_waiter(lock, &waiter,
-					    task_thread_info(task));
-			mutex_release(&lock->dep_map, 1, ip);
-			spin_unlock_mutex(&lock->wait_lock, flags);
+			ret = -EINTR;
+			goto err;
+		}
 
-			debug_mutex_free_waiter(&waiter);
-			preempt_enable();
-			return -EINTR;
+		if (!__builtin_constant_p(reservation_id) && !res_slow) {
+			ret = __mutex_lock_check_reserve(lock, reservation_id);
+			if (ret)
+				goto err;
 		}
+
 		__set_task_state(task, state);
 
 		/* didn't get the lock, go to sleep: */
@@ -251,6 +365,41 @@  done:
 	mutex_remove_waiter(lock, &waiter, current_thread_info());
 	mutex_set_owner(lock);
 
+	if (!__builtin_constant_p(reservation_id)) {
+		struct ticket_mutex *m = container_of(lock,
+						      struct ticket_mutex,
+						      base);
+		struct mutex_waiter *cur;
+
+		/*
+		 * this should get optimized out for the common case,
+		 * and is only important for _mutex_reserve_lock
+		 */
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+		unsigned long old_id;
+		old_id = atomic_long_xchg(&m->reservation_id, reservation_id);
+
+		/*
+		 * If this WARN_ON triggers, you used mutex_lock to acquire,
+		 * but released with mutex_unreserve_unlock in this call.
+		 */
+		DEBUG_LOCKS_WARN_ON(old_id);
+#else
+		atomic_long_set(&m->reservation_id, reservation_id);
+#endif
+
+		/*
+		 * give any possible sleeping processes the chance to wake up,
+		 * so they can recheck if they have to back off from
+		 * reservations
+		 */
+		list_for_each_entry(cur, &lock->wait_list, list) {
+			debug_mutex_wake_waiter(lock, cur);
+			wake_up_process(cur->task);
+		}
+	}
+
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -261,6 +410,19 @@  done:
 	preempt_enable();
 
 	return 0;
+
+err:
+	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+	spin_unlock_mutex(&lock->wait_lock, flags);
+	debug_mutex_free_waiter(&waiter);
+
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+err_nowait:
+#endif
+	mutex_release(&lock->dep_map, 1, ip);
+
+	preempt_enable();
+	return ret;
 }
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -268,7 +430,8 @@  void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
+			    subclass, NULL, _RET_IP_, 0, 0);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
@@ -277,7 +440,8 @@  void __sched
 _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, nest, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
+			    0, nest, _RET_IP_, 0, 0);
 }
 
 EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock);
@@ -286,7 +450,8 @@  int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE,
+				   subclass, NULL, _RET_IP_, 0, 0);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
@@ -295,10 +460,63 @@  mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
-				   subclass, NULL, _RET_IP_);
+				   subclass, NULL, _RET_IP_, 0, 0);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
+
+int __sched
+_mutex_reserve_lock(struct ticket_mutex *lock, struct lockdep_map *nest,
+		    unsigned long reservation_id)
+{
+	DEBUG_LOCKS_WARN_ON(!reservation_id);
+
+	might_sleep();
+	return __mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE,
+				   0, nest, _RET_IP_, reservation_id, 0);
+}
+EXPORT_SYMBOL_GPL(_mutex_reserve_lock);
+
+
+int __sched
+_mutex_reserve_lock_interruptible(struct ticket_mutex *lock,
+				  struct lockdep_map *nest,
+				  unsigned long reservation_id)
+{
+	DEBUG_LOCKS_WARN_ON(!reservation_id);
+
+	might_sleep();
+	return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE,
+				   0, nest, _RET_IP_, reservation_id, 0);
+}
+EXPORT_SYMBOL_GPL(_mutex_reserve_lock_interruptible);
+
+void __sched
+_mutex_reserve_lock_slow(struct ticket_mutex *lock, struct lockdep_map *nest,
+			 unsigned long reservation_id)
+{
+	DEBUG_LOCKS_WARN_ON(!reservation_id);
+
+	might_sleep();
+	__mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0,
+			    nest, _RET_IP_, reservation_id, 1);
+}
+EXPORT_SYMBOL_GPL(_mutex_reserve_lock_slow);
+
+int __sched
+_mutex_reserve_lock_intr_slow(struct ticket_mutex *lock,
+			      struct lockdep_map *nest,
+			      unsigned long reservation_id)
+{
+	DEBUG_LOCKS_WARN_ON(!reservation_id);
+
+	might_sleep();
+	return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0,
+				   nest, _RET_IP_, reservation_id, 1);
+}
+EXPORT_SYMBOL_GPL(_mutex_reserve_lock_intr_slow);
+
+
 #endif
 
 /*
@@ -401,20 +619,39 @@  __mutex_lock_slowpath(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0,
+			    NULL, _RET_IP_, 0, 0);
 }
 
 static noinline int __sched
 __mutex_lock_killable_slowpath(struct mutex *lock)
 {
-	return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, 0,
+				   NULL, _RET_IP_, 0, 0);
 }
 
 static noinline int __sched
 __mutex_lock_interruptible_slowpath(struct mutex *lock)
 {
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0,
+				   NULL, _RET_IP_, 0, 0);
+}
+
+static noinline int __sched
+__mutex_lock_reserve_slowpath(struct ticket_mutex *lock, unsigned long rid)
+{
+	return __mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0,
+				   NULL, _RET_IP_, rid, 0);
+}
+
+static noinline int __sched
+__mutex_lock_interruptible_reserve_slowpath(struct ticket_mutex *lock,
+					    unsigned long rid)
+{
+	return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0,
+				   NULL, _RET_IP_, rid, 0);
 }
+
 #endif
 
 /*
@@ -470,6 +707,63 @@  int __sched mutex_trylock(struct mutex *lock)
 }
 EXPORT_SYMBOL(mutex_trylock);
 
+#ifndef CONFIG_DEBUG_LOCK_ALLOC
+int __sched
+_mutex_reserve_lock(struct ticket_mutex *lock, unsigned long rid)
+{
+	int ret;
+
+	might_sleep();
+
+	ret = __mutex_fastpath_lock_retval(&lock->base.count);
+
+	if (likely(!ret)) {
+		mutex_set_reservation_fastpath(lock, rid);
+		mutex_set_owner(&lock->base);
+	} else
+		ret = __mutex_lock_reserve_slowpath(lock, rid);
+	return ret;
+}
+EXPORT_SYMBOL(_mutex_reserve_lock);
+
+int __sched
+_mutex_reserve_lock_interruptible(struct ticket_mutex *lock, unsigned long rid)
+{
+	int ret;
+
+	might_sleep();
+
+	ret = __mutex_fastpath_lock_retval(&lock->base.count);
+
+	if (likely(!ret)) {
+		mutex_set_reservation_fastpath(lock, rid);
+		mutex_set_owner(&lock->base);
+	} else
+		ret = __mutex_lock_interruptible_reserve_slowpath(lock, rid);
+	return ret;
+}
+EXPORT_SYMBOL(_mutex_reserve_lock_interruptible);
+
+void __sched
+_mutex_reserve_lock_slow(struct ticket_mutex *lock, unsigned long rid)
+{
+	might_sleep();
+	__mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE,
+			    0, NULL, _RET_IP_, rid, 1);
+}
+EXPORT_SYMBOL(_mutex_reserve_lock_slow);
+
+int __sched
+_mutex_reserve_lock_intr_slow(struct ticket_mutex *lock, unsigned long rid)
+{
+	might_sleep();
+	return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE,
+				   0, NULL, _RET_IP_, rid, 1);
+}
+EXPORT_SYMBOL(_mutex_reserve_lock_intr_slow);
+
+#endif
+
 /**
  * atomic_dec_and_mutex_lock - return holding mutex if we dec to 0
  * @cnt: the atomic which we are to dec