Message ID | 20230322224403.35742-1-robdclark@gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers |
Received: from vger.kernel.org ([23.128.96.18]) by www.linuxtv.org with esmtp (Exim 4.92) (envelope-from <linux-media-owner@vger.kernel.org>) id 1pf7CC-005GSu-NB; Wed, 22 Mar 2023 22:44:41 +0000 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229603AbjCVWog (ORCPT <rfc822;mkrufky@linuxtv.org> + 1 other); Wed, 22 Mar 2023 18:44:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37924 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229497AbjCVWoR (ORCPT <rfc822;linux-media@vger.kernel.org>); Wed, 22 Mar 2023 18:44:17 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBC306E90; Wed, 22 Mar 2023 15:44:16 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id c18so20628594ple.11; Wed, 22 Mar 2023 15:44:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679525056; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=u7IFLnv2wp+6+M4RDOrDXlUdlj/4yQHYPm5exES9iSg=; b=d0r7ofq5hY83b1KFqtk/khmAPHfvgWyeyA3MONIOqMO+HDlgKrSCSYHRBu9oUUxNh0 j4EqAQFrGaB9BN7ZmZnfRLFB26ZpAiJMEJnoQxyJHli0iP9bAT/iH7DVxyRFGD+bJPKJ SLBipXnpy6bZPQzbq/pKSi5ySkL0ewxUDi0/JHUXGcSOfS9ASAUCHoG5N3dKZR5lmYBn 8pDDQWpd8Og5i0XCV/oMlJ/mbTeh2P7zzsGjJeagqUcCa97pbKfSObAs2UtIl33p5/EE GHfsN5IyZuIzig3G4q4KZLyOfHa48jwasgb2TM9tordUu7H8Qafj78GDSaI4JfkXej/f 28qQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679525056; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=u7IFLnv2wp+6+M4RDOrDXlUdlj/4yQHYPm5exES9iSg=; b=kK4TUdQBl/KLtXYadX8Eh56a2lJAtrdzCrzDDqr4Ikaq5f/ENVRNSzzG4BFuFNoMqW Ya5k0AmLPeunMhob1TBFYQ4d3pwobKsKE7WXv+yp81P33+B+dciykovKCSjEGvHKHnm/ BcmRD3uarGx6sC4gs5SOFWDejFGDOgMYZm6KnLix7Zb+2WtUOOVX5/41Olk6zXUMxW3V UN9vm9Xb59R3D5mpoHe6hXh3tQOk/XbnvPH/A4uCQ1yS1qb8UIt20zVjBQCMNwdrxB6f lKsGV4rR+WMgntpRGT9xjJYG4aMORXYEzMbbLv+LWPMlH5b4u+C68p3JuwXn5/0CVRZl z5ZA== X-Gm-Message-State: AO0yUKUJg3uZCv6A5iZgrzJcKBpkNxcTr8J+51LPssJzzV3GcCawWJKg nR1a3mGgdoR/CktVzIoOGoI= X-Google-Smtp-Source: AK7set/xDVFAvIS+FUJS6LFPnzBnvslEhWBhH930+MmTO6iZlkSX7t7YHlarHpTIx2aAUAoJOtOaRQ== X-Received: by 2002:a17:90b:3803:b0:234:a9df:db96 with SMTP id mq3-20020a17090b380300b00234a9dfdb96mr5338431pjb.33.1679525056271; Wed, 22 Mar 2023 15:44:16 -0700 (PDT) Received: from localhost ([2a00:79e1:abd:4a00:61b:48ed:72ab:435b]) by smtp.gmail.com with ESMTPSA id p5-20020a1709026b8500b001a1aeb3a7a9sm9798689plk.137.2023.03.22.15.44.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Mar 2023 15:44:15 -0700 (PDT) From: Rob Clark <robdclark@gmail.com> To: dri-devel@lists.freedesktop.org Cc: Rob Clark <robdclark@chromium.org>, Luben Tuikov <luben.tuikov@amd.com>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Sumit Semwal <sumit.semwal@linaro.org>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, linux-kernel@vger.kernel.org (open list), linux-media@vger.kernel.org (open list:DMA BUFFER SHARING FRAMEWORK), linaro-mm-sig@lists.linaro.org (moderated list:DMA BUFFER SHARING FRAMEWORK) Subject: [RFC] drm/scheduler: Unwrap job dependencies Date: Wed, 22 Mar 2023 15:44:03 -0700 Message-Id: <20230322224403.35742-1-robdclark@gmail.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-media.vger.kernel.org> X-Mailing-List: linux-media@vger.kernel.org X-LSpam-Score: -2.5 (--) X-LSpam-Report: No, score=-2.5 required=5.0 tests=BAYES_00=-1.9,DKIM_SIGNED=0.1,DKIM_VALID=-0.1,DKIM_VALID_AU=-0.1,FREEMAIL_FORGED_FROMDOMAIN=0.001,FREEMAIL_FROM=0.001,HEADER_FROM_DIFFERENT_DOMAINS=0.5,MAILING_LIST_MULTI=-1 autolearn=ham autolearn_force=no |
Series |
[RFC] drm/scheduler: Unwrap job dependencies
|
|
Commit Message
Rob Clark
March 22, 2023, 10:44 p.m. UTC
From: Rob Clark <robdclark@chromium.org> Container fences have burner contexts, which makes the trick to store at most one fence per context somewhat useless if we don't unwrap array or chain fences. Signed-off-by: Rob Clark <robdclark@chromium.org> --- tbh, I'm not sure why we weren't doing this already, unless there is something I'm overlooking drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++--------- 1 file changed, 28 insertions(+), 15 deletions(-)
Comments
Am 22.03.23 um 23:44 schrieb Rob Clark: > From: Rob Clark <robdclark@chromium.org> > > Container fences have burner contexts, which makes the trick to store at > most one fence per context somewhat useless if we don't unwrap array or > chain fences. Mhm, we intentionally kept them not unwrapped since this way they only occupy one fence slot. But it might be better to unwrap them if you add many of those dependencies. > > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > tbh, I'm not sure why we weren't doing this already, unless there is > something I'm overlooking > > drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++--------- > 1 file changed, 28 insertions(+), 15 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index c2ee44d6224b..f59e5335afbb 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -41,20 +41,21 @@ > * 4. Entities themselves maintain a queue of jobs that will be scheduled on > * the hardware. > * > * The jobs in a entity are always scheduled in the order that they were pushed. > */ > > #include <linux/kthread.h> > #include <linux/wait.h> > #include <linux/sched.h> > #include <linux/completion.h> > +#include <linux/dma-fence-unwrap.h> > #include <linux/dma-resv.h> > #include <uapi/linux/sched/types.h> > > #include <drm/drm_print.h> > #include <drm/drm_gem.h> > #include <drm/gpu_scheduler.h> > #include <drm/spsc_queue.h> > > #define CREATE_TRACE_POINTS > #include "gpu_scheduler_trace.h" > @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job) > sched = entity->rq->sched; > > job->sched = sched; > job->s_priority = entity->rq - sched->sched_rq; > job->id = atomic64_inc_return(&sched->job_id_count); > > drm_sched_fence_init(job->s_fence, job->entity); > } > EXPORT_SYMBOL(drm_sched_job_arm); > > -/** > - * drm_sched_job_add_dependency - adds the fence as a job dependency > - * @job: scheduler job to add the dependencies to > - * @fence: the dma_fence to add to the list of dependencies. > - * > - * Note that @fence is consumed in both the success and error cases. > - * > - * Returns: > - * 0 on success, or an error on failing to expand the array. > - */ > -int drm_sched_job_add_dependency(struct drm_sched_job *job, > - struct dma_fence *fence) > +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence) Please keep the drm_sched_job_ prefix here even for static functions. The symbol _add_dependency just sucks in a backtrace, especially when it's tail optimized. > { > struct dma_fence *entry; > unsigned long index; > u32 id = 0; > int ret; > > - if (!fence) > - return 0; > - > /* Deduplicate if we already depend on a fence from the same context. > * This lets the size of the array of deps scale with the number of > * engines involved, rather than the number of BOs. > */ > xa_for_each(&job->dependencies, index, entry) { > if (entry->context != fence->context) > continue; > > if (dma_fence_is_later(fence, entry)) { > dma_fence_put(entry); > @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job, > } > return 0; > } > > ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); > if (ret != 0) > dma_fence_put(fence); > > return ret; > } > + > +/** > + * drm_sched_job_add_dependency - adds the fence as a job dependency > + * @job: scheduler job to add the dependencies to > + * @fence: the dma_fence to add to the list of dependencies. > + * > + * Note that @fence is consumed in both the success and error cases. > + * > + * Returns: > + * 0 on success, or an error on failing to expand the array. > + */ > +int drm_sched_job_add_dependency(struct drm_sched_job *job, > + struct dma_fence *fence) Maybe name the new function drm_sched_job_unwrap_add_dependency or something like this. I need to double check, but I think for some cases we don't need or don't even want this in the driver. Christian. > +{ > + struct dma_fence_unwrap iter; > + struct dma_fence *f; > + int ret = 0; > + > + dma_fence_unwrap_for_each (f, &iter, fence) { > + ret = _add_dependency(job, f); > + if (ret) > + break; > + } > + > + return ret; > +} > EXPORT_SYMBOL(drm_sched_job_add_dependency); > > /** > * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job > * @job: scheduler job to add the dependencies to > * @resv: the dma_resv object to get the fences from > * @usage: the dma_resv_usage to use to filter the fences > * > * This adds all fences matching the given usage from @resv to @job. > * Must be called with the @resv lock held.
On Thu, Mar 23, 2023 at 12:35 AM Christian König <christian.koenig@amd.com> wrote: > > Am 22.03.23 um 23:44 schrieb Rob Clark: > > From: Rob Clark <robdclark@chromium.org> > > > > Container fences have burner contexts, which makes the trick to store at > > most one fence per context somewhat useless if we don't unwrap array or > > chain fences. > > Mhm, we intentionally kept them not unwrapped since this way they only > occupy one fence slot. > > But it might be better to unwrap them if you add many of those dependencies. > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > tbh, I'm not sure why we weren't doing this already, unless there is > > something I'm overlooking > > > > drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++--------- > > 1 file changed, 28 insertions(+), 15 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index c2ee44d6224b..f59e5335afbb 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -41,20 +41,21 @@ > > * 4. Entities themselves maintain a queue of jobs that will be scheduled on > > * the hardware. > > * > > * The jobs in a entity are always scheduled in the order that they were pushed. > > */ > > > > #include <linux/kthread.h> > > #include <linux/wait.h> > > #include <linux/sched.h> > > #include <linux/completion.h> > > +#include <linux/dma-fence-unwrap.h> > > #include <linux/dma-resv.h> > > #include <uapi/linux/sched/types.h> > > > > #include <drm/drm_print.h> > > #include <drm/drm_gem.h> > > #include <drm/gpu_scheduler.h> > > #include <drm/spsc_queue.h> > > > > #define CREATE_TRACE_POINTS > > #include "gpu_scheduler_trace.h" > > @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job) > > sched = entity->rq->sched; > > > > job->sched = sched; > > job->s_priority = entity->rq - sched->sched_rq; > > job->id = atomic64_inc_return(&sched->job_id_count); > > > > drm_sched_fence_init(job->s_fence, job->entity); > > } > > EXPORT_SYMBOL(drm_sched_job_arm); > > > > -/** > > - * drm_sched_job_add_dependency - adds the fence as a job dependency > > - * @job: scheduler job to add the dependencies to > > - * @fence: the dma_fence to add to the list of dependencies. > > - * > > - * Note that @fence is consumed in both the success and error cases. > > - * > > - * Returns: > > - * 0 on success, or an error on failing to expand the array. > > - */ > > -int drm_sched_job_add_dependency(struct drm_sched_job *job, > > - struct dma_fence *fence) > > +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence) > > Please keep the drm_sched_job_ prefix here even for static functions. > The symbol _add_dependency just sucks in a backtrace, especially when > it's tail optimized. > > > { > > struct dma_fence *entry; > > unsigned long index; > > u32 id = 0; > > int ret; > > > > - if (!fence) > > - return 0; > > - > > /* Deduplicate if we already depend on a fence from the same context. > > * This lets the size of the array of deps scale with the number of > > * engines involved, rather than the number of BOs. > > */ > > xa_for_each(&job->dependencies, index, entry) { > > if (entry->context != fence->context) > > continue; > > > > if (dma_fence_is_later(fence, entry)) { > > dma_fence_put(entry); > > @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job, > > } > > return 0; > > } > > > > ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); > > if (ret != 0) > > dma_fence_put(fence); > > > > return ret; > > } > > + > > +/** > > + * drm_sched_job_add_dependency - adds the fence as a job dependency > > + * @job: scheduler job to add the dependencies to > > + * @fence: the dma_fence to add to the list of dependencies. > > + * > > + * Note that @fence is consumed in both the success and error cases. > > + * > > + * Returns: > > + * 0 on success, or an error on failing to expand the array. > > + */ > > +int drm_sched_job_add_dependency(struct drm_sched_job *job, > > + struct dma_fence *fence) > > Maybe name the new function drm_sched_job_unwrap_add_dependency or > something like this. > > I need to double check, but I think for some cases we don't need or > don't even want this in the driver. I'd be curious to know the cases where you don't want this.. one thing I was thinking about, what if you have a container fence with two contained fences. One is on the same ctx as the job, one is not but signals sooner. You end up artificially waiting on both, which seems sub-optimal. Anyways, I can make this a new entrypoint which unwraps, and/or rename the internal static function, if we think this is a good idea. BR, -R > Christian. > > > +{ > > + struct dma_fence_unwrap iter; > > + struct dma_fence *f; > > + int ret = 0; > > + > > + dma_fence_unwrap_for_each (f, &iter, fence) { > > + ret = _add_dependency(job, f); > > + if (ret) > > + break; > > + } > > + > > + return ret; > > +} > > EXPORT_SYMBOL(drm_sched_job_add_dependency); > > > > /** > > * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job > > * @job: scheduler job to add the dependencies to > > * @resv: the dma_resv object to get the fences from > > * @usage: the dma_resv_usage to use to filter the fences > > * > > * This adds all fences matching the given usage from @resv to @job. > > * Must be called with the @resv lock held. >
Am 23.03.23 um 14:54 schrieb Rob Clark: > On Thu, Mar 23, 2023 at 12:35 AM Christian König > <christian.koenig@amd.com> wrote: >> Am 22.03.23 um 23:44 schrieb Rob Clark: >>> From: Rob Clark <robdclark@chromium.org> >>> >>> Container fences have burner contexts, which makes the trick to store at >>> most one fence per context somewhat useless if we don't unwrap array or >>> chain fences. >> Mhm, we intentionally kept them not unwrapped since this way they only >> occupy one fence slot. >> >> But it might be better to unwrap them if you add many of those dependencies. >> >>> Signed-off-by: Rob Clark <robdclark@chromium.org> >>> --- >>> tbh, I'm not sure why we weren't doing this already, unless there is >>> something I'm overlooking >>> >>> drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++--------- >>> 1 file changed, 28 insertions(+), 15 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index c2ee44d6224b..f59e5335afbb 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -41,20 +41,21 @@ >>> * 4. Entities themselves maintain a queue of jobs that will be scheduled on >>> * the hardware. >>> * >>> * The jobs in a entity are always scheduled in the order that they were pushed. >>> */ >>> >>> #include <linux/kthread.h> >>> #include <linux/wait.h> >>> #include <linux/sched.h> >>> #include <linux/completion.h> >>> +#include <linux/dma-fence-unwrap.h> >>> #include <linux/dma-resv.h> >>> #include <uapi/linux/sched/types.h> >>> >>> #include <drm/drm_print.h> >>> #include <drm/drm_gem.h> >>> #include <drm/gpu_scheduler.h> >>> #include <drm/spsc_queue.h> >>> >>> #define CREATE_TRACE_POINTS >>> #include "gpu_scheduler_trace.h" >>> @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job) >>> sched = entity->rq->sched; >>> >>> job->sched = sched; >>> job->s_priority = entity->rq - sched->sched_rq; >>> job->id = atomic64_inc_return(&sched->job_id_count); >>> >>> drm_sched_fence_init(job->s_fence, job->entity); >>> } >>> EXPORT_SYMBOL(drm_sched_job_arm); >>> >>> -/** >>> - * drm_sched_job_add_dependency - adds the fence as a job dependency >>> - * @job: scheduler job to add the dependencies to >>> - * @fence: the dma_fence to add to the list of dependencies. >>> - * >>> - * Note that @fence is consumed in both the success and error cases. >>> - * >>> - * Returns: >>> - * 0 on success, or an error on failing to expand the array. >>> - */ >>> -int drm_sched_job_add_dependency(struct drm_sched_job *job, >>> - struct dma_fence *fence) >>> +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence) >> Please keep the drm_sched_job_ prefix here even for static functions. >> The symbol _add_dependency just sucks in a backtrace, especially when >> it's tail optimized. >> >>> { >>> struct dma_fence *entry; >>> unsigned long index; >>> u32 id = 0; >>> int ret; >>> >>> - if (!fence) >>> - return 0; >>> - >>> /* Deduplicate if we already depend on a fence from the same context. >>> * This lets the size of the array of deps scale with the number of >>> * engines involved, rather than the number of BOs. >>> */ >>> xa_for_each(&job->dependencies, index, entry) { >>> if (entry->context != fence->context) >>> continue; >>> >>> if (dma_fence_is_later(fence, entry)) { >>> dma_fence_put(entry); >>> @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job, >>> } >>> return 0; >>> } >>> >>> ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); >>> if (ret != 0) >>> dma_fence_put(fence); >>> >>> return ret; >>> } >>> + >>> +/** >>> + * drm_sched_job_add_dependency - adds the fence as a job dependency >>> + * @job: scheduler job to add the dependencies to >>> + * @fence: the dma_fence to add to the list of dependencies. >>> + * >>> + * Note that @fence is consumed in both the success and error cases. >>> + * >>> + * Returns: >>> + * 0 on success, or an error on failing to expand the array. >>> + */ >>> +int drm_sched_job_add_dependency(struct drm_sched_job *job, >>> + struct dma_fence *fence) >> Maybe name the new function drm_sched_job_unwrap_add_dependency or >> something like this. >> >> I need to double check, but I think for some cases we don't need or >> don't even want this in the driver. > I'd be curious to know the cases where you don't want this.. one thing > I was thinking about, what if you have a container fence with two > contained fences. One is on the same ctx as the job, one is not but > signals sooner. You end up artificially waiting on both, which seems > sub-optimal. Well resv objects don't contain other containers for example. Then we also have an use case in amdgpu where fence need to be explicitly waited for even when they are from the same ctx as the job because otherwise we wouldn't see everything cache coherent. On the other hand we currently handle that amdgpu use case differently and the extra overhead of unwrapping fences even if they can't be containers is probably negligible. > Anyways, I can make this a new entrypoint which unwraps, and/or rename > the internal static function, if we think this is a good idea. If you think that's unnecessary keeping your original approach is fine with me as well. Regards, Christian. > > BR, > -R > >> Christian. >> >>> +{ >>> + struct dma_fence_unwrap iter; >>> + struct dma_fence *f; >>> + int ret = 0; >>> + >>> + dma_fence_unwrap_for_each (f, &iter, fence) { >>> + ret = _add_dependency(job, f); >>> + if (ret) >>> + break; >>> + } >>> + >>> + return ret; >>> +} >>> EXPORT_SYMBOL(drm_sched_job_add_dependency); >>> >>> /** >>> * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job >>> * @job: scheduler job to add the dependencies to >>> * @resv: the dma_resv object to get the fences from >>> * @usage: the dma_resv_usage to use to filter the fences >>> * >>> * This adds all fences matching the given usage from @resv to @job. >>> * Must be called with the @resv lock held.
On Thu, Mar 23, 2023 at 7:03 AM Christian König <christian.koenig@amd.com> wrote: > > Am 23.03.23 um 14:54 schrieb Rob Clark: > > On Thu, Mar 23, 2023 at 12:35 AM Christian König > > <christian.koenig@amd.com> wrote: > >> Am 22.03.23 um 23:44 schrieb Rob Clark: > >>> From: Rob Clark <robdclark@chromium.org> > >>> > >>> Container fences have burner contexts, which makes the trick to store at > >>> most one fence per context somewhat useless if we don't unwrap array or > >>> chain fences. > >> Mhm, we intentionally kept them not unwrapped since this way they only > >> occupy one fence slot. > >> > >> But it might be better to unwrap them if you add many of those dependencies. > >> > >>> Signed-off-by: Rob Clark <robdclark@chromium.org> > >>> --- > >>> tbh, I'm not sure why we weren't doing this already, unless there is > >>> something I'm overlooking > >>> > >>> drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++--------- > >>> 1 file changed, 28 insertions(+), 15 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > >>> index c2ee44d6224b..f59e5335afbb 100644 > >>> --- a/drivers/gpu/drm/scheduler/sched_main.c > >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c > >>> @@ -41,20 +41,21 @@ > >>> * 4. Entities themselves maintain a queue of jobs that will be scheduled on > >>> * the hardware. > >>> * > >>> * The jobs in a entity are always scheduled in the order that they were pushed. > >>> */ > >>> > >>> #include <linux/kthread.h> > >>> #include <linux/wait.h> > >>> #include <linux/sched.h> > >>> #include <linux/completion.h> > >>> +#include <linux/dma-fence-unwrap.h> > >>> #include <linux/dma-resv.h> > >>> #include <uapi/linux/sched/types.h> > >>> > >>> #include <drm/drm_print.h> > >>> #include <drm/drm_gem.h> > >>> #include <drm/gpu_scheduler.h> > >>> #include <drm/spsc_queue.h> > >>> > >>> #define CREATE_TRACE_POINTS > >>> #include "gpu_scheduler_trace.h" > >>> @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job) > >>> sched = entity->rq->sched; > >>> > >>> job->sched = sched; > >>> job->s_priority = entity->rq - sched->sched_rq; > >>> job->id = atomic64_inc_return(&sched->job_id_count); > >>> > >>> drm_sched_fence_init(job->s_fence, job->entity); > >>> } > >>> EXPORT_SYMBOL(drm_sched_job_arm); > >>> > >>> -/** > >>> - * drm_sched_job_add_dependency - adds the fence as a job dependency > >>> - * @job: scheduler job to add the dependencies to > >>> - * @fence: the dma_fence to add to the list of dependencies. > >>> - * > >>> - * Note that @fence is consumed in both the success and error cases. > >>> - * > >>> - * Returns: > >>> - * 0 on success, or an error on failing to expand the array. > >>> - */ > >>> -int drm_sched_job_add_dependency(struct drm_sched_job *job, > >>> - struct dma_fence *fence) > >>> +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence) > >> Please keep the drm_sched_job_ prefix here even for static functions. > >> The symbol _add_dependency just sucks in a backtrace, especially when > >> it's tail optimized. > >> > >>> { > >>> struct dma_fence *entry; > >>> unsigned long index; > >>> u32 id = 0; > >>> int ret; > >>> > >>> - if (!fence) > >>> - return 0; > >>> - > >>> /* Deduplicate if we already depend on a fence from the same context. > >>> * This lets the size of the array of deps scale with the number of > >>> * engines involved, rather than the number of BOs. > >>> */ > >>> xa_for_each(&job->dependencies, index, entry) { > >>> if (entry->context != fence->context) > >>> continue; > >>> > >>> if (dma_fence_is_later(fence, entry)) { > >>> dma_fence_put(entry); > >>> @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job, > >>> } > >>> return 0; > >>> } > >>> > >>> ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); > >>> if (ret != 0) > >>> dma_fence_put(fence); > >>> > >>> return ret; > >>> } > >>> + > >>> +/** > >>> + * drm_sched_job_add_dependency - adds the fence as a job dependency > >>> + * @job: scheduler job to add the dependencies to > >>> + * @fence: the dma_fence to add to the list of dependencies. > >>> + * > >>> + * Note that @fence is consumed in both the success and error cases. > >>> + * > >>> + * Returns: > >>> + * 0 on success, or an error on failing to expand the array. > >>> + */ > >>> +int drm_sched_job_add_dependency(struct drm_sched_job *job, > >>> + struct dma_fence *fence) > >> Maybe name the new function drm_sched_job_unwrap_add_dependency or > >> something like this. > >> > >> I need to double check, but I think for some cases we don't need or > >> don't even want this in the driver. > > I'd be curious to know the cases where you don't want this.. one thing > > I was thinking about, what if you have a container fence with two > > contained fences. One is on the same ctx as the job, one is not but > > signals sooner. You end up artificially waiting on both, which seems > > sub-optimal. > > Well resv objects don't contain other containers for example. I suppose I have the explicit sync case more in mind, where the dependent fence ends up being a chain or array (if userspace is merging fence fd's). > Then we also have an use case in amdgpu where fence need to be > explicitly waited for even when they are from the same ctx as the job > because otherwise we wouldn't see everything cache coherent. This was the kinda weird case I wanted to make sure I wasn't breaking. I remember seeing something fly by for this, but can't find it now or remember what amdgpu's solution was.. > On the other hand we currently handle that amdgpu use case differently > and the extra overhead of unwrapping fences even if they can't be > containers is probably negligible. > > > Anyways, I can make this a new entrypoint which unwraps, and/or rename > > the internal static function, if we think this is a good idea. > > If you think that's unnecessary keeping your original approach is fine > with me as well. I'm going to assume unnecessary until someone speaks up with their weird specific case ;-) BR, -R > Regards, > Christian. > > > > > BR, > > -R > > > >> Christian. > >> > >>> +{ > >>> + struct dma_fence_unwrap iter; > >>> + struct dma_fence *f; > >>> + int ret = 0; > >>> + > >>> + dma_fence_unwrap_for_each (f, &iter, fence) { > >>> + ret = _add_dependency(job, f); > >>> + if (ret) > >>> + break; > >>> + } > >>> + > >>> + return ret; > >>> +} > >>> EXPORT_SYMBOL(drm_sched_job_add_dependency); > >>> > >>> /** > >>> * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job > >>> * @job: scheduler job to add the dependencies to > >>> * @resv: the dma_resv object to get the fences from > >>> * @usage: the dma_resv_usage to use to filter the fences > >>> * > >>> * This adds all fences matching the given usage from @resv to @job. > >>> * Must be called with the @resv lock held. >
On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote: > > On Thu, Mar 23, 2023 at 7:03 AM Christian König > <christian.koenig@amd.com> wrote: > > > > Am 23.03.23 um 14:54 schrieb Rob Clark: > > > On Thu, Mar 23, 2023 at 12:35 AM Christian König > > > <christian.koenig@amd.com> wrote: > > >> Am 22.03.23 um 23:44 schrieb Rob Clark: > > >>> From: Rob Clark <robdclark@chromium.org> > > >>> > > >>> Container fences have burner contexts, which makes the trick to store at > > >>> most one fence per context somewhat useless if we don't unwrap array or > > >>> chain fences. > > >> Mhm, we intentionally kept them not unwrapped since this way they only > > >> occupy one fence slot. > > >> > > >> But it might be better to unwrap them if you add many of those dependencies. > > >> > > >>> Signed-off-by: Rob Clark <robdclark@chromium.org> > > >>> --- > > >>> tbh, I'm not sure why we weren't doing this already, unless there is > > >>> something I'm overlooking > > >>> > > >>> drivers/gpu/drm/scheduler/sched_main.c | 43 +++++++++++++++++--------- > > >>> 1 file changed, 28 insertions(+), 15 deletions(-) > > >>> > > >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > >>> index c2ee44d6224b..f59e5335afbb 100644 > > >>> --- a/drivers/gpu/drm/scheduler/sched_main.c > > >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c > > >>> @@ -41,20 +41,21 @@ > > >>> * 4. Entities themselves maintain a queue of jobs that will be scheduled on > > >>> * the hardware. > > >>> * > > >>> * The jobs in a entity are always scheduled in the order that they were pushed. > > >>> */ > > >>> > > >>> #include <linux/kthread.h> > > >>> #include <linux/wait.h> > > >>> #include <linux/sched.h> > > >>> #include <linux/completion.h> > > >>> +#include <linux/dma-fence-unwrap.h> > > >>> #include <linux/dma-resv.h> > > >>> #include <uapi/linux/sched/types.h> > > >>> > > >>> #include <drm/drm_print.h> > > >>> #include <drm/drm_gem.h> > > >>> #include <drm/gpu_scheduler.h> > > >>> #include <drm/spsc_queue.h> > > >>> > > >>> #define CREATE_TRACE_POINTS > > >>> #include "gpu_scheduler_trace.h" > > >>> @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job) > > >>> sched = entity->rq->sched; > > >>> > > >>> job->sched = sched; > > >>> job->s_priority = entity->rq - sched->sched_rq; > > >>> job->id = atomic64_inc_return(&sched->job_id_count); > > >>> > > >>> drm_sched_fence_init(job->s_fence, job->entity); > > >>> } > > >>> EXPORT_SYMBOL(drm_sched_job_arm); > > >>> > > >>> -/** > > >>> - * drm_sched_job_add_dependency - adds the fence as a job dependency > > >>> - * @job: scheduler job to add the dependencies to > > >>> - * @fence: the dma_fence to add to the list of dependencies. > > >>> - * > > >>> - * Note that @fence is consumed in both the success and error cases. > > >>> - * > > >>> - * Returns: > > >>> - * 0 on success, or an error on failing to expand the array. > > >>> - */ > > >>> -int drm_sched_job_add_dependency(struct drm_sched_job *job, > > >>> - struct dma_fence *fence) > > >>> +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence) > > >> Please keep the drm_sched_job_ prefix here even for static functions. > > >> The symbol _add_dependency just sucks in a backtrace, especially when > > >> it's tail optimized. > > >> > > >>> { > > >>> struct dma_fence *entry; > > >>> unsigned long index; > > >>> u32 id = 0; > > >>> int ret; > > >>> > > >>> - if (!fence) > > >>> - return 0; > > >>> - > > >>> /* Deduplicate if we already depend on a fence from the same context. > > >>> * This lets the size of the array of deps scale with the number of > > >>> * engines involved, rather than the number of BOs. > > >>> */ > > >>> xa_for_each(&job->dependencies, index, entry) { > > >>> if (entry->context != fence->context) > > >>> continue; > > >>> > > >>> if (dma_fence_is_later(fence, entry)) { > > >>> dma_fence_put(entry); > > >>> @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job, > > >>> } > > >>> return 0; > > >>> } > > >>> > > >>> ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); > > >>> if (ret != 0) > > >>> dma_fence_put(fence); > > >>> > > >>> return ret; > > >>> } > > >>> + > > >>> +/** > > >>> + * drm_sched_job_add_dependency - adds the fence as a job dependency > > >>> + * @job: scheduler job to add the dependencies to > > >>> + * @fence: the dma_fence to add to the list of dependencies. > > >>> + * > > >>> + * Note that @fence is consumed in both the success and error cases. > > >>> + * > > >>> + * Returns: > > >>> + * 0 on success, or an error on failing to expand the array. > > >>> + */ > > >>> +int drm_sched_job_add_dependency(struct drm_sched_job *job, > > >>> + struct dma_fence *fence) > > >> Maybe name the new function drm_sched_job_unwrap_add_dependency or > > >> something like this. > > >> > > >> I need to double check, but I think for some cases we don't need or > > >> don't even want this in the driver. > > > I'd be curious to know the cases where you don't want this.. one thing > > > I was thinking about, what if you have a container fence with two > > > contained fences. One is on the same ctx as the job, one is not but > > > signals sooner. You end up artificially waiting on both, which seems > > > sub-optimal. > > > > Well resv objects don't contain other containers for example. > > I suppose I have the explicit sync case more in mind, where the > dependent fence ends up being a chain or array (if userspace is > merging fence fd's). > > > Then we also have an use case in amdgpu where fence need to be > > explicitly waited for even when they are from the same ctx as the job > > because otherwise we wouldn't see everything cache coherent. > > This was the kinda weird case I wanted to make sure I wasn't breaking. > I remember seeing something fly by for this, but can't find it now or > remember what amdgpu's solution was.. > > > On the other hand we currently handle that amdgpu use case differently > > and the extra overhead of unwrapping fences even if they can't be > > containers is probably negligible. > > > > > Anyways, I can make this a new entrypoint which unwraps, and/or rename > > > the internal static function, if we think this is a good idea. > > > > If you think that's unnecessary keeping your original approach is fine > > with me as well. > > I'm going to assume unnecessary until someone speaks up with their > weird specific case ;-) So, this patch turns out to blow up spectacularly with dma_fence refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, because it starts unwrapping fence chains, possibly in parallel with fence signaling on the retire path. Is it supposed to be permissible to unwrap a fence chain concurrently? BR, -R > BR, > -R > > > Regards, > > Christian. > > > > > > > > BR, > > > -R > > > > > >> Christian. > > >> > > >>> +{ > > >>> + struct dma_fence_unwrap iter; > > >>> + struct dma_fence *f; > > >>> + int ret = 0; > > >>> + > > >>> + dma_fence_unwrap_for_each (f, &iter, fence) { > > >>> + ret = _add_dependency(job, f); > > >>> + if (ret) > > >>> + break; > > >>> + } > > >>> + > > >>> + return ret; > > >>> +} > > >>> EXPORT_SYMBOL(drm_sched_job_add_dependency); > > >>> > > >>> /** > > >>> * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job > > >>> * @job: scheduler job to add the dependencies to > > >>> * @resv: the dma_resv object to get the fences from > > >>> * @usage: the dma_resv_usage to use to filter the fences > > >>> * > > >>> * This adds all fences matching the given usage from @resv to @job. > > >>> * Must be called with the @resv lock held. > >
Am 04.12.23 um 22:54 schrieb Rob Clark: > On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote: >> [SNIP] > So, this patch turns out to blow up spectacularly with dma_fence > refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, > because it starts unwrapping fence chains, possibly in parallel with > fence signaling on the retire path. Is it supposed to be permissible > to unwrap a fence chain concurrently? The DMA-fence chain object and helper functions were designed so that concurrent accesses to all elements are always possible. See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example. dma_fence_chain_walk() starts with a reference to the current fence (the anchor of the walk) and tries to grab an up to date reference on the previous fence in the chain. Only after that reference is successfully acquired we drop the reference to the anchor where we started. Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a reference to the array which in turn holds references to each fence inside the array until it is destroyed itself. When this blows up we have somehow mixed up the references somewhere. Regards, Christian. > > BR, > -R
On Mon, Dec 4, 2023 at 10:46 PM Christian König <christian.koenig@amd.com> wrote: > > Am 04.12.23 um 22:54 schrieb Rob Clark: > > On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote: > >> [SNIP] > > So, this patch turns out to blow up spectacularly with dma_fence > > refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, > > because it starts unwrapping fence chains, possibly in parallel with > > fence signaling on the retire path. Is it supposed to be permissible > > to unwrap a fence chain concurrently? > > The DMA-fence chain object and helper functions were designed so that > concurrent accesses to all elements are always possible. > > See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example. > dma_fence_chain_walk() starts with a reference to the current fence (the > anchor of the walk) and tries to grab an up to date reference on the > previous fence in the chain. Only after that reference is successfully > acquired we drop the reference to the anchor where we started. > > Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a > reference to the array which in turn holds references to each fence > inside the array until it is destroyed itself. > > When this blows up we have somehow mixed up the references somewhere. That's what it looked like to me, but wanted to make sure I wasn't overlooking something subtle. And in this case, the fence actually should be the syncobj timeline point fence, not the fence chain. Virtgpu has essentially the same logic (there we really do want to unwrap fences so we can pass host fences back to host rather than waiting in guest), I'm not sure if it would blow up in the same way. BR, -R > Regards, > Christian. > > > > > BR, > > -R
Am 05.12.23 um 16:41 schrieb Rob Clark: > On Mon, Dec 4, 2023 at 10:46 PM Christian König > <christian.koenig@amd.com> wrote: >> Am 04.12.23 um 22:54 schrieb Rob Clark: >>> On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote: >>>> [SNIP] >>> So, this patch turns out to blow up spectacularly with dma_fence >>> refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, >>> because it starts unwrapping fence chains, possibly in parallel with >>> fence signaling on the retire path. Is it supposed to be permissible >>> to unwrap a fence chain concurrently? >> The DMA-fence chain object and helper functions were designed so that >> concurrent accesses to all elements are always possible. >> >> See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example. >> dma_fence_chain_walk() starts with a reference to the current fence (the >> anchor of the walk) and tries to grab an up to date reference on the >> previous fence in the chain. Only after that reference is successfully >> acquired we drop the reference to the anchor where we started. >> >> Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a >> reference to the array which in turn holds references to each fence >> inside the array until it is destroyed itself. >> >> When this blows up we have somehow mixed up the references somewhere. > That's what it looked like to me, but wanted to make sure I wasn't > overlooking something subtle. And in this case, the fence actually > should be the syncobj timeline point fence, not the fence chain. > Virtgpu has essentially the same logic (there we really do want to > unwrap fences so we can pass host fences back to host rather than > waiting in guest), I'm not sure if it would blow up in the same way. Well do you have a backtrace of what exactly happens? Maybe we have some _put() before _get() or something like this. Thanks, Christian. > > BR, > -R > >> Regards, >> Christian. >> >>> BR, >>> -R
On Tue, Dec 5, 2023 at 7:58 AM Christian König <christian.koenig@amd.com> wrote: > > Am 05.12.23 um 16:41 schrieb Rob Clark: > > On Mon, Dec 4, 2023 at 10:46 PM Christian König > > <christian.koenig@amd.com> wrote: > >> Am 04.12.23 um 22:54 schrieb Rob Clark: > >>> On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote: > >>>> [SNIP] > >>> So, this patch turns out to blow up spectacularly with dma_fence > >>> refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, > >>> because it starts unwrapping fence chains, possibly in parallel with > >>> fence signaling on the retire path. Is it supposed to be permissible > >>> to unwrap a fence chain concurrently? > >> The DMA-fence chain object and helper functions were designed so that > >> concurrent accesses to all elements are always possible. > >> > >> See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example. > >> dma_fence_chain_walk() starts with a reference to the current fence (the > >> anchor of the walk) and tries to grab an up to date reference on the > >> previous fence in the chain. Only after that reference is successfully > >> acquired we drop the reference to the anchor where we started. > >> > >> Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a > >> reference to the array which in turn holds references to each fence > >> inside the array until it is destroyed itself. > >> > >> When this blows up we have somehow mixed up the references somewhere. > > That's what it looked like to me, but wanted to make sure I wasn't > > overlooking something subtle. And in this case, the fence actually > > should be the syncobj timeline point fence, not the fence chain. > > Virtgpu has essentially the same logic (there we really do want to > > unwrap fences so we can pass host fences back to host rather than > > waiting in guest), I'm not sure if it would blow up in the same way. > > Well do you have a backtrace of what exactly happens? > > Maybe we have some _put() before _get() or something like this. I hacked up something to store the backtrace in dma_fence_release() (and leak the block so the backtrace would still be around later when dma_fence_get/put was later called) and ended up with: [ 152.811360] freed at: [ 152.813718] dma_fence_release+0x30/0x134 [ 152.817865] dma_fence_put+0x38/0x98 [gpu_sched] [ 152.822657] drm_sched_job_add_dependency+0x160/0x18c [gpu_sched] [ 152.828948] drm_sched_job_add_syncobj_dependency+0x58/0x88 [gpu_sched] [ 152.835770] msm_ioctl_gem_submit+0x580/0x1160 [msm] [ 152.841070] drm_ioctl_kernel+0xec/0x16c [ 152.845132] drm_ioctl+0x2e8/0x3f4 [ 152.848646] vfs_ioctl+0x30/0x50 [ 152.851982] __arm64_sys_ioctl+0x80/0xb4 [ 152.856039] invoke_syscall+0x8c/0x120 [ 152.859919] el0_svc_common.constprop.0+0xc0/0xdc [ 152.864777] do_el0_svc+0x24/0x30 [ 152.868207] el0_svc+0x8c/0xd8 [ 152.871365] el0t_64_sync_handler+0x84/0x12c [ 152.875771] el0t_64_sync+0x190/0x194 I suppose that doesn't guarantee that this was the problematic put. But dropping this patch to unwrap the fence makes the problem go away.. BR, -R > Thanks, > Christian. > > > > > BR, > > -R > > > >> Regards, > >> Christian. > >> > >>> BR, > >>> -R >
On Tue, Dec 5, 2023 at 8:56 AM Rob Clark <robdclark@gmail.com> wrote: > > On Tue, Dec 5, 2023 at 7:58 AM Christian König <christian.koenig@amd.com> wrote: > > > > Am 05.12.23 um 16:41 schrieb Rob Clark: > > > On Mon, Dec 4, 2023 at 10:46 PM Christian König > > > <christian.koenig@amd.com> wrote: > > >> Am 04.12.23 um 22:54 schrieb Rob Clark: > > >>> On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote: > > >>>> [SNIP] > > >>> So, this patch turns out to blow up spectacularly with dma_fence > > >>> refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, > > >>> because it starts unwrapping fence chains, possibly in parallel with > > >>> fence signaling on the retire path. Is it supposed to be permissible > > >>> to unwrap a fence chain concurrently? > > >> The DMA-fence chain object and helper functions were designed so that > > >> concurrent accesses to all elements are always possible. > > >> > > >> See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example. > > >> dma_fence_chain_walk() starts with a reference to the current fence (the > > >> anchor of the walk) and tries to grab an up to date reference on the > > >> previous fence in the chain. Only after that reference is successfully > > >> acquired we drop the reference to the anchor where we started. > > >> > > >> Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a > > >> reference to the array which in turn holds references to each fence > > >> inside the array until it is destroyed itself. > > >> > > >> When this blows up we have somehow mixed up the references somewhere. > > > That's what it looked like to me, but wanted to make sure I wasn't > > > overlooking something subtle. And in this case, the fence actually > > > should be the syncobj timeline point fence, not the fence chain. > > > Virtgpu has essentially the same logic (there we really do want to > > > unwrap fences so we can pass host fences back to host rather than > > > waiting in guest), I'm not sure if it would blow up in the same way. > > > > Well do you have a backtrace of what exactly happens? > > > > Maybe we have some _put() before _get() or something like this. > > I hacked up something to store the backtrace in dma_fence_release() > (and leak the block so the backtrace would still be around later when > dma_fence_get/put was later called) and ended up with: > > [ 152.811360] freed at: > [ 152.813718] dma_fence_release+0x30/0x134 > [ 152.817865] dma_fence_put+0x38/0x98 [gpu_sched] > [ 152.822657] drm_sched_job_add_dependency+0x160/0x18c [gpu_sched] > [ 152.828948] drm_sched_job_add_syncobj_dependency+0x58/0x88 [gpu_sched] > [ 152.835770] msm_ioctl_gem_submit+0x580/0x1160 [msm] > [ 152.841070] drm_ioctl_kernel+0xec/0x16c > [ 152.845132] drm_ioctl+0x2e8/0x3f4 > [ 152.848646] vfs_ioctl+0x30/0x50 > [ 152.851982] __arm64_sys_ioctl+0x80/0xb4 > [ 152.856039] invoke_syscall+0x8c/0x120 > [ 152.859919] el0_svc_common.constprop.0+0xc0/0xdc > [ 152.864777] do_el0_svc+0x24/0x30 > [ 152.868207] el0_svc+0x8c/0xd8 > [ 152.871365] el0t_64_sync_handler+0x84/0x12c > [ 152.875771] el0t_64_sync+0x190/0x194 > > I suppose that doesn't guarantee that this was the problematic put. > But dropping this patch to unwrap the fence makes the problem go > away.. Oh, hmm, _add_dependency() is consuming the fence reference BR, -R > BR, > -R > > > Thanks, > > Christian. > > > > > > > > BR, > > > -R > > > > > >> Regards, > > >> Christian. > > >> > > >>> BR, > > >>> -R > >
Am 05.12.23 um 18:14 schrieb Rob Clark: > On Tue, Dec 5, 2023 at 8:56 AM Rob Clark <robdclark@gmail.com> wrote: >> On Tue, Dec 5, 2023 at 7:58 AM Christian König <christian.koenig@amd.com> wrote: >>> Am 05.12.23 um 16:41 schrieb Rob Clark: >>>> On Mon, Dec 4, 2023 at 10:46 PM Christian König >>>> <christian.koenig@amd.com> wrote: >>>>> Am 04.12.23 um 22:54 schrieb Rob Clark: >>>>>> On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote: >>>>>>> [SNIP] >>>>>> So, this patch turns out to blow up spectacularly with dma_fence >>>>>> refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think, >>>>>> because it starts unwrapping fence chains, possibly in parallel with >>>>>> fence signaling on the retire path. Is it supposed to be permissible >>>>>> to unwrap a fence chain concurrently? >>>>> The DMA-fence chain object and helper functions were designed so that >>>>> concurrent accesses to all elements are always possible. >>>>> >>>>> See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example. >>>>> dma_fence_chain_walk() starts with a reference to the current fence (the >>>>> anchor of the walk) and tries to grab an up to date reference on the >>>>> previous fence in the chain. Only after that reference is successfully >>>>> acquired we drop the reference to the anchor where we started. >>>>> >>>>> Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a >>>>> reference to the array which in turn holds references to each fence >>>>> inside the array until it is destroyed itself. >>>>> >>>>> When this blows up we have somehow mixed up the references somewhere. >>>> That's what it looked like to me, but wanted to make sure I wasn't >>>> overlooking something subtle. And in this case, the fence actually >>>> should be the syncobj timeline point fence, not the fence chain. >>>> Virtgpu has essentially the same logic (there we really do want to >>>> unwrap fences so we can pass host fences back to host rather than >>>> waiting in guest), I'm not sure if it would blow up in the same way. >>> Well do you have a backtrace of what exactly happens? >>> >>> Maybe we have some _put() before _get() or something like this. >> I hacked up something to store the backtrace in dma_fence_release() >> (and leak the block so the backtrace would still be around later when >> dma_fence_get/put was later called) and ended up with: >> >> [ 152.811360] freed at: >> [ 152.813718] dma_fence_release+0x30/0x134 >> [ 152.817865] dma_fence_put+0x38/0x98 [gpu_sched] >> [ 152.822657] drm_sched_job_add_dependency+0x160/0x18c [gpu_sched] >> [ 152.828948] drm_sched_job_add_syncobj_dependency+0x58/0x88 [gpu_sched] >> [ 152.835770] msm_ioctl_gem_submit+0x580/0x1160 [msm] >> [ 152.841070] drm_ioctl_kernel+0xec/0x16c >> [ 152.845132] drm_ioctl+0x2e8/0x3f4 >> [ 152.848646] vfs_ioctl+0x30/0x50 >> [ 152.851982] __arm64_sys_ioctl+0x80/0xb4 >> [ 152.856039] invoke_syscall+0x8c/0x120 >> [ 152.859919] el0_svc_common.constprop.0+0xc0/0xdc >> [ 152.864777] do_el0_svc+0x24/0x30 >> [ 152.868207] el0_svc+0x8c/0xd8 >> [ 152.871365] el0t_64_sync_handler+0x84/0x12c >> [ 152.875771] el0t_64_sync+0x190/0x194 >> >> I suppose that doesn't guarantee that this was the problematic put. >> But dropping this patch to unwrap the fence makes the problem go >> away.. > Oh, hmm, _add_dependency() is consuming the fence reference Yeah, I was just about to point that out as well :) Should be trivial to fix, Christian > > BR, > -R > >> BR, >> -R >> >>> Thanks, >>> Christian. >>> >>>> BR, >>>> -R >>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> BR, >>>>>> -R > _______________________________________________ > Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org > To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index c2ee44d6224b..f59e5335afbb 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -41,20 +41,21 @@ * 4. Entities themselves maintain a queue of jobs that will be scheduled on * the hardware. * * The jobs in a entity are always scheduled in the order that they were pushed. */ #include <linux/kthread.h> #include <linux/wait.h> #include <linux/sched.h> #include <linux/completion.h> +#include <linux/dma-fence-unwrap.h> #include <linux/dma-resv.h> #include <uapi/linux/sched/types.h> #include <drm/drm_print.h> #include <drm/drm_gem.h> #include <drm/gpu_scheduler.h> #include <drm/spsc_queue.h> #define CREATE_TRACE_POINTS #include "gpu_scheduler_trace.h" @@ -665,41 +666,27 @@ void drm_sched_job_arm(struct drm_sched_job *job) sched = entity->rq->sched; job->sched = sched; job->s_priority = entity->rq - sched->sched_rq; job->id = atomic64_inc_return(&sched->job_id_count); drm_sched_fence_init(job->s_fence, job->entity); } EXPORT_SYMBOL(drm_sched_job_arm); -/** - * drm_sched_job_add_dependency - adds the fence as a job dependency - * @job: scheduler job to add the dependencies to - * @fence: the dma_fence to add to the list of dependencies. - * - * Note that @fence is consumed in both the success and error cases. - * - * Returns: - * 0 on success, or an error on failing to expand the array. - */ -int drm_sched_job_add_dependency(struct drm_sched_job *job, - struct dma_fence *fence) +static int _add_dependency(struct drm_sched_job *job, struct dma_fence *fence) { struct dma_fence *entry; unsigned long index; u32 id = 0; int ret; - if (!fence) - return 0; - /* Deduplicate if we already depend on a fence from the same context. * This lets the size of the array of deps scale with the number of * engines involved, rather than the number of BOs. */ xa_for_each(&job->dependencies, index, entry) { if (entry->context != fence->context) continue; if (dma_fence_is_later(fence, entry)) { dma_fence_put(entry); @@ -709,20 +696,46 @@ int drm_sched_job_add_dependency(struct drm_sched_job *job, } return 0; } ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); if (ret != 0) dma_fence_put(fence); return ret; } + +/** + * drm_sched_job_add_dependency - adds the fence as a job dependency + * @job: scheduler job to add the dependencies to + * @fence: the dma_fence to add to the list of dependencies. + * + * Note that @fence is consumed in both the success and error cases. + * + * Returns: + * 0 on success, or an error on failing to expand the array. + */ +int drm_sched_job_add_dependency(struct drm_sched_job *job, + struct dma_fence *fence) +{ + struct dma_fence_unwrap iter; + struct dma_fence *f; + int ret = 0; + + dma_fence_unwrap_for_each (f, &iter, fence) { + ret = _add_dependency(job, f); + if (ret) + break; + } + + return ret; +} EXPORT_SYMBOL(drm_sched_job_add_dependency); /** * drm_sched_job_add_resv_dependencies - add all fences from the resv to the job * @job: scheduler job to add the dependencies to * @resv: the dma_resv object to get the fences from * @usage: the dma_resv_usage to use to filter the fences * * This adds all fences matching the given usage from @resv to @job. * Must be called with the @resv lock held.