Message ID | 20230707092414.866760-1-zyytlz.wz@163.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Hans Verkuil |
Headers |
Received: from vger.kernel.org ([23.128.96.18]) by www.linuxtv.org with esmtp (Exim 4.92) (envelope-from <linux-media-owner@vger.kernel.org>) id 1qHhiw-00Cz6m-Fw; Fri, 07 Jul 2023 09:25:58 +0000 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232285AbjGGJZ4 (ORCPT <rfc822;mkrufky@linuxtv.org> + 1 other); Fri, 7 Jul 2023 05:25:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49676 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229572AbjGGJZz (ORCPT <rfc822;linux-media@vger.kernel.org>); Fri, 7 Jul 2023 05:25:55 -0400 Received: from m12.mail.163.com (m12.mail.163.com [220.181.12.216]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 919B5172B; Fri, 7 Jul 2023 02:25:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:Subject:Date:Message-Id:MIME-Version; bh=1ZSFg IwAdkqH7N1HARcWbHynu04UbnWZF0dHrxv5TFE=; b=V6IkZMTIcRqKgYiXxGieM DqGVE9KJlUxFfbDN59IkXE8ducvoNEY/dunfwg9ZIh5yTVFF2XoL1xKcjWECIrnm NduxOiQ6Oxy2z0wcnXqsCjhSbI/DJPekxiNu4PFgqkUvt5ZVHhG7vouT1bra7Bvt RVHaIrxMOJg7XvFqcZsT2I= Received: from leanderwang-LC2.localdomain (unknown [111.206.145.21]) by zwqz-smtp-mta-g5-4 (Coremail) with SMTP id _____wBXXT3A2adkGf2qBw--.56525S2; Fri, 07 Jul 2023 17:24:16 +0800 (CST) From: Zheng Wang <zyytlz.wz@163.com> To: Kyrie.Wu@mediatek.com Cc: bin.liu@mediatek.com, mchehab@kernel.org, matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com, linux-media@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org, Irui.Wang@mediatek.com, security@kernel.org, hackerzheng666@gmail.com, 1395428693sheep@gmail.com, alex000young@gmail.com, Zheng Wang <zyytlz.wz@163.com> Subject: [RESEND PATCH v2] media: mtk-jpeg: Fix use after free bug due to uncanceled work Date: Fri, 7 Jul 2023 17:24:14 +0800 Message-Id: <20230707092414.866760-1-zyytlz.wz@163.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: _____wBXXT3A2adkGf2qBw--.56525S2 X-Coremail-Antispam: 1Uf129KBjvJXoW7uF17uw15Jr4xtw4xCFy7trb_yoW8Wr43pr W3K3yUCrWUGFs0qr1UJ3W7ZFyrCwnxKa1xWr17uw4Iv393Jrs7JryFya48tFWIyF92kayf Jr18X34xGr4qvFJanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0ziaZXrUUUUU= X-Originating-IP: [111.206.145.21] X-CM-SenderInfo: h2113zf2oz6qqrwthudrp/1tbiXAWlU1Xl7LcvLgAAsF X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_BL,RCVD_IN_MSPIKE_L4,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-media.vger.kernel.org> X-Mailing-List: linux-media@vger.kernel.org X-LSpam-Score: -2.5 (--) X-LSpam-Report: No, score=-2.5 required=5.0 tests=BAYES_00=-1.9,DKIM_SIGNED=0.1,DKIM_VALID=-0.1,DKIM_VALID_AU=-0.1,FREEMAIL_FORGED_FROMDOMAIN=0.001,FREEMAIL_FROM=0.001,HEADER_FROM_DIFFERENT_DOMAINS=0.5,MAILING_LIST_MULTI=-1 autolearn=ham autolearn_force=no |
Series |
[RESEND,v2] media: mtk-jpeg: Fix use after free bug due to uncanceled work
|
|
Commit Message
Zheng Wang
July 7, 2023, 9:24 a.m. UTC
In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
and mtk_jpeg_enc_device_run may be called to start the
work.
If we remove the module which will call mtk_jpeg_remove
to make cleanup, there may be a unfinished work. The
possible sequence is as follows, which will cause a
typical UAF bug.
Fix it by canceling the work before cleanup in the mtk_jpeg_remove
CPU0 CPU1
|mtk_jpeg_job_timeout_work
mtk_jpeg_remove |
v4l2_m2m_release |
kfree(m2m_dev); |
|
| v4l2_m2m_get_curr_priv
| m2m_dev->curr_ctx //use
Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
---
- v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
---
drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
1 file changed, 1 insertion(+)
Comments
On 07/07/2023 11:24, Zheng Wang wrote: > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > and mtk_jpeg_enc_device_run may be called to start the > work. > If we remove the module which will call mtk_jpeg_remove > to make cleanup, there may be a unfinished work. The > possible sequence is as follows, which will cause a > typical UAF bug. > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > CPU0 CPU1 > > |mtk_jpeg_job_timeout_work > mtk_jpeg_remove | > v4l2_m2m_release | > kfree(m2m_dev); | > | > | v4l2_m2m_get_curr_priv > | m2m_dev->curr_ctx //use Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
Hi, This issue has not been resolved for a long time. Is there anyone who can help? Best regards, Zheng Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道: > > > > On 07/07/2023 11:24, Zheng Wang wrote: > > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > > and mtk_jpeg_enc_device_run may be called to start the > > work. > > If we remove the module which will call mtk_jpeg_remove > > to make cleanup, there may be a unfinished work. The > > possible sequence is as follows, which will cause a > > typical UAF bug. > > > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > > > CPU0 CPU1 > > > > |mtk_jpeg_job_timeout_work > > mtk_jpeg_remove | > > v4l2_m2m_release | > > kfree(m2m_dev); | > > | > > | v4l2_m2m_get_curr_priv > > | m2m_dev->curr_ctx //use > > Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com> > > -- > Regards, > Alexandre
Friendly ping Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道: > > Hi, > > This issue has not been resolved for a long time. Is there anyone who can help? > > Best regards, > Zheng > > Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道: > > > > > > > > On 07/07/2023 11:24, Zheng Wang wrote: > > > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > > > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > > > and mtk_jpeg_enc_device_run may be called to start the > > > work. > > > If we remove the module which will call mtk_jpeg_remove > > > to make cleanup, there may be a unfinished work. The > > > possible sequence is as follows, which will cause a > > > typical UAF bug. > > > > > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > > > > > CPU0 CPU1 > > > > > > |mtk_jpeg_job_timeout_work > > > mtk_jpeg_remove | > > > v4l2_m2m_release | > > > kfree(m2m_dev); | > > > | > > > | v4l2_m2m_get_curr_priv > > > | m2m_dev->curr_ctx //use > > > > Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com> > > > > -- > > Regards, > > Alexandre
On 18/07/2023 05:07, Zheng Hacker wrote: > Friendly ping > > Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道: >> >> Hi, >> >> This issue has not been resolved for a long time. Is there anyone who can help? >> >> Best regards, >> Zheng >> >> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道: >>> >>> >>> >>> On 07/07/2023 11:24, Zheng Wang wrote: >>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with >>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run >>>> and mtk_jpeg_enc_device_run may be called to start the >>>> work. >>>> If we remove the module which will call mtk_jpeg_remove >>>> to make cleanup, there may be a unfinished work. The >>>> possible sequence is as follows, which will cause a >>>> typical UAF bug. >>>> >>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove >>>> >>>> CPU0 CPU1 >>>> >>>> |mtk_jpeg_job_timeout_work >>>> mtk_jpeg_remove | >>>> v4l2_m2m_release | >>>> kfree(m2m_dev); | >>>> | >>>> | v4l2_m2m_get_curr_priv >>>> | m2m_dev->curr_ctx //use >>> >>> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com> >>> >>> -- >>> Regards, >>> Alexandre Hi Zheng, If you asking me to merge patch, sorry but I can't, I'm just a reviewer. I invite you to ping the maintainers directly: Bin Liu <bin.liu@mediatek.com> (supporter:MEDIATEK JPEG DRIVER) Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB)) Matthias Brugger <matthias.bgg@gmail.com> (maintainer:ARM/Mediatek SoC support) Otherwise, I misunderstood what you asking me. If so, can you rephrase your question please?
Thanks dor your kind reply. I'll try to connect them later. Best regards, Zheng Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月19日周三 18:17写道: > > > > On 18/07/2023 05:07, Zheng Hacker wrote: > > Friendly ping > > > > Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道: > >> > >> Hi, > >> > >> This issue has not been resolved for a long time. Is there anyone who can help? > >> > >> Best regards, > >> Zheng > >> > >> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道: > >>> > >>> > >>> > >>> On 07/07/2023 11:24, Zheng Wang wrote: > >>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > >>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > >>>> and mtk_jpeg_enc_device_run may be called to start the > >>>> work. > >>>> If we remove the module which will call mtk_jpeg_remove > >>>> to make cleanup, there may be a unfinished work. The > >>>> possible sequence is as follows, which will cause a > >>>> typical UAF bug. > >>>> > >>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove > >>>> > >>>> CPU0 CPU1 > >>>> > >>>> |mtk_jpeg_job_timeout_work > >>>> mtk_jpeg_remove | > >>>> v4l2_m2m_release | > >>>> kfree(m2m_dev); | > >>>> | > >>>> | v4l2_m2m_get_curr_priv > >>>> | m2m_dev->curr_ctx //use > >>> > >>> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com> > >>> > >>> -- > >>> Regards, > >>> Alexandre > > Hi Zheng, > > If you asking me to merge patch, sorry but I can't, I'm just a reviewer. > I invite you to ping the maintainers directly: > > Bin Liu <bin.liu@mediatek.com> (supporter:MEDIATEK JPEG DRIVER) > Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT > INFRASTRUCTURE (V4L/DVB)) > Matthias Brugger <matthias.bgg@gmail.com> (maintainer:ARM/Mediatek SoC > support) > > Otherwise, I misunderstood what you asking me. If so, can you rephrase > your question please? > > -- > Regards, > Alexandre
On Fri, Jul 7, 2023 at 5:25 PM Zheng Wang <zyytlz.wz@163.com> wrote: > > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > and mtk_jpeg_enc_device_run may be called to start the > work. > If we remove the module which will call mtk_jpeg_remove > to make cleanup, there may be a unfinished work. The > possible sequence is as follows, which will cause a > typical UAF bug. > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > CPU0 CPU1 > > |mtk_jpeg_job_timeout_work > mtk_jpeg_remove | > v4l2_m2m_release | > kfree(m2m_dev); | > | > | v4l2_m2m_get_curr_priv > | m2m_dev->curr_ctx //use > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Il 07/07/23 11:24, Zheng Wang ha scritto: > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > and mtk_jpeg_enc_device_run may be called to start the > work. > If we remove the module which will call mtk_jpeg_remove > to make cleanup, there may be a unfinished work. The > possible sequence is as follows, which will cause a > typical UAF bug. > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > CPU0 CPU1 > > |mtk_jpeg_job_timeout_work > mtk_jpeg_remove | > v4l2_m2m_release | > kfree(m2m_dev); | > | > | v4l2_m2m_get_curr_priv > | m2m_dev->curr_ctx //use > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Hello, Is there any follow-up about this issue? Thanks, Zheng AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> 于2023年7月20日周四 15:45写道: > > Il 07/07/23 11:24, Zheng Wang ha scritto: > > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > > and mtk_jpeg_enc_device_run may be called to start the > > work. > > If we remove the module which will call mtk_jpeg_remove > > to make cleanup, there may be a unfinished work. The > > possible sequence is as follows, which will cause a > > typical UAF bug. > > > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > > > CPU0 CPU1 > > > > |mtk_jpeg_job_timeout_work > > mtk_jpeg_remove | > > v4l2_m2m_release | > > kfree(m2m_dev); | > > | > > | v4l2_m2m_get_curr_priv > > | m2m_dev->curr_ctx //use > > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > > Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > >
Hello Zheng, On 7/7/23 12:24, Zheng Wang wrote: > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > and mtk_jpeg_enc_device_run may be called to start the > work. > If we remove the module which will call mtk_jpeg_remove > to make cleanup, there may be a unfinished work. The > possible sequence is as follows, which will cause a > typical UAF bug. > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > CPU0 CPU1 > > |mtk_jpeg_job_timeout_work > mtk_jpeg_remove | > v4l2_m2m_release | > kfree(m2m_dev); | > | > | v4l2_m2m_get_curr_priv > | m2m_dev->curr_ctx //use > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > --- > - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie. > --- > drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > index 0051f372a66c..6069ecf420b0 100644 > --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev) > { > struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev); > > + cancel_delayed_work_sync(&jpeg->job_timeout_work); > pm_runtime_disable(&pdev->dev); > video_unregister_device(jpeg->vdev); > v4l2_m2m_release(jpeg->m2m_dev); AFAICS, there is a fundamental problem here. The job_timeout_work uses v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded, all the v4l contexts must be closed and released. Hence the v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when work is executed before cancel_delayed_work_sync(). At the time when mtk_jpeg_remove() is invoked, there shall be no job_timeout_work running in background because all jobs should be completed before context is released. If you'll look at v4l2_m2m_cancel_job(), you can see that it waits for the task completion before closing context. You shouldn't be able to remove driver module while it has active/opened v4l contexts. If you can do that, then this is yours bug that needs to be fixed. In addition to this all, the job_timeout_work is initialized only for the single-core JPEG device. I'd expect this patch should crash multi-core JPEG devices.
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道: > > Hello Zheng, > > On 7/7/23 12:24, Zheng Wang wrote: > > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > > and mtk_jpeg_enc_device_run may be called to start the > > work. > > If we remove the module which will call mtk_jpeg_remove > > to make cleanup, there may be a unfinished work. The > > possible sequence is as follows, which will cause a > > typical UAF bug. > > > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > > > CPU0 CPU1 > > > > |mtk_jpeg_job_timeout_work > > mtk_jpeg_remove | > > v4l2_m2m_release | > > kfree(m2m_dev); | > > | > > | v4l2_m2m_get_curr_priv > > | m2m_dev->curr_ctx //use > > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > > --- > > - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie. > > --- > > drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > index 0051f372a66c..6069ecf420b0 100644 > > --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev) > > { > > struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev); > > > > + cancel_delayed_work_sync(&jpeg->job_timeout_work); > > pm_runtime_disable(&pdev->dev); > > video_unregister_device(jpeg->vdev); > > v4l2_m2m_release(jpeg->m2m_dev); > > AFAICS, there is a fundamental problem here. The job_timeout_work uses > v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded, > all the v4l contexts must be closed and released. Hence the > v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when > work is executed before cancel_delayed_work_sync(). > Hi Dmitry, Thanks for your reply. I think you're right. As m2m_dev is freed in v4l2_m2m_release, the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null pointer dereference bug. I am sure that context is closed when we invoke mtk_jpeg_remove. But I'm not sure if context is released when mtk_jpegdec_timeout_work running. > At the time when mtk_jpeg_remove() is invoked, there shall be no > job_timeout_work running in background because all jobs should be > completed before context is released. If you'll look at > v4l2_m2m_cancel_job(), you can see that it waits for the task completion > before closing context. Yes, so I think the better way is to put the cancel_delayed_work_sync invoking into v4l2_m2m_ctx_release function? > > You shouldn't be able to remove driver module while it has active/opened > v4l contexts. If you can do that, then this is yours bug that needs to > be fixed. > > In addition to this all, the job_timeout_work is initialized only for > the single-core JPEG device. I'd expect this patch should crash > multi-core JPEG devices. > I think that's true. As I'm not familiar with the code here. Could you please give me some advice about the patch? Best regards, Zheng > -- > Best regards, > Dmitry >
On 8/24/23 11:20, Zheng Hacker wrote: > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道: > >> >> Hello Zheng, >> >> On 7/7/23 12:24, Zheng Wang wrote: >>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with >>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run >>> and mtk_jpeg_enc_device_run may be called to start the >>> work. >>> If we remove the module which will call mtk_jpeg_remove >>> to make cleanup, there may be a unfinished work. The >>> possible sequence is as follows, which will cause a >>> typical UAF bug. >>> >>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove >>> >>> CPU0 CPU1 >>> >>> |mtk_jpeg_job_timeout_work >>> mtk_jpeg_remove | >>> v4l2_m2m_release | >>> kfree(m2m_dev); | >>> | >>> | v4l2_m2m_get_curr_priv >>> | m2m_dev->curr_ctx //use >>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") >>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com> >>> --- >>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie. >>> --- >>> drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c >>> index 0051f372a66c..6069ecf420b0 100644 >>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c >>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c >>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev) >>> { >>> struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev); >>> >>> + cancel_delayed_work_sync(&jpeg->job_timeout_work); >>> pm_runtime_disable(&pdev->dev); >>> video_unregister_device(jpeg->vdev); >>> v4l2_m2m_release(jpeg->m2m_dev); >> >> AFAICS, there is a fundamental problem here. The job_timeout_work uses >> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded, >> all the v4l contexts must be closed and released. Hence the >> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when >> work is executed before cancel_delayed_work_sync(). >> > > Hi Dmitry, > > Thanks for your reply. I think you're right. As m2m_dev is freed in > v4l2_m2m_release, > the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null > pointer dereference > bug. I am sure that context is closed when we invoke mtk_jpeg_remove. > But I'm not sure if > context is released when mtk_jpegdec_timeout_work running. > >> At the time when mtk_jpeg_remove() is invoked, there shall be no >> job_timeout_work running in background because all jobs should be >> completed before context is released. If you'll look at >> v4l2_m2m_cancel_job(), you can see that it waits for the task completion >> before closing context. > > Yes, so I think the better way is to put the cancel_delayed_work_sync > invoking into > v4l2_m2m_ctx_release function? The v4l2_m2m_ctx_release() already should wait for the job_timeout_work completion or for the interrupt fire. Apparently it doesn't work in yours case. You'll need to debug why v4l job or job_timeout_work is running after v4l2_m2m_ctx_release(), it shouldn't happen. The interrupt handler cancels job_timeout_work, you shouldn't need to flush the work. Technically, interrupt handler may race with job_timeout_work, but the timeout is set to 1 second and in practice should be difficult to trigger the race. The interrupt handler needs to be threaded, it should use cancel_delayed_work_sync() and check the return value of this function. >> >> You shouldn't be able to remove driver module while it has active/opened >> v4l contexts. If you can do that, then this is yours bug that needs to >> be fixed. >> >> In addition to this all, the job_timeout_work is initialized only for >> the single-core JPEG device. I'd expect this patch should crash >> multi-core JPEG devices. >> > > I think that's true. As I'm not familiar with the code here. Could you > please give me some advice about the patch? We'll need to understand why v4l2_m2m_ctx_release() doesn't work as expected before thinking about the patch.
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月28日周一 10:04写道: > > On 8/24/23 11:20, Zheng Hacker wrote: > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道: > > > >> > >> Hello Zheng, > >> > >> On 7/7/23 12:24, Zheng Wang wrote: > >>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > >>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > >>> and mtk_jpeg_enc_device_run may be called to start the > >>> work. > >>> If we remove the module which will call mtk_jpeg_remove > >>> to make cleanup, there may be a unfinished work. The > >>> possible sequence is as follows, which will cause a > >>> typical UAF bug. > >>> > >>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove > >>> > >>> CPU0 CPU1 > >>> > >>> |mtk_jpeg_job_timeout_work > >>> mtk_jpeg_remove | > >>> v4l2_m2m_release | > >>> kfree(m2m_dev); | > >>> | > >>> | v4l2_m2m_get_curr_priv > >>> | m2m_dev->curr_ctx //use > >>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > >>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > >>> --- > >>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie. > >>> --- > >>> drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 + > >>> 1 file changed, 1 insertion(+) > >>> > >>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > >>> index 0051f372a66c..6069ecf420b0 100644 > >>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > >>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > >>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev) > >>> { > >>> struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev); > >>> > >>> + cancel_delayed_work_sync(&jpeg->job_timeout_work); > >>> pm_runtime_disable(&pdev->dev); > >>> video_unregister_device(jpeg->vdev); > >>> v4l2_m2m_release(jpeg->m2m_dev); > >> > >> AFAICS, there is a fundamental problem here. The job_timeout_work uses > >> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded, > >> all the v4l contexts must be closed and released. Hence the > >> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when > >> work is executed before cancel_delayed_work_sync(). > >> > > > > Hi Dmitry, > > > > Thanks for your reply. I think you're right. As m2m_dev is freed in > > v4l2_m2m_release, > > the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null > > pointer dereference > > bug. I am sure that context is closed when we invoke mtk_jpeg_remove. > > But I'm not sure if > > context is released when mtk_jpegdec_timeout_work running. > > > >> At the time when mtk_jpeg_remove() is invoked, there shall be no > >> job_timeout_work running in background because all jobs should be > >> completed before context is released. If you'll look at > >> v4l2_m2m_cancel_job(), you can see that it waits for the task completion > >> before closing context. > > > > Yes, so I think the better way is to put the cancel_delayed_work_sync > > invoking into > > v4l2_m2m_ctx_release function? > > The v4l2_m2m_ctx_release() already should wait for the job_timeout_work > completion or for the interrupt fire. Apparently it doesn't work in > yours case. You'll need to debug why v4l job or job_timeout_work is > running after v4l2_m2m_ctx_release(), it shouldn't happen. > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish to trigger that. However, this is not the only path to call v4l2_m2m_job_finish. Here is a invoking chain: v4l_streamon ->v4l2_m2m_ioctl_streamon ->v4l2_m2m_streamon ->v4l2_m2m_try_schedule ->v4l2_m2m_try_run ->mtk_jpeg_dec_device_run ->schedule_delayed_work(&jpeg->job_timeout_work... ->error path goto dec_end ->v4l2_m2m_job_finish In some specific situation, it starts the worker and also calls v4l2_m2m_job_finish, which might make v4l2_m2m_cancel_job continues. > The interrupt handler cancels job_timeout_work, you shouldn't need to > flush the work. It will, but as I said, there might be an early invocation chain to start the work.(Not very sure) > > Technically, interrupt handler may race with job_timeout_work, but the > timeout is set to 1 second and in practice should be difficult to > trigger the race. The interrupt handler needs to be threaded, it should > use cancel_delayed_work_sync() and check the return value of this function. > Yes, it's better to use cancel_delayed_work_sync here. > >> > >> You shouldn't be able to remove driver module while it has active/opened > >> v4l contexts. If you can do that, then this is yours bug that needs to > >> be fixed. > >> > >> In addition to this all, the job_timeout_work is initialized only for > >> the single-core JPEG device. I'd expect this patch should crash > >> multi-core JPEG devices. > >> > > > > I think that's true. As I'm not familiar with the code here. Could you > > please give me some advice about the patch? > > We'll need to understand why v4l2_m2m_ctx_release() doesn't work as > expected before thinking about the patch. > > -- > Best regards, > Dmitry >
Friendly ping. Zheng Hacker <hackerzheng666@gmail.com> 于2023年8月31日周四 16:18写道: > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月28日周一 10:04写道: > > > > On 8/24/23 11:20, Zheng Hacker wrote: > > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道: > > > > > >> > > >> Hello Zheng, > > >> > > >> On 7/7/23 12:24, Zheng Wang wrote: > > >>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > > >>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > > >>> and mtk_jpeg_enc_device_run may be called to start the > > >>> work. > > >>> If we remove the module which will call mtk_jpeg_remove > > >>> to make cleanup, there may be a unfinished work. The > > >>> possible sequence is as follows, which will cause a > > >>> typical UAF bug. > > >>> > > >>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > >>> > > >>> CPU0 CPU1 > > >>> > > >>> |mtk_jpeg_job_timeout_work > > >>> mtk_jpeg_remove | > > >>> v4l2_m2m_release | > > >>> kfree(m2m_dev); | > > >>> | > > >>> | v4l2_m2m_get_curr_priv > > >>> | m2m_dev->curr_ctx //use > > >>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > > >>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > > >>> --- > > >>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie. > > >>> --- > > >>> drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 + > > >>> 1 file changed, 1 insertion(+) > > >>> > > >>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > >>> index 0051f372a66c..6069ecf420b0 100644 > > >>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > >>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > >>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev) > > >>> { > > >>> struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev); > > >>> > > >>> + cancel_delayed_work_sync(&jpeg->job_timeout_work); > > >>> pm_runtime_disable(&pdev->dev); > > >>> video_unregister_device(jpeg->vdev); > > >>> v4l2_m2m_release(jpeg->m2m_dev); > > >> > > >> AFAICS, there is a fundamental problem here. The job_timeout_work uses > > >> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded, > > >> all the v4l contexts must be closed and released. Hence the > > >> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when > > >> work is executed before cancel_delayed_work_sync(). > > >> > > > > > > Hi Dmitry, > > > > > > Thanks for your reply. I think you're right. As m2m_dev is freed in > > > v4l2_m2m_release, > > > the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null > > > pointer dereference > > > bug. I am sure that context is closed when we invoke mtk_jpeg_remove. > > > But I'm not sure if > > > context is released when mtk_jpegdec_timeout_work running. > > > > > >> At the time when mtk_jpeg_remove() is invoked, there shall be no > > >> job_timeout_work running in background because all jobs should be > > >> completed before context is released. If you'll look at > > >> v4l2_m2m_cancel_job(), you can see that it waits for the task completion > > >> before closing context. > > > > > > Yes, so I think the better way is to put the cancel_delayed_work_sync > > > invoking into > > > v4l2_m2m_ctx_release function? > > > > The v4l2_m2m_ctx_release() already should wait for the job_timeout_work > > completion or for the interrupt fire. Apparently it doesn't work in > > yours case. You'll need to debug why v4l job or job_timeout_work is > > running after v4l2_m2m_ctx_release(), it shouldn't happen. > > > > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, > the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish > to trigger that. > > However, this is not the only path to call v4l2_m2m_job_finish. Here > is a invoking chain: > v4l_streamon > ->v4l2_m2m_ioctl_streamon > ->v4l2_m2m_streamon > ->v4l2_m2m_try_schedule > ->v4l2_m2m_try_run > ->mtk_jpeg_dec_device_run > ->schedule_delayed_work(&jpeg->job_timeout_work... > ->error path goto dec_end > ->v4l2_m2m_job_finish > > In some specific situation, it starts the worker and also calls > v4l2_m2m_job_finish, which might > make v4l2_m2m_cancel_job continues. > > > The interrupt handler cancels job_timeout_work, you shouldn't need to > > flush the work. > > It will, but as I said, there might be an early invocation chain to > start the work.(Not very sure) > > > > > Technically, interrupt handler may race with job_timeout_work, but the > > timeout is set to 1 second and in practice should be difficult to > > trigger the race. The interrupt handler needs to be threaded, it should > > use cancel_delayed_work_sync() and check the return value of this function. > > > > Yes, it's better to use cancel_delayed_work_sync here. > > > >> > > >> You shouldn't be able to remove driver module while it has active/opened > > >> v4l contexts. If you can do that, then this is yours bug that needs to > > >> be fixed. > > >> > > >> In addition to this all, the job_timeout_work is initialized only for > > >> the single-core JPEG device. I'd expect this patch should crash > > >> multi-core JPEG devices. > > >> > > > > > > I think that's true. As I'm not familiar with the code here. Could you > > > please give me some advice about the patch? > > > > We'll need to understand why v4l2_m2m_ctx_release() doesn't work as > > expected before thinking about the patch. > > > > -- > > Best regards, > > Dmitry > >
Hi Dmitry, The patch is on the stable queue. Could you please take a look at my analysis? Thanks for your help. Best regards, Zheng Zheng Hacker <hackerzheng666@gmail.com> 于2023年9月5日周二 12:24写道: > > Friendly ping. > > Zheng Hacker <hackerzheng666@gmail.com> 于2023年8月31日周四 16:18写道: > > > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月28日周一 10:04写道: > > > > > > On 8/24/23 11:20, Zheng Hacker wrote: > > > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道: > > > > > > > >> > > > >> Hello Zheng, > > > >> > > > >> On 7/7/23 12:24, Zheng Wang wrote: > > > >>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with > > > >>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run > > > >>> and mtk_jpeg_enc_device_run may be called to start the > > > >>> work. > > > >>> If we remove the module which will call mtk_jpeg_remove > > > >>> to make cleanup, there may be a unfinished work. The > > > >>> possible sequence is as follows, which will cause a > > > >>> typical UAF bug. > > > >>> > > > >>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove > > > >>> > > > >>> CPU0 CPU1 > > > >>> > > > >>> |mtk_jpeg_job_timeout_work > > > >>> mtk_jpeg_remove | > > > >>> v4l2_m2m_release | > > > >>> kfree(m2m_dev); | > > > >>> | > > > >>> | v4l2_m2m_get_curr_priv > > > >>> | m2m_dev->curr_ctx //use > > > >>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver") > > > >>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > > > >>> --- > > > >>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie. > > > >>> --- > > > >>> drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 + > > > >>> 1 file changed, 1 insertion(+) > > > >>> > > > >>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > > >>> index 0051f372a66c..6069ecf420b0 100644 > > > >>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > > >>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > > > >>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev) > > > >>> { > > > >>> struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev); > > > >>> > > > >>> + cancel_delayed_work_sync(&jpeg->job_timeout_work); > > > >>> pm_runtime_disable(&pdev->dev); > > > >>> video_unregister_device(jpeg->vdev); > > > >>> v4l2_m2m_release(jpeg->m2m_dev); > > > >> > > > >> AFAICS, there is a fundamental problem here. The job_timeout_work uses > > > >> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded, > > > >> all the v4l contexts must be closed and released. Hence the > > > >> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when > > > >> work is executed before cancel_delayed_work_sync(). > > > >> > > > > > > > > Hi Dmitry, > > > > > > > > Thanks for your reply. I think you're right. As m2m_dev is freed in > > > > v4l2_m2m_release, > > > > the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null > > > > pointer dereference > > > > bug. I am sure that context is closed when we invoke mtk_jpeg_remove. > > > > But I'm not sure if > > > > context is released when mtk_jpegdec_timeout_work running. > > > > > > > >> At the time when mtk_jpeg_remove() is invoked, there shall be no > > > >> job_timeout_work running in background because all jobs should be > > > >> completed before context is released. If you'll look at > > > >> v4l2_m2m_cancel_job(), you can see that it waits for the task completion > > > >> before closing context. > > > > > > > > Yes, so I think the better way is to put the cancel_delayed_work_sync > > > > invoking into > > > > v4l2_m2m_ctx_release function? > > > > > > The v4l2_m2m_ctx_release() already should wait for the job_timeout_work > > > completion or for the interrupt fire. Apparently it doesn't work in > > > yours case. You'll need to debug why v4l job or job_timeout_work is > > > running after v4l2_m2m_ctx_release(), it shouldn't happen. > > > > > > > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, > > the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish > > to trigger that. > > > > However, this is not the only path to call v4l2_m2m_job_finish. Here > > is a invoking chain: > > v4l_streamon > > ->v4l2_m2m_ioctl_streamon > > ->v4l2_m2m_streamon > > ->v4l2_m2m_try_schedule > > ->v4l2_m2m_try_run > > ->mtk_jpeg_dec_device_run > > ->schedule_delayed_work(&jpeg->job_timeout_work... > > ->error path goto dec_end > > ->v4l2_m2m_job_finish > > > > In some specific situation, it starts the worker and also calls > > v4l2_m2m_job_finish, which might > > make v4l2_m2m_cancel_job continues. > > > > > The interrupt handler cancels job_timeout_work, you shouldn't need to > > > flush the work. > > > > It will, but as I said, there might be an early invocation chain to > > start the work.(Not very sure) > > > > > > > > Technically, interrupt handler may race with job_timeout_work, but the > > > timeout is set to 1 second and in practice should be difficult to > > > trigger the race. The interrupt handler needs to be threaded, it should > > > use cancel_delayed_work_sync() and check the return value of this function. > > > > > > > Yes, it's better to use cancel_delayed_work_sync here. > > > > > >> > > > >> You shouldn't be able to remove driver module while it has active/opened > > > >> v4l contexts. If you can do that, then this is yours bug that needs to > > > >> be fixed. > > > >> > > > >> In addition to this all, the job_timeout_work is initialized only for > > > >> the single-core JPEG device. I'd expect this patch should crash > > > >> multi-core JPEG devices. > > > >> > > > > > > > > I think that's true. As I'm not familiar with the code here. Could you > > > > please give me some advice about the patch? > > > > > > We'll need to understand why v4l2_m2m_ctx_release() doesn't work as > > > expected before thinking about the patch. > > > > > > -- > > > Best regards, > > > Dmitry > > >
On 8/31/23 11:18, Zheng Hacker wrote: >> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work >> completion or for the interrupt fire. Apparently it doesn't work in >> yours case. You'll need to debug why v4l job or job_timeout_work is >> running after v4l2_m2m_ctx_release(), it shouldn't happen. >> > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, > the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish > to trigger that. > > However, this is not the only path to call v4l2_m2m_job_finish. Here > is a invoking chain: > v4l_streamon > ->v4l2_m2m_ioctl_streamon > ->v4l2_m2m_streamon > ->v4l2_m2m_try_schedule > ->v4l2_m2m_try_run > ->mtk_jpeg_dec_device_run > ->schedule_delayed_work(&jpeg->job_timeout_work... > ->error path goto dec_end > ->v4l2_m2m_job_finish > > In some specific situation, it starts the worker and also calls > v4l2_m2m_job_finish, which might > make v4l2_m2m_cancel_job continues. Then the error path should cancel the job_timeout_work, or better job needs to be run after the dec/enc has been started and not before. Looking further at the code, I'm confused by this hunk: mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base); v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx); The job should be marked as finished when h/w has finished processing the job and not right after the job has been started. So the job is always completed and mtk_jpeg_job_timeout_work() doesn't work as expected, am I missing something?
On 9/19/23 21:24, Dmitry Osipenko wrote: > On 8/31/23 11:18, Zheng Hacker wrote: >>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work >>> completion or for the interrupt fire. Apparently it doesn't work in >>> yours case. You'll need to debug why v4l job or job_timeout_work is >>> running after v4l2_m2m_ctx_release(), it shouldn't happen. >>> >> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, >> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish >> to trigger that. >> >> However, this is not the only path to call v4l2_m2m_job_finish. Here >> is a invoking chain: >> v4l_streamon >> ->v4l2_m2m_ioctl_streamon >> ->v4l2_m2m_streamon >> ->v4l2_m2m_try_schedule >> ->v4l2_m2m_try_run >> ->mtk_jpeg_dec_device_run >> ->schedule_delayed_work(&jpeg->job_timeout_work... >> ->error path goto dec_end >> ->v4l2_m2m_job_finish >> >> In some specific situation, it starts the worker and also calls >> v4l2_m2m_job_finish, which might >> make v4l2_m2m_cancel_job continues. > > Then the error path should cancel the job_timeout_work, or better job s/job/timeout work/ > needs to be run after the dec/enc has been started and not before. > > Looking further at the code, I'm confused by this hunk: > > mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base); > v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx); > > The job should be marked as finished when h/w has finished processing > the job and not right after the job has been started. So the job is > always completed and mtk_jpeg_job_timeout_work() doesn't work as > expected, am I missing something? >
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道: > > On 8/31/23 11:18, Zheng Hacker wrote: > >> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work > >> completion or for the interrupt fire. Apparently it doesn't work in > >> yours case. You'll need to debug why v4l job or job_timeout_work is > >> running after v4l2_m2m_ctx_release(), it shouldn't happen. > >> > > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, > > the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish > > to trigger that. > > > > However, this is not the only path to call v4l2_m2m_job_finish. Here > > is a invoking chain: > > v4l_streamon > > ->v4l2_m2m_ioctl_streamon > > ->v4l2_m2m_streamon > > ->v4l2_m2m_try_schedule > > ->v4l2_m2m_try_run > > ->mtk_jpeg_dec_device_run > > ->schedule_delayed_work(&jpeg->job_timeout_work... > > ->error path goto dec_end > > ->v4l2_m2m_job_finish > > > > In some specific situation, it starts the worker and also calls > > v4l2_m2m_job_finish, which might > > make v4l2_m2m_cancel_job continues. > > Then the error path should cancel the job_timeout_work, or better job > needs to be run after the dec/enc has been started and not before. > Hi, Sorry for my late reply for I just went on a long vacation. Get it. I'll write another patch and change the summary to the lack of canceling job in error path. > Looking further at the code, I'm confused by this hunk: > > mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base); > v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx); > > The job should be marked as finished when h/w has finished processing > the job and not right after the job has been started. So the job is > always completed and mtk_jpeg_job_timeout_work() doesn't work as > expected, am I missing something? After reading the code I still don't know. I didn't see any function like mtk_jpeg_dec_end. The same thing happens on mtk_jpeg_enc_start. I think I'd better fix the first problem and wait for someone familiar with the second part. Best regards, Zheng > > -- > Best regards, > Dmitry >
On 10/8/23 12:13, Zheng Hacker wrote: > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道: >> >> On 8/31/23 11:18, Zheng Hacker wrote: >>>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work >>>> completion or for the interrupt fire. Apparently it doesn't work in >>>> yours case. You'll need to debug why v4l job or job_timeout_work is >>>> running after v4l2_m2m_ctx_release(), it shouldn't happen. >>>> >>> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, >>> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish >>> to trigger that. >>> >>> However, this is not the only path to call v4l2_m2m_job_finish. Here >>> is a invoking chain: >>> v4l_streamon >>> ->v4l2_m2m_ioctl_streamon >>> ->v4l2_m2m_streamon >>> ->v4l2_m2m_try_schedule >>> ->v4l2_m2m_try_run >>> ->mtk_jpeg_dec_device_run >>> ->schedule_delayed_work(&jpeg->job_timeout_work... >>> ->error path goto dec_end >>> ->v4l2_m2m_job_finish >>> >>> In some specific situation, it starts the worker and also calls >>> v4l2_m2m_job_finish, which might >>> make v4l2_m2m_cancel_job continues. >> >> Then the error path should cancel the job_timeout_work, or better job >> needs to be run after the dec/enc has been started and not before. >> > > Hi, > > Sorry for my late reply for I just went on a long vacation. > > Get it. I'll write another patch and change the summary to the lack of > canceling job in error path. > >> Looking further at the code, I'm confused by this hunk: >> >> mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base); >> v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx); >> >> The job should be marked as finished when h/w has finished processing >> the job and not right after the job has been started. So the job is >> always completed and mtk_jpeg_job_timeout_work() doesn't work as >> expected, am I missing something? > > After reading the code I still don't know. I didn't see any function > like mtk_jpeg_dec_end. The same thing > happens on mtk_jpeg_enc_start. I think I'd better fix the first > problem and wait for someone familiar with > the second part. I missed that the code mentioned above is related to the multi-core hw version, while you care about single-core. Nevertheless, the multi-core device_run() looks incorrect, So, the error code paths need to be corrected. Please try to revert yours fix and test this change: diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c index 0051f372a66c..fd3b0587fcad 100644 --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c @@ -1254,9 +1254,6 @@ static void mtk_jpegdec_worker(struct work_struct *work) v4l2_m2m_src_buf_remove(ctx->fh.m2m_ctx); v4l2_m2m_dst_buf_remove(ctx->fh.m2m_ctx); - schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work, - msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); - mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs); if (mtk_jpeg_set_dec_dst(ctx, &jpeg_src_buf->dec_param, @@ -1266,6 +1263,9 @@ static void mtk_jpegdec_worker(struct work_struct *work) goto setdst_end; } + schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work, + msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); + spin_lock_irqsave(&comp_jpeg[hw_id]->hw_lock, flags); ctx->total_frame_num++; mtk_jpeg_dec_reset(comp_jpeg[hw_id]->reg_base); @@ -1330,13 +1330,13 @@ static void mtk_jpeg_dec_device_run(void *priv) if (ret < 0) goto dec_end; - schedule_delayed_work(&jpeg->job_timeout_work, - msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); - mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs); if (mtk_jpeg_set_dec_dst(ctx, &jpeg_src_buf->dec_param, &dst_buf->vb2_buf, &fb)) goto dec_end; + schedule_delayed_work(&jpeg->job_timeout_work, + msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); + spin_lock_irqsave(&jpeg->hw_lock, flags); mtk_jpeg_dec_reset(jpeg->reg_base); mtk_jpeg_dec_set_config(jpeg->reg_base,
Thanks for your patch. I think this should fix the problem. As I have no experience in reverting, can I submit the patch with your fix as well as reverting my fix? Best regards, Zheng Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年10月20日周五 03:56写道: > > On 10/8/23 12:13, Zheng Hacker wrote: > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道: > >> > >> On 8/31/23 11:18, Zheng Hacker wrote: > >>>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work > >>>> completion or for the interrupt fire. Apparently it doesn't work in > >>>> yours case. You'll need to debug why v4l job or job_timeout_work is > >>>> running after v4l2_m2m_ctx_release(), it shouldn't happen. > >>>> > >>> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be ~TRANS_RUNNING, > >>> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish > >>> to trigger that. > >>> > >>> However, this is not the only path to call v4l2_m2m_job_finish. Here > >>> is a invoking chain: > >>> v4l_streamon > >>> ->v4l2_m2m_ioctl_streamon > >>> ->v4l2_m2m_streamon > >>> ->v4l2_m2m_try_schedule > >>> ->v4l2_m2m_try_run > >>> ->mtk_jpeg_dec_device_run > >>> ->schedule_delayed_work(&jpeg->job_timeout_work... > >>> ->error path goto dec_end > >>> ->v4l2_m2m_job_finish > >>> > >>> In some specific situation, it starts the worker and also calls > >>> v4l2_m2m_job_finish, which might > >>> make v4l2_m2m_cancel_job continues. > >> > >> Then the error path should cancel the job_timeout_work, or better job > >> needs to be run after the dec/enc has been started and not before. > >> > > > > Hi, > > > > Sorry for my late reply for I just went on a long vacation. > > > > Get it. I'll write another patch and change the summary to the lack of > > canceling job in error path. > > > >> Looking further at the code, I'm confused by this hunk: > >> > >> mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base); > >> v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx); > >> > >> The job should be marked as finished when h/w has finished processing > >> the job and not right after the job has been started. So the job is > >> always completed and mtk_jpeg_job_timeout_work() doesn't work as > >> expected, am I missing something? > > > > After reading the code I still don't know. I didn't see any function > > like mtk_jpeg_dec_end. The same thing > > happens on mtk_jpeg_enc_start. I think I'd better fix the first > > problem and wait for someone familiar with > > the second part. > > I missed that the code mentioned above is related to the multi-core hw version, while you care about single-core. Nevertheless, the multi-core device_run() looks incorrect, > > So, the error code paths need to be corrected. Please try to revert yours fix and test this change: > > diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > index 0051f372a66c..fd3b0587fcad 100644 > --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c > @@ -1254,9 +1254,6 @@ static void mtk_jpegdec_worker(struct work_struct *work) > v4l2_m2m_src_buf_remove(ctx->fh.m2m_ctx); > v4l2_m2m_dst_buf_remove(ctx->fh.m2m_ctx); > > - schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work, > - msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); > - > mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs); > if (mtk_jpeg_set_dec_dst(ctx, > &jpeg_src_buf->dec_param, > @@ -1266,6 +1263,9 @@ static void mtk_jpegdec_worker(struct work_struct *work) > goto setdst_end; > } > > + schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work, > + msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); > + > spin_lock_irqsave(&comp_jpeg[hw_id]->hw_lock, flags); > ctx->total_frame_num++; > mtk_jpeg_dec_reset(comp_jpeg[hw_id]->reg_base); > @@ -1330,13 +1330,13 @@ static void mtk_jpeg_dec_device_run(void *priv) > if (ret < 0) > goto dec_end; > > - schedule_delayed_work(&jpeg->job_timeout_work, > - msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); > - > mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs); > if (mtk_jpeg_set_dec_dst(ctx, &jpeg_src_buf->dec_param, &dst_buf->vb2_buf, &fb)) > goto dec_end; > > + schedule_delayed_work(&jpeg->job_timeout_work, > + msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC)); > + > spin_lock_irqsave(&jpeg->hw_lock, flags); > mtk_jpeg_dec_reset(jpeg->reg_base); > mtk_jpeg_dec_set_config(jpeg->reg_base, > > -- > Best regards, > Dmitry >
diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c index 0051f372a66c..6069ecf420b0 100644 --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev) { struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev); + cancel_delayed_work_sync(&jpeg->job_timeout_work); pm_runtime_disable(&pdev->dev); video_unregister_device(jpeg->vdev); v4l2_m2m_release(jpeg->m2m_dev);