[RESEND,v2] media: mtk-jpeg: Fix use after free bug due to uncanceled work

Message ID 20230707092414.866760-1-zyytlz.wz@163.com (mailing list archive)
State Accepted
Delegated to: Hans Verkuil
Headers
Series [RESEND,v2] media: mtk-jpeg: Fix use after free bug due to uncanceled work |

Commit Message

Zheng Wang July 7, 2023, 9:24 a.m. UTC
  In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
and mtk_jpeg_enc_device_run may be called to start the
work.
If we remove the module which will call mtk_jpeg_remove
to make cleanup, there may be a unfinished work. The
possible sequence is as follows, which will cause a
typical UAF bug.

Fix it by canceling the work before cleanup in the mtk_jpeg_remove

CPU0                  CPU1

                    |mtk_jpeg_job_timeout_work
mtk_jpeg_remove     |
  v4l2_m2m_release  |
    kfree(m2m_dev); |
                    |
                    | v4l2_m2m_get_curr_priv
                    |   m2m_dev->curr_ctx //use
Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
---
- v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
---
 drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
 1 file changed, 1 insertion(+)
  

Comments

Alexandre Mergnat July 7, 2023, 2:11 p.m. UTC | #1
On 07/07/2023 11:24, Zheng Wang wrote:
> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> and mtk_jpeg_enc_device_run may be called to start the
> work.
> If we remove the module which will call mtk_jpeg_remove
> to make cleanup, there may be a unfinished work. The
> possible sequence is as follows, which will cause a
> typical UAF bug.
> 
> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> 
> CPU0                  CPU1
> 
>                      |mtk_jpeg_job_timeout_work
> mtk_jpeg_remove     |
>    v4l2_m2m_release  |
>      kfree(m2m_dev); |
>                      |
>                      | v4l2_m2m_get_curr_priv
>                      |   m2m_dev->curr_ctx //use

Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
  
Zheng Hacker July 15, 2023, 4:08 p.m. UTC | #2
Hi,

This issue has not been resolved for a long time. Is there anyone who can help?

Best regards,
Zheng

Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
>
>
>
> On 07/07/2023 11:24, Zheng Wang wrote:
> > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > and mtk_jpeg_enc_device_run may be called to start the
> > work.
> > If we remove the module which will call mtk_jpeg_remove
> > to make cleanup, there may be a unfinished work. The
> > possible sequence is as follows, which will cause a
> > typical UAF bug.
> >
> > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >
> > CPU0                  CPU1
> >
> >                      |mtk_jpeg_job_timeout_work
> > mtk_jpeg_remove     |
> >    v4l2_m2m_release  |
> >      kfree(m2m_dev); |
> >                      |
> >                      | v4l2_m2m_get_curr_priv
> >                      |   m2m_dev->curr_ctx //use
>
> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
>
> --
> Regards,
> Alexandre
  
Zheng Hacker July 18, 2023, 3:07 a.m. UTC | #3
Friendly ping

Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道:
>
> Hi,
>
> This issue has not been resolved for a long time. Is there anyone who can help?
>
> Best regards,
> Zheng
>
> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
> >
> >
> >
> > On 07/07/2023 11:24, Zheng Wang wrote:
> > > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > > and mtk_jpeg_enc_device_run may be called to start the
> > > work.
> > > If we remove the module which will call mtk_jpeg_remove
> > > to make cleanup, there may be a unfinished work. The
> > > possible sequence is as follows, which will cause a
> > > typical UAF bug.
> > >
> > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> > >
> > > CPU0                  CPU1
> > >
> > >                      |mtk_jpeg_job_timeout_work
> > > mtk_jpeg_remove     |
> > >    v4l2_m2m_release  |
> > >      kfree(m2m_dev); |
> > >                      |
> > >                      | v4l2_m2m_get_curr_priv
> > >                      |   m2m_dev->curr_ctx //use
> >
> > Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
> >
> > --
> > Regards,
> > Alexandre
  
Alexandre Mergnat July 19, 2023, 10:17 a.m. UTC | #4
On 18/07/2023 05:07, Zheng Hacker wrote:
> Friendly ping
> 
> Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道:
>>
>> Hi,
>>
>> This issue has not been resolved for a long time. Is there anyone who can help?
>>
>> Best regards,
>> Zheng
>>
>> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
>>>
>>>
>>>
>>> On 07/07/2023 11:24, Zheng Wang wrote:
>>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
>>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
>>>> and mtk_jpeg_enc_device_run may be called to start the
>>>> work.
>>>> If we remove the module which will call mtk_jpeg_remove
>>>> to make cleanup, there may be a unfinished work. The
>>>> possible sequence is as follows, which will cause a
>>>> typical UAF bug.
>>>>
>>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
>>>>
>>>> CPU0                  CPU1
>>>>
>>>>                       |mtk_jpeg_job_timeout_work
>>>> mtk_jpeg_remove     |
>>>>     v4l2_m2m_release  |
>>>>       kfree(m2m_dev); |
>>>>                       |
>>>>                       | v4l2_m2m_get_curr_priv
>>>>                       |   m2m_dev->curr_ctx //use
>>>
>>> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
>>>
>>> --
>>> Regards,
>>> Alexandre

Hi Zheng,

If you asking me to merge patch, sorry but I can't, I'm just a reviewer. 
I invite you to ping the maintainers directly:

Bin Liu <bin.liu@mediatek.com> (supporter:MEDIATEK JPEG DRIVER)
Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT 
INFRASTRUCTURE (V4L/DVB))
Matthias Brugger <matthias.bgg@gmail.com> (maintainer:ARM/Mediatek SoC 
support)

Otherwise, I misunderstood what you asking me. If so, can you rephrase 
your question please?
  
Zheng Hacker July 20, 2023, 3:17 a.m. UTC | #5
Thanks dor your kind reply. I'll try to connect them later.

Best regards,
Zheng

Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月19日周三 18:17写道:
>
>
>
> On 18/07/2023 05:07, Zheng Hacker wrote:
> > Friendly ping
> >
> > Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道:
> >>
> >> Hi,
> >>
> >> This issue has not been resolved for a long time. Is there anyone who can help?
> >>
> >> Best regards,
> >> Zheng
> >>
> >> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
> >>>
> >>>
> >>>
> >>> On 07/07/2023 11:24, Zheng Wang wrote:
> >>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> >>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> >>>> and mtk_jpeg_enc_device_run may be called to start the
> >>>> work.
> >>>> If we remove the module which will call mtk_jpeg_remove
> >>>> to make cleanup, there may be a unfinished work. The
> >>>> possible sequence is as follows, which will cause a
> >>>> typical UAF bug.
> >>>>
> >>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >>>>
> >>>> CPU0                  CPU1
> >>>>
> >>>>                       |mtk_jpeg_job_timeout_work
> >>>> mtk_jpeg_remove     |
> >>>>     v4l2_m2m_release  |
> >>>>       kfree(m2m_dev); |
> >>>>                       |
> >>>>                       | v4l2_m2m_get_curr_priv
> >>>>                       |   m2m_dev->curr_ctx //use
> >>>
> >>> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
> >>>
> >>> --
> >>> Regards,
> >>> Alexandre
>
> Hi Zheng,
>
> If you asking me to merge patch, sorry but I can't, I'm just a reviewer.
> I invite you to ping the maintainers directly:
>
> Bin Liu <bin.liu@mediatek.com> (supporter:MEDIATEK JPEG DRIVER)
> Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT
> INFRASTRUCTURE (V4L/DVB))
> Matthias Brugger <matthias.bgg@gmail.com> (maintainer:ARM/Mediatek SoC
> support)
>
> Otherwise, I misunderstood what you asking me. If so, can you rephrase
> your question please?
>
> --
> Regards,
> Alexandre
  
Chen-Yu Tsai July 20, 2023, 3:40 a.m. UTC | #6
On Fri, Jul 7, 2023 at 5:25 PM Zheng Wang <zyytlz.wz@163.com> wrote:
>
> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> and mtk_jpeg_enc_device_run may be called to start the
> work.
> If we remove the module which will call mtk_jpeg_remove
> to make cleanup, there may be a unfinished work. The
> possible sequence is as follows, which will cause a
> typical UAF bug.
>
> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
>
> CPU0                  CPU1
>
>                     |mtk_jpeg_job_timeout_work
> mtk_jpeg_remove     |
>   v4l2_m2m_release  |
>     kfree(m2m_dev); |
>                     |
>                     | v4l2_m2m_get_curr_priv
>                     |   m2m_dev->curr_ctx //use
> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>

Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
  
AngeloGioacchino Del Regno July 20, 2023, 7:45 a.m. UTC | #7
Il 07/07/23 11:24, Zheng Wang ha scritto:
> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> and mtk_jpeg_enc_device_run may be called to start the
> work.
> If we remove the module which will call mtk_jpeg_remove
> to make cleanup, there may be a unfinished work. The
> possible sequence is as follows, which will cause a
> typical UAF bug.
> 
> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> 
> CPU0                  CPU1
> 
>                      |mtk_jpeg_job_timeout_work
> mtk_jpeg_remove     |
>    v4l2_m2m_release  |
>      kfree(m2m_dev); |
>                      |
>                      | v4l2_m2m_get_curr_priv
>                      |   m2m_dev->curr_ctx //use
> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>

Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
  
Zheng Hacker July 25, 2023, 2:39 a.m. UTC | #8
Hello,

Is there any follow-up about this issue?

Thanks,
Zheng

AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
于2023年7月20日周四 15:45写道:
>
> Il 07/07/23 11:24, Zheng Wang ha scritto:
> > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > and mtk_jpeg_enc_device_run may be called to start the
> > work.
> > If we remove the module which will call mtk_jpeg_remove
> > to make cleanup, there may be a unfinished work. The
> > possible sequence is as follows, which will cause a
> > typical UAF bug.
> >
> > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >
> > CPU0                  CPU1
> >
> >                      |mtk_jpeg_job_timeout_work
> > mtk_jpeg_remove     |
> >    v4l2_m2m_release  |
> >      kfree(m2m_dev); |
> >                      |
> >                      | v4l2_m2m_get_curr_priv
> >                      |   m2m_dev->curr_ctx //use
> > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> > Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
>
> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>
>
  
Dmitry Osipenko Aug. 22, 2023, 6:51 p.m. UTC | #9
Hello Zheng,

On 7/7/23 12:24, Zheng Wang wrote:
> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> and mtk_jpeg_enc_device_run may be called to start the
> work.
> If we remove the module which will call mtk_jpeg_remove
> to make cleanup, there may be a unfinished work. The
> possible sequence is as follows, which will cause a
> typical UAF bug.
> 
> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> 
> CPU0                  CPU1
> 
>                     |mtk_jpeg_job_timeout_work
> mtk_jpeg_remove     |
>   v4l2_m2m_release  |
>     kfree(m2m_dev); |
>                     |
>                     | v4l2_m2m_get_curr_priv
>                     |   m2m_dev->curr_ctx //use
> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> ---
> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
> ---
>  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> index 0051f372a66c..6069ecf420b0 100644
> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
>  {
>  	struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
>  
> +	cancel_delayed_work_sync(&jpeg->job_timeout_work);
>  	pm_runtime_disable(&pdev->dev);
>  	video_unregister_device(jpeg->vdev);
>  	v4l2_m2m_release(jpeg->m2m_dev);

AFAICS, there is a fundamental problem here. The job_timeout_work uses
v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
all the v4l contexts must be closed and released. Hence the
v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
work is executed before cancel_delayed_work_sync().

At the time when mtk_jpeg_remove() is invoked, there shall be no
job_timeout_work running in background because all jobs should be
completed before context is released. If you'll look at
v4l2_m2m_cancel_job(), you can see that it waits for the task completion
before closing context.

You shouldn't be able to remove driver module while it has active/opened
v4l contexts. If you can do that, then this is yours bug that needs to
be fixed.

In addition to this all, the job_timeout_work is initialized only for
the single-core JPEG device. I'd expect this patch should crash
multi-core JPEG devices.
  
Zheng Hacker Aug. 24, 2023, 8:20 a.m. UTC | #10
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道:

>
> Hello Zheng,
>
> On 7/7/23 12:24, Zheng Wang wrote:
> > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > and mtk_jpeg_enc_device_run may be called to start the
> > work.
> > If we remove the module which will call mtk_jpeg_remove
> > to make cleanup, there may be a unfinished work. The
> > possible sequence is as follows, which will cause a
> > typical UAF bug.
> >
> > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >
> > CPU0                  CPU1
> >
> >                     |mtk_jpeg_job_timeout_work
> > mtk_jpeg_remove     |
> >   v4l2_m2m_release  |
> >     kfree(m2m_dev); |
> >                     |
> >                     | v4l2_m2m_get_curr_priv
> >                     |   m2m_dev->curr_ctx //use
> > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> > Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> > ---
> > - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
> > ---
> >  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > index 0051f372a66c..6069ecf420b0 100644
> > --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
> >  {
> >       struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
> >
> > +     cancel_delayed_work_sync(&jpeg->job_timeout_work);
> >       pm_runtime_disable(&pdev->dev);
> >       video_unregister_device(jpeg->vdev);
> >       v4l2_m2m_release(jpeg->m2m_dev);
>
> AFAICS, there is a fundamental problem here. The job_timeout_work uses
> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
> all the v4l contexts must be closed and released. Hence the
> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
> work is executed before cancel_delayed_work_sync().
>

Hi Dmitry,

Thanks for your reply. I think you're right. As m2m_dev is freed in
v4l2_m2m_release,
the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null
pointer dereference
bug. I am sure that context is closed when we invoke mtk_jpeg_remove.
But I'm not sure if
context is released when mtk_jpegdec_timeout_work running.

> At the time when mtk_jpeg_remove() is invoked, there shall be no
> job_timeout_work running in background because all jobs should be
> completed before context is released. If you'll look at
> v4l2_m2m_cancel_job(), you can see that it waits for the task completion
> before closing context.

Yes, so I think the better way is to put the cancel_delayed_work_sync
invoking into
v4l2_m2m_ctx_release function?

>
> You shouldn't be able to remove driver module while it has active/opened
> v4l contexts. If you can do that, then this is yours bug that needs to
> be fixed.
>
> In addition to this all, the job_timeout_work is initialized only for
> the single-core JPEG device. I'd expect this patch should crash
> multi-core JPEG devices.
>

I think that's true. As I'm not familiar with the code here. Could you
please give me some advice about the patch?

Best regards,
Zheng

> --
> Best regards,
> Dmitry
>
  
Dmitry Osipenko Aug. 28, 2023, 2:04 a.m. UTC | #11
On 8/24/23 11:20, Zheng Hacker wrote:
> Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道:
> 
>>
>> Hello Zheng,
>>
>> On 7/7/23 12:24, Zheng Wang wrote:
>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
>>> and mtk_jpeg_enc_device_run may be called to start the
>>> work.
>>> If we remove the module which will call mtk_jpeg_remove
>>> to make cleanup, there may be a unfinished work. The
>>> possible sequence is as follows, which will cause a
>>> typical UAF bug.
>>>
>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
>>>
>>> CPU0                  CPU1
>>>
>>>                     |mtk_jpeg_job_timeout_work
>>> mtk_jpeg_remove     |
>>>   v4l2_m2m_release  |
>>>     kfree(m2m_dev); |
>>>                     |
>>>                     | v4l2_m2m_get_curr_priv
>>>                     |   m2m_dev->curr_ctx //use
>>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
>>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
>>> ---
>>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
>>> ---
>>>  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
>>> index 0051f372a66c..6069ecf420b0 100644
>>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
>>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
>>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
>>>  {
>>>       struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
>>>
>>> +     cancel_delayed_work_sync(&jpeg->job_timeout_work);
>>>       pm_runtime_disable(&pdev->dev);
>>>       video_unregister_device(jpeg->vdev);
>>>       v4l2_m2m_release(jpeg->m2m_dev);
>>
>> AFAICS, there is a fundamental problem here. The job_timeout_work uses
>> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
>> all the v4l contexts must be closed and released. Hence the
>> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
>> work is executed before cancel_delayed_work_sync().
>>
> 
> Hi Dmitry,
> 
> Thanks for your reply. I think you're right. As m2m_dev is freed in
> v4l2_m2m_release,
> the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null
> pointer dereference
> bug. I am sure that context is closed when we invoke mtk_jpeg_remove.
> But I'm not sure if
> context is released when mtk_jpegdec_timeout_work running.
> 
>> At the time when mtk_jpeg_remove() is invoked, there shall be no
>> job_timeout_work running in background because all jobs should be
>> completed before context is released. If you'll look at
>> v4l2_m2m_cancel_job(), you can see that it waits for the task completion
>> before closing context.
> 
> Yes, so I think the better way is to put the cancel_delayed_work_sync
> invoking into
> v4l2_m2m_ctx_release function?

The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
completion or for the interrupt fire. Apparently it doesn't work in
yours case. You'll need to debug why v4l job or job_timeout_work is
running after v4l2_m2m_ctx_release(), it shouldn't happen.

The interrupt handler cancels job_timeout_work, you shouldn't need to
flush the work.

Technically, interrupt handler may race with job_timeout_work, but the
timeout is set to 1 second and in practice should be difficult to
trigger the race. The interrupt handler needs to be threaded, it should
use cancel_delayed_work_sync() and check the return value of this function.

>>
>> You shouldn't be able to remove driver module while it has active/opened
>> v4l contexts. If you can do that, then this is yours bug that needs to
>> be fixed.
>>
>> In addition to this all, the job_timeout_work is initialized only for
>> the single-core JPEG device. I'd expect this patch should crash
>> multi-core JPEG devices.
>>
> 
> I think that's true. As I'm not familiar with the code here. Could you
> please give me some advice about the patch?

We'll need to understand why v4l2_m2m_ctx_release() doesn't work as
expected before thinking about the patch.
  
Zheng Hacker Aug. 31, 2023, 8:18 a.m. UTC | #12
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月28日周一 10:04写道:
>
> On 8/24/23 11:20, Zheng Hacker wrote:
> > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道:
> >
> >>
> >> Hello Zheng,
> >>
> >> On 7/7/23 12:24, Zheng Wang wrote:
> >>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> >>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> >>> and mtk_jpeg_enc_device_run may be called to start the
> >>> work.
> >>> If we remove the module which will call mtk_jpeg_remove
> >>> to make cleanup, there may be a unfinished work. The
> >>> possible sequence is as follows, which will cause a
> >>> typical UAF bug.
> >>>
> >>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >>>
> >>> CPU0                  CPU1
> >>>
> >>>                     |mtk_jpeg_job_timeout_work
> >>> mtk_jpeg_remove     |
> >>>   v4l2_m2m_release  |
> >>>     kfree(m2m_dev); |
> >>>                     |
> >>>                     | v4l2_m2m_get_curr_priv
> >>>                     |   m2m_dev->curr_ctx //use
> >>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> >>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> >>> ---
> >>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
> >>> ---
> >>>  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> >>> index 0051f372a66c..6069ecf420b0 100644
> >>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> >>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> >>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
> >>>  {
> >>>       struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
> >>>
> >>> +     cancel_delayed_work_sync(&jpeg->job_timeout_work);
> >>>       pm_runtime_disable(&pdev->dev);
> >>>       video_unregister_device(jpeg->vdev);
> >>>       v4l2_m2m_release(jpeg->m2m_dev);
> >>
> >> AFAICS, there is a fundamental problem here. The job_timeout_work uses
> >> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
> >> all the v4l contexts must be closed and released. Hence the
> >> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
> >> work is executed before cancel_delayed_work_sync().
> >>
> >
> > Hi Dmitry,
> >
> > Thanks for your reply. I think you're right. As m2m_dev is freed in
> > v4l2_m2m_release,
> > the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null
> > pointer dereference
> > bug. I am sure that context is closed when we invoke mtk_jpeg_remove.
> > But I'm not sure if
> > context is released when mtk_jpegdec_timeout_work running.
> >
> >> At the time when mtk_jpeg_remove() is invoked, there shall be no
> >> job_timeout_work running in background because all jobs should be
> >> completed before context is released. If you'll look at
> >> v4l2_m2m_cancel_job(), you can see that it waits for the task completion
> >> before closing context.
> >
> > Yes, so I think the better way is to put the cancel_delayed_work_sync
> > invoking into
> > v4l2_m2m_ctx_release function?
>
> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
> completion or for the interrupt fire. Apparently it doesn't work in
> yours case. You'll need to debug why v4l job or job_timeout_work is
> running after v4l2_m2m_ctx_release(), it shouldn't happen.
>

Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
to trigger that.

However, this is not the only path to call v4l2_m2m_job_finish. Here
is a invoking chain:
v4l_streamon
  ->v4l2_m2m_ioctl_streamon
    ->v4l2_m2m_streamon
      ->v4l2_m2m_try_schedule
        ->v4l2_m2m_try_run
          ->mtk_jpeg_dec_device_run
            ->schedule_delayed_work(&jpeg->job_timeout_work...
            ->error path goto dec_end
            ->v4l2_m2m_job_finish

In some specific situation, it starts the worker and also calls
v4l2_m2m_job_finish, which might
make v4l2_m2m_cancel_job continues.

> The interrupt handler cancels job_timeout_work, you shouldn't need to
> flush the work.

It will, but as I said, there might be an early invocation chain to
start the work.(Not very sure)

>
> Technically, interrupt handler may race with job_timeout_work, but the
> timeout is set to 1 second and in practice should be difficult to
> trigger the race. The interrupt handler needs to be threaded, it should
> use cancel_delayed_work_sync() and check the return value of this function.
>

Yes, it's better to use cancel_delayed_work_sync here.

> >>
> >> You shouldn't be able to remove driver module while it has active/opened
> >> v4l contexts. If you can do that, then this is yours bug that needs to
> >> be fixed.
> >>
> >> In addition to this all, the job_timeout_work is initialized only for
> >> the single-core JPEG device. I'd expect this patch should crash
> >> multi-core JPEG devices.
> >>
> >
> > I think that's true. As I'm not familiar with the code here. Could you
> > please give me some advice about the patch?
>
> We'll need to understand why v4l2_m2m_ctx_release() doesn't work as
> expected before thinking about the patch.
>
> --
> Best regards,
> Dmitry
>
  
Zheng Hacker Sept. 5, 2023, 4:24 a.m. UTC | #13
Friendly ping.

Zheng Hacker <hackerzheng666@gmail.com> 于2023年8月31日周四 16:18写道:
>
> Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月28日周一 10:04写道:
> >
> > On 8/24/23 11:20, Zheng Hacker wrote:
> > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道:
> > >
> > >>
> > >> Hello Zheng,
> > >>
> > >> On 7/7/23 12:24, Zheng Wang wrote:
> > >>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > >>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > >>> and mtk_jpeg_enc_device_run may be called to start the
> > >>> work.
> > >>> If we remove the module which will call mtk_jpeg_remove
> > >>> to make cleanup, there may be a unfinished work. The
> > >>> possible sequence is as follows, which will cause a
> > >>> typical UAF bug.
> > >>>
> > >>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> > >>>
> > >>> CPU0                  CPU1
> > >>>
> > >>>                     |mtk_jpeg_job_timeout_work
> > >>> mtk_jpeg_remove     |
> > >>>   v4l2_m2m_release  |
> > >>>     kfree(m2m_dev); |
> > >>>                     |
> > >>>                     | v4l2_m2m_get_curr_priv
> > >>>                     |   m2m_dev->curr_ctx //use
> > >>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> > >>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> > >>> ---
> > >>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
> > >>> ---
> > >>>  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
> > >>>  1 file changed, 1 insertion(+)
> > >>>
> > >>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > >>> index 0051f372a66c..6069ecf420b0 100644
> > >>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > >>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > >>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
> > >>>  {
> > >>>       struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
> > >>>
> > >>> +     cancel_delayed_work_sync(&jpeg->job_timeout_work);
> > >>>       pm_runtime_disable(&pdev->dev);
> > >>>       video_unregister_device(jpeg->vdev);
> > >>>       v4l2_m2m_release(jpeg->m2m_dev);
> > >>
> > >> AFAICS, there is a fundamental problem here. The job_timeout_work uses
> > >> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
> > >> all the v4l contexts must be closed and released. Hence the
> > >> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
> > >> work is executed before cancel_delayed_work_sync().
> > >>
> > >
> > > Hi Dmitry,
> > >
> > > Thanks for your reply. I think you're right. As m2m_dev is freed in
> > > v4l2_m2m_release,
> > > the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null
> > > pointer dereference
> > > bug. I am sure that context is closed when we invoke mtk_jpeg_remove.
> > > But I'm not sure if
> > > context is released when mtk_jpegdec_timeout_work running.
> > >
> > >> At the time when mtk_jpeg_remove() is invoked, there shall be no
> > >> job_timeout_work running in background because all jobs should be
> > >> completed before context is released. If you'll look at
> > >> v4l2_m2m_cancel_job(), you can see that it waits for the task completion
> > >> before closing context.
> > >
> > > Yes, so I think the better way is to put the cancel_delayed_work_sync
> > > invoking into
> > > v4l2_m2m_ctx_release function?
> >
> > The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
> > completion or for the interrupt fire. Apparently it doesn't work in
> > yours case. You'll need to debug why v4l job or job_timeout_work is
> > running after v4l2_m2m_ctx_release(), it shouldn't happen.
> >
>
> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
> to trigger that.
>
> However, this is not the only path to call v4l2_m2m_job_finish. Here
> is a invoking chain:
> v4l_streamon
>   ->v4l2_m2m_ioctl_streamon
>     ->v4l2_m2m_streamon
>       ->v4l2_m2m_try_schedule
>         ->v4l2_m2m_try_run
>           ->mtk_jpeg_dec_device_run
>             ->schedule_delayed_work(&jpeg->job_timeout_work...
>             ->error path goto dec_end
>             ->v4l2_m2m_job_finish
>
> In some specific situation, it starts the worker and also calls
> v4l2_m2m_job_finish, which might
> make v4l2_m2m_cancel_job continues.
>
> > The interrupt handler cancels job_timeout_work, you shouldn't need to
> > flush the work.
>
> It will, but as I said, there might be an early invocation chain to
> start the work.(Not very sure)
>
> >
> > Technically, interrupt handler may race with job_timeout_work, but the
> > timeout is set to 1 second and in practice should be difficult to
> > trigger the race. The interrupt handler needs to be threaded, it should
> > use cancel_delayed_work_sync() and check the return value of this function.
> >
>
> Yes, it's better to use cancel_delayed_work_sync here.
>
> > >>
> > >> You shouldn't be able to remove driver module while it has active/opened
> > >> v4l contexts. If you can do that, then this is yours bug that needs to
> > >> be fixed.
> > >>
> > >> In addition to this all, the job_timeout_work is initialized only for
> > >> the single-core JPEG device. I'd expect this patch should crash
> > >> multi-core JPEG devices.
> > >>
> > >
> > > I think that's true. As I'm not familiar with the code here. Could you
> > > please give me some advice about the patch?
> >
> > We'll need to understand why v4l2_m2m_ctx_release() doesn't work as
> > expected before thinking about the patch.
> >
> > --
> > Best regards,
> > Dmitry
> >
  
Zheng Hacker Sept. 12, 2023, 9:29 a.m. UTC | #14
Hi Dmitry,

The patch is on the stable queue. Could you please take a look at my
analysis? Thanks for your help.

Best regards,
Zheng

Zheng Hacker <hackerzheng666@gmail.com> 于2023年9月5日周二 12:24写道:
>
> Friendly ping.
>
> Zheng Hacker <hackerzheng666@gmail.com> 于2023年8月31日周四 16:18写道:
> >
> > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月28日周一 10:04写道:
> > >
> > > On 8/24/23 11:20, Zheng Hacker wrote:
> > > > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道:
> > > >
> > > >>
> > > >> Hello Zheng,
> > > >>
> > > >> On 7/7/23 12:24, Zheng Wang wrote:
> > > >>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > > >>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > > >>> and mtk_jpeg_enc_device_run may be called to start the
> > > >>> work.
> > > >>> If we remove the module which will call mtk_jpeg_remove
> > > >>> to make cleanup, there may be a unfinished work. The
> > > >>> possible sequence is as follows, which will cause a
> > > >>> typical UAF bug.
> > > >>>
> > > >>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> > > >>>
> > > >>> CPU0                  CPU1
> > > >>>
> > > >>>                     |mtk_jpeg_job_timeout_work
> > > >>> mtk_jpeg_remove     |
> > > >>>   v4l2_m2m_release  |
> > > >>>     kfree(m2m_dev); |
> > > >>>                     |
> > > >>>                     | v4l2_m2m_get_curr_priv
> > > >>>                     |   m2m_dev->curr_ctx //use
> > > >>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> > > >>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> > > >>> ---
> > > >>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
> > > >>> ---
> > > >>>  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
> > > >>>  1 file changed, 1 insertion(+)
> > > >>>
> > > >>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > > >>> index 0051f372a66c..6069ecf420b0 100644
> > > >>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > > >>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > > >>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
> > > >>>  {
> > > >>>       struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
> > > >>>
> > > >>> +     cancel_delayed_work_sync(&jpeg->job_timeout_work);
> > > >>>       pm_runtime_disable(&pdev->dev);
> > > >>>       video_unregister_device(jpeg->vdev);
> > > >>>       v4l2_m2m_release(jpeg->m2m_dev);
> > > >>
> > > >> AFAICS, there is a fundamental problem here. The job_timeout_work uses
> > > >> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
> > > >> all the v4l contexts must be closed and released. Hence the
> > > >> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
> > > >> work is executed before cancel_delayed_work_sync().
> > > >>
> > > >
> > > > Hi Dmitry,
> > > >
> > > > Thanks for your reply. I think you're right. As m2m_dev is freed in
> > > > v4l2_m2m_release,
> > > > the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null
> > > > pointer dereference
> > > > bug. I am sure that context is closed when we invoke mtk_jpeg_remove.
> > > > But I'm not sure if
> > > > context is released when mtk_jpegdec_timeout_work running.
> > > >
> > > >> At the time when mtk_jpeg_remove() is invoked, there shall be no
> > > >> job_timeout_work running in background because all jobs should be
> > > >> completed before context is released. If you'll look at
> > > >> v4l2_m2m_cancel_job(), you can see that it waits for the task completion
> > > >> before closing context.
> > > >
> > > > Yes, so I think the better way is to put the cancel_delayed_work_sync
> > > > invoking into
> > > > v4l2_m2m_ctx_release function?
> > >
> > > The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
> > > completion or for the interrupt fire. Apparently it doesn't work in
> > > yours case. You'll need to debug why v4l job or job_timeout_work is
> > > running after v4l2_m2m_ctx_release(), it shouldn't happen.
> > >
> >
> > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
> > the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
> > to trigger that.
> >
> > However, this is not the only path to call v4l2_m2m_job_finish. Here
> > is a invoking chain:
> > v4l_streamon
> >   ->v4l2_m2m_ioctl_streamon
> >     ->v4l2_m2m_streamon
> >       ->v4l2_m2m_try_schedule
> >         ->v4l2_m2m_try_run
> >           ->mtk_jpeg_dec_device_run
> >             ->schedule_delayed_work(&jpeg->job_timeout_work...
> >             ->error path goto dec_end
> >             ->v4l2_m2m_job_finish
> >
> > In some specific situation, it starts the worker and also calls
> > v4l2_m2m_job_finish, which might
> > make v4l2_m2m_cancel_job continues.
> >
> > > The interrupt handler cancels job_timeout_work, you shouldn't need to
> > > flush the work.
> >
> > It will, but as I said, there might be an early invocation chain to
> > start the work.(Not very sure)
> >
> > >
> > > Technically, interrupt handler may race with job_timeout_work, but the
> > > timeout is set to 1 second and in practice should be difficult to
> > > trigger the race. The interrupt handler needs to be threaded, it should
> > > use cancel_delayed_work_sync() and check the return value of this function.
> > >
> >
> > Yes, it's better to use cancel_delayed_work_sync here.
> >
> > > >>
> > > >> You shouldn't be able to remove driver module while it has active/opened
> > > >> v4l contexts. If you can do that, then this is yours bug that needs to
> > > >> be fixed.
> > > >>
> > > >> In addition to this all, the job_timeout_work is initialized only for
> > > >> the single-core JPEG device. I'd expect this patch should crash
> > > >> multi-core JPEG devices.
> > > >>
> > > >
> > > > I think that's true. As I'm not familiar with the code here. Could you
> > > > please give me some advice about the patch?
> > >
> > > We'll need to understand why v4l2_m2m_ctx_release() doesn't work as
> > > expected before thinking about the patch.
> > >
> > > --
> > > Best regards,
> > > Dmitry
> > >
  
Dmitry Osipenko Sept. 19, 2023, 6:24 p.m. UTC | #15
On 8/31/23 11:18, Zheng Hacker wrote:
>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
>> completion or for the interrupt fire. Apparently it doesn't work in
>> yours case. You'll need to debug why v4l job or job_timeout_work is
>> running after v4l2_m2m_ctx_release(), it shouldn't happen.
>>
> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
> to trigger that.
> 
> However, this is not the only path to call v4l2_m2m_job_finish. Here
> is a invoking chain:
> v4l_streamon
>   ->v4l2_m2m_ioctl_streamon
>     ->v4l2_m2m_streamon
>       ->v4l2_m2m_try_schedule
>         ->v4l2_m2m_try_run
>           ->mtk_jpeg_dec_device_run
>             ->schedule_delayed_work(&jpeg->job_timeout_work...
>             ->error path goto dec_end
>             ->v4l2_m2m_job_finish
> 
> In some specific situation, it starts the worker and also calls
> v4l2_m2m_job_finish, which might
> make v4l2_m2m_cancel_job continues.

Then the error path should cancel the job_timeout_work, or better job
needs to be run after the dec/enc has been started and not before.

Looking further at the code, I'm confused by this hunk:

	mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
	v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);

The job should be marked as finished when h/w has finished processing
the job and not right after the job has been started. So the job is
always completed and mtk_jpeg_job_timeout_work() doesn't work as
expected, am I missing something?
  
Dmitry Osipenko Sept. 19, 2023, 6:25 p.m. UTC | #16
On 9/19/23 21:24, Dmitry Osipenko wrote:
> On 8/31/23 11:18, Zheng Hacker wrote:
>>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
>>> completion or for the interrupt fire. Apparently it doesn't work in
>>> yours case. You'll need to debug why v4l job or job_timeout_work is
>>> running after v4l2_m2m_ctx_release(), it shouldn't happen.
>>>
>> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
>> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
>> to trigger that.
>>
>> However, this is not the only path to call v4l2_m2m_job_finish. Here
>> is a invoking chain:
>> v4l_streamon
>>   ->v4l2_m2m_ioctl_streamon
>>     ->v4l2_m2m_streamon
>>       ->v4l2_m2m_try_schedule
>>         ->v4l2_m2m_try_run
>>           ->mtk_jpeg_dec_device_run
>>             ->schedule_delayed_work(&jpeg->job_timeout_work...
>>             ->error path goto dec_end
>>             ->v4l2_m2m_job_finish
>>
>> In some specific situation, it starts the worker and also calls
>> v4l2_m2m_job_finish, which might
>> make v4l2_m2m_cancel_job continues.
> 
> Then the error path should cancel the job_timeout_work, or better job

s/job/timeout work/

> needs to be run after the dec/enc has been started and not before.
> 
> Looking further at the code, I'm confused by this hunk:
> 
> 	mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
> 	v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);
> 
> The job should be marked as finished when h/w has finished processing
> the job and not right after the job has been started. So the job is
> always completed and mtk_jpeg_job_timeout_work() doesn't work as
> expected, am I missing something?
>
  
Zheng Hacker Oct. 8, 2023, 9:13 a.m. UTC | #17
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道:
>
> On 8/31/23 11:18, Zheng Hacker wrote:
> >> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
> >> completion or for the interrupt fire. Apparently it doesn't work in
> >> yours case. You'll need to debug why v4l job or job_timeout_work is
> >> running after v4l2_m2m_ctx_release(), it shouldn't happen.
> >>
> > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
> > the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
> > to trigger that.
> >
> > However, this is not the only path to call v4l2_m2m_job_finish. Here
> > is a invoking chain:
> > v4l_streamon
> >   ->v4l2_m2m_ioctl_streamon
> >     ->v4l2_m2m_streamon
> >       ->v4l2_m2m_try_schedule
> >         ->v4l2_m2m_try_run
> >           ->mtk_jpeg_dec_device_run
> >             ->schedule_delayed_work(&jpeg->job_timeout_work...
> >             ->error path goto dec_end
> >             ->v4l2_m2m_job_finish
> >
> > In some specific situation, it starts the worker and also calls
> > v4l2_m2m_job_finish, which might
> > make v4l2_m2m_cancel_job continues.
>
> Then the error path should cancel the job_timeout_work, or better job
> needs to be run after the dec/enc has been started and not before.
>

Hi,

Sorry for my late reply for I just went on a long vacation.

Get it. I'll write another patch and change the summary to the lack of
canceling job in error path.

> Looking further at the code, I'm confused by this hunk:
>
>         mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
>         v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);
>
> The job should be marked as finished when h/w has finished processing
> the job and not right after the job has been started. So the job is
> always completed and mtk_jpeg_job_timeout_work() doesn't work as
> expected, am I missing something?

After reading the code I still don't know. I didn't see any function
like mtk_jpeg_dec_end. The same thing
happens on mtk_jpeg_enc_start. I think I'd better fix the first
problem and wait for someone familiar with
the second part.

Best regards,
Zheng

>
> --
> Best regards,
> Dmitry
>
  
Dmitry Osipenko Oct. 19, 2023, 7:56 p.m. UTC | #18
On 10/8/23 12:13, Zheng Hacker wrote:
> Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道:
>>
>> On 8/31/23 11:18, Zheng Hacker wrote:
>>>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
>>>> completion or for the interrupt fire. Apparently it doesn't work in
>>>> yours case. You'll need to debug why v4l job or job_timeout_work is
>>>> running after v4l2_m2m_ctx_release(), it shouldn't happen.
>>>>
>>> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
>>> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
>>> to trigger that.
>>>
>>> However, this is not the only path to call v4l2_m2m_job_finish. Here
>>> is a invoking chain:
>>> v4l_streamon
>>>   ->v4l2_m2m_ioctl_streamon
>>>     ->v4l2_m2m_streamon
>>>       ->v4l2_m2m_try_schedule
>>>         ->v4l2_m2m_try_run
>>>           ->mtk_jpeg_dec_device_run
>>>             ->schedule_delayed_work(&jpeg->job_timeout_work...
>>>             ->error path goto dec_end
>>>             ->v4l2_m2m_job_finish
>>>
>>> In some specific situation, it starts the worker and also calls
>>> v4l2_m2m_job_finish, which might
>>> make v4l2_m2m_cancel_job continues.
>>
>> Then the error path should cancel the job_timeout_work, or better job
>> needs to be run after the dec/enc has been started and not before.
>>
> 
> Hi,
> 
> Sorry for my late reply for I just went on a long vacation.
> 
> Get it. I'll write another patch and change the summary to the lack of
> canceling job in error path.
> 
>> Looking further at the code, I'm confused by this hunk:
>>
>>         mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
>>         v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);
>>
>> The job should be marked as finished when h/w has finished processing
>> the job and not right after the job has been started. So the job is
>> always completed and mtk_jpeg_job_timeout_work() doesn't work as
>> expected, am I missing something?
> 
> After reading the code I still don't know. I didn't see any function
> like mtk_jpeg_dec_end. The same thing
> happens on mtk_jpeg_enc_start. I think I'd better fix the first
> problem and wait for someone familiar with
> the second part.

I missed that the code mentioned above is related to the multi-core hw version, while you care about single-core. Nevertheless, the multi-core device_run() looks incorrect,

So, the error code paths need to be corrected. Please try to revert yours fix and test this change: 

diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
index 0051f372a66c..fd3b0587fcad 100644
--- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
+++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
@@ -1254,9 +1254,6 @@ static void mtk_jpegdec_worker(struct work_struct *work)
 	v4l2_m2m_src_buf_remove(ctx->fh.m2m_ctx);
 	v4l2_m2m_dst_buf_remove(ctx->fh.m2m_ctx);
 
-	schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work,
-			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
-
 	mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs);
 	if (mtk_jpeg_set_dec_dst(ctx,
 				 &jpeg_src_buf->dec_param,
@@ -1266,6 +1263,9 @@ static void mtk_jpegdec_worker(struct work_struct *work)
 		goto setdst_end;
 	}
 
+	schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work,
+			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
+
 	spin_lock_irqsave(&comp_jpeg[hw_id]->hw_lock, flags);
 	ctx->total_frame_num++;
 	mtk_jpeg_dec_reset(comp_jpeg[hw_id]->reg_base);
@@ -1330,13 +1330,13 @@ static void mtk_jpeg_dec_device_run(void *priv)
 	if (ret < 0)
 		goto dec_end;
 
-	schedule_delayed_work(&jpeg->job_timeout_work,
-			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
-
 	mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs);
 	if (mtk_jpeg_set_dec_dst(ctx, &jpeg_src_buf->dec_param, &dst_buf->vb2_buf, &fb))
 		goto dec_end;
 
+	schedule_delayed_work(&jpeg->job_timeout_work,
+			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
+
 	spin_lock_irqsave(&jpeg->hw_lock, flags);
 	mtk_jpeg_dec_reset(jpeg->reg_base);
 	mtk_jpeg_dec_set_config(jpeg->reg_base,
  
Zheng Hacker Oct. 20, 2023, 2:51 a.m. UTC | #19
Thanks for your patch. I think this should fix the problem. As I have
no experience in reverting, can I submit the patch with your fix as
well as reverting my fix?

Best regards,
Zheng

Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年10月20日周五 03:56写道:
>
> On 10/8/23 12:13, Zheng Hacker wrote:
> > Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道:
> >>
> >> On 8/31/23 11:18, Zheng Hacker wrote:
> >>>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
> >>>> completion or for the interrupt fire. Apparently it doesn't work in
> >>>> yours case. You'll need to debug why v4l job or job_timeout_work is
> >>>> running after v4l2_m2m_ctx_release(), it shouldn't happen.
> >>>>
> >>> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
> >>> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
> >>> to trigger that.
> >>>
> >>> However, this is not the only path to call v4l2_m2m_job_finish. Here
> >>> is a invoking chain:
> >>> v4l_streamon
> >>>   ->v4l2_m2m_ioctl_streamon
> >>>     ->v4l2_m2m_streamon
> >>>       ->v4l2_m2m_try_schedule
> >>>         ->v4l2_m2m_try_run
> >>>           ->mtk_jpeg_dec_device_run
> >>>             ->schedule_delayed_work(&jpeg->job_timeout_work...
> >>>             ->error path goto dec_end
> >>>             ->v4l2_m2m_job_finish
> >>>
> >>> In some specific situation, it starts the worker and also calls
> >>> v4l2_m2m_job_finish, which might
> >>> make v4l2_m2m_cancel_job continues.
> >>
> >> Then the error path should cancel the job_timeout_work, or better job
> >> needs to be run after the dec/enc has been started and not before.
> >>
> >
> > Hi,
> >
> > Sorry for my late reply for I just went on a long vacation.
> >
> > Get it. I'll write another patch and change the summary to the lack of
> > canceling job in error path.
> >
> >> Looking further at the code, I'm confused by this hunk:
> >>
> >>         mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
> >>         v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);
> >>
> >> The job should be marked as finished when h/w has finished processing
> >> the job and not right after the job has been started. So the job is
> >> always completed and mtk_jpeg_job_timeout_work() doesn't work as
> >> expected, am I missing something?
> >
> > After reading the code I still don't know. I didn't see any function
> > like mtk_jpeg_dec_end. The same thing
> > happens on mtk_jpeg_enc_start. I think I'd better fix the first
> > problem and wait for someone familiar with
> > the second part.
>
> I missed that the code mentioned above is related to the multi-core hw version, while you care about single-core. Nevertheless, the multi-core device_run() looks incorrect,
>
> So, the error code paths need to be corrected. Please try to revert yours fix and test this change:
>
> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> index 0051f372a66c..fd3b0587fcad 100644
> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> @@ -1254,9 +1254,6 @@ static void mtk_jpegdec_worker(struct work_struct *work)
>         v4l2_m2m_src_buf_remove(ctx->fh.m2m_ctx);
>         v4l2_m2m_dst_buf_remove(ctx->fh.m2m_ctx);
>
> -       schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work,
> -                             msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
> -
>         mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs);
>         if (mtk_jpeg_set_dec_dst(ctx,
>                                  &jpeg_src_buf->dec_param,
> @@ -1266,6 +1263,9 @@ static void mtk_jpegdec_worker(struct work_struct *work)
>                 goto setdst_end;
>         }
>
> +       schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work,
> +                             msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
> +
>         spin_lock_irqsave(&comp_jpeg[hw_id]->hw_lock, flags);
>         ctx->total_frame_num++;
>         mtk_jpeg_dec_reset(comp_jpeg[hw_id]->reg_base);
> @@ -1330,13 +1330,13 @@ static void mtk_jpeg_dec_device_run(void *priv)
>         if (ret < 0)
>                 goto dec_end;
>
> -       schedule_delayed_work(&jpeg->job_timeout_work,
> -                             msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
> -
>         mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs);
>         if (mtk_jpeg_set_dec_dst(ctx, &jpeg_src_buf->dec_param, &dst_buf->vb2_buf, &fb))
>                 goto dec_end;
>
> +       schedule_delayed_work(&jpeg->job_timeout_work,
> +                             msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
> +
>         spin_lock_irqsave(&jpeg->hw_lock, flags);
>         mtk_jpeg_dec_reset(jpeg->reg_base);
>         mtk_jpeg_dec_set_config(jpeg->reg_base,
>
> --
> Best regards,
> Dmitry
>
  

Patch

diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
index 0051f372a66c..6069ecf420b0 100644
--- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
+++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
@@ -1816,6 +1816,7 @@  static void mtk_jpeg_remove(struct platform_device *pdev)
 {
 	struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
 
+	cancel_delayed_work_sync(&jpeg->job_timeout_work);
 	pm_runtime_disable(&pdev->dev);
 	video_unregister_device(jpeg->vdev);
 	v4l2_m2m_release(jpeg->m2m_dev);