diff mbox series

media: hantro: Check whether reset op is defined before use

Message ID	20230824013858.303105-1-marex@denx.de (mailing list archive)
State	Accepted
Delegated to:	Hans Verkuil
Headers	sender: marex@denx.de) by phobos.denx.de (Postfix) with ESMTPSA id 78638807C4; Thu, 24 Aug 2023 03:39:09 +0200 (CEST) From: Marek Vasut <marex@denx.de> To: linux-media@vger.kernel.org Cc: Marek Vasut <marex@denx.de>, Adam Ford <aford173@gmail.com>, Benjamin Gaignard <benjamin.gaignard@collabora.com>, Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>, Mauro Carvalho Chehab <mchehab@kernel.org>, Philipp Zabel <p.zabel@pengutronix.de>, linux-rockchip@lists.infradead.org Subject: [PATCH] media: hantro: Check whether reset op is defined before use Date: Thu, 24 Aug 2023 03:38:58 +0200 Message-Id: <20230824013858.303105-1-marex@denx.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	media: hantro: Check whether reset op is defined before use \| media: hantro: Check whether reset op is defined before use

Commit Message

Marek Vasut Aug. 24, 2023, 1:38 a.m. UTC

  The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
done by genpd. Check whether the .reset op is defined before calling it
to avoid NULL pointer dereference.

Note that the Fixes tag is set to the commit which removed the reset op
from i.MX8M Hantro G2 implementation, this is because before this commit
all the implementations did define the .reset op.

Fixes: 6971efb70ac3 ("media: hantro: Allow i.MX8MQ G1 and G2 to run independently")
Signed-off-by: Marek Vasut <marex@denx.de>
---
Cc: Adam Ford <aford173@gmail.com>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: linux-media@vger.kernel.org
Cc: linux-rockchip@lists.infradead.org
---
 drivers/media/platform/verisilicon/hantro_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Chen-Yu Tsai Aug. 24, 2023, 2:45 a.m. UTC | #1

On Thu, Aug 24, 2023 at 9:39 AM Marek Vasut <marex@denx.de> wrote:
>
> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
> done by genpd. Check whether the .reset op is defined before calling it
> to avoid NULL pointer dereference.
>
> Note that the Fixes tag is set to the commit which removed the reset op
> from i.MX8M Hantro G2 implementation, this is because before this commit
> all the implementations did define the .reset op.
>
> Fixes: 6971efb70ac3 ("media: hantro: Allow i.MX8MQ G1 and G2 to run independently")
> Signed-off-by: Marek Vasut <marex@denx.de>

Had the same change in my local tree, so

Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>

> ---
> Cc: Adam Ford <aford173@gmail.com>
> Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Philipp Zabel <p.zabel@pengutronix.de>
> Cc: linux-media@vger.kernel.org
> Cc: linux-rockchip@lists.infradead.org
> ---
>  drivers/media/platform/verisilicon/hantro_drv.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/media/platform/verisilicon/hantro_drv.c b/drivers/media/platform/verisilicon/hantro_drv.c
> index 423fc85d79ee3..50ec24c753e9e 100644
> --- a/drivers/media/platform/verisilicon/hantro_drv.c
> +++ b/drivers/media/platform/verisilicon/hantro_drv.c
> @@ -125,7 +125,8 @@ void hantro_watchdog(struct work_struct *work)
>         ctx = v4l2_m2m_get_curr_priv(vpu->m2m_dev);
>         if (ctx) {
>                 vpu_err("frame processing timed out!\n");
> -               ctx->codec_ops->reset(ctx);
> +               if (ctx->codec_ops->reset)
> +                       ctx->codec_ops->reset(ctx);
>                 hantro_job_finish(vpu, ctx, VB2_BUF_STATE_ERROR);
>         }
>  }
> --
> 2.40.1
>
>
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rockchip

Adam Ford Aug. 24, 2023, 10:39 a.m. UTC | #2

On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
>
> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
> done by genpd. Check whether the .reset op is defined before calling it
> to avoid NULL pointer dereference.
>
> Note that the Fixes tag is set to the commit which removed the reset op
> from i.MX8M Hantro G2 implementation, this is because before this commit
> all the implementations did define the .reset op.

I am surprised I didn't have issues when I was testing the 8MQ and
8MM, but this makes sense.
>
> Fixes: 6971efb70ac3 ("media: hantro: Allow i.MX8MQ G1 and G2 to run independently")
> Signed-off-by: Marek Vasut <marex@denx.de>

Reviewed-by: Adam Ford <aford173@gmail.com>
> ---
> Cc: Adam Ford <aford173@gmail.com>
> Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Philipp Zabel <p.zabel@pengutronix.de>
> Cc: linux-media@vger.kernel.org
> Cc: linux-rockchip@lists.infradead.org
> ---
>  drivers/media/platform/verisilicon/hantro_drv.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/media/platform/verisilicon/hantro_drv.c b/drivers/media/platform/verisilicon/hantro_drv.c
> index 423fc85d79ee3..50ec24c753e9e 100644
> --- a/drivers/media/platform/verisilicon/hantro_drv.c
> +++ b/drivers/media/platform/verisilicon/hantro_drv.c
> @@ -125,7 +125,8 @@ void hantro_watchdog(struct work_struct *work)
>         ctx = v4l2_m2m_get_curr_priv(vpu->m2m_dev);
>         if (ctx) {
>                 vpu_err("frame processing timed out!\n");
> -               ctx->codec_ops->reset(ctx);
> +               if (ctx->codec_ops->reset)
> +                       ctx->codec_ops->reset(ctx);
>                 hantro_job_finish(vpu, ctx, VB2_BUF_STATE_ERROR);
>         }
>  }
> --
> 2.40.1
>

Marek Vasut Aug. 24, 2023, 1:08 p.m. UTC | #3

On 8/24/23 12:39, Adam Ford wrote:
> On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
>>
>> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
>> done by genpd. Check whether the .reset op is defined before calling it
>> to avoid NULL pointer dereference.
>>
>> Note that the Fixes tag is set to the commit which removed the reset op
>> from i.MX8M Hantro G2 implementation, this is because before this commit
>> all the implementations did define the .reset op.
> 
> I am surprised I didn't have issues when I was testing the 8MQ and
> 8MM, but this makes sense.

You need to trigger the VPU watchdog to trigger the crash, that means 
you have to get the VPU into some weird state where it fails to decode 
frame. Then it triggers the reset and ... boom.

See this patch, that contains a gstreamer invocation to generate such 
trigger condition input data:

[PATCH] media: verisilicon: Do not enable G2 postproc downscale if 
source is narrower than destination

"
To generate input test data to trigger this bug, use e.g.:
$ gst-launch-1.0 videotestsrc ! 
video/x-raw,width=272,height=256,format=I420 ! \
                  vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
To trigger the bug upon decoding (note that the NV12 must be forced, as
that assures the output data would pass the G2 postproc):
$ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux ! 
vp9parse ! \
                  v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert 
! fbdevsink
"

Chen-Yu Tsai Aug. 25, 2023, 7:09 a.m. UTC | #4

On Thu, Aug 24, 2023 at 9:08 PM Marek Vasut <marex@denx.de> wrote:
>
> On 8/24/23 12:39, Adam Ford wrote:
> > On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
> >>
> >> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
> >> done by genpd. Check whether the .reset op is defined before calling it
> >> to avoid NULL pointer dereference.
> >>
> >> Note that the Fixes tag is set to the commit which removed the reset op
> >> from i.MX8M Hantro G2 implementation, this is because before this commit
> >> all the implementations did define the .reset op.
> >
> > I am surprised I didn't have issues when I was testing the 8MQ and
> > 8MM, but this makes sense.
>
> You need to trigger the VPU watchdog to trigger the crash, that means
> you have to get the VPU into some weird state where it fails to decode
> frame. Then it triggers the reset and ... boom.
>
> See this patch, that contains a gstreamer invocation to generate such
> trigger condition input data:
>
> [PATCH] media: verisilicon: Do not enable G2 postproc downscale if
> source is narrower than destination
>
> "
> To generate input test data to trigger this bug, use e.g.:
> $ gst-launch-1.0 videotestsrc !
> video/x-raw,width=272,height=256,format=I420 ! \
>                   vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
> To trigger the bug upon decoding (note that the NV12 must be forced, as
> that assures the output data would pass the G2 postproc):
> $ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux !
> vp9parse ! \
>                   v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert
> ! fbdevsink
> "

Does it completely recover afterwards? In my previous trials the hardware
ended up in some bizzare state: while decoding succeeds, the output's md5sum
didn't match up.

ChenYu

Marek Vasut Aug. 25, 2023, 8:33 a.m. UTC | #5

On 8/25/23 09:09, Chen-Yu Tsai wrote:
> On Thu, Aug 24, 2023 at 9:08 PM Marek Vasut <marex@denx.de> wrote:
>>
>> On 8/24/23 12:39, Adam Ford wrote:
>>> On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
>>>>
>>>> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
>>>> done by genpd. Check whether the .reset op is defined before calling it
>>>> to avoid NULL pointer dereference.
>>>>
>>>> Note that the Fixes tag is set to the commit which removed the reset op
>>>> from i.MX8M Hantro G2 implementation, this is because before this commit
>>>> all the implementations did define the .reset op.
>>>
>>> I am surprised I didn't have issues when I was testing the 8MQ and
>>> 8MM, but this makes sense.
>>
>> You need to trigger the VPU watchdog to trigger the crash, that means
>> you have to get the VPU into some weird state where it fails to decode
>> frame. Then it triggers the reset and ... boom.
>>
>> See this patch, that contains a gstreamer invocation to generate such
>> trigger condition input data:
>>
>> [PATCH] media: verisilicon: Do not enable G2 postproc downscale if
>> source is narrower than destination
>>
>> "
>> To generate input test data to trigger this bug, use e.g.:
>> $ gst-launch-1.0 videotestsrc !
>> video/x-raw,width=272,height=256,format=I420 ! \
>>                    vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
>> To trigger the bug upon decoding (note that the NV12 must be forced, as
>> that assures the output data would pass the G2 postproc):
>> $ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux !
>> vp9parse ! \
>>                    v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert
>> ! fbdevsink
>> "
> 
> Does it completely recover afterwards? In my previous trials the hardware
> ended up in some bizzare state: while decoding succeeds, the output's md5sum
> didn't match up.

Have you got a testcase that triggers this, one I can try ?

I am not entirely sure whether this is happening here as well or not, 
but I can imagine that the power domain went down and back up between 
tests, so the VPU would be power cycled (and therefore reset) that way. 
So, I think it is worth testing that.

Chen-Yu Tsai Aug. 25, 2023, 8:52 a.m. UTC | #6

On Fri, Aug 25, 2023 at 4:33 PM Marek Vasut <marex@denx.de> wrote:
>
> On 8/25/23 09:09, Chen-Yu Tsai wrote:
> > On Thu, Aug 24, 2023 at 9:08 PM Marek Vasut <marex@denx.de> wrote:
> >>
> >> On 8/24/23 12:39, Adam Ford wrote:
> >>> On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
> >>>>
> >>>> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
> >>>> done by genpd. Check whether the .reset op is defined before calling it
> >>>> to avoid NULL pointer dereference.
> >>>>
> >>>> Note that the Fixes tag is set to the commit which removed the reset op
> >>>> from i.MX8M Hantro G2 implementation, this is because before this commit
> >>>> all the implementations did define the .reset op.
> >>>
> >>> I am surprised I didn't have issues when I was testing the 8MQ and
> >>> 8MM, but this makes sense.
> >>
> >> You need to trigger the VPU watchdog to trigger the crash, that means
> >> you have to get the VPU into some weird state where it fails to decode
> >> frame. Then it triggers the reset and ... boom.
> >>
> >> See this patch, that contains a gstreamer invocation to generate such
> >> trigger condition input data:
> >>
> >> [PATCH] media: verisilicon: Do not enable G2 postproc downscale if
> >> source is narrower than destination
> >>
> >> "
> >> To generate input test data to trigger this bug, use e.g.:
> >> $ gst-launch-1.0 videotestsrc !
> >> video/x-raw,width=272,height=256,format=I420 ! \
> >>                    vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
> >> To trigger the bug upon decoding (note that the NV12 must be forced, as
> >> that assures the output data would pass the G2 postproc):
> >> $ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux !
> >> vp9parse ! \
> >>                    v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert
> >> ! fbdevsink
> >> "
> >
> > Does it completely recover afterwards? In my previous trials the hardware
> > ended up in some bizzare state: while decoding succeeds, the output's md5sum
> > didn't match up.
>
> Have you got a testcase that triggers this, one I can try ?
>
> I am not entirely sure whether this is happening here as well or not,
> but I can imagine that the power domain went down and back up between
> tests, so the VPU would be power cycled (and therefore reset) that way.
> So, I think it is worth testing that.

This was last year while I was writing HEVC decoding code for Chromium.
IIRC the SAODBLK_A_MainConcept_4 test vector from the official HEVC test
suite does cause our stack to crash, but Gstreamer seemed to handle it
OK. It could be that the Chromium decoder stack is passing bad values to
the decoder.


ChenYu

Marek Vasut Aug. 26, 2023, 9:44 p.m. UTC | #7

On 8/25/23 10:52, Chen-Yu Tsai wrote:
> On Fri, Aug 25, 2023 at 4:33 PM Marek Vasut <marex@denx.de> wrote:
>>
>> On 8/25/23 09:09, Chen-Yu Tsai wrote:
>>> On Thu, Aug 24, 2023 at 9:08 PM Marek Vasut <marex@denx.de> wrote:
>>>>
>>>> On 8/24/23 12:39, Adam Ford wrote:
>>>>> On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
>>>>>>
>>>>>> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
>>>>>> done by genpd. Check whether the .reset op is defined before calling it
>>>>>> to avoid NULL pointer dereference.
>>>>>>
>>>>>> Note that the Fixes tag is set to the commit which removed the reset op
>>>>>> from i.MX8M Hantro G2 implementation, this is because before this commit
>>>>>> all the implementations did define the .reset op.
>>>>>
>>>>> I am surprised I didn't have issues when I was testing the 8MQ and
>>>>> 8MM, but this makes sense.
>>>>
>>>> You need to trigger the VPU watchdog to trigger the crash, that means
>>>> you have to get the VPU into some weird state where it fails to decode
>>>> frame. Then it triggers the reset and ... boom.
>>>>
>>>> See this patch, that contains a gstreamer invocation to generate such
>>>> trigger condition input data:
>>>>
>>>> [PATCH] media: verisilicon: Do not enable G2 postproc downscale if
>>>> source is narrower than destination
>>>>
>>>> "
>>>> To generate input test data to trigger this bug, use e.g.:
>>>> $ gst-launch-1.0 videotestsrc !
>>>> video/x-raw,width=272,height=256,format=I420 ! \
>>>>                     vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
>>>> To trigger the bug upon decoding (note that the NV12 must be forced, as
>>>> that assures the output data would pass the G2 postproc):
>>>> $ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux !
>>>> vp9parse ! \
>>>>                     v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert
>>>> ! fbdevsink
>>>> "
>>>
>>> Does it completely recover afterwards? In my previous trials the hardware
>>> ended up in some bizzare state: while decoding succeeds, the output's md5sum
>>> didn't match up.
>>
>> Have you got a testcase that triggers this, one I can try ?
>>
>> I am not entirely sure whether this is happening here as well or not,
>> but I can imagine that the power domain went down and back up between
>> tests, so the VPU would be power cycled (and therefore reset) that way.
>> So, I think it is worth testing that.
> 
> This was last year while I was writing HEVC decoding code for Chromium.
> IIRC the SAODBLK_A_MainConcept_4 test vector from the official HEVC test
> suite does cause our stack to crash, but Gstreamer seemed to handle it
> OK. It could be that the Chromium decoder stack is passing bad values to
> the decoder.

That can be easily tested with ftrace enabled. I was just tracking down 
an issue with gstreamer and added the following patch to the hantro 
driver. Then:

echo > /sys/kernel/debug/tracing/trace
<run fail test>
cat /sys/kernel/debug/tracing/trace > /tmp/fail.trace
echo > /sys/kernel/debug/tracing/trace
<run pass test>
cat /sys/kernel/debug/tracing/trace > /tmp/pass.trace
# remove time stamps etc.
diff /tmp/{fail,pass}.trace

You should see whether some register programming differs between 
gstreamer and chromium.

diff --git a/drivers/media/platform/verisilicon/hantro.h 
b/drivers/media/platform/verisilicon/hantro.h
index dea35a501ba30..529f1ab478ec8 100644
--- a/drivers/media/platform/verisilicon/hantro.h
+++ b/drivers/media/platform/verisilicon/hantro.h
@@ -353,8 +353,7 @@ extern int hantro_debug;

  #define vpu_debug(level, fmt, args...)                         \
         do {                                                    \
-               if (hantro_debug & BIT(level))          \
-                       pr_info("%s:%d: " fmt,                  \
+               trace_printk("%s:%d: " fmt,                     \
                                  __func__, __LINE__, ##args);   \
         } while (0)

Chen-Yu Tsai Aug. 30, 2023, 3:38 a.m. UTC | #8

On Sun, Aug 27, 2023 at 5:44 AM Marek Vasut <marex@denx.de> wrote:
>
> On 8/25/23 10:52, Chen-Yu Tsai wrote:
> > On Fri, Aug 25, 2023 at 4:33 PM Marek Vasut <marex@denx.de> wrote:
> >>
> >> On 8/25/23 09:09, Chen-Yu Tsai wrote:
> >>> On Thu, Aug 24, 2023 at 9:08 PM Marek Vasut <marex@denx.de> wrote:
> >>>>
> >>>> On 8/24/23 12:39, Adam Ford wrote:
> >>>>> On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
> >>>>>>
> >>>>>> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
> >>>>>> done by genpd. Check whether the .reset op is defined before calling it
> >>>>>> to avoid NULL pointer dereference.
> >>>>>>
> >>>>>> Note that the Fixes tag is set to the commit which removed the reset op
> >>>>>> from i.MX8M Hantro G2 implementation, this is because before this commit
> >>>>>> all the implementations did define the .reset op.
> >>>>>
> >>>>> I am surprised I didn't have issues when I was testing the 8MQ and
> >>>>> 8MM, but this makes sense.
> >>>>
> >>>> You need to trigger the VPU watchdog to trigger the crash, that means
> >>>> you have to get the VPU into some weird state where it fails to decode
> >>>> frame. Then it triggers the reset and ... boom.
> >>>>
> >>>> See this patch, that contains a gstreamer invocation to generate such
> >>>> trigger condition input data:
> >>>>
> >>>> [PATCH] media: verisilicon: Do not enable G2 postproc downscale if
> >>>> source is narrower than destination
> >>>>
> >>>> "
> >>>> To generate input test data to trigger this bug, use e.g.:
> >>>> $ gst-launch-1.0 videotestsrc !
> >>>> video/x-raw,width=272,height=256,format=I420 ! \
> >>>>                     vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
> >>>> To trigger the bug upon decoding (note that the NV12 must be forced, as
> >>>> that assures the output data would pass the G2 postproc):
> >>>> $ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux !
> >>>> vp9parse ! \
> >>>>                     v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert
> >>>> ! fbdevsink
> >>>> "
> >>>
> >>> Does it completely recover afterwards? In my previous trials the hardware
> >>> ended up in some bizzare state: while decoding succeeds, the output's md5sum
> >>> didn't match up.
> >>
> >> Have you got a testcase that triggers this, one I can try ?
> >>
> >> I am not entirely sure whether this is happening here as well or not,
> >> but I can imagine that the power domain went down and back up between
> >> tests, so the VPU would be power cycled (and therefore reset) that way.
> >> So, I think it is worth testing that.
> >
> > This was last year while I was writing HEVC decoding code for Chromium.
> > IIRC the SAODBLK_A_MainConcept_4 test vector from the official HEVC test
> > suite does cause our stack to crash, but Gstreamer seemed to handle it
> > OK. It could be that the Chromium decoder stack is passing bad values to
> > the decoder.
>
> That can be easily tested with ftrace enabled. I was just tracking down
> an issue with gstreamer and added the following patch to the hantro
> driver. Then:
>
> echo > /sys/kernel/debug/tracing/trace
> <run fail test>
> cat /sys/kernel/debug/tracing/trace > /tmp/fail.trace
> echo > /sys/kernel/debug/tracing/trace
> <run pass test>
> cat /sys/kernel/debug/tracing/trace > /tmp/pass.trace
> # remove time stamps etc.
> diff /tmp/{fail,pass}.trace
>
> You should see whether some register programming differs between
> gstreamer and chromium.

I ended up using VISL to compare the controls set. I did find a bug.
It still hard hangs after a couple frames, so I guess I'd need to use
your method, but do printk instead.

BTW, I wonder if we shouldn't add a reset op, if only just to stop the
hardware? That is, do the same two register writes as in the Hantro G2
interrupt handler.

ChenYu


> diff --git a/drivers/media/platform/verisilicon/hantro.h
> b/drivers/media/platform/verisilicon/hantro.h
> index dea35a501ba30..529f1ab478ec8 100644
> --- a/drivers/media/platform/verisilicon/hantro.h
> +++ b/drivers/media/platform/verisilicon/hantro.h
> @@ -353,8 +353,7 @@ extern int hantro_debug;
>
>   #define vpu_debug(level, fmt, args...)                         \
>          do {                                                    \
> -               if (hantro_debug & BIT(level))          \
> -                       pr_info("%s:%d: " fmt,                  \
> +               trace_printk("%s:%d: " fmt,                     \
>                                   __func__, __LINE__, ##args);   \
>          } while (0)

Marek Vasut Aug. 30, 2023, 7:13 p.m. UTC | #9

On 8/30/23 05:38, Chen-Yu Tsai wrote:
> On Sun, Aug 27, 2023 at 5:44 AM Marek Vasut <marex@denx.de> wrote:
>>
>> On 8/25/23 10:52, Chen-Yu Tsai wrote:
>>> On Fri, Aug 25, 2023 at 4:33 PM Marek Vasut <marex@denx.de> wrote:
>>>>
>>>> On 8/25/23 09:09, Chen-Yu Tsai wrote:
>>>>> On Thu, Aug 24, 2023 at 9:08 PM Marek Vasut <marex@denx.de> wrote:
>>>>>>
>>>>>> On 8/24/23 12:39, Adam Ford wrote:
>>>>>>> On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
>>>>>>>>
>>>>>>>> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
>>>>>>>> done by genpd. Check whether the .reset op is defined before calling it
>>>>>>>> to avoid NULL pointer dereference.
>>>>>>>>
>>>>>>>> Note that the Fixes tag is set to the commit which removed the reset op
>>>>>>>> from i.MX8M Hantro G2 implementation, this is because before this commit
>>>>>>>> all the implementations did define the .reset op.
>>>>>>>
>>>>>>> I am surprised I didn't have issues when I was testing the 8MQ and
>>>>>>> 8MM, but this makes sense.
>>>>>>
>>>>>> You need to trigger the VPU watchdog to trigger the crash, that means
>>>>>> you have to get the VPU into some weird state where it fails to decode
>>>>>> frame. Then it triggers the reset and ... boom.
>>>>>>
>>>>>> See this patch, that contains a gstreamer invocation to generate such
>>>>>> trigger condition input data:
>>>>>>
>>>>>> [PATCH] media: verisilicon: Do not enable G2 postproc downscale if
>>>>>> source is narrower than destination
>>>>>>
>>>>>> "
>>>>>> To generate input test data to trigger this bug, use e.g.:
>>>>>> $ gst-launch-1.0 videotestsrc !
>>>>>> video/x-raw,width=272,height=256,format=I420 ! \
>>>>>>                      vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
>>>>>> To trigger the bug upon decoding (note that the NV12 must be forced, as
>>>>>> that assures the output data would pass the G2 postproc):
>>>>>> $ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux !
>>>>>> vp9parse ! \
>>>>>>                      v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert
>>>>>> ! fbdevsink
>>>>>> "
>>>>>
>>>>> Does it completely recover afterwards? In my previous trials the hardware
>>>>> ended up in some bizzare state: while decoding succeeds, the output's md5sum
>>>>> didn't match up.
>>>>
>>>> Have you got a testcase that triggers this, one I can try ?
>>>>
>>>> I am not entirely sure whether this is happening here as well or not,
>>>> but I can imagine that the power domain went down and back up between
>>>> tests, so the VPU would be power cycled (and therefore reset) that way.
>>>> So, I think it is worth testing that.
>>>
>>> This was last year while I was writing HEVC decoding code for Chromium.
>>> IIRC the SAODBLK_A_MainConcept_4 test vector from the official HEVC test
>>> suite does cause our stack to crash, but Gstreamer seemed to handle it
>>> OK. It could be that the Chromium decoder stack is passing bad values to
>>> the decoder.
>>
>> That can be easily tested with ftrace enabled. I was just tracking down
>> an issue with gstreamer and added the following patch to the hantro
>> driver. Then:
>>
>> echo > /sys/kernel/debug/tracing/trace
>> <run fail test>
>> cat /sys/kernel/debug/tracing/trace > /tmp/fail.trace
>> echo > /sys/kernel/debug/tracing/trace
>> <run pass test>
>> cat /sys/kernel/debug/tracing/trace > /tmp/pass.trace
>> # remove time stamps etc.
>> diff /tmp/{fail,pass}.trace
>>
>> You should see whether some register programming differs between
>> gstreamer and chromium.
> 
> I ended up using VISL to compare the controls set. I did find a bug.
> It still hard hangs after a couple frames, so I guess I'd need to use
> your method, but do printk instead.
> 
> BTW, I wonder if we shouldn't add a reset op, if only just to stop the
> hardware? That is, do the same two register writes as in the Hantro G2
> interrupt handler.

You mean these two ?

38         vdpu_write(vpu, 0, G2_REG_INTERRUPT);
39         vdpu_write(vpu, G2_REG_CONFIG_DEC_CLK_GATE_E, G2_REG_CONFIG);

As far as I understand this, that only clears IRQ and gates the clock 
off, but it doesn't reset the IP state, does it ?

Chen-Yu Tsai Aug. 31, 2023, 3:26 a.m. UTC | #10

On Thu, Aug 31, 2023 at 3:13 AM Marek Vasut <marex@denx.de> wrote:
>
> On 8/30/23 05:38, Chen-Yu Tsai wrote:
> > On Sun, Aug 27, 2023 at 5:44 AM Marek Vasut <marex@denx.de> wrote:
> >>
> >> On 8/25/23 10:52, Chen-Yu Tsai wrote:
> >>> On Fri, Aug 25, 2023 at 4:33 PM Marek Vasut <marex@denx.de> wrote:
> >>>>
> >>>> On 8/25/23 09:09, Chen-Yu Tsai wrote:
> >>>>> On Thu, Aug 24, 2023 at 9:08 PM Marek Vasut <marex@denx.de> wrote:
> >>>>>>
> >>>>>> On 8/24/23 12:39, Adam Ford wrote:
> >>>>>>> On Wed, Aug 23, 2023 at 8:39 PM Marek Vasut <marex@denx.de> wrote:
> >>>>>>>>
> >>>>>>>> The i.MX8MM/N/P does not define the .reset op since reset of the VPU is
> >>>>>>>> done by genpd. Check whether the .reset op is defined before calling it
> >>>>>>>> to avoid NULL pointer dereference.
> >>>>>>>>
> >>>>>>>> Note that the Fixes tag is set to the commit which removed the reset op
> >>>>>>>> from i.MX8M Hantro G2 implementation, this is because before this commit
> >>>>>>>> all the implementations did define the .reset op.
> >>>>>>>
> >>>>>>> I am surprised I didn't have issues when I was testing the 8MQ and
> >>>>>>> 8MM, but this makes sense.
> >>>>>>
> >>>>>> You need to trigger the VPU watchdog to trigger the crash, that means
> >>>>>> you have to get the VPU into some weird state where it fails to decode
> >>>>>> frame. Then it triggers the reset and ... boom.
> >>>>>>
> >>>>>> See this patch, that contains a gstreamer invocation to generate such
> >>>>>> trigger condition input data:
> >>>>>>
> >>>>>> [PATCH] media: verisilicon: Do not enable G2 postproc downscale if
> >>>>>> source is narrower than destination
> >>>>>>
> >>>>>> "
> >>>>>> To generate input test data to trigger this bug, use e.g.:
> >>>>>> $ gst-launch-1.0 videotestsrc !
> >>>>>> video/x-raw,width=272,height=256,format=I420 ! \
> >>>>>>                      vp9enc ! matroskamux ! filesink location=/tmp/test.vp9
> >>>>>> To trigger the bug upon decoding (note that the NV12 must be forced, as
> >>>>>> that assures the output data would pass the G2 postproc):
> >>>>>> $ gst-launch-1.0 filesrc location=/tmp/test.vp9 ! matroskademux !
> >>>>>> vp9parse ! \
> >>>>>>                      v4l2slvp9dec ! video/x-raw,format=NV12 ! videoconvert
> >>>>>> ! fbdevsink
> >>>>>> "
> >>>>>
> >>>>> Does it completely recover afterwards? In my previous trials the hardware
> >>>>> ended up in some bizzare state: while decoding succeeds, the output's md5sum
> >>>>> didn't match up.
> >>>>
> >>>> Have you got a testcase that triggers this, one I can try ?
> >>>>
> >>>> I am not entirely sure whether this is happening here as well or not,
> >>>> but I can imagine that the power domain went down and back up between
> >>>> tests, so the VPU would be power cycled (and therefore reset) that way.
> >>>> So, I think it is worth testing that.
> >>>
> >>> This was last year while I was writing HEVC decoding code for Chromium.
> >>> IIRC the SAODBLK_A_MainConcept_4 test vector from the official HEVC test
> >>> suite does cause our stack to crash, but Gstreamer seemed to handle it
> >>> OK. It could be that the Chromium decoder stack is passing bad values to
> >>> the decoder.
> >>
> >> That can be easily tested with ftrace enabled. I was just tracking down
> >> an issue with gstreamer and added the following patch to the hantro
> >> driver. Then:
> >>
> >> echo > /sys/kernel/debug/tracing/trace
> >> <run fail test>
> >> cat /sys/kernel/debug/tracing/trace > /tmp/fail.trace
> >> echo > /sys/kernel/debug/tracing/trace
> >> <run pass test>
> >> cat /sys/kernel/debug/tracing/trace > /tmp/pass.trace
> >> # remove time stamps etc.
> >> diff /tmp/{fail,pass}.trace
> >>
> >> You should see whether some register programming differs between
> >> gstreamer and chromium.
> >
> > I ended up using VISL to compare the controls set. I did find a bug.
> > It still hard hangs after a couple frames, so I guess I'd need to use
> > your method, but do printk instead.
> >
> > BTW, I wonder if we shouldn't add a reset op, if only just to stop the
> > hardware? That is, do the same two register writes as in the Hantro G2
> > interrupt handler.
>
> You mean these two ?
>
> 38         vdpu_write(vpu, 0, G2_REG_INTERRUPT);
> 39         vdpu_write(vpu, G2_REG_CONFIG_DEC_CLK_GATE_E, G2_REG_CONFIG);

Yes.

> As far as I understand this, that only clears IRQ and gates the clock
> off, but it doesn't reset the IP state, does it ?

That's right, but it would stop the hardware from continuing to do whatever
it is doing before it gets shut down through runtime PM. I'm not sure it
would make that much of a difference, but it did seem like something that
could be done.

ChenYu

diff mbox series

Patch

diff --git a/drivers/media/platform/verisilicon/hantro_drv.c b/drivers/media/platform/verisilicon/hantro_drv.c
index 423fc85d79ee3..50ec24c753e9e 100644
--- a/drivers/media/platform/verisilicon/hantro_drv.c
+++ b/drivers/media/platform/verisilicon/hantro_drv.c
@@ -125,7 +125,8 @@  void hantro_watchdog(struct work_struct *work)
 	ctx = v4l2_m2m_get_curr_priv(vpu->m2m_dev);
 	if (ctx) {
 		vpu_err("frame processing timed out!\n");
-		ctx->codec_ops->reset(ctx);
+		if (ctx->codec_ops->reset)
+			ctx->codec_ops->reset(ctx);
 		hantro_job_finish(vpu, ctx, VB2_BUF_STATE_ERROR);
 	}
 }