LinuxTV Patchwork Revert 95f408bb Ryzen DMA related RiSC engine stall fixes

login
register
mail settings
Submitter Mauro Carvalho Chehab
Date Dec. 5, 2018, 11:07 a.m.
Message ID <20181205090721.43e7f36c@coco.lan>
Download mbox | patch
Permalink /patch/53318/
State RFC
Headers show

Comments

Mauro Carvalho Chehab - Dec. 5, 2018, 11:07 a.m.
Em Sun, 21 Oct 2018 15:45:39 +0200
Markus Dobel <markus.dobel@gmx.de> escreveu:

> The original commit (the one reverted in this patch) introduced a 
> regression,
> making a previously flawless adapter unresponsive after running a few 
> hours
> to days. Since I never experienced the problems that the original commit 
> is
> supposed to fix, I propose to revert the change until a regression-free
> variant is found.
> 
> Before submitting this, I've been running a system 24x7 with this revert 
> for
> several weeks now, and it's running stable again.
> 
> It's not a pure revert, as the original commit does not revert cleanly
> anymore due to other changes, but content-wise it is.
> 
> Signed-off-by: Markus Dobel <markus.dobel@gmx.de>
> ---
>   drivers/media/pci/cx23885/cx23885-core.c | 60 ------------------------
>   drivers/media/pci/cx23885/cx23885-reg.h  | 14 ------
>   2 files changed, 74 deletions(-)
> 
> diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
> b/drivers/media/pci/cx23885/cx23885-core.c
> index 39804d830305..606f6fc0e68b 100644
> --- a/drivers/media/pci/cx23885/cx23885-core.c
> +++ b/drivers/media/pci/cx23885/cx23885-core.c
> @@ -601,25 +601,6 @@ static void cx23885_risc_disasm(struct 
> cx23885_tsport *port,

Patch was mangled by your e-mailer: it broke longer lines, causing
it to not apply.

Also, before just reverting the entire thing, could you please check
if the enclosed hack would solve it?

If so, it should be easy to add a quirk at drivers/pci/quirks.c
in order to detect the Ryzen models with a bad DMA engine that
require periodic resets, and then make cx23885 to use it.

We did similar tricks before with some broken DMA engines, at
the time we had overlay support on drivers and AMD controllers
didn't support PCI2PCI DMA transfers.

Brad,

Could you please address this issue?





Thanks,
Mauro
Brad Love - Dec. 6, 2018, 4:37 p.m.
Hi Mauro & Markus,


On 05/12/2018 05.07, Mauro Carvalho Chehab wrote:
> Em Sun, 21 Oct 2018 15:45:39 +0200
> Markus Dobel <markus.dobel@gmx.de> escreveu:
>
>> The original commit (the one reverted in this patch) introduced a 
>> regression,
>> making a previously flawless adapter unresponsive after running a few 
>> hours
>> to days. Since I never experienced the problems that the original commit 
>> is
>> supposed to fix, I propose to revert the change until a regression-free
>> variant is found.
>>
>> Before submitting this, I've been running a system 24x7 with this revert 
>> for
>> several weeks now, and it's running stable again.
>>
>> It's not a pure revert, as the original commit does not revert cleanly
>> anymore due to other changes, but content-wise it is.
>>
>> Signed-off-by: Markus Dobel <markus.dobel@gmx.de>
>> ---
>>   drivers/media/pci/cx23885/cx23885-core.c | 60 ------------------------
>>   drivers/media/pci/cx23885/cx23885-reg.h  | 14 ------
>>   2 files changed, 74 deletions(-)
>>
>> diff --git a/drivers/media/pci/cx23885/cx23885-core.c 
>> b/drivers/media/pci/cx23885/cx23885-core.c
>> index 39804d830305..606f6fc0e68b 100644
>> --- a/drivers/media/pci/cx23885/cx23885-core.c
>> +++ b/drivers/media/pci/cx23885/cx23885-core.c
>> @@ -601,25 +601,6 @@ static void cx23885_risc_disasm(struct 
>> cx23885_tsport *port,
> Patch was mangled by your e-mailer: it broke longer lines, causing
> it to not apply.
>
> Also, before just reverting the entire thing, could you please check
> if the enclosed hack would solve it?
>
> If so, it should be easy to add a quirk at drivers/pci/quirks.c
> in order to detect the Ryzen models with a bad DMA engine that
> require periodic resets, and then make cx23885 to use it.
>
> We did similar tricks before with some broken DMA engines, at
> the time we had overlay support on drivers and AMD controllers
> didn't support PCI2PCI DMA transfers.
>
> Brad,
>
> Could you please address this issue?


I'll try to address this today or tomorrow. Since the original patch was
applied I have not received any complaints from ryzen users, but we have
accumulated a few reports from Intel users with a variety of
motherboards that do now encounter issue. Strangely system load affects
the repro; low/no load exhibits the error condition, high system load
everything is fine. I'll do my best to send in a ryzen specific patch by
the weekend.

Regards,

Brad



>
>
> diff --git a/drivers/media/pci/cx23885/cx23885-core.c b/drivers/media/pci/cx23885/cx23885-core.c
> index 39804d830305..8b012bee6b32 100644
> --- a/drivers/media/pci/cx23885/cx23885-core.c
> +++ b/drivers/media/pci/cx23885/cx23885-core.c
> @@ -603,8 +603,14 @@ static void cx23885_risc_disasm(struct cx23885_tsport *port,
>  
>  static void cx23885_clear_bridge_error(struct cx23885_dev *dev)
>  {
> -	uint32_t reg1_val = cx_read(TC_REQ); /* read-only */
> -	uint32_t reg2_val = cx_read(TC_REQ_SET);
> +	uint32_t reg1_val, reg2_val;
> +
> +	/* TODO: check for Ryzen quirk */
> +	if (1)
> +		return;
> +
> +	reg1_val = cx_read(TC_REQ); /* read-only */
> +	reg2_val = cx_read(TC_REQ_SET);
>  
>  	if (reg1_val && reg2_val) {
>  		cx_write(TC_REQ, reg1_val);
>
>
>
> Thanks,
> Mauro
Markus Dobel - Dec. 6, 2018, 5:18 p.m.
Hi everyone,

I will try if the hack mentioned fixes the issue for me on the weekend (but I assume, as if effectively removes the function).

Just in case this is of interest, I neither have Ryzen nor Intel, but an HP Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more on the slow side. 

Regards, Markus
Mauro Carvalho Chehab - Dec. 6, 2018, 6:01 p.m.
Em Thu, 06 Dec 2018 18:18:23 +0100
Markus Dobel <markus.dobel@gmx.de> escreveu:

> Hi everyone,
> 
> I will try if the hack mentioned fixes the issue for me on the weekend (but I assume, as if effectively removes the function).

It should, but it keeps a few changes. Just want to be sure that what
would be left won't cause issues. If this works, the logic that would
solve Ryzen DMA fixes will be contained into a single point, making
easier to maintain it.

> 
> Just in case this is of interest, I neither have Ryzen nor Intel, but an HP Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more on the slow side. 

Good to know. It would probably worth to check if this Ryzen
bug occors with all versions of it or with just a subset.
I mean: maybe it is only at the first gen or Ryzen and doesn't
affect Ryzen 2 (or vice versa).

The PCI quirks logic will likely need to detect the PCI ID of
the memory controllers found at the buggy CPUs, in order to enable
the quirk only for the affected ones.

It could be worth talking with AMD people in order to be sure about
the differences at the DMA engine side.

Thanks,
Mauro
Alex Deucher - Dec. 6, 2018, 6:36 p.m.
On Thu, Dec 6, 2018 at 1:05 PM Mauro Carvalho Chehab <mchehab@kernel.org> wrote:
>
> Em Thu, 06 Dec 2018 18:18:23 +0100
> Markus Dobel <markus.dobel@gmx.de> escreveu:
>
> > Hi everyone,
> >
> > I will try if the hack mentioned fixes the issue for me on the weekend (but I assume, as if effectively removes the function).
>
> It should, but it keeps a few changes. Just want to be sure that what
> would be left won't cause issues. If this works, the logic that would
> solve Ryzen DMA fixes will be contained into a single point, making
> easier to maintain it.
>
> >
> > Just in case this is of interest, I neither have Ryzen nor Intel, but an HP Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more on the slow side.
>
> Good to know. It would probably worth to check if this Ryzen
> bug occors with all versions of it or with just a subset.
> I mean: maybe it is only at the first gen or Ryzen and doesn't
> affect Ryzen 2 (or vice versa).

The original commit also mentions some Xeons are affected too.  Seems
like this is potentially an issue on the device side rather than the
platform.

>
> The PCI quirks logic will likely need to detect the PCI ID of
> the memory controllers found at the buggy CPUs, in order to enable
> the quirk only for the affected ones.
>
> It could be worth talking with AMD people in order to be sure about
> the differences at the DMA engine side.
>

It's not clear to me what the pci or platform quirk would do.  The
workaround seems to be in the driver, not on the platform.

Alex
Mauro Carvalho Chehab - Dec. 6, 2018, 7:07 p.m.
Em Thu, 6 Dec 2018 13:36:24 -0500
Alex Deucher <alexdeucher@gmail.com> escreveu:

> On Thu, Dec 6, 2018 at 1:05 PM Mauro Carvalho Chehab <mchehab@kernel.org> wrote:
> >
> > Em Thu, 06 Dec 2018 18:18:23 +0100
> > Markus Dobel <markus.dobel@gmx.de> escreveu:
> >  
> > > Hi everyone,
> > >
> > > I will try if the hack mentioned fixes the issue for me on the weekend (but I assume, as if effectively removes the function).  
> >
> > It should, but it keeps a few changes. Just want to be sure that what
> > would be left won't cause issues. If this works, the logic that would
> > solve Ryzen DMA fixes will be contained into a single point, making
> > easier to maintain it.
> >  
> > >
> > > Just in case this is of interest, I neither have Ryzen nor Intel, but an HP Microserver G7 with an AMD Turion II Neo  N54L, so the machine is more on the slow side.  
> >
> > Good to know. It would probably worth to check if this Ryzen
> > bug occors with all versions of it or with just a subset.
> > I mean: maybe it is only at the first gen or Ryzen and doesn't
> > affect Ryzen 2 (or vice versa).  
> 
> The original commit also mentions some Xeons are affected too.  Seems
> like this is potentially an issue on the device side rather than the
> platform.

Maybe.

> >
> > The PCI quirks logic will likely need to detect the PCI ID of
> > the memory controllers found at the buggy CPUs, in order to enable
> > the quirk only for the affected ones.
> >
> > It could be worth talking with AMD people in order to be sure about
> > the differences at the DMA engine side.
> >  
> 
> It's not clear to me what the pci or platform quirk would do.  The
> workaround seems to be in the driver, not on the platform.

Yeah, the fix should be at the driver, but pci/quirk.c would be able
to detect memory controllers that would require a hack inside the
driver, in a similar way to what the media PCI drivers already do
for DMA controllers that don't support pci2pci transfers.

There, basically the PCI core (drivers/pci/pci.c and 
drivers/pci/quirks.c) sets a flag (pci_pci_problems) indicating
potential issues.

Then, the driver compares such flag in order to enable the specific quirk.

Ok, there would be a different way to handle it. The driver could use a 
logic similar to the one I wrote for drivers/edac/i7core_edac.c. There,
the logic seeks for some specific PCI device IDs using pci_get_device(),
adjusting the code accordingly, depending on the detected PCI devices.

In other words, running something like this (untested), at probe time should
produce similar results:

	/*
	 * FIXME: It probably makes sense to also be able to identify specific
	 * versions of the same PCI ID, just in case a latter stepping got a
	 * fix for the issue.
	 */
	const static struct {
		int vendor, dev;
	} broken_dev_id[] = {
		PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_foo,
		PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_bar,
	},

	bool cx23885_does_dma_require_reset(void) 
	{
		int i;
		struct pci_dev *pdev = NULL;

		for (i = 0; i < sizeof(broken_dev_id); i++) {
			pdev = pci_get_device(broken_dev_id[i].vendor, broken_dev_id[i].dev, NULL);
			if (pdev) {
				pci_put_device(pdev);
				return true;
			}
		}
		return false;
	}

Should work. In any case, we need to know what memory controllers 
have problems, and what are their PCI IDs, and add them (if not there yet)
at include/linux/pci_ids.h

Thanks,
Mauro

Patch

diff --git a/drivers/media/pci/cx23885/cx23885-core.c b/drivers/media/pci/cx23885/cx23885-core.c
index 39804d830305..8b012bee6b32 100644
--- a/drivers/media/pci/cx23885/cx23885-core.c
+++ b/drivers/media/pci/cx23885/cx23885-core.c
@@ -603,8 +603,14 @@  static void cx23885_risc_disasm(struct cx23885_tsport *port,
 
 static void cx23885_clear_bridge_error(struct cx23885_dev *dev)
 {
-	uint32_t reg1_val = cx_read(TC_REQ); /* read-only */
-	uint32_t reg2_val = cx_read(TC_REQ_SET);
+	uint32_t reg1_val, reg2_val;
+
+	/* TODO: check for Ryzen quirk */
+	if (1)
+		return;
+
+	reg1_val = cx_read(TC_REQ); /* read-only */
+	reg2_val = cx_read(TC_REQ_SET);
 
 	if (reg1_val && reg2_val) {
 		cx_write(TC_REQ, reg1_val);

Privacy Policy