[0/3] Chunk Heap Support on DMA-HEAP

Message ID 20200818080415.7531-1-hyesoo.yu@samsung.com (mailing list archive)
Headers
Series Chunk Heap Support on DMA-HEAP |

Message

Hyesoo Yu Aug. 18, 2020, 8:04 a.m. UTC
  These patch series to introduce a new dma heap, chunk heap.
That heap is needed for special HW that requires bulk allocation of
fixed high order pages. For example, 64MB dma-buf pages are made up
to fixed order-4 pages * 1024.

The chunk heap uses alloc_pages_bulk to allocate high order page.
https://lore.kernel.org/linux-mm/20200814173131.2803002-1-minchan@kernel.org

The chunk heap is registered by device tree with alignment and memory node
of contiguous memory allocator(CMA). Alignment defines chunk page size.
For example, alignment 0x1_0000 means chunk page size is 64KB.
The phandle to memory node indicates contiguous memory allocator(CMA).
If device node doesn't have cma, the registration of chunk heap fails.

The patchset includes the following:
 - export dma-heap API to register kernel module dma heap.
 - add chunk heap implementation.
 - document of device tree to register chunk heap

Hyesoo Yu (3):
  dma-buf: add missing EXPORT_SYMBOL_GPL() for dma heaps
  dma-buf: heaps: add chunk heap to dmabuf heaps
  dma-heap: Devicetree binding for chunk heap

 .../devicetree/bindings/dma-buf/chunk_heap.yaml    |  46 +++++
 drivers/dma-buf/dma-heap.c                         |   2 +
 drivers/dma-buf/heaps/Kconfig                      |   9 +
 drivers/dma-buf/heaps/Makefile                     |   1 +
 drivers/dma-buf/heaps/chunk_heap.c                 | 222 +++++++++++++++++++++
 drivers/dma-buf/heaps/heap-helpers.c               |   2 +
 6 files changed, 282 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/dma-buf/chunk_heap.yaml
 create mode 100644 drivers/dma-buf/heaps/chunk_heap.c
  

Comments

Brian Starkey Aug. 18, 2020, 10:55 a.m. UTC | #1
Hi,

On Tue, Aug 18, 2020 at 05:04:12PM +0900, Hyesoo Yu wrote:
> These patch series to introduce a new dma heap, chunk heap.
> That heap is needed for special HW that requires bulk allocation of
> fixed high order pages. For example, 64MB dma-buf pages are made up
> to fixed order-4 pages * 1024.
> 
> The chunk heap uses alloc_pages_bulk to allocate high order page.
> https://lore.kernel.org/linux-mm/20200814173131.2803002-1-minchan@kernel.org
> 
> The chunk heap is registered by device tree with alignment and memory node
> of contiguous memory allocator(CMA). Alignment defines chunk page size.
> For example, alignment 0x1_0000 means chunk page size is 64KB.
> The phandle to memory node indicates contiguous memory allocator(CMA).
> If device node doesn't have cma, the registration of chunk heap fails.

This reminds me of an ion heap developed at Arm several years ago:
https://git.linaro.org/landing-teams/working/arm/kernel.git/tree/drivers/staging/android/ion/ion_compound_page.c

Some more descriptive text here:
https://github.com/ARM-software/CPA

It maintains a pool of high-order pages with a worker thread to
attempt compaction and allocation to keep the pool filled, with high
and low watermarks to trigger freeing/allocating of chunks.
It implements a shrinker to allow the system to reclaim the pool under
high memory pressure.

Is maintaining a pool something you considered? From the
alloc_pages_bulk thread it sounds like you want to allocate 300M at a
time, so I expect if you tuned the pool size to match that it could
work quite well.

That implementation isn't using a CMA region, but a similar approach
could definitely be applied.

Thanks,
-Brian

> 
> The patchset includes the following:
>  - export dma-heap API to register kernel module dma heap.
>  - add chunk heap implementation.
>  - document of device tree to register chunk heap
> 
> Hyesoo Yu (3):
>   dma-buf: add missing EXPORT_SYMBOL_GPL() for dma heaps
>   dma-buf: heaps: add chunk heap to dmabuf heaps
>   dma-heap: Devicetree binding for chunk heap
> 
>  .../devicetree/bindings/dma-buf/chunk_heap.yaml    |  46 +++++
>  drivers/dma-buf/dma-heap.c                         |   2 +
>  drivers/dma-buf/heaps/Kconfig                      |   9 +
>  drivers/dma-buf/heaps/Makefile                     |   1 +
>  drivers/dma-buf/heaps/chunk_heap.c                 | 222 +++++++++++++++++++++
>  drivers/dma-buf/heaps/heap-helpers.c               |   2 +
>  6 files changed, 282 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/dma-buf/chunk_heap.yaml
>  create mode 100644 drivers/dma-buf/heaps/chunk_heap.c
> 
> -- 
> 2.7.4
>
  
John Stultz Aug. 18, 2020, 8:55 p.m. UTC | #2
On Tue, Aug 18, 2020 at 12:45 AM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
>
> These patch series to introduce a new dma heap, chunk heap.
> That heap is needed for special HW that requires bulk allocation of
> fixed high order pages. For example, 64MB dma-buf pages are made up
> to fixed order-4 pages * 1024.
>
> The chunk heap uses alloc_pages_bulk to allocate high order page.
> https://lore.kernel.org/linux-mm/20200814173131.2803002-1-minchan@kernel.org
>
> The chunk heap is registered by device tree with alignment and memory node
> of contiguous memory allocator(CMA). Alignment defines chunk page size.
> For example, alignment 0x1_0000 means chunk page size is 64KB.
> The phandle to memory node indicates contiguous memory allocator(CMA).
> If device node doesn't have cma, the registration of chunk heap fails.
>
> The patchset includes the following:
>  - export dma-heap API to register kernel module dma heap.
>  - add chunk heap implementation.
>  - document of device tree to register chunk heap
>
> Hyesoo Yu (3):
>   dma-buf: add missing EXPORT_SYMBOL_GPL() for dma heaps
>   dma-buf: heaps: add chunk heap to dmabuf heaps
>   dma-heap: Devicetree binding for chunk heap

Hey! Thanks so much for sending this out! I'm really excited to see
these heaps be submitted and reviewed on the list!

The first general concern I have with your series is that it adds a dt
binding for the chunk heap, which we've gotten a fair amount of
pushback on.

A possible alternative might be something like what Kunihiko Hayashi
proposed for non-default CMA heaps:
  https://lore.kernel.org/lkml/1594948208-4739-1-git-send-email-hayashi.kunihiko@socionext.com/

This approach would insteal allow a driver to register a CMA area with
the chunk heap implementation.

However, (and this was the catch Kunihiko Hayashi's patch) this
requires that the driver also be upstream, as we need an in-tree user
of such code.

Also, it might be good to provide some further rationale on why this
heap is beneficial over the existing CMA heap?  In general focusing
the commit messages more on the why we might want the patch, rather
than what the patch does, is helpful.

"Special hardware" that doesn't have upstream drivers isn't very
compelling for most maintainers.

That said, I'm very excited to see these sorts of submissions, as I
know lots of vendors have historically had very custom out of tree ION
heaps, and I think it would be a great benefit to the community to
better understand the experience vendors have in optimizing
performance on their devices, so we can create good common solutions
upstream. So I look forward to your insights on future revisions of
this patch series!

thanks
-john
  
Cho KyongHo Aug. 19, 2020, 3:46 a.m. UTC | #3
On Tue, Aug 18, 2020 at 11:55:57AM +0100, Brian Starkey wrote:
> Hi,
> 
> On Tue, Aug 18, 2020 at 05:04:12PM +0900, Hyesoo Yu wrote:
> > These patch series to introduce a new dma heap, chunk heap.
> > That heap is needed for special HW that requires bulk allocation of
> > fixed high order pages. For example, 64MB dma-buf pages are made up
> > to fixed order-4 pages * 1024.
> > 
> > The chunk heap uses alloc_pages_bulk to allocate high order page.
> > https://lore.kernel.org/linux-mm/20200814173131.2803002-1-minchan@kernel.org
> > 
> > The chunk heap is registered by device tree with alignment and memory node
> > of contiguous memory allocator(CMA). Alignment defines chunk page size.
> > For example, alignment 0x1_0000 means chunk page size is 64KB.
> > The phandle to memory node indicates contiguous memory allocator(CMA).
> > If device node doesn't have cma, the registration of chunk heap fails.
> 
> This reminds me of an ion heap developed at Arm several years ago:
> https://protect2.fireeye.com/v1/url?k=aceed8af-f122140a-acef53e0-0cc47a30d446-0980fa451deb2df6&q=1&e=a58a9bb0-a837-4fc5-970e-907089bfe25e&u=https%3A%2F%2Fgit.linaro.org%2Flanding-teams%2Fworking%2Farm%2Fkernel.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_compound_page.c
> 
> Some more descriptive text here:
> https://protect2.fireeye.com/v1/url?k=83dc3e8b-de10f22e-83ddb5c4-0cc47a30d446-a406aa201ca7dddc&q=1&e=a58a9bb0-a837-4fc5-970e-907089bfe25e&u=https%3A%2F%2Fgithub.com%2FARM-software%2FCPA
> 
> It maintains a pool of high-order pages with a worker thread to
> attempt compaction and allocation to keep the pool filled, with high
> and low watermarks to trigger freeing/allocating of chunks.
> It implements a shrinker to allow the system to reclaim the pool under
> high memory pressure.
> 
> Is maintaining a pool something you considered? From the
> alloc_pages_bulk thread it sounds like you want to allocate 300M at a
> time, so I expect if you tuned the pool size to match that it could
> work quite well.
> 
> That implementation isn't using a CMA region, but a similar approach
> could definitely be applied.
> 
I have seriously considered CPA in our product but we developed our own
because of the pool in CPA.
The high-order pages are required by some specific users like Netflix
app. Moreover required number of bytes are dramatically increasing
because of high resolution videos and displays in these days.

Gathering lots of free high-order pages in the background during
run-time means reserving that amount of pages from the entier available
system memory. Moreover the gathered pages are soon reclaimed whenever
the system is sufferring from memory pressure (i.e. camera recording,
heavy games). So we had to consider allocating hundreds of megabytes at
at time. Of course we don't allocate all buffers by a single call to
alloc_pages_bulk(). But still a buffer is very large.
A single frame of 8K HDR video needs 95MB (7680*4320*2*1.5). Even a
single frame of HDR 4K video needs 24MB and 4K HDR is now popular in
Netflix, YouTube and Google Play video.

> Thanks,
> -Brian

Thank you!

KyongHo
  
Brian Starkey Aug. 19, 2020, 1:22 p.m. UTC | #4
Hi KyongHo,

On Wed, Aug 19, 2020 at 12:46:26PM +0900, Cho KyongHo wrote:
> I have seriously considered CPA in our product but we developed our own
> because of the pool in CPA.

Oh good, I'm glad you considered it :-)

> The high-order pages are required by some specific users like Netflix
> app. Moreover required number of bytes are dramatically increasing
> because of high resolution videos and displays in these days.
> 
> Gathering lots of free high-order pages in the background during
> run-time means reserving that amount of pages from the entier available
> system memory. Moreover the gathered pages are soon reclaimed whenever
> the system is sufferring from memory pressure (i.e. camera recording,
> heavy games).

Aren't these two things in contradiction? If they're easily reclaimed
then they aren't "reserved" in any detrimental way. And if you don't
want them to be reclaimed, then you need them to be reserved...

The approach you have here assigns the chunk of memory as a reserved
CMA region which the kernel is going to try not to use too - similar
to the CPA pool.

I suppose it's a balance depending on how much you're willing to wait
for migration on the allocation path. CPA has the potential to get you
faster allocations, but the downside is you need to make it a little
more "greedy".

Cheers,
-Brian
  
Cho KyongHo Aug. 21, 2020, 7:38 a.m. UTC | #5
Hi Brain,

On Wed, Aug 19, 2020 at 02:22:04PM +0100, Brian Starkey wrote:
> Hi KyongHo,
> 
> On Wed, Aug 19, 2020 at 12:46:26PM +0900, Cho KyongHo wrote:
> > I have seriously considered CPA in our product but we developed our own
> > because of the pool in CPA.
> 
> Oh good, I'm glad you considered it :-)
> 
> > The high-order pages are required by some specific users like Netflix
> > app. Moreover required number of bytes are dramatically increasing
> > because of high resolution videos and displays in these days.
> > 
> > Gathering lots of free high-order pages in the background during
> > run-time means reserving that amount of pages from the entier available
> > system memory. Moreover the gathered pages are soon reclaimed whenever
> > the system is sufferring from memory pressure (i.e. camera recording,
> > heavy games).
> 
> Aren't these two things in contradiction? If they're easily reclaimed
> then they aren't "reserved" in any detrimental way. And if you don't
> want them to be reclaimed, then you need them to be reserved...
> 
> The approach you have here assigns the chunk of memory as a reserved
> CMA region which the kernel is going to try not to use too - similar
> to the CPA pool.
> 
> I suppose it's a balance depending on how much you're willing to wait
> for migration on the allocation path. CPA has the potential to get you
> faster allocations, but the downside is you need to make it a little
> more "greedy".
> 
I understand why you think it as contradiction. But I don't think so.
Kernel page allocator now prefers free pages in CMA when allocating
movable pages by commit
https://lore.kernel.org/linux-mm/CAAmzW4P6+3O_RLvgy_QOKD4iXw+Hk3HE7Toc4Ky7kvQbCozCeA@mail.gmail.com/
.

We are trying to reduce unused pages to improve performance. So, unused
pages in a pool should be easily reclaimed. That is why we does not
secure free pages in a special pool for a specific usecase. Instead we
have tried to reduce performance bottle-necks in page migration to
allocate large amount memory when the memory is needed.