From patchwork Fri Apr 4 08:28:22 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hugh Dickins X-Patchwork-Id: 23406 Received: from mail.tu-berlin.de ([130.149.7.33]) by www.linuxtv.org with esmtp (Exim 4.72) (envelope-from ) id 1WVzVt-0005Cy-Ir; Fri, 04 Apr 2014 10:30:13 +0200 X-tubIT-Incoming-IP: 209.132.180.67 Received: from vger.kernel.org ([209.132.180.67]) by mail.tu-berlin.de (exim-4.72/mailfrontend-8) with esmtp id 1WVzVr-0005Yf-ji; Fri, 04 Apr 2014 10:30:13 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752037AbaDDI3o (ORCPT + 1 other); Fri, 4 Apr 2014 04:29:44 -0400 Received: from mail-pa0-f52.google.com ([209.85.220.52]:34927 "EHLO mail-pa0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751899AbaDDI3h (ORCPT ); Fri, 4 Apr 2014 04:29:37 -0400 Received: by mail-pa0-f52.google.com with SMTP id rd3so3120013pab.25 for ; Fri, 04 Apr 2014 01:29:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=date:from:to:cc:subject:message-id:user-agent:mime-version :content-type; bh=OYgN+A2dZ1KQzFNWKEtlgvoUQK0OsSp99kxIX31cr/A=; b=h/1udQbOn9/7I7L82q8PCJWBKVtovzLbcXy/zBS/hJ/ng/4C/e8mOvkcRHYjyWba0z xOy/zuXHuJYoNg5HsGmPoRNk3QzprgSuwRhhJ7E0LFRd6LMO94IlktfPaS/6R9DgfvbL Ao9N2c00wFXeWM2MD/aB7bQyf9BCPA6ulex9Od4a3FSVikDNy4hn9XR0ykwzVa0Trzht sZSh63JlZtYw45mCvnSRHbeqHzK1fguZXqnnsbdMHjgbQHaCyHzh8fk+dngo7xfhyLyV jQF6CldB3A4hDsUKsbdTlyYZvtPW7IeVa/mprGn3758umwySx+2INdVOZ+dynkUIElC/ YsQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version:content-type; bh=OYgN+A2dZ1KQzFNWKEtlgvoUQK0OsSp99kxIX31cr/A=; b=lZhGA5+2xgiSlYG2wRdR6TwQwlHOfs5a74uQ/r2gFNx1EgvsboUspLF87kvr9ii7NS 9Bq2JqGJEygiutNVO5FdL0wppyZ2E407aSIP0enfHvyHSbkx1Ryz9fgTY5DFZXrwH4eS j3UyJjpSPYNZx1LiqgXQaOffr7M9DMtmgRQOldngZ0QEshhPHpGWLl14vABN32d93olN L2hJeOVJOB54NlZb7u8GHAY4uwYQHTR5Vhkgg1o56VkB6iPqLU3n+cdypbmmVs0BobEW pdy47g1AYGuzwvx0qh3u6ytxLLyJs4xz+vlgVFhn69H6yQfvSSE83tO9CN41h97qgGok +/hg== X-Gm-Message-State: ALoCoQku32mr0JE4q0X2bjV+VpFiVhfXEQ3US8fnFS0VmQmGLIwl4Sj7IUaI8fziCCLCrc8szlLVjq2RjWmjBT5EmpTzFzOCpnUdta62UlIAMlkZSH7kSOR4OLFN8FG0ydPrMEAJhzOjlcVZuXJ59wiFCHq3xGhuWsPLRGQDT/euguI+e6LLc3rDxzkSvwnOXhN8dxnAIKfWbbplb8qcrM29ESU0tvmIyA== X-Received: by 10.68.129.34 with SMTP id nt2mr13638728pbb.18.1396600176674; Fri, 04 Apr 2014 01:29:36 -0700 (PDT) Received: from [192.168.1.19] (c-67-169-183-19.hsd1.ca.comcast.net. [67.169.183.19]) by mx.google.com with ESMTPSA id sv10sm8853845pbc.74.2014.04.04.01.29.34 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 04 Apr 2014 01:29:35 -0700 (PDT) Date: Fri, 4 Apr 2014 01:28:22 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: Andrew Morton , Jan Kara , Roland Dreier , Oleg Nesterov , Konstantin Khlebnikov , "Kirill A. Shutemov" , Mauro Carvalho Chehab , Omar Ramirez Luna , Inki Dae , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-rdma@vger.kernel.org, linux-media@vger.kernel.org Subject: [PATCH] mm: get_user_pages(write,force) refuse to COW in shared areas Message-ID: User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Sender: linux-media-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org X-PMX-Version: 6.0.0.2142326, Antispam-Engine: 2.7.2.2107409, Antispam-Data: 2014.4.4.82114 X-PMX-Spam: Gauge=IIIIIIII, Probability=8%, Report=' MULTIPLE_RCPTS 0.1, HTML_00_01 0.05, HTML_00_10 0.05, DKIM_SIGNATURE 0, ECARD_KNOWN_DOMAINS 0, URI_ENDS_IN_HTML 0, __ANY_URI 0, __CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0, __HAS_FROM 0, __HAS_MSGID 0, __HAS_X_MAILING_LIST 0, __LINES_OF_YELLING 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __MULTIPLE_RCPTS_CC_X2 0, __PHISH_SPEAR_ACCOUNT_1 0, __SANE_MSGID 0, __SUBJ_ALPHA_END 0, __SXL_RIP_ERROR_SERVFAIL , __SXL_SIG_ERROR_SERVFAIL , __TO_MALFORMED_2 0, __URI_NO_WWW 0, __URI_NS , __USER_AGENT 0, __YOUTUBE_RCVD 0' get_user_pages(write=1, force=1) has always had odd behaviour on write- protected shared mappings: although it demands FMODE_WRITE-access to the underlying object (do_mmap_pgoff sets neither VM_SHARED nor VM_MAYWRITE without that), it ends up with do_wp_page substituting private anonymous Copied-On-Write pages for the shared file pages in the area. That was long ago intentional, as a safety measure to prevent ptrace setting a breakpoint (or POKETEXT or POKEDATA) from inadvertently corrupting the underlying executable. Yet exec and dynamic loaders open the file read-only, and use MAP_PRIVATE rather than MAP_SHARED. The traditional odd behaviour still causes surprises and bugs in mm, and is probably not what any caller wants - even the comment on the flag says "You do not want this" (although it's undoubtedly necessary for overriding userspace protections in some contexts, and good when !write). Let's stop doing that. But it would be dangerous to remove the long- standing safety at this stage, so just make get_user_pages(write,force) fail with EFAULT when applied to a write-protected shared area. Infiniband may in future want to force write through to underlying object: we can add another FOLL_flag later to enable that if required. Odd though the old behaviour was, there is no doubt that we may turn out to break userspace with this change, and have to revert it quickly. Issue a WARN_ON_ONCE to help debug the changed case (easily triggered by userspace, so only once to prevent spamming the logs); and delay a few associated cleanups until this change is proved. get_user_pages callers who might see trouble from this change: ptrace poking, or writing to /proc//mem drivers/infiniband/ drivers/media/v4l2-core/ drivers/gpu/drm/exynos/exynos_drm_gem.c drivers/staging/tidspbridge/core/tiomap3430.c if they ever apply get_user_pages to write-protected shared mappings of an object which was opened for writing. I went to apply the same change to mm/nommu.c, but retreated. NOMMU has no place for COW, and its VM_flags conventions are not the same: I'd be more likely to screw up NOMMU than make an improvement there. Suggested-by: Linus Torvalds Signed-off-by: Hugh Dickins Acked-by: Kirill A. Shutemov --- You suggested something like this in the LKML discussion of 887843961c4b and "bad rss-counter message in 3.14rc5" on March 18th, and I agreed to remind you "early in the 3.15 merge window". Sorry, this comes a few days later than intended: and it's the first time I've posted it, so it's not seen exposure in mmotm or linux-next; nor any approval from those Cc'ed, though I did mention it to a few. Up to you: you may prefer to hold it over, or give it exposure soonest; I've seen no problem from it, but then I'm not likely to. I may have exaggerated the accounting difficulties of the present behaviour: even with this change, write-force still violates the Committed_AS accounting which Konstantin had patches to fix; but I hope we can do that more simply now, with some kind of cmpxchg setting VM_ACCOUNT here in vm_flags, without mmap_sem for writing. mm/memory.c | 66 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 45 insertions(+), 21 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- 3.14/mm/memory.c 2014-03-30 20:40:15.000000000 -0700 +++ linux/mm/memory.c 2014-04-03 15:26:41.884372480 -0700 @@ -1705,15 +1705,6 @@ long __get_user_pages(struct task_struct VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); - /* - * Require read or write permissions. - * If FOLL_FORCE is set, we only require the "MAY" flags. - */ - vm_flags = (gup_flags & FOLL_WRITE) ? - (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD); - vm_flags &= (gup_flags & FOLL_FORCE) ? - (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE); - /* * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault * would be called on PROT_NONE ranges. We must never invoke @@ -1741,7 +1732,7 @@ long __get_user_pages(struct task_struct /* user gate pages are read-only */ if (gup_flags & FOLL_WRITE) - return i ? : -EFAULT; + goto efault; if (pg > TASK_SIZE) pgd = pgd_offset_k(pg); else @@ -1751,12 +1742,12 @@ long __get_user_pages(struct task_struct BUG_ON(pud_none(*pud)); pmd = pmd_offset(pud, pg); if (pmd_none(*pmd)) - return i ? : -EFAULT; + goto efault; VM_BUG_ON(pmd_trans_huge(*pmd)); pte = pte_offset_map(pmd, pg); if (pte_none(*pte)) { pte_unmap(pte); - return i ? : -EFAULT; + goto efault; } vma = get_gate_vma(mm); if (pages) { @@ -1769,7 +1760,7 @@ long __get_user_pages(struct task_struct page = pte_page(*pte); else { pte_unmap(pte); - return i ? : -EFAULT; + goto efault; } } pages[i] = page; @@ -1780,10 +1771,42 @@ long __get_user_pages(struct task_struct goto next_page; } - if (!vma || - (vma->vm_flags & (VM_IO | VM_PFNMAP)) || - !(vm_flags & vma->vm_flags)) - return i ? : -EFAULT; + if (!vma) + goto efault; + vm_flags = vma->vm_flags; + if (vm_flags & (VM_IO | VM_PFNMAP)) + goto efault; + + if (gup_flags & FOLL_WRITE) { + if (!(vm_flags & VM_WRITE)) { + if (!(gup_flags & FOLL_FORCE)) + goto efault; + /* + * We used to let the write,force case do COW + * in a VM_MAYWRITE VM_SHARED !VM_WRITE vma, so + * ptrace could set a breakpoint in a read-only + * mapping of an executable, without corrupting + * the file (yet only when that file had been + * opened for writing!). Anon pages in shared + * mappings are surprising: now just reject it. + */ + if (!is_cow_mapping(vm_flags)) { + WARN_ON_ONCE(vm_flags & VM_MAYWRITE); + goto efault; + } + } + } else { + if (!(vm_flags & VM_READ)) { + if (!(gup_flags & FOLL_FORCE)) + goto efault; + /* + * Is there actually any vma we can reach here + * which does not have VM_MAYREAD set? + */ + if (!(vm_flags & VM_MAYREAD)) + goto efault; + } + } if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, @@ -1837,7 +1860,7 @@ long __get_user_pages(struct task_struct return -EFAULT; } if (ret & VM_FAULT_SIGBUS) - return i ? i : -EFAULT; + goto efault; BUG(); } @@ -1895,6 +1918,8 @@ next_page: } while (nr_pages && start < vma->vm_end); } while (nr_pages); return i; +efault: + return i ? : -EFAULT; } EXPORT_SYMBOL(__get_user_pages); @@ -1962,9 +1987,8 @@ int fixup_user_fault(struct task_struct * @start: starting user address * @nr_pages: number of pages from start to pin * @write: whether pages will be written to by the caller - * @force: whether to force write access even if user mapping is - * readonly. This will result in the page being COWed even - * in MAP_SHARED mappings. You do not want this. + * @force: whether to force access even when user mapping is currently + * protected (but never forces write access to shared mapping). * @pages: array that receives pointers to the pages pinned. * Should be at least nr_pages long. Or NULL, if caller * only intends to ensure the pages are faulted in.