ThreadSanitizer warnings for cThread

Message ID Y5TCSnuu3uNGxtUX@jyty
State New
Headers
Series ThreadSanitizer warnings for cThread |

Commit Message

Marko Mäkelä Dec. 10, 2022, 5:30 p.m. UTC
  Because of the heap-use-after-free race condition that was rather easily 
reproducible with AddressSanitizer (-fsanitize=address), I thought that 
I should finally try to learn to use ThreadSanitizer (TSAN, 
-fsanitize=thread in GCC and clang).

https://clang.llvm.org/docs/ThreadSanitizer.html

Because VDR makes use of POSIX thread synchronization primitives, no 
additional instrumentation via <sanitizer/tsan_interface.h> should be 
necessary.

Before C++11 defined a memory model for multi-threaded applications, 
semantics around shared data structures were rather unclear, and I would 
guess that most multi-threaded pre-C++11 code bases would trip 
ThreadSanitizer. Also, multi-threaded CPUs were rare in the early days, 
and the Total Store Order of the x86 is very forgiving, compared to the 
weak memory model of ARM (see 
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for some 
examples).

To silence one prominent source of TSAN warnings in the cThread 
destructor, I applied the attached patch. It is not ideal, because 
std::atomic defaults to std::memory_order_seq_cst while 
std::memory_order_relaxed would likely suffice here.

Even after applying the first patch, a simple test with no DVB receiver 
device and no valid data directory would still produce a couple of data 
race reports (see the end of this message). I recorded a trace of such a 
run with "rr record" (https://rr-project.org) and debugged it in "rr 
replay". Unsurprisingly, I did not find actual evidence of a race 
condition.

Finally, I figured out what is causing the first report: 
cThread::description is not protected by cThread::mutex. Possibly, most 
cThread data members (including cThread::active) should be protected by 
cThread::mutex?

With both attached patches applied, the report of the first race will 
disappear. The second race is apparently about some memory that is 
allocated inside opendir(). I did not figure it out yet.

Related to this, cThread::Cancel() especially when invoked with 
WaitSeconds=-1 looks problematic to me, and I see that VDR is invoking 
pthread_detach() and never invoking pthread_join(). The second patch 
includes an attempt to clean this up as well.

Both patches are just a proof of concept; I did not test them beyond the 
simple failing-to-start VDR run under TSAN. Unfortunately, TSAN is not 
available for my primary VDR hardware, running on 32-bit ARM.

With best regards,

	Marko

vdr: error while reading '/var/lib/vdr/sources.conf'
vdr: error while reading '/var/lib/vdr/channels.conf'
vdr: no primary device found - using first device!
==================
WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=96847)
   Write of size 8 at 0x7ffc7f773e60 by main thread:
     #0 cThread::~cThread() /dev/shm/vdr/thread.c:249 (vdr+0x1d72b8)
     #1 cEpgDataReader::~cEpgDataReader() /dev/shm/vdr/epg.h:236 (vdr+0xa956b)
     #2 main /dev/shm/vdr/vdr.c:731 (vdr+0xa956b)

   Previous read of size 8 at 0x7ffc7f773e60 by thread T2:
     #0 cThread::StartThread(cThread*) /dev/shm/vdr/thread.c:293 (vdr+0x1d76d9)

   Location is stack of main thread.

   Location is global '<null>' at 0x000000000000 ([stack]+0x1fe60)

   Thread T2 'epg data reader' (tid=96855, finished) created by main thread at:
     #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
     #1 cThread::Start() /dev/shm/vdr/thread.c:316 (vdr+0x1d6fe0)
     #2 main /dev/shm/vdr/vdr.c:804 (vdr+0xaa477)

SUMMARY: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) /dev/shm/vdr/thread.c:249 in cThread::~cThread()
==================
==================
WARNING: ThreadSanitizer: data race (pid=96847)
   Write of size 8 at 0x7b04000005a0 by main thread:
     #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:706 (libtsan.so.2+0x47e82)
     #1 cString::~cString() /dev/shm/vdr/tools.c:1097 (vdr+0x1e52df)
     #2 cxa_at_exit_wrapper ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:389 (libtsan.so.2+0x2dee3)

   Previous read of size 8 at 0x7b04000005a0 by thread T1:
     #0 opendir ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:3271 (libtsan.so.2+0x4c641)
     #1 cReadDir::cReadDir(char const*) /dev/shm/vdr/tools.c:1553 (vdr+0x1ea8bd)
     #2 cVideoDirectoryScannerThread::ScanVideoDir(char const*, int, int) /dev/shm/vdr/recording.c:1439 (vdr+0x180620)
     #3 cVideoDirectoryScannerThread::Action() /dev/shm/vdr/recording.c:1433 (vdr+0x180bfc)
     #4 cThread::StartThread(cThread*) /dev/shm/vdr/thread.c:293 (vdr+0x1d76ea)

   Thread T1 'video directory' (tid=96854, running) created by main thread at:
     #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686)
     #1 cThread::Start() /dev/shm/vdr/thread.c:316 (vdr+0x1d6fe0)
     #2 cRecordings::Update(bool) /dev/shm/vdr/recording.c:1554 (vdr+0x175387)
     #3 main /dev/shm/vdr/vdr.c:788 (vdr+0xaa3f8)

SUMMARY: ThreadSanitizer: data race /dev/shm/vdr/tools.c:1097 in cString::~cString()
==================
ThreadSanitizer: reported 2 warnings
  

Comments

Marko Mäkelä Dec. 10, 2022, 5:46 p.m. UTC | #1
Sat, Dec 10, 2022 at 07:30:50PM +0200, Marko Mäkelä wrote:
>Finally, I figured out what is causing the first report: 
>cThread::description is not protected by cThread::mutex.

Sorry, I failed to notice that even after applying both patches, both 
TSAN reports are still there. The race condition apparently is about the 
Thread->Action() member function call, which is not protected by 
anything. Perhaps cThread::StartThread() should acquire the 
Thread->mutex very early on? Or TSAN simply would insist that 
cThread::Cancel() be always called before the destructor?

	Marko
  
Marko Mäkelä Dec. 12, 2022, 2:53 p.m. UTC | #2
Sat, Dec 10, 2022 at 07:30:50PM +0200, Marko Mäkelä wrote:
>Because of the heap-use-after-free race condition that was rather 
>easily reproducible with AddressSanitizer (-fsanitize=address), I 
>thought that I should finally try to learn to use ThreadSanitizer 
>(TSAN, -fsanitize=thread in GCC and clang).
>
>https://clang.llvm.org/docs/ThreadSanitizer.html
>
>Because VDR makes use of POSIX thread synchronization primitives, no 
>additional instrumentation via <sanitizer/tsan_interface.h> should be 
>necessary.
>
>Before C++11 defined a memory model for multi-threaded applications, 
>semantics around shared data structures were rather unclear, and I 
>would guess that most multi-threaded pre-C++11 code bases would trip 
>ThreadSanitizer. Also, multi-threaded CPUs were rare in the early 
>days, and the Total Store Order of the x86 is very forgiving, compared 
>to the weak memory model of ARM (see 
>https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for some 
>examples).

https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces 
gives some examples of races, which seem to be possible in VDR. Since 
there are not many virtual member functions in a multithreaded software 
component that I maintain, I was not even aware that a vptr could be 
modified inside a destructor.

Of course, may be a huge gap between something bad reported by 
ThreadSanitizer and something bad actually being rather easily 
reproducible. For example, lock-order-inversion potentially causes a 
deadlock. An actual deadlock requires an (un)fortunate scheduling of 
multiple threads that are acquiring some locks in the opposite order.

The wiki page includes the following:
>On architectures other than x86, the cache effects or out-of-order 
>instruction execution may lead to other subtle problems.

Some race conditions on x86 may be merely about a missing "compiler 
barrier", which would prevent reordering some code at compilation time.

On ARM, POWER, RISC-V and other CPUs that implement a weak memory model, 
special instructions are needed for guaranteeing the correct memory 
ordering, for example, when publishing a dynamically constructed object 
in a shared data structure:
https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering
On the x86, there is no difference between that and relaxed memory 
ordering, perhaps expect for "compiler barriers".

	Marko
  
Klaus Schmidinger Dec. 14, 2022, 9:21 a.m. UTC | #3
On 10.12.22 18:30, Marko Mäkelä wrote:
> ...
> Finally, I figured out what is causing the first report: cThread::description is not protected by cThread::mutex. Possibly, most cThread data members (including cThread::active) should be protected by cThread::mutex?

Unless I'm missing something, cThread::description is never accessed while the thread
is actually running, locking isn't necessary here.

> ...
> Related to this, cThread::Cancel() especially when invoked with WaitSeconds=-1 looks problematic to me, and I see that VDR is invoking pthread_detach() and never invoking pthread_join(). The second patch includes an attempt to clean this up as well.

Is there an actual problem that requires this?
It has been that way for many, many years, so I'd like to see more than
"looks problematic to me" before I dare touch this ;-).

Klaus
  
Marko Mäkelä Dec. 14, 2022, 10:05 a.m. UTC | #4
Wed, Dec 14, 2022 at 10:21:10AM +0100, Klaus Schmidinger wrote:
>Is there an actual problem that requires this?
>It has been that way for many, many years, so I'd like to see more than 
>"looks problematic to me" before I dare touch this ;-).

The only problem that I am currently aware of is something specific to 
rpihddevice shutdown and not VDR itself, and so far it only occurred 
when I compiled the code with -fsanitize=undefined.

Some 15 years ago, VDR startup with softdevice (not softhddevice) would 
hang once in a few months (say, once in 100 to 300 attempts) on a 
single-core system, but I don't remember trying to analyze it back then.  
It should have been on a 2.6 kernel and NPTL (not LinuxThreads), that 
is, not too different from what we have now.

Code around thread creation and destruction is indeed tricky, and the 
current implementation could be fine, only not "in good style" according 
to ThreadSanitizer. If there is not too much thread creation and 
destruction during normal usage, any actual race conditions might only 
affect startup or shutdown.

I could give ThreadSanitizer another try, on a PC and a USB receiver 
stick, to see what it complains during normal usage, and if something 
can be fixed easily. My main motivation is to learn to use the tool, in 
order to make use of it in another (much larger) code base that will 
require some additional instrumentation due to the use of some custom 
synchronization primitives.

	Marko
  

Patch

diff --git a/thread.c b/thread.c
index 93eb8c0d..5f76ca76 100644
--- a/thread.c
+++ b/thread.c
@@ -249,7 +249,11 @@  cThread::cThread(const char *Description, bool LowPriority)
 cThread::~cThread()
 {
   Cancel(); // just in case the derived class didn't call it
+  if (childTid)
+    pthread_join(childTid, NULL);
+  mutex.Lock();
   free(description);
+  mutex.Unlock();
 }
 
 void cThread::SetPriority(int Priority)
@@ -266,6 +270,7 @@  void cThread::SetIOPriority(int Priority)
 
 void cThread::SetDescription(const char *Description, ...)
 {
+  mutex.Lock();
   free(description);
   description = NULL;
   if (Description) {
@@ -274,6 +279,7 @@  void cThread::SetDescription(const char *Description, ...)
      description = strdup(cString::vsprintf(Description, ap));
      va_end(ap);
      }
+  mutex.Unlock();
 }
 
 void *cThread::StartThread(cThread *Thread)
@@ -291,8 +297,10 @@  void *cThread::StartThread(cThread *Thread)
      Thread->SetIOPriority(7);
      }
   Thread->Action();
+  Thread->mutex.Lock();
   if (Thread->description)
      dsyslog("%s thread ended (pid=%d, tid=%d)", Thread->description, getpid(), Thread->childThreadId);
+  Thread->mutex.Unlock();
   Thread->running = false;
   Thread->active = false;
   return NULL;
@@ -314,7 +322,6 @@  bool cThread::Start(void)
      if (!active) {
         active = running = true;
         if (pthread_create(&childTid, NULL, (void *(*) (void *))&StartThread, (void *)this) == 0) {
-           pthread_detach(childTid); // auto-reap
            }
         else {
            LOG_ERROR;
@@ -364,6 +371,7 @@  void cThread::Cancel(int WaitSeconds)
         esyslog("ERROR: %s thread %d won't end (waited %d seconds) - canceling it...", description ? description : "", childThreadId, WaitSeconds);
         }
      pthread_cancel(childTid);
+     pthread_join(childTid, NULL);
      childTid = 0;
      active = false;
      }