Because of the heap-use-after-free race condition that was rather easily reproducible with AddressSanitizer (-fsanitize=address), I thought that I should finally try to learn to use ThreadSanitizer (TSAN, -fsanitize=thread in GCC and clang).
https://clang.llvm.org/docs/ThreadSanitizer.html
Because VDR makes use of POSIX thread synchronization primitives, no additional instrumentation via <sanitizer/tsan_interface.h> should be necessary.
Before C++11 defined a memory model for multi-threaded applications, semantics around shared data structures were rather unclear, and I would guess that most multi-threaded pre-C++11 code bases would trip ThreadSanitizer. Also, multi-threaded CPUs were rare in the early days, and the Total Store Order of the x86 is very forgiving, compared to the weak memory model of ARM (see https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for some examples).
To silence one prominent source of TSAN warnings in the cThread destructor, I applied the attached patch. It is not ideal, because std::atomic defaults to std::memory_order_seq_cst while std::memory_order_relaxed would likely suffice here.
Even after applying the first patch, a simple test with no DVB receiver device and no valid data directory would still produce a couple of data race reports (see the end of this message). I recorded a trace of such a run with "rr record" (https://rr-project.org) and debugged it in "rr replay". Unsurprisingly, I did not find actual evidence of a race condition.
Finally, I figured out what is causing the first report: cThread::description is not protected by cThread::mutex. Possibly, most cThread data members (including cThread::active) should be protected by cThread::mutex?
With both attached patches applied, the report of the first race will disappear. The second race is apparently about some memory that is allocated inside opendir(). I did not figure it out yet.
Related to this, cThread::Cancel() especially when invoked with WaitSeconds=-1 looks problematic to me, and I see that VDR is invoking pthread_detach() and never invoking pthread_join(). The second patch includes an attempt to clean this up as well.
Both patches are just a proof of concept; I did not test them beyond the simple failing-to-start VDR run under TSAN. Unfortunately, TSAN is not available for my primary VDR hardware, running on 32-bit ARM.
With best regards,
Marko
vdr: error while reading '/var/lib/vdr/sources.conf' vdr: error while reading '/var/lib/vdr/channels.conf' vdr: no primary device found - using first device! ================== WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=96847) Write of size 8 at 0x7ffc7f773e60 by main thread: #0 cThread::~cThread() /dev/shm/vdr/thread.c:249 (vdr+0x1d72b8) #1 cEpgDataReader::~cEpgDataReader() /dev/shm/vdr/epg.h:236 (vdr+0xa956b) #2 main /dev/shm/vdr/vdr.c:731 (vdr+0xa956b)
Previous read of size 8 at 0x7ffc7f773e60 by thread T2: #0 cThread::StartThread(cThread*) /dev/shm/vdr/thread.c:293 (vdr+0x1d76d9)
Location is stack of main thread.
Location is global '<null>' at 0x000000000000 ([stack]+0x1fe60)
Thread T2 'epg data reader' (tid=96855, finished) created by main thread at: #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686) #1 cThread::Start() /dev/shm/vdr/thread.c:316 (vdr+0x1d6fe0) #2 main /dev/shm/vdr/vdr.c:804 (vdr+0xaa477)
SUMMARY: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) /dev/shm/vdr/thread.c:249 in cThread::~cThread() ================== ================== WARNING: ThreadSanitizer: data race (pid=96847) Write of size 8 at 0x7b04000005a0 by main thread: #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:706 (libtsan.so.2+0x47e82) #1 cString::~cString() /dev/shm/vdr/tools.c:1097 (vdr+0x1e52df) #2 cxa_at_exit_wrapper ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:389 (libtsan.so.2+0x2dee3)
Previous read of size 8 at 0x7b04000005a0 by thread T1: #0 opendir ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:3271 (libtsan.so.2+0x4c641) #1 cReadDir::cReadDir(char const*) /dev/shm/vdr/tools.c:1553 (vdr+0x1ea8bd) #2 cVideoDirectoryScannerThread::ScanVideoDir(char const*, int, int) /dev/shm/vdr/recording.c:1439 (vdr+0x180620) #3 cVideoDirectoryScannerThread::Action() /dev/shm/vdr/recording.c:1433 (vdr+0x180bfc) #4 cThread::StartThread(cThread*) /dev/shm/vdr/thread.c:293 (vdr+0x1d76ea)
Thread T1 'video directory' (tid=96854, running) created by main thread at: #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1001 (libtsan.so.2+0x5e686) #1 cThread::Start() /dev/shm/vdr/thread.c:316 (vdr+0x1d6fe0) #2 cRecordings::Update(bool) /dev/shm/vdr/recording.c:1554 (vdr+0x175387) #3 main /dev/shm/vdr/vdr.c:788 (vdr+0xaa3f8)
SUMMARY: ThreadSanitizer: data race /dev/shm/vdr/tools.c:1097 in cString::~cString() ================== ThreadSanitizer: reported 2 warnings
Sat, Dec 10, 2022 at 07:30:50PM +0200, Marko Mäkelä wrote:
Finally, I figured out what is causing the first report: cThread::description is not protected by cThread::mutex.
Sorry, I failed to notice that even after applying both patches, both TSAN reports are still there. The race condition apparently is about the Thread->Action() member function call, which is not protected by anything. Perhaps cThread::StartThread() should acquire the Thread->mutex very early on? Or TSAN simply would insist that cThread::Cancel() be always called before the destructor?
Marko
Sat, Dec 10, 2022 at 07:30:50PM +0200, Marko Mäkelä wrote:
Because of the heap-use-after-free race condition that was rather easily reproducible with AddressSanitizer (-fsanitize=address), I thought that I should finally try to learn to use ThreadSanitizer (TSAN, -fsanitize=thread in GCC and clang).
https://clang.llvm.org/docs/ThreadSanitizer.html
Because VDR makes use of POSIX thread synchronization primitives, no additional instrumentation via <sanitizer/tsan_interface.h> should be necessary.
Before C++11 defined a memory model for multi-threaded applications, semantics around shared data structures were rather unclear, and I would guess that most multi-threaded pre-C++11 code bases would trip ThreadSanitizer. Also, multi-threaded CPUs were rare in the early days, and the Total Store Order of the x86 is very forgiving, compared to the weak memory model of ARM (see https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for some examples).
https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces gives some examples of races, which seem to be possible in VDR. Since there are not many virtual member functions in a multithreaded software component that I maintain, I was not even aware that a vptr could be modified inside a destructor.
Of course, may be a huge gap between something bad reported by ThreadSanitizer and something bad actually being rather easily reproducible. For example, lock-order-inversion potentially causes a deadlock. An actual deadlock requires an (un)fortunate scheduling of multiple threads that are acquiring some locks in the opposite order.
The wiki page includes the following:
On architectures other than x86, the cache effects or out-of-order instruction execution may lead to other subtle problems.
Some race conditions on x86 may be merely about a missing "compiler barrier", which would prevent reordering some code at compilation time.
On ARM, POWER, RISC-V and other CPUs that implement a weak memory model, special instructions are needed for guaranteeing the correct memory ordering, for example, when publishing a dynamically constructed object in a shared data structure: https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_orderi... On the x86, there is no difference between that and relaxed memory ordering, perhaps expect for "compiler barriers".
Marko
On 10.12.22 18:30, Marko Mäkelä wrote:
... Finally, I figured out what is causing the first report: cThread::description is not protected by cThread::mutex. Possibly, most cThread data members (including cThread::active) should be protected by cThread::mutex?
Unless I'm missing something, cThread::description is never accessed while the thread is actually running, locking isn't necessary here.
... Related to this, cThread::Cancel() especially when invoked with WaitSeconds=-1 looks problematic to me, and I see that VDR is invoking pthread_detach() and never invoking pthread_join(). The second patch includes an attempt to clean this up as well.
Is there an actual problem that requires this? It has been that way for many, many years, so I'd like to see more than "looks problematic to me" before I dare touch this ;-).
Klaus
Wed, Dec 14, 2022 at 10:21:10AM +0100, Klaus Schmidinger wrote:
Is there an actual problem that requires this? It has been that way for many, many years, so I'd like to see more than "looks problematic to me" before I dare touch this ;-).
The only problem that I am currently aware of is something specific to rpihddevice shutdown and not VDR itself, and so far it only occurred when I compiled the code with -fsanitize=undefined.
Some 15 years ago, VDR startup with softdevice (not softhddevice) would hang once in a few months (say, once in 100 to 300 attempts) on a single-core system, but I don't remember trying to analyze it back then. It should have been on a 2.6 kernel and NPTL (not LinuxThreads), that is, not too different from what we have now.
Code around thread creation and destruction is indeed tricky, and the current implementation could be fine, only not "in good style" according to ThreadSanitizer. If there is not too much thread creation and destruction during normal usage, any actual race conditions might only affect startup or shutdown.
I could give ThreadSanitizer another try, on a PC and a USB receiver stick, to see what it complains during normal usage, and if something can be fixed easily. My main motivation is to learn to use the tool, in order to make use of it in another (much larger) code base that will require some additional instrumentation due to the use of some custom synchronization primitives.
Marko