Geant4 Version: 10.07.p04, 11.4.0 Operating System: Ubuntu, RHEL8, Mac Compiler/Version: Various CMake Version: Various
We are running into a problem where we cannot do multiple runs, with geometry changes, in a single job. We see a segfault, with a partial traceback that looks like:
#5 0x000015394a3476de in __memcmp_evex_movbe () from /lib64/libc.so.6
#6 0x000015394d680c5d in G4TransportationManager::IsWorldExisting(G4String const&) () from
/scratch/group/mitchcomp/eb/x86_64/sw/Geant4/10.7.4-foss-2022b/lib64/libG4geometry.so
#7 0x000015394d6825dd in G4TransportationManager::GetParallelWorld(G4String const&) () from
/scratch/group/mitchcomp/eb/x86_64/sw/Geant4/10.7.4-foss-2022b/lib64/libG4geometry.so
#8 0x000015394ea637fc in G4ParallelWorldProcess::SetParallelWorld(G4String) () from
/scratch/group/mitchcomp/eb/x86_64/sw/Geant4/10.7.4-foss-2022b/lib64/libG4processes.so
#9 0x000015394ea6519a in G4ParallelWorldProcessStore::UpdateWorlds() () from
/scratch/group/mitchcomp/eb/x86_64/sw/Geant4/10.7.4-foss-2022b/lib64/libG4processes.so
(this is with 10.07.p04, but we also see it in 11.4.0 jobs). As near as I can tell, the Transportation in the second run appears to be seeing stale volume pointers from the first geometry build. I would have thought that they would have been cleared out. Is this something others have seen when using parallel worlds?
I was able to reproduce this problem. It seems to need both a parallel world and a field. In my tests, if I remove the field, the problem goes away and repeated geometry rebuilds run fine.
So at the moment it looks like the crash is triggered by the combination of:
parallel world + field + geometry rebuild + next run
Thank you for the confirmation, Dmitri! In our CDMS simulation framework, we have exactly that geometry configuration, with parallel worlds (for importance biasing in the shielding) and electric field for our detector crystals.
In fact, the use case where I ran into this is generating a bunch of runs ramping the voltage up, which is just a series of two-line copies in the macro file.
I don’t have an MRE, since the CDMS Simulation framework is large and cumbersome. Should I make a Bugzilla report anyway and reference you as a reproducer?
I have created Bugzilla report #2714, including the relevant portion of traceback from our CDMS job. Dmitri, if you have an MRE, would you mind attaching it there?
Many thanks to @gcosmo for fixing this bug so quickly, and so cleanly, and to @dkonst for pointing me at the Geant4 dev MR. It looks like I may be able to backport this to Geant4 10.07.p04 for our CDMS framework (not much longer!), and we’ll get it in 11.4.2 when that comes out later in the spring(?).