Stale or corrupted world-volume pointer in Navigator after geometry rebuilds

mkelsey · September 26, 2019, 10:13pm

I am running a simulation which includes tracking charged particles through a volume with an associated electric field. I’ve set up my application so that I can adjust the detector voltage with a macro command; I should (and used to be!) successful in doing a whole series of runs in a single job.

Since the voltage is used to build the G4UniformElectricField instance, which is part of the geometry, my job does the necessary clean up and rebuilding of the geometry before each run. The commands I use are

G4RunManager* rm = G4RunManager::GetRunManager();
rm->ReinitializeGeometry(true);
rm->GeometryHasBeenModified();
rm->InitializeGeometry();

Unfortunately, after a small number of runs, where this procedure is invoked, my job segfaults within G4ParallelWorldProcess. The backtrace looks like

(gdb) bt
#0  0x00007ffff035c42c in G4LogicalVolume::GetSolid() const ()
at /afs/slac.stanford.edu/package/geant4/vol51/geant4.10.03.p03/source/geometry/management/src/G4LogicalVolume.cc:368
#1  0x00007ffff039c4d6 in G4PropagatorInField::ComputeStep(G4FieldTrack&, double, double&, G4VPhysicalVolume*) ()
at /afs/slac.stanford.edu/package/geant4/vol51/geant4.10.03.p03/source/geometry/navigation/src/G4PropagatorInField.cc:217
#2  0x00007ffff03993c0 in G4PathFinder::DoNextCurvedStep(G4FieldTrack const&, double, G4VPhysicalVolume*) ()
at /afs/slac.stanford.edu/package/geant4/vol51/geant4.10.03.p03/source/geometry/navigation/src/G4PathFinder.cc:1207
#3  0x00007ffff0397492 in G4PathFinder::ComputeStep(G4FieldTrack const&, double, int, int, double&, ELimited&, G4FieldTrack&, G4VPhysicalVolume*) ()
at /afs/slac.stanford.edu/package/geant4/vol51/geant4.10.03.p03/source/geometry/navigation/src/G4PathFinder.cc:242
#4  0x00007ffff25c3997 in G4ParallelWorldProcess::AlongStepGetPhysicalInteractionLength(G4Track const&, double, double, double&, G4GPILSelection*) ()
at /afs/slac.stanford.edu/package/geant4/vol51/geant4.10.03.p03/source/processes/scoring/src/G4ParallelWorldProcess.cc:312

The specific line where it fails is

   fNavigator->GetWorldVolume()->GetLogicalVolume()->
           GetSolid()->DistanceToOut(StartPointA, VelocityUnit) )

where the GetSolid() call returns a null pointer. This is obviously “impossible” since the world volume has to exist and have been built properly (solid -> LV -> PV). I can’t tell whether the problem is a stale pointer in fNavigator, or if the current (correct) world-volume or other geometry object has gotten corrupted.

Are there any geometry store, or tracking related, cleanups which I should be doing in my application, beyond telling the RunManager about the geometry change?

gcosmo · October 4, 2019, 8:03am

Hi Mike, if you wipe out your mass geometry where a parallel geometry is also defined and associated to… you obviously need to re-do the association to the parallel world each time.

mkelsey · October 4, 2019, 2:07pm

Yes, of course. Does the RunManager::ReinitializeGeometry(true) do that (i.e., does it delete the parallel world volumes?). Certainly our geometry constructor does rebuild the parallel world.

asaim · October 4, 2019, 3:06pm

Hi Mike,
Yes, RunManager::ReinitializeGeometry(true) delete all solids, logical volumes and physical volumes, and strips out all logical volume pointers from regions except the “defaultRegion”.
https://geant4.kek.jp/lxr/source/run/src/G4RunManager.cc?v=10.3.p3#L923
But if you have a parallel world, the corresponding G4ParallelWorldProcess has to be informed of the deletion of the obsolete world and then set the new parallel world.

mkelsey · October 4, 2019, 4:35pm

Thanks, Makoto. I wondered about that, but I didn’t see an obvious way to do so. The G4ParallelWorldPhysics constructor takes the name of the world as an argument, and in our code we reuse the same name when the geometry gets rebuilt.

I don’t see any method to discard the obsolete world, but I do see the public G4ParallelWorldProcess::SetParallelWorld(string) and (PV) functions. We aren’t calling that function at present, since we pass the world name in the ctor for the physics builder, rather than the plain process.

If we rebuild the geometry, do we need to modify the geometry builder itself to find the parallel world process and do the notification? Or does the rm->InitializeGeometry() call take care of it?

mkelsey · October 4, 2019, 4:47pm

… Oh, I see. Maybe it shouldn’t happen automatically, because there’s no guarantee that the new geometry has the same named parallel world as before.

So the geometry builder does have to handle the notification explicitly.

asaim · October 4, 2019, 4:56pm

Yes, your geometry builder or somewhere that is invoked before the start of next run should do it.
How many parallel worlds do you have and how many of them are rebuilt at once? If all the parallel worlds are rebuilt simultaneously, you may simply invoke this method
https://geant4.kek.jp/lxr/source/processes/scoring/src/G4ParallelWorldProcessStore.cc#L81
and you are done. Please make sure to invoke it in your master thread.

mkelsey · October 4, 2019, 8:08pm

Thank you, Makoto. That’s much easier than what I was doing (scanning the process table for a particle to find the parallel-world process, then calling it from my G4VUserParallelWorld instance with the new world PV).

It looks like G4ParallelWorldProcessStore::UpdateWorlds() needs to be called after the parallel worlds have been constructed. Does that mean I can’t call it from my geometry builder Construct() (which is only invoked from master, if I recall correctly)?

asaim · October 4, 2019, 8:30pm

G4ParallelWorldProcessStore::UpdateWorlds() may be invoked at any time once G4ParallelWorldProcess is created. Thus, you can call it at the very bottom of your builder’s Construct() method, but it has to be protected with a static boolean so that it is not invoked at the first time Construct() is called.
Alternatively, you may invoke G4ParallelWorldProcessStore::UpdateWorlds() from your BeginOfRunAction().

mkelsey · October 4, 2019, 8:52pm

Okay, I’m trying this now. …No, it failed on the second run. The failure is inside the UpdateWorlds() function:

G4TransportationManager::GetParallelWorld(G4String const&) (in libG4geometry.dylib) (G4TransportationManager.cc:172)
G4ParallelWorldProcess::SetParallelWorld(G4String) (in libG4processes.dylib) (G4ParallelWorldProcess.cc:105)
G4ParallelWorldProcessStore::UpdateWorlds() (in libG4processes.dylib) (G4ParallelWorldProcessStore.cc:87)
CDMSGeomConstructor::Construct() (in CDMS_G4DMC) (CDMSGeomConstructor.cc:443)
[...]

I guess this is because the new parallel worlds can’t be built (by calling our CDMSParallelWorld::Construct()) until after the main Construct() has returned the mass-world volume.

I will try using BeginOfRunAction(), and see if that works.

asaim · October 4, 2019, 9:06pm

Of course new parallel world has to be built before G4ParallelWorldProcessStore::UpdateWorlds() is invoked.

mkelsey · October 4, 2019, 9:31pm

Ye, indeed But it didn’t help. I have the UpdateWorlds() call in my BeginOfRunAction(), I ran in the debugger (LLDB) and confirmed that it was being called at the start of each run. The first three runs (each with a geometry rebuild in between) ran to completion, but the fourth failed with the same traceback I originally posted.

Starting at G4ParallelWorldProcess::AlongStepGetPhysicalInteractionLength(), the actual crash occurs via G4PropagatorInField::ComputeStep(), with an invalid world logical volume:

fNavigator->GetWorldVolume()->GetLogicalVolume()->
          GetSolid()->DistanceToOut(StartPointA, VelocityUnit));

The cached world volume in the parallel-world navigator is clearly corrupt:

(lldb) p *(fNavigator->fTopPhysical)
(G4VPhysicalVolume) $3 = {
  instanceID = 167479552
  flogical = 0x000000000000002c
  fname = (std::__1::string = "")
  flmother = 0x0000000000000000
  pvdata = 0x0000000109fb8960
}

asaim · October 4, 2019, 9:48pm

Does the fourth run also have geometry rebuilt? What was the difference of the fourth run compared to the previous runs? Have you changed the name of the parallel world?

mkelsey · October 4, 2019, 10:09pm

All the runs in my test job have essentially identical physical geometry (same volumes, same names, etc.). What’s changing is the electric field attached to one of the volumes.

We’re using the CDMS simulation framework for this, which has the “restriction” that if you change any configuration related to any part of the geometry, the whole thing gets rebuilt (specifically, the user has to trigger a rebuild, or the old geometry will just get reused). That was much simpler to write, and more bulletproof.

It does seem to be related specifically to the electric field, and the Navigator using G4PropagatorInField for its work. If I do a similar job, but disable the electric field, I can process as many runs as I want without a failure.

mkelsey · October 8, 2019, 6:47pm

To be clear, the overall geometry in this test is relatively simple:

World: 5 x 5 x 5 m box of G4_Galactic

Detector: 100 cm diameter x 3.333 cm tall cylinder of G4_Ge, with some flats cut off. Placed at (0,0,0) in the World.

Parallel World: named “Scorers”, with NO internal volumes, just the registered world.

The detector volume has a G4UniformElectricField attached to it.

The geometry is completely deleted and rebuilt from scratch, identically, for each of my runs. The only thing which changes is the magnitude assigned to the electric field for each run.

mkelsey · October 16, 2019, 7:42pm

After doing some development work in our simulation, I went back to the version of the code that showed this problem, did a clean rebuild (yes, I had done multiple builds when I reported the issue), and now it won’t segfault.