Strange Geant4 failures for some beams

Ahmed-Naceur · August 7, 2024, 2:28am

I have already run several thousand Geant4 calculations between 1 MeV and 30 GeV for 107 benchmarks. The goal was to validate a reference deterministic nuclear reactor code for electron transport. The Geant4 results were very satisfactory, in a high compliance level with respect to the deterministic solver. Recently, for some reasons, I had to reinstall Geant4.10.7 and Geant4.11.2 on clusters and supercomputers, and retake all Geant4 simulations from scratch. Suddenly, the Geant4 simulation became unstable. For several beams, I was able to get the usual desired result very well. For others, Geant4 shows abnormal failures. These new simulations were multi-threaded. I decided to reinstall everything in single-threading, and the problem persists. Geant4 succeeds for ~60-70% of the beams and fails for the rest. The Geant4 failure is completely random and varies from one cluster to another. Do you have an explanation for this? Is it a problem related to the Geant4 installation? My benchmark shown here is a slab of water, followed by a slab of aluminum, a slab of steel, and then another slab of water. Fig1 shows a typical case of Geant4 success (e.g., 2MeV and 4MeV). Fig2 shows a typical case of Geant4 failure (e.g.,1MeV). Both figures are for the same benchmark, the same Geant4 application, the same classes and headers, the same calculation scheme, the same PhysicsList, and the same number of events, executed on the same cluster. The circles show the dose predicted by Geant4. The continuous lines show the dose predicted by the deterministic nuclear reactor physics code, Dragon5. The simulation is purely electron-based with a unidirectional electron beam. A StackingAction is applied to kill gamma photons for reasons related to electron library validation. All our published data in the literature are based on this functional StackingAction, and we have taken the effort to adapt it to multithreading.

The results shown here can be reproduced with these configurations:
Geant4 Version: v10.7.p02
Operating System: AlmaLinux 9.3
_Compiler/Version:_gcc/9.3.0
_CMake Version:_cmake/3.18.4

Geant4 Version: v10.7.p02
Operating System: Rocky Linux 8.9 (Green Obsidian)
_Compiler/Version:_gcc/9.3.0
_CMake Version:_cmake/3.18.4

dkonst · August 7, 2024, 8:33am

Hello @Ahmed-Naceur,

It would be great if you could provide a “reproducer” of the failures you’re encountering. This will help us investigate the issue more effectively.

Thank you!

Ahmed-Naceur · August 10, 2024, 12:55am

Thank you very much @dkonst for this quick response and for the help. Much appreciated!

I prepared a simplified version of the benchmark that produces the same random errors and successes. The version here is the MultiThreading version. The forum doesn’t allow uploading a compressed folder with the Classes and Headers. I will try to upload them here if that works for you. Otherwise, I’m not sure what you mean by “reproducer”—do you mean just the .mac files? The ‘E2_inf_mobetron_uni_*’ files are the .mac files for the 1MeV, 2MeV, 3MeV and 4MeV beams. I modified here .mac to .txt to respect forum restriction during uploading. For each beam, the .mac file is automatically generated by a bash file, and the detector (DetectorConstruction.cc class) has its dimensions set to the ~range of the beam. Note that the dimensions in y and z are ‘infinite’ to keep the simulation in 1D for comparison purposes.

E2_inf_mobetron_uni_1.0MeV_200mesh_emlivermore.txt (4.4 KB)
E2_inf_mobetron_uni_2.0MeV_200mesh_emlivermore.txt (4.4 KB)
E2_inf_mobetron_uni_3.0MeV_200mesh_emlivermore.txt (4.4 KB)
E2_inf_mobetron_uni_4.0MeV_200mesh_emlivermore.txt (4.4 KB)

Here the main script.
electron.cc (2.2 KB)

Here the simplified Classes:
ActionInitialization.cc (1.3 KB)
DetectorConstruction.cc (8.5 KB)
EventAction.cc (755 Bytes)
PhysicsList.cc (7.2 KB)
PhysicsListMessenger.cc (869 Bytes)
PrimaryGeneratorAction.cc (1.4 KB)
RunAction.cc (3.4 KB)
StackingAction.cc (1.7 KB)
SteppingAction.cc (1.5 KB)

Here the headers:
ActionInitialization.hh (665 Bytes)
DetectorConstruction.hh (907 Bytes)
EventAction.hh (812 Bytes)
PhysicsList.hh (846 Bytes)
PhysicsListMessenger.hh (735 Bytes)
PrimaryGeneratorAction.hh (897 Bytes)
RunAction.hh (933 Bytes)
StackingAction.hh (867 Bytes)
SteppingAction.hh (766 Bytes)

dkonst · August 16, 2024, 9:25am

Hello @Ahmed-Naceur,

Thank you for the clarification, and I apologize for the misunderstanding earlier. If I understand correctly now, your application is not crashing but producing varying physical results on different nodes. For example, with a 1 MeV beam, you sometimes obtain results that align with your reference (Dragon5), while other times, the results differ significantly. Is that correct?

To help diagnose the issue, could you try running the simulations without using multi-threading (MT) and see if the problem persists?

Cheers,
Dmitri

dkonst · August 16, 2024, 9:29am

Hello @civanch and @atolosad,

Could you please have a look at the “Physics List” used in this simulation and let us know if it looks fine for you?

Also, do you foresee any issues with running this configuration in multi-threaded (MT) mode?

Thank you in advance for your assistance.

Cheers,

Dmitri

civanch · August 16, 2024, 11:39am

hello,

On the first view, there is no problem in Physics List.

It is always difficult to comment on external application without. Much more easy if a macro file for existing Geant4 example is given. I would suggest to reproduce the problem in $G4INSTALL/examples/extended/electromagnetic/TestEm5 - one slab or in TestEm3 - up to 10 slabs. The macro file which demonstrate the problem may be attached to this thread.

VI

Ahmed-Naceur · September 3, 2024, 2:45pm

Thank you, @dkonst, for the follow-up and assistance. Thanks to you, @civanch, for the verification and suggestion. I have answers to your questions, @dkonst (Sorry for the delay; I was on vacation). What you initially said is true.

(1) Yes, the Geant4 application is not crashing. However, it is unstable and can produce completely erroneous doses, and this happens randomly. Let me explain. We are irradiating heterogeneous as well as homogeneous benchmarks from Z=1 to Z=99 for energies between 1MeV and 30GeV. In total, there are 107 benchmarks and about 500 beams per benchmark. A good portion of these results and meta-analyses have been published in Refs. [1], [2], and [3], including Geant4 dose profiles. These previous results were on a single CPU on conventional machines. In my case, after my new multi-threading Geant4 installation on clusters and supercomputers, Geant4 behaves in a very unexpected way. For example, consider the case of 20 beams for a benchmark “A”; 1MeV, 2MeV, 3MeV up to 20MeV. The dose is predicted very well for all beams, except 7MeV and 14MeV. For these two beams, Geant4 completely fails, giving non-physical results (See the example in Fig. 2, above). If I rerun these simulations on another cluster for the same benchmark “A”, the successes involve all the beams, except now 5MeV, 8MeV, and 10MeV. So, the failing beams are entirely random. Therefore, to answer your second question: No, it is not a matter of doses not aligning, but rather that Geant4 gives a totally “chaotic” response in an “unexpected” and “random” manner for some beams. Geant4 is the reference for our study. Attached, you can see another example for another benchmark ‘B’ with 11 slabs. The 400MeV figure shows a reasonable response from Geant4. The 35MeV figure shows a completely unexpected and inappropriate response from Geant4.

(2) Of course, I have tried Single Threading (with the same scripts shared above). The same behavior persists. However, I reinstalled Geant4 in Single Threading (on these same clusters and supercomputers) and modified my application shared above to be entirely Single Threading for all Classes. The main has been modified as follows (see
electron_1thread.cc (1.7 KB)). At that point, I managed to drastically reduce the failure rate (let’s say from 20% on 500 beams down to 2% on these same clusters). But as you can see, I continue to have a failure rate. And I need execution in multi-threading.

Have I answered your question, @dkonst ?

What do you think? The ideal would be to run a non-regression test of the examples mentioned by @civanch for beams between 1MeV and 20MeV with 15 million events each, and observe the deposited dose. If a regression appears, it means there is a problem with the Geant4 installation on these clusters and supercomputers. I don’t know if you have another easier diagnosis or if you detect something else. Do you think that the quality of the installed cmake and gcc on these machines could be the reason? These are the only pre-installed prerequisites over which I have “no control” on these clusters and supercomputers.