I had multiple jobs running on our university cluster, some around 70 days, and suddenly everything stopped with a segmentation fault on the common macro file. Attempts to restart the runs has resulted in a few particles then the same seg fault popping up again in the execute_error file but nothing in the output file.
/var/spool/slurmdspool/job12639/slurm_script: line 14: 164200 Segmentation fault (core dumped) ./MultiOrtho run1E4.mac
run1E4.mac
# Macro file for Mono-Ortho
#
# Can be run in batch, without graphic
# or interactively: Idle> /control/execute run1.mac
#
# Change the default number of workers (in multi-threading mode)
#/run/numberOfThreads 4
#
# Initialize kernel
/run/initialize
/process/inactivate ePairProd
/process/inactivate hPairProd
/process/inactivate muPairProd
/process/inactivate eBrem
/process/inactivate hBrems
# /process/inactivate CoulombScat
#
/control/verbose 0
/run/verbose 0
/event/verbose 0
/tracking/verbose 0
#
# gamma 6 MeV to the direction (0.,0.,1.)
#
#/gun/particle gamma
#/gun/energy 6 MeV
#
/run/beamOn 10000
Line 14 is the call to /process/inactivate/eBrem which the seg fault hits even if I comment that line out as # /process/inactivate/eBrem or #/process/inactivate/eBrem.
Does GEANT4 call the source code actively during a run? It seems highly improbable that runs that were 70 days and 10hrs long would hit identical programming errors simultaneously and all new runs will hit that within 1 min. I suspect that the install has been corrupted or something and that’s what’s causing this error.
Thanks
Geant4 Version: 11.2.2 Operating System: server linux of some sort Compiler/Version: Unknown CMake Version: 3.26
C++ is a compiled language. Editing the source code won’t make any difference unless you recompile and relink your executables.
Your macro file as posted has no useful content (so you must be doing everything in your application source code), and you have not provided a traceback of the segfault, so there’s nothing even to start with.
Start by doing a rebuild, beginning with cmake -DCMAKE_BUILD_TYPE=RelWithDebugInfo , so that you can get a sensible traceback.
Then try running one of your jobs interactively in the debugger. When it segfaults, you will be able to get your own traceback, look at the line of code where the traceback happened, and then investigate the values of variables, or step back through the calling frames to see what triggered the problem.
If you are unfamiliar with how to use a debugger, there are many, many resources available; that is out of scope for this forum.
Thank you Michael for the quick response. I find it suspicious that code that was stable for 70 days and about 6E3 source particles suddenly all crashes at the same time despite my understanding that it had been compiled and was simply running the executable. The one thing that did catch my notice was that despite running with a reported time via squeue in the 60+ days range, some of the job IDs via seff only showed 10hrs of wall time despite the 6-9E3 particles ran in the output with file activity up till near the crash.
I’ll try to figure out how to add the traceback on the seg fault as the cluster managers just said the examples run fine. Going through the examples list on the manual page doesn’t point to any that have a /process/inactivate call to test if that’s working.
I’ll try to update if I can find a culprit as this technique of suppressing certain reactions greatly speeds up the runs for just checking the dominant physics.
“Worked for 70 days and then suddenly all crashes”. I mean, my gut tells me this is a memory leak. In addition to recompiling with debug flags I would also watch the memory usage of the cores you are using.
What are you simulating that needs days for 6000 particles?
Running on a cluster really makes this a lot of this harder to do than a local run. The Linux on the cluster and Windows on the local split has somewhat stovepiped my work onto the cluster. I’m trying to add the needed calls in the cmake build script and then track down where the log files go.
cmake -Wdev --debug-output --trace ..
I got more in the CMakeConfigureLog this time but none of the CMakeFiles\CMakeOutput.log or CMakeFiles\CMakeError.log files I’ve seen referenced in debugger mode posts.
The reason I’m stovepiped on the cluster comes from the answer the question of “what takes that long”, it’s a several thousand element 3D radiography simulation (hundreds of AssemblyVolume placements of a multilayer configuration with hundreds of individual Divisions) where I’m looking at the effects of internal scatter within the ~1.3m long detector volume. This is also why I deactivate the reaction classes as earlier runs with all the physics showed them to be relatively minor contributions when it came to charge creation in the volume.
I have been continuing to work on this and have come to a realization, the Line 14 corresponds to the call ./MultiOrtho run01.mac. Looking at the execute file, it’s successfully running 10 particles then the seg fault happens. I have been unable to add any more information about how it’s failing as the error log despite adding the cmake debugger call has never gotten any more verbose. I’ll try to bring the project over to Windows and see if manual debugger cmake will track this down as I’ve not had any luck on the Linux cluster.
I’m not sure this is telling you anything useful. This is a log from how the program was configured to be built and has nothing useful to say about what happens as runtime. Debugging on a cluster can be awkward, but you can configure/build Geant4 with
My apologies for making this into a C++ type tutorial but I’m well out of my depth and have not been able to really find any local help.
I am trying to implement the G4_BACKTRACE and the has me tripped up. Is that a call to an output location and I need to give it a directory or call stderr?
I’m not sure I understand your question. All you need to do to enable automatic backtracing is either:
Build Geant4 from scratch as I described in my last post.
or In your main program, add
// with the other headers
#include <G4Backtrace.hh>
...
// inside main(), generally before constructing the run manager
G4Backtrace::Enable();
The default set of signals handled is SIGQUIT, SIGILL, SIGABRT, SIGKILL, SIGBUS, SIGSEGV and it should write any backtrace to std::cerr so that should appear in wherever your job is sending the error stream.
Thanks for the detailed instructions. I can’t do the rebuild from scratch, not a permission level I have, but I was able to insert the above commands into the main.cc and successfully rebuild the project. This time I was able to catch the output in stderr piped into a file. I also turned on the correct level of verbosity to help figure out what the particles were doing in context when it happens.
CAUGHT SIGNAL: 11 ### address: 0x140, signal = SIGSEGV, value = 11, description = segmentation violation. Address not mapped to object.
I think this means that the geometry is incomplete at that location so there isn’t anything at the location despite there being a logical volume.
Speculation on why all runs had an error simultaneously when it was working for days. I have been using a common error file name in the script so some sort of check failed once it became non-empty.
This looks like an issue in your process hits with your sensitive detector [1/20]. If I am reading the trace right maybe trying to print a string [2/20] that was a null. So it could be the geometry has some overlap issues or floating point errors at boundaries if subtract operator was used a lot for the geometry.
Thanks for the interpretation of the backtrace. I was suspecting some sort of geometry error but my mind was not going in this direction. The error cropping up now makes a bit of sense as I had just corrected the Z spacing from the earlier running but geometrically inaccurate setups.
Not so much subtraction but there is a lot of addition in the components as they are supposed to be line to line in the stackup. The Z positions in the assembly elements are all summations off of the front face reference. Ref + Ojb1_Z + Ojb2_Z + Ojb3_Z etc. I’ll start chasing this line of thought and report back.
Footnote: 95% sure this is the solution but want to test before declaring it such.
I’m going to break off the geometry discussion to a separate thread as I think we’ve reached the end of the troubleshooting and why it all crashed. That latter conclusion is if there is an internal check so that if the log file that all scripts write to for an error becomes non-empty it terminates the run.
#SBATCH --error=execute_error.txt
If that’s the same for all runs then when execute_error.txt becomes non-zero file length any script referencing it will terminate?
Bad physics list handling of the neutron interaction, whatever interaction that was. I made no changes to the geometry, only added FTFP_BERT_HPT to the physics lists and the runs have been working flawlessly. I also learned my lesson and now have unique error file names.