Parallel Computing Performance Worse than Serial

Hi Geant4 community member,

I am pretty new to Geant4. I am testing the multi-thread computing performance of B1 example, but got some bizarre results. I found that the best performance can be achieved by running in serial, however, if I use multi-thread run manager, the running time is much larger than running in serial. If I set number of threads to be 1, (in theory, this should be equivalent to running in serial), the performance is still much worse than running in serial. I noticed the CPU usage is much higher in the case of serial computing (63.1%), however, in the case of multi-thread computing, the CPU is consistent at about 16-18% no matter how many threads I use. I don’t know if this is related to the issue.

I did not change the code of B1 example, except in the main function, I directly set the number of threads by adding a line “runManager->SetNumberOfThreads(2);” after initializing the run manager.

I used the Qt GUI to run the simulation with command “/run/beamOn 10000” for each case. It looks like to me that the setting of my multi-thread simulation was correct. When number of thread equals 4, each thread processes 2500 event. When number of thread equals 2, each thread processes 5000 events. But the performance is really puzzling for me to understand. Please let me know if you need more information to diagnose my issue.

My laptop is MacBook Pro 2018, below is the configuration information
Operating system: OSX 10.15.7.
Processor: 2.6 GHz 6-Core Intel Core i7.
Memory: 16 GB 2400 MHz DDR4.
Graphics: Radeon Pro 560X 4 GB Intel UHD Graphics 630 1536 MB

Here are the console output of the multiple runs:
When running in multi-thread, with number of threads = 4:

##############################################################################
Available UI session types: [ Qt, GAG, tcsh, csh ]
G4WT0 > /control/saveHistory
G4WT3 > /control/saveHistory
G4WT1 > /control/saveHistory
G4WT2 > /control/saveHistory
G4WT0 > /run/verbose 2
G4WT3 > /run/verbose 2
G4WT2 > /run/verbose 2
G4WT1 > /run/verbose 2
G4WT0 > /run/initialize
G4WT2 > /run/initialize
G4WT1 > /run/initialize
G4WT0 > /run/physicsModified
G4WT3 > /run/initialize
G4WT2 > /run/physicsModified
G4WT1 > /run/physicsModified
G4WT3 > /run/physicsModified
G4WT1 > /tracking/storeTrajectory 2
G4WT0 > /tracking/storeTrajectory 2
G4WT3 > /tracking/storeTrajectory 2
G4WT2 > /tracking/storeTrajectory 2
G4WT1 > ### Run 0 starts on worker thread 1.
G4WT0 > ### Run 0 starts on worker thread 0.
G4WT3 > ### Run 0 starts on worker thread 3.
G4WT2 > ### Run 0 starts on worker thread 2.


G4WT1 > Thread-local run terminated.
G4WT1 > Run Summary
G4WT1 > Number of events processed : 2500
G4WT1 > User=1.840000s Real=10.763816s Sys=0.100000s [Cpu=18.0%]
G4WT1 >
G4WT1 > --------------------End of Local Run------------------------
G4WT1 > The run consists of 2500 gamma of 6 MeV
G4WT1 > Cumulated dose per run, in scoring volume : 112.939 picoGy rms = 6.26384 picoGy
G4WT1 > ------------------------------------------------------------
G4WT1 >
G4WT2 > Thread-local run terminated.
G4WT2 > Run Summary
G4WT2 > Number of events processed : 2500
G4WT2 > User=1.840000s Real=10.764333s Sys=0.100000s [Cpu=18.0%]
G4WT2 >
G4WT2 > --------------------End of Local Run------------------------
G4WT2 > The run consists of 2500 gamma of 6 MeV
G4WT2 > Cumulated dose per run, in scoring volume : 106.1 picoGy rms = 6.11031 picoGy
G4WT2 > ------------------------------------------------------------
G4WT2 >
G4WT0 > Thread-local run terminated.
G4WT0 > Run Summary
G4WT0 > Number of events processed : 2500
G4WT0 > User=1.840000s Real=10.765963s Sys=0.100000s [Cpu=18.0%]
G4WT0 >
G4WT0 > --------------------End of Local Run------------------------
G4WT0 > The run consists of 2500 gamma of 6 MeV
G4WT0 > Cumulated dose per run, in scoring volume : 105.887 picoGy rms = 6.19659 picoGy
G4WT0 > ------------------------------------------------------------
G4WT0 >
G4WT3 > Thread-local run terminated.
G4WT3 > Run Summary
G4WT3 > Number of events processed : 2500
G4WT3 > User=1.840000s Real=10.766906s Sys=0.100000s [Cpu=18.0%]
G4WT3 >
G4WT3 > --------------------End of Local Run------------------------
G4WT3 > The run consists of 2500 gamma of 6 MeV
G4WT3 > Cumulated dose per run, in scoring volume : 106.636 picoGy rms = 6.19897 picoGy
G4WT3 > ------------------------------------------------------------
G4WT3 >

##############################################################################

When running in multi-thread, with number of threads = 2:
##############################################################################
G4WT0 > /control/saveHistory

G4WT1 > /control/saveHistory

G4WT0 > /run/verbose 2

G4WT0 > /run/initialize

G4WT0 > /run/physicsModified

G4WT1 > /run/verbose 2

G4WT1 > /run/initialize

G4WT1 > /run/physicsModified

G4WT1 > /tracking/storeTrajectory 2

G4WT0 > /tracking/storeTrajectory 2

G4WT1 > ### Run 0 starts on worker thread 1.

G4WT0 > ### Run 0 starts on worker thread 0.


G4WT1 > Thread-local run terminated.

G4WT1 > Run Summary

G4WT1 > Number of events processed : 4970

G4WT1 > User=1.690000s Real=10.759053s Sys=0.110000s [Cpu=16.7%]

G4WT1 >

G4WT1 > --------------------End of Local Run------------------------

G4WT1 > The run consists of 4970 gamma of 6 MeV

G4WT1 > Cumulated dose per run, in scoring volume : 231.798 picoGy rms = 9.16681 picoGy

G4WT1 > ------------------------------------------------------------

G4WT1 >

G4WT0 > Thread-local run terminated.

G4WT0 > Run Summary

G4WT0 > Number of events processed : 5030

G4WT0 > User=1.690000s Real=10.761573s Sys=0.110000s [Cpu=16.7%]

G4WT0 >

G4WT0 > --------------------End of Local Run------------------------

G4WT0 > The run consists of 5030 gamma of 6 MeV

G4WT0 > Cumulated dose per run, in scoring volume : 199.764 picoGy rms = 8.32225 picoGy

G4WT0 > ------------------------------------------------------------

G4WT0 >

##############################################################################

When running in multi-thread, with number of threads = 1:
##############################################################################
Available UI session types: [ Qt, GAG, tcsh, csh ]

G4WT0 > /control/saveHistory

G4WT0 > /run/verbose 2

G4WT0 > /run/initialize

G4WT0 > /run/physicsModified

G4WT0 > /tracking/storeTrajectory 2

G4WT0 > ### Run 0 starts on worker thread 0.


G4WT0 > Thread-local run terminated.

G4WT0 > Run Summary

G4WT0 > Number of events processed : 10000

G4WT0 > User=1.380000s Real=7.935211s Sys=0.060000s [Cpu=18.1%]

G4WT0 >

G4WT0 > --------------------End of Local Run------------------------

G4WT0 > The run consists of 10000 gamma of 6 MeV

G4WT0 > Cumulated dose per run, in scoring volume : 431.562 picoGy rms = 12.3859 picoGy

G4WT0 > ------------------------------------------------------------

G4WT0 >
##############################################################################

When running in serial:
##############################################################################
Run terminated.

Run Summary

Number of events processed : 10000

User=0.990000s Real=1.649343s Sys=0.050000s [Cpu=63.1%]

--------------------End of Global Run-----------------------

The run consists of 10000 gamma of 6 MeV

Cumulated dose per run, in scoring volume : 410.143 picoGy rms = 11.9778 picoGy
##############################################################################

Does exampleB1 open any output files? You could be getting caught up with mutexes and I/O contention.

Visualisation runs in its own thread receiving events one by one from the multiple worker threads, and holds up the worker threads if the queue for drawing gets full. Did you get any warning messages such as?

G4WT2 > WARNING: The number of events in the visualisation queue has exceeded
  the maximum, 100.
  If, during a multithreaded run, the simulation gets ahead of the
  visualisation by more than this maximum, the simulation is delayed
  until the vis sub-thread has drawn a few more events and removed them
  from the queue.  You may change this maximum number of events with
  "/vis/multithreading/maxEventQueueSize ", where N is the maximum
  number you wish to allow.  N   Alternatively you may choose to discard events for drawing by setting
  "/vis/multithreading/actionOnEventQueueFull discard".
  To avoid visualisation altogether: "/vis/disable".
  And maybe "/tracking/storeTrajectories 0".

So either /vis/disable or /vis/multithreading/actionOnEventQueueFull discard before running lots of events.

I don’t think it opens any output files. I will double check it.

Hi Allison,

Thank you so very much! After disabling the visualization, I can indeed see enhanced performance.

BTW: I did see the warning message, but ignored it… This is a terrible habit that I need to get out of badly…

Really appreciate your help. This problem has been bothering me for a while.