Multithreading isn't effective after some numbers of cores

rahmanyster · January 19, 2020, 8:45pm

Hello all, I am trying to simulate a problem which requires a high amount of particles to be tracked. So I am using the cluster of my university to speed up the simulation. But the problem is after adding a certain number of cpu cores, the simulation time is not reduced and it is not effective to increase the number of cores. For instance, the simulation time for my code running on 256 cores and 1024 cores are the same. I would appreciate any suggestions regarding this case.
Thank you in advance.

donahuw2 · January 20, 2020, 7:09pm

Geant4 does have some inherent scalability issues. These have been known for a while. When the multithreading package was first released, there was some excellent data on the scalability.
(https://www.slac.stanford.edu/xorg/geant4/tutorial/MC2015G4WS/Multithreading.pdf)

At 256 cores you have reached the scalability limits of adding more processors.

Other question for you. Are you using the Geant4 MPI Suite? Most Clusters don’t have 256 cores or more on a single compute node (most have ~32 cores per node). Typically, the compute nodes need to communicate data between them. This is not the same as multithreadding an application to run on multiple cores.

rahmanyster · January 20, 2020, 10:11pm

Thank you for your explanations. It was very insightful and now everything makes sense. The cluster in our university has 72 cores per node, and now I am checking my testing data, adding more cores than 72 (one node) doesn’t make a difference. For instance the run time for 72, 144 and 288 core are all the same.
Again thank you for your explanations. Unfortunately, I am very novice in working with Geant4. So I am going to ask another question. Does Geant4 MPI makes any difference in run time if I use 144 cores instead of 72, for example.

jrmadsen · January 24, 2020, 9:25pm

To be clear, Geant4 does not have “inherent scalability issues”. Geant4 falls under the category of what is called “embarrassingly parallel” because there is no implicit ordering or synchronization required at a very high-level in the code. The limitation is at the hardware level, not Geant4.

MPI is a standard for communicating between processes. Each process has one or more threads. All threads in a process must live on the same “computer” (node is the proper term). You are currently running one process with 72 threads. The node only has 72 cores so 2 MPI processes each running with 72 threads on the same node won’t provide you any benefit. The hardware limits you to only doing 72 things at one time. You use Geant4 MPI to link together one process w/ 72 threads running on one node with another process w/ 72 threads on an entirely separate node.