CPU Farm recommendations

swissco67 · January 28, 2020, 9:24pm

We’re considering acquiring a CPU farm for future GEANT4 simulations.

So wondering if there are any lessons learned or recommendations what kind of hardware to acquire - i.e. like best bang for the buck and future compatible.

To start, we’d be interested in getting 64 cores - preferably on a single node so we don’t have to mess with MPI.

Alternatively, any recommendations for great external HPC access ?

bmorgan · January 29, 2020, 11:38am

What you choose will be completely dependent on the profile of the application(s) to be run on it. It would be best to do that first to find out specific requirements in things like CPU clock vs cores, RAM, I/O etc. Only that will give a reasonable bang for buck number.

Jason_Jonson · January 29, 2020, 12:36pm

I think there is a ryzen threadripper being release soon that has 64 cores and 128 threads on a single CPU, which will retail at around $4500 I believe. Then there are the questions of RAM and storage, but this CPU has lots of PCIE lanes for that.
https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-3990x

swissco67 · January 29, 2020, 11:54pm

Currently running my simulations on 2 systems per below and getting greatly different run times that are very surprising. The simulation is loading a 180MB gdml model that has 1 detector location and runs 10 million electrons to compute the accumulated dose in this detector.

System 1 Hardware
Intel Xeon X650 @ 2.67GHz (2 processors), L1 cache 256KB, L2 1MB, L3 24MB
Installed 24GB RAM
2 sockets, 4 cores per socket, 1 thread per core
PassMark - CPU Mark 7266

Installed Virtual Linux Session (VM) on System 1 using Ubuntu 18.04.3 LTS install
1 socket, 4 cores per socket, 1 thread per core
CPU MHz: 2659.93
BogoMIPS: 5319.86
Mem 7.8GB
Swap 472MB

System 2: Stand-alone Linux Install (Ubuntu 19.10)
Hardware: i5-3470 @ 3.2GHz, L1 cache 128KB, L2 1MB, L3 6MB
8GB memory 2GB swap
1 socket, 4 cores per socket, 1 thread per core
CPU MHz: 1596.44
BogoMIPS: 6385.70
RAM type: DDR3, 1.6GHz
PassMark - CPU Mark 6733

The simulation is using multi-threading parallel processing routines and is run with 4 threads on both systems.

A run with 1E6 electrons on the VM system will run for ~2 days, a run with 10E6 particles runs for over 8 days. On the Linux box, the same runs will take about 8 to 10 times longer to complete!

When starting another job on the VM machine, both jobs crash - assuming running out of memory but not sure. When starting 2 parallel jobs on Linux box they run fine.

Any idea why the virtual linux session on System 1 runs so much more efficient than the native linux install run from System 2?

Seems run performance is not scaling with CPU benchmark score - not sure what are the critical elements to configure an optimized system with >32 threads???

amadio · January 30, 2020, 7:54am

Your system 1 has twice as big L1 cache, and four times as big L3 cache. This makes a big difference for Geant4 simulations. I would recommend machines with larger caches for best performance.

Jason_Jonson · January 30, 2020, 2:52pm

Also RAM availability perhaps? In my experience trying to track more than one particle at a time in complex geometries causes RAM use to explode.

amadio · January 31, 2020, 12:26pm

Are you using Geant4-MT? The memory footprint with it should be much lower than with multiple processes, as cross section and geometry data is shared between threads.

Jason_Jonson · January 31, 2020, 12:29pm

Indeed I am. I do not run multiple processes. I noticed the increased RAM usage especially when generating particles with the GPS and the /gps/number command above 1.