No performance improvement using Multi-Threading

Geant4 Version: 11.3.2
Operating System: Ubuntu 24.04 LTS (Noble Numbat)
Compiler/Version: g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
CMake Version: 3.28.3


Hi :slight_smile:
I have updated my application to run in multithreading mode. After some fixes, it now completes the job without crashing, but I don’t see any performance improvement. It is taking the same amount of time as the serial application.

I tried to understand why, but I cannot find any reason. Are there any suggestions?

I will attach here some part of the code that can be related to the issue:

RunAction.cpp

#include "RunAction.h"

#include "RunMetadata.h"
#include <filesystem>

RunAction::RunAction()
{
    // set printing event number per each 100 events
    G4RunManager::GetRunManager()->SetPrintProgress(1000);

    fHistManager = HistogramManager::Instance();
}

RunAction::~RunAction()
{
    delete fHistManager;
}

void RunAction::BeginOfRunAction(const G4Run *run)
{
    HistogramManager::Instance()->Book();

    G4RunManager *runManager = G4RunManager::GetRunManager();

    // inform the runManager to save random number seed
    runManager->SetRandomNumberStore(false);
}

void RunAction::EndOfRunAction(const G4Run *run)
{
    G4int nEvents = run->GetNumberOfEvent();

    HistogramManager::Instance()->Close();

    auto analysisManager = G4AnalysisManager::Instance();

    auto runMetadata = RunMetadata::Instance();
    runMetadata->SetNumberOfEvents(nEvents);

    std::string filename = analysisManager->GetFileName();
    std::string stem = std::filesystem::path(filename).stem().string();                // filename without extension
    std::filesystem::path parent_path = std::filesystem::path(filename).parent_path(); // return the path to the file location

    std::string jsonFilename = (parent_path / stem).string() + ".json";

    runMetadata->SaveToJson(jsonFilename);
}

All the creation and filing of nTuples is done in this class:

HistogramManager.cpp

#include "HistogramManager.h"

#include "G4SystemOfUnits.hh"

#include "MyHit.h"
#include "Names.h"

// Define the thread-local singleton pointer
G4ThreadLocal HistogramManager *HistogramManager::fInstance = nullptr;

HistogramManager *HistogramManager::Instance()
{
    if (!fInstance)
    {
        fInstance = new HistogramManager();
    }
    return fInstance;
}

HistogramManager::HistogramManager()
{
    G4AnalysisManager::Instance()->SetDefaultFileType("root"); // Default file type
    G4AnalysisManager::Instance()->SetFileName("output");      // Default file name
    G4AnalysisManager::Instance()->SetVerboseLevel(1);

    // Only merge in MT mode to avoid warning when running in Sequential mode
#ifdef G4MULTITHREADED
        G4cout << "[INFO] SetNtupleMerging set to True!" << G4endl;
        G4AnalysisManager::Instance()->SetNtupleMerging(true);
#endif
    }
}

HistogramManager::~HistogramManager()
{
}

void HistogramManager::Book()
{
    if (fIsBooked)
        return;

    Open();
    CreateNTuples();

    fIsBooked = true;
    G4cout << "[HistogramManager] Histograms and ntuples booked." << G4endl;
}

void HistogramManager::Open()
{
    G4AnalysisManager::Instance()->OpenFile(); // uses macro-set filename
    G4cout << "[HistogramManager] File opened." << G4endl;
}

void HistogramManager::Close()
{
    G4AnalysisManager::Instance()->Write();
    G4AnalysisManager::Instance()->CloseFile();
    G4cout << "[HistogramManager] File written and closed." << G4endl;
}

// -------------------------------------------------------------
// --------------- C R E A T E     M E T H O D S ---------------
// -------------------------------------------------------------

void HistogramManager::CreateNTuples()
{

    //! NTuple: Hit info

    G4AnalysisManager::Instance()->CreateNtuple(Names::nTupleHitData.name, "Run Hit data"); // ID: 1

    G4AnalysisManager::Instance()->CreateNtupleDColumn(Names::nTupleHitData.id, Names::cEnergyDeposited.name);

    // .... OTHER CREATE COLUMNS.....

    G4AnalysisManager::Instance()->FinishNtuple();

    //! NTuple: Event summary
    G4AnalysisManager::Instance()->CreateNtuple(Names::nTupleEventSummary.name, "Event summary"); // ID: 2
    // ... CREATE COLUMNS ...
    G4AnalysisManager::Instance()->FinishNtuple();
}

// ----------------------------------------------------------
// --------------- F I L L      M E T H O D S ---------------
// ----------------------------------------------------------

void HistogramManager::FillHitData(const MyHit *hit)
{

    // Fill the Ntuple: Hit Data
    G4AnalysisManager::Instance()->FillNtupleDColumn(Names::nTupleHitData.id, Names::cEnergyDeposited.id, hit->GetEdep());
    
    // .... OTHER FILL COMMANDS .....

    G4AnalysisManager::Instance()->AddNtupleRow(Names::nTupleHitData.id);
}

void HistogramManager::FillEventSummary(G4double totalEdep, G4int nHits, G4int nTracks, int id)
{
    G4AnalysisManager::Instance()->FillNtupleDColumn(Names::nTupleEventSummary.id, Names::cTotEdep.id, totalEdep / MeV);
    // .... OTHER FILL COMMANDS .....
    G4AnalysisManager::Instance()->AddNtupleRow(Names::nTupleEventSummary.id);
}

And they are called in the EventAction.cpp

#include "EventAction.h"

#include "MyHit.h"
#include "G4THitsCollection.hh"
#include "G4SDManager.hh"
#include "G4AnalysisManager.hh"
#include "HistogramManager.h"
#include "G4Event.hh"

#include "Names.h"
#include <set>

// Called automatically by Geant4 at the begin of each event
void EventAction::BeginOfEventAction(const G4Event *) {}

// Called automatically by Geant4 at the end of each event
void EventAction::EndOfEventAction(const G4Event *event)
{
    // Retrieve the hit collections container for this event
    G4HCofThisEvent *hce = event->GetHCofThisEvent();

    // If no hit collections exist, skip this event
    if (!hce)
        return;

    // Static variables: get collection IDs once
    static G4int hcIDActive = -1;

    if (hcIDActive < 0)
        hcIDActive = G4SDManager::GetSDMpointer()->GetCollectionID(Names::activeAreaSensitiveDetector + "_thread" + std::to_string(G4Threading::G4GetThreadId()) + "/" + Names::activeAreaHitsCollection);

    // Get the actual hits collections for this event
    auto hitsCollectionActive = static_cast<G4THitsCollection<MyHit> *>(hce->GetHC(hcIDActive));

    // If neither collection found, skip
    if (!hitsCollectionActive)
        return;

    // Initialize per-event variables using the active area collection
    G4double totalEdep = 0.;                         // Sum of all energy deposited in this event (active area)
    G4int numHits = hitsCollectionActive->entries(); // Number of hits recorded in active area
    G4double totalChargedEdep = 0.;                  // Total energy deposited by charged particles in active area
    std::set<G4int> trackIDs;

    auto analysisManager = G4AnalysisManager::Instance();
    auto histogramManager = HistogramManager::Instance();

    // Loop over hits in the active area collection
    for (G4int iHit = 0; iHit < numHits; ++iHit)
    {
        auto hit = (*hitsCollectionActive)[iHit];

        trackIDs.insert(hit->GetTrackID());

        if (hit->GetIsActiveArea())
        {
            // Account only E Dep in the active Area
            totalEdep += hit->GetEdep();
        }

        histogramManager->FillHitData(hit);
    }

    // Avoid saving events with no energy deposition
    histogramManager->FillEventSummary(totalEdep, numHits, trackIDs.size(), event->GetEventID());
}

Perhaps I’m doing something wrong, but I’m unable to identify the issue. I’m saving the output to .root file and some run metadata to a .json file format.

Many thanks for considering my request.
Michael

How many worker threads did you request? The macro command is /run/numberOfThreads (before /run/initialize), and the C++ function is runManager->SetNumberOfThreads().

I requested 4 threads, and in the terminal output, I can see the 4 threads working. I also tried with 2, and I saw both in that case too.

Hmmm. In my experience there’s a surprising amount of initialization time (at least with our simulation code) before the run (with multiple threads) starts. Depending on the number of events you’re generating, it’s possible that you’re seeing mostly that baseline. Try a large number of events (10k, 100k, …), enough to have significant running time. Then you should be able to see the effect of threading.

Thanks, I tried simulating 1e7 events, do you think I should try more?

That ought to be enough, but maybe you can get a bit more timing information. Add the line /run/printProgress 100000 to your macro. Time how long it takes to get to the “Run starts” message, which should be the same no matter how many threads you have. Then see how long the job takes for different numbers of threads.

Are you running this on an HPC cluster, or your personal laptop/desktop machine? How many cores do you have available? You won’t see any improvement (and will likely see worse performance) if you request more threads than cores.

I’m running on my personal laptop.

I have these core availables, from lscpu:

Core(s) per socket: 4
Thread(s) per core: 2