Multiple output files

Florian · January 29, 2020, 8:07pm

Hi,

I was looking into saving the deposited energy (and maybe more) for each particle in a run of several thousands of them, but also the averaged one.All of it should be saved in the form of histograms in root format. I wanted this output written as different files but while adapting my previous code (based on TestEm11 example which provides mean histograms), I figured out that analysis manager can only have 1 instance. I initially wanted to have one instance dedicated to the run level of information (id est average) and one instance for each event, that would be opening and closing files.

I know that by saving all information I require in ntuples then post-process it into different files that would work but ntuples are heavier than histograms and not my first choice.

Is there any easy way to have 2 files opened at the same time? I know that Multithreading opens several files then merge them…

Thanks,
Florian

Geant4 10.5

Florian · January 31, 2020, 11:33pm

Hi,
Another option would be to create an histogram for each event but then the ram usage explodes and restrain the amount of events that can be processed in 1 run.

Could you help me figure out solutions that I might not have foreseen?

Thank you in advance

allison · February 1, 2020, 10:12am

Not sure what you’re trying to do. Ntuples are a fine way to go.

Try posting your question on the “Getting Started” Forum.

Florian · February 7, 2020, 5:50pm

Hi,
Let me rephrase it so it turns clearer. (Hopefully)

My code was adapted from TestEM11 example where histogram 1 (longitudinal energy profile) registered the averaged energy deposited per event along the depth of the material.
My goal is now to track each particle so that I have the fluctuations. This means 50000 particles/run to be tracked.
My GEANT4 performs on a cluster and use Multi-threading, I thus need it to be optimized in time/cores/RAM/disk usage.

I thought of different possibilities to adapt my code but each comes with difficulties. Thus, I would like your help to improve those solutions or maybe think of another solution that might be more efficient.

*Create a histogram for each particle and store them in 1 file. This requires RAM usage to store all histograms (50000 ones and each threads) and seems to go higher than available RAM on cluster.

Create a file for each particle and store a histogram of the longitudinal energy deposited profile. Opening a file needs an Analysis Manager instance. From my understanding and testing, this instance needs to be created at the Master thread level and there can only be 1 instance. While, this options seems to be impossible to do, if it worked it would allow me to reduce drastically the RAM as only the current histograms being filled by the worker threads will be in RAM (just the number of cores 40 for my case compared to previous solution of 50 000 histograms)
Use a Ntuple. This Ntuple will store each steps deposition and flag them with the event info so that it can be post processed into histograms out of GEANT. I tried to develop this solution but it seems it to have some drawbacks.
This solution seems to drastically slow Geant run. 50% slower.

It might certainly be my code that is faulty but i also think it might come from ntuples being written in each step.
Also the output files are heavy from 0.6GB to 7.7GB which is more than I think the histograms would output. Thus, using Ntuples will need to be careful on data storage .

I would be really grateful if you could help me see more clearly into this and help find an adapted structure solution to this situation.

Thanks in advance!

allison · February 9, 2020, 3:05pm

I’m sorry. What you say does not make sense to me:

“Create a histogram for each particle” It suggests to me that you have not understood the concept of a histogram. It is the representation of data accumulated over a run.

“Create a file for each particle”: What do you mean, “each particle”? “Each type of particle”? (There are hundreds). “Each particles in each event”? (There would be millions.) “Each track?” Why would you create a file for each? That’s crazy.

Ntuple allows you to have just one file. Even if you use histograms you can accumulate data in them over a run and write them out in one file, I believe.

Like I said, “Try posting your question on the “Getting Started” Forum.”

Florian · February 10, 2020, 4:21pm

Thanks for the reply.

My understanding of histogram is a “representation of the distribution of numerical data”. It allows me to obtain distributions depending on a binning. So by using histograms I can drastically reduce (and easily by just using the bin filling) the amount of data. The run aspect is thus not a requirement in anyway except maybe in performances.

In my case, maybe not clear enough despite the example value of 50000/run, 1 primary particle=ion is generated per event and it is the particle I am referring to. I mainly track the primary. So this still means, indeed, a lot of histograms (50 000). Huge yes, not optimized maybe, crazy no. This was the reason for me seeking advice.

Creating a file for each was in order to not keep in memory an histogram of an event already processed.

And indeed I can accumulate data in histograms but creating 50 000 histograms seems to crash Geant due to huge demand in RAM. Thus, my choice of having only the events processed in memory and to immediately write the file once processed.

I will move to getting started. Thanks for your replies.

Florian · February 10, 2020, 4:24pm

Moved from Recording, Visualizing and Persisting Data to Getting Started.

bmorgan · February 11, 2020, 2:00pm

From the description of your use case, it sounds like an Ntuple with post-processing might be the better solution. @ivana can advise on why the Ntuple solution is slow, if you can post that portion of your code here.

Florian · February 11, 2020, 4:08pm

Hi,
here are the relevant parts where Ntuples are present:

In HistoManager.cc:

analysisManager->SetNtupleMerging(true);
analysisManager->CreateNtuple(“Landau”, “Edep and TrackLength”);
analysisManager->CreateNtupleDColumn(“EventID”);
analysisManager->CreateNtupleDColumn(“Detector”);
analysisManager->CreateNtupleDColumn(“Step”);
analysisManager->CreateNtupleDColumn(“Edep”);
analysisManager->FinishNtuple();

In SteppingAction.cc:

if (nhist<= fDetector->GetNbOfAbsor() && nhist>0)
{
analysisManager->FillNtupleDColumn(0,0, currentEvent->GetEventID());
analysisManager->FillNtupleDColumn(0,1, nhist-1);
analysisManager->FillNtupleDColumn(0,2, zshiftedAbsor/um);
analysisManager->FillNtupleDColumn(0,3, edep/keV);
analysisManager->AddNtupleRow();
}

In RunAction:
Begin of Run:

auto analysisManager = G4AnalysisManager::Instance();
analysisManager->OpenFile();
fHistoManager->Setup();

End of Run:

fHistoManager->Save();

void HistoManager::Save()
{
//Save histograms
auto analysisManager = G4AnalysisManager::Instance();
analysisManager->Write();
analysisManager->CloseFile();
}

Hope it can help you understand what I did wrong.

Thanks for your help

Florian · February 11, 2020, 5:14pm

Also my files are named in the macro file using:

/analysis/setFileName
Result/{particleloc}/config{ndetector}{particleloc}{eKinLoc}MeV_nucl_{angleLoc}_deg

I was working on the post process but faced some issues and one thing that appeared was corrupted header of the output root file. This is a comment I received in Root forum when trying to solve my post process issues

Indeed the file is odd … It reports that it was written by ROOT v4 … but more importantly it is self inconsistent. It claims that the original file name was 122 characters longs when in fact it was only 43 characters long (“Result/alpha/config300_alpha_6MeV_nucl.root”). The extra 80 characters means that TFile inadvertently overwrite the first basket when updating its header.

I don’t know if all is linked or not to my previous issues.

ivana · February 12, 2020, 8:08pm

Hi Florian,

Thank you for providing code.
It is not clear how you want to process 50000 of histogram to evaluate the fluctuation. I would just note here that the histograms contain statistical data for each bin, see more details here:
http://geant4-userdoc.web.cern.ch/geant4-userdoc/UsersGuides/ForApplicationDeveloper/html/Analysis/g4tools.html
So you may take a look if you could use one histogram for all particles and get the statistics from the additional information saved per bin.

If you really need to save the data for each step, and so use ntuple, then there are several factors which may affect performance. The data are not written in a file with each AddNtupleRow() call, but they are accumulated in buffers and only when a buffer is filled, the data are written in a file.
You mention that you run in MT on a cluster with ntuple merging node activated. Ntuple merging can affect performance, as the buffers when they are filled are not written in a file directly but they are first transferred to the master thread which performs writing. Note that the
SetNtupleMerging() function can be called with additional parameters which can be used to tune performance:

void SetNtupleMerging(G4bool mergeNtuples,
G4int nofReducedNtupleFiles = 0,
G4bool rowWise = true,
unsigned int basketSize = fgkDefaultBasketSize);

I suggest to see first the impact of writing ntuples on the simulation time in sequential mode on a smaller number of particles; is the factor here also ~50%? If yes, you can try to increase the basket size: this will increase memory consumption but will process writing less often.

Best regards,

Florian · February 14, 2020, 10:46pm

Hi @Ivana,

Thanks for your help.

I need those histograms to be ideally passed to another program for pulse analysis. This second software simulates the carriers based on energy deposition in every step of a particle (1 micron in our case for a total 4800um full detector). So I need to account for fluctuations in each steps and propagate to others steps. But I could not think of other ways than having, either the energy distribution in each microns by the 50000 particles ( either 4800 histos or ntuples), or track each particle’s energy deposition profile (either 50000 histograms or ntuples).

I read through the link and I might be missing a point but I don’t see a way for me to reconstruct the energy deposition of each particle or the distribution in each microns from the average histogram only.

Thus, my currently implemented code uses ntuples (histograms cannot be handled it seems by the RAM) and allows in post process to obtain either the energy profiles of each particles or the distributions in each microns.
However, my implementation of ntuples takes a huge amount of data storage and slow my runs.

I tested the cores usage for the following test:
_ MT+Ntuples: 340% usage out of 6 cores so 55% efficiency
_ MT+ no Ntuples, only previous averaged histograms: 600% out of 6 cores so 100% efficiency
_ Single thread + Ntuples: 100% out of 1 core so 100% efficiency
_MT+NTuples + changed buffer size: 340% is the maximum out of 6 cores for the default configuration. Increasing or decreasing the buffer seems to actually slow it more.

I hope that those information can clarify to help improve my codes.

Thanks in advance for further help you can provide me.
Best regards,

mkelsey · February 15, 2020, 2:04am

Instead of storing each track/step as a separate histogram, you might want to look into writing out “rich trajectories” instead. Those store the step positions and related step information for each track (see http://geant4-userdoc.web.cern.ch/geant4-userdoc/UsersGuides/ForApplicationDeveloper/html/TrackingAndPhysics/tracking.html#track-traj).

These are used primarily for visualization, but since they are stored in G4Event (in the G4TrajectoryContainer object), you should be able to access them in your event and run action and write them out to your file. They can also be written out by vis itself in the HepRep format.

Florian · February 27, 2020, 8:42pm

Hi @mkelsey,

Sorry to come back this late I did not have time before to try implementing your suggestion.
While trying now I am face with a simple issue I have a hard time solving… How to get access to the trajectory?

G4TrajectoryContainer* trajectoryContainer=anEvent->GetTrajectoryContainer();

I got access to the Trajectory container and also TrajectoryVector* Vtrack= trajectoryContainer->GetVector(); but for some reason I can’t access the the trajectory itself. Although those classes implementation, I did not manage to obtain a G4Trajectory. Could you help me pass this small hindrance?

Thanks in advance

mkelsey · February 27, 2020, 10:02pm

In the section of the App Guide I referenced above, there’s a paragraph,

The creation of a trajectory can be controlled by invoking G4TrackingManager::SetStoreTrajectory(G4bool) . The UI command /tracking/storeTrajectory bool does the same. The user can set this flag for each individual track from his/her G4UserTrackingAction::PreUserTrackingAction() method.

Did this work?

Did you look to see if the vector Vtrack had any entries (i.e., is Vtrack->size() > 0)?

Are you accessing this container late enough in the event that you know there have already been tracks processed?

Florian · February 28, 2020, 5:31pm

Hi
My code in EventAction:

G4TrajectoryContainer* trajectoryContainer=anEvent->GetTrajectoryContainer();
G4int n_trajectories = 0;
if (trajectoryContainer) n_trajectories = trajectoryContainer->entries();
G4cout<< "::::::-------- n_trajectories= "<< n_trajectories << “----------------------:::::”<<G4endl;
TrajectoryVector* Vtrack= trajectoryContainer->GetVector();
G4cout<< "::::::-------- VtrackSize= "<< Vtrack->size() << “----------------------:::::”<<G4endl;

This prints out n_trajectories= 1 and VtrackSize=1;
I tried G4Trajectory* track= Vtrack[0]; to get a trajectory but it does not compile:
error: cannot convert ‘TrajectoryVector {aka std::vector<G4VTrajectory*>}’ to ‘G4Trajectory*’ in initialization

mkelsey · February 29, 2020, 7:08am

The compiler error reads strangely, but I can see the problem. Your Vtrack is a pointer to a vector, not a vector itself. So you need to write:

G4VTrajectory* track = (*Vtrack)[0];

and then maybe downcast it to G4Trajectory if you really need to do so. You can also, apparently, avoid the vector access entirely:

G4VTrajectory* track = trajectoryContainer[0];

should work, according to the .hh file.

Florian · March 5, 2020, 4:09pm

Hi,
thanks i managed to access it.
By looking at the G4VTrajectory, I understood I need to add some stored attributes to get the energy deposited of each step as Trajectory only store locations in their default form.
Do you know if one of the Geant4 examples show the full implementation of a “rich” trajectory?

mkelsey · March 5, 2020, 7:15pm

I don’t know for sure. I see that a lot of the “extended” examples make use of trajectories (find examples -name '*.cc'|xargs grep Trajectory). It looks to me as though runAndEvent/RE04 might be your best bet. It stores a lot of useful stuff about the track in the RE04Trajectory subclass.

Florian · March 20, 2020, 4:10pm

Thanks I managed to perform this implementation. I learned a lot.