Storage and post processing information of large datasets of particles

Dear all,

I would like to ask suggestions to storage and manipulate large datasets of data in Geant4. From my simulations, I have obtained the information of 1e7 neutrons in a .txt file as follows (each line is referred to a quantity saved as a scalar or a vector) :

output.txt

energy particle 1
position (x,y,z) particle 1
direction (dx, dy, dz) particle 1
energy particle 2
position (x,y,z) particle 2
direction (dx, dy, dz) particle 2
.........
.........
.........
energy particle N
position (x,y,z) particle N
direction (dx, dy, dz) particle N

This file reaches a size of around 400 MB. I aim to increase the computational cost of the simulation of an order magnitude or more, i.e. talking about GBs/tens of GBs of storage. In addition to this problem, the time for post processing this information with my python scripts is long. In this case, the extraction and manipulation of an output file requires 1 hour, reading line by line the text file.
I would like to ask, please, suggestions both to reduce the storage of this file (using eventually other formats) and to speed up the post process procedure in python. I guess that using hdf5 format could be an useful option, but I would be open to all suggestions.

Thanks in advance for your time.
Best regards,

Christian

Your use case sounds like it’s exactly what binary encoded .root files were made for. The G4AnalysisManager within Geant4 is already able to output .root files. You could save the information you work with as an Ntuple. I encourage you to look into the CERN ROOT project. You can find more information about G4AnalysisManager here: Analysis Manager Classes — Book For Application Developers 11.0 documentation

Edit: Feel free to take a look at my project MicroTrackGenerator. Rather than using the limited version of ROOT already built into Geant4 through the analysis manager, my project makes use of the full ROOT library and outputs to a TTree rather than an Ntuple. Take a look at the SteppingAction to see how the trees are made and how x,y,z position and energy deposited are output to the tree for every step.

Working with ROOT will not only make your output files the smallest possible – but the functionality built into ROOT for analyzing data will also make your analysis/post-processing much faster than working with .txt. When multithreading on 4 cores of my laptop I’m able to read in ~500MB/s of track information using ROOT.

Joseph

1 Like

Dear @JDecunha ,

thank you very much for your suggestion. I’m a really beginner with ROOT and I just started to use it. Your implementation seems for experts and I’ll give a look later for sure.

Until now, I tried to export as .root outputs the information that I collected from my simulations. In addition, I started to work with pyROOT, since I need to use some specific Python libraries for post processing of my results.

Unfortunately, I got some problems to extract the information from the root files. I would like to ask, please, some suggestions taking as the reference the .root files from the B4a example. I report the python script that I used to manage the file generated, B4.root (that I am not allowed to attach here due to its format):


from ROOT import *

fileName="./build/B4.root"
myFile = ROOT.TFile.Open(fileName, "READ")

tree=myFile.Get("B4")
leaf = tree.GetBranch("Eabs").GetLeaf("Eabs")
myFile.ls(), tree.Scan(),  leaf.GetValue(), tree.Print()

In Jupyter Notebook, I get printed this information:

(20, 22.873348309946593)

TFile**		B4aTestRoot/build/B4.root	
 TFile*		std_examples/B4aTestRoot/build/B4.root	
  KEY: TTree	B4;1	Edep and TrackL
  KEY: TH1D	Eabs;1	Edep in absorber
  KEY: TH1D	Egap;1	Edep in gap
  KEY: TH1D	Labs;1	trackL in absorber
  KEY: TH1D	Lgap;1	trackL in gap

************************************************************
*    Row   * Eabs.Eabs * Egap.Egap * Labs.Labs * Lgap.Lgap *
************************************************************
*        0 * 43.042040 *         0 * 29.575223 *         0 *
*        1 * 27.340958 *         0 * 19.127214 *         0 *
*        2 * 45.473479 *         0 * 30.956585 *         0 *
*        3 * 4.7084119 *         0 * 1.4942436 *         0 *
*        4 * 3.6049559 *         0 * 0.7217933 *         0 *
*        5 * 3.4328333 *         0 * 0.5616289 *         0 *
*        6 * 51.597768 * 0.2504165 * 39.337433 * 0.6461602 *
*        7 * 3.3216662 *         0 * 0.4482391 *         0 *
*        8 * 44.348139 * 1.4075870 * 33.634048 * 7.6039427 *
*        9 * 3.2214064 *         0 * 0.3639539 *         0 *
*       10 * 3.1431678 *         0 * 0.3159866 *         0 *
*       11 * 29.325894 *         0 * 20.652747 *         0 *
*       12 * 48.283826 *         0 * 36.003877 *         0 *
*       13 * 31.511904 * 1.4158138 * 23.978050 * 8.1664451 *
*       14 * 32.828006 *         0 * 23.612736 *         0 *
*       15 * 10.381831 *         0 * 6.1633032 *         0 *
*       16 * 3.2690086 *         0 * 0.3967726 *         0 *
*       17 * 3.1745242 *         0 * 0.3311820 *         0 *
*       18 * 47.452206 * 7.2540443 * 33.439702 * 39.732736 *
*       19 * 22.873348 *         0 * 15.977893 *         0 *
************************************************************

******************************************************************************
*Tree    :B4        : Edep and TrackL                                        *
*Entries :       20 : Total =            4621 bytes  File  Size =       2805 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :Eabs      : Double_t B4                                            *
*Entries :       20 : Total  Size=       1053 bytes  File Size  =        548 *
*Baskets :        4 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    1 :Egap      : Double_t B4                                            *
*Entries :       20 : Total  Size=       1053 bytes  File Size  =        548 *
*Baskets :        4 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    2 :Labs      : Double_t B4                                            *
*Entries :       20 : Total  Size=       1053 bytes  File Size  =        548 *
*Baskets :        4 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*
*Br    3 :Lgap      : Double_t B4                                            *
*Entries :       20 : Total  Size=       1053 bytes  File Size  =        548 *
*Baskets :        4 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

where I am not able to extract the values of each column but only the one at the last row (22.873348309946593).
I tried to get the single values in the following way:

leaf.GetValue(1), leaf.GetValue(2), leaf.GetValue(3), leaf.GetValue(20)

by obtaining:

(0.0, 5.73e-322, 2.4e-322, 5.667e-321)

I installed the last version of ROOT and I updated jupyter and Python in the Anaconda environment, even though I don’t exclude errors due to bugs.

Do you have any suggestion, please?

Thank you very much for your time.
Best regards,

Christian

Hi @christian_castagna,

I think your question might better be addressed on the ROOT forums. But for reading out a ROOT TTree I would advise you take a look at section 14.14.3 of the following document: Chapter: Trees. I think you are running into trouble because you are taking the leaves of the tree and trying to output their values – I believe you want to take a branch of the tree and iterate through / output its entries, not its leaves.

Joseph

1 Like

Dear @JDecunha ,

thank you very much for your suggestion. I have been able to do it. I report here the script, related to the extraction of the variable Eabs, also to help people that can have my same problem:

import ROOT

fileName="B4root"
myFile = ROOT.TFile.Open(fileName, "READ")
tree = myFile.Get("B4")

# to get Eabs

counter=0
for i in range(tree.GetEntries()):
    tree.GetEntry(i)
    Eabs = getattr(tree, 'Eabs')
    print (Eabs)

Now, I am testing the capabilities of .root files with respect .txt files. Unfortunately, in term of memory storage, I do not have a great decrease of size: the .root file covers 388 MB and .txt file 404 MB.
Considering that the .root file is already compressed, I expected an higher gain. Do you have an explanation and/or suggestion, please?

In addition, by a zip compression, the .root file in .zip format covers 343 MB and the .txt 177 MB. In this case, I obtain a gain through the .txt file.

However, in relation to the reading of the single file with myFile = ROOT.TFile.Open(fileName, "READ") (not the single entries), I’ve obtained with ROOT an increase of the velocity of a factor of 20, that is significant.

Thank you very much for your time.
Best regards,

Christian

Hi @christian_castagna,

I think your question would be best addressed on the CERN ROOT forums. A couple of guiding comments that might help you though:

1.) Keep in mind that .root files can contain many different data sets. There may be other information in your .root file than just the single TTree you are looking at, which could be increasing the space of your file. You could investigate this by calling myFile.ls() to list the contents of your file.

2.) Is the data you are saving to your .txt file and your .root file of equal precision? i.e. I imagine you are outputting floats or doubles to your .root file. Does the information saved to your .txt have the same number of significant figures? This might be why the space saving is not evident.

3.) ROOT TFiles, can have varying levels of compression, varying from not compressed at all, to highly compressed (at the expense of taking longer to read out). It is possible your TFile is not compressed and this is responsible for the size you are observing. The ROOT: TFile Class Reference here shows some functions you may find helpful (GetCompressionAlgorithm(), GetCompressionLevel(), GetCompressionSettings() and functions to change the compression such as SetCompressionLevel(int setting) etc.)

Best,

Joseph

1 Like

Dear @JDecunha ,

thank you very much for the information and for the help.
Best regards,

Christian