Skip to content
Snippets Groups Projects
Commit 79d981b2 authored by Enrico Guiraud's avatar Enrico Guiraud
Browse files

[DF][NFC] Improve doxygen docs

- add example usages to Filter, Define, Snapshot
- add some links to other parts of the documentation where relevant
- small fixes here and there
parent 205f015e
No related branches found
No related tags found
No related merge requests found
...@@ -138,6 +138,25 @@ public: ...@@ -138,6 +138,25 @@ public:
AddDefaultColumns(); AddDefaultColumns();
} }
////////////////////////////////////////////////////////////////////////////
/// \brief Cast any RDataFrame node to a common type ROOT::RDF::RNode.
/// Different RDataFrame methods return different C++ types. All nodes, however,
/// can be cast to this common type at the cost of a small performance penalty.
/// This allows, for example, storing RDataFrame nodes in a vector, or passing them
/// around via (non-template, C++11) helper functions.
/// Example usage:
/// ~~~{.cpp}
/// // a function that conditionally adds a Range to a RDataFrame node.
/// RNode MaybeAddRange(RNode df, bool mustAddRange)
/// {
/// return mustAddRange ? df.Range(1) : df;
/// }
/// // use as :
/// ROOT::RDataFrame df(10);
/// auto maybeRanged = MaybeAddRange(df, true);
/// ~~~
/// Note that it is not a problem to pass RNode's by value.
operator RNode() const operator RNode() const
{ {
return RNode(std::static_pointer_cast<::ROOT::Detail::RDF::RNodeBase>(fProxiedPtr), *fLoopManager, fCustomColumns, return RNode(std::static_pointer_cast<::ROOT::Detail::RDF::RNodeBase>(fProxiedPtr), *fLoopManager, fCustomColumns,
...@@ -163,6 +182,15 @@ public: ...@@ -163,6 +182,15 @@ public:
/// Even if multiple actions or transformations depend on the same filter, /// Even if multiple actions or transformations depend on the same filter,
/// it is executed once per entry. If its result is requested more than /// it is executed once per entry. If its result is requested more than
/// once, the cached result is served. /// once, the cached result is served.
///
/// ### Example usage:
/// ~~~{.cpp}
/// // C++ callable (function, functor class, lambda...) that takes two parameters of the types of "x" and "y"
/// auto filtered = df.Filter(myCut, {"x", "y"});
//
/// // String: it must contain valid C++ except that column names can be used instead of variable names
/// auto filtered = df.Filter("x*y > 0");
/// ~~~
template <typename F, typename std::enable_if<!std::is_convertible<F, std::string>::value, int>::type = 0> template <typename F, typename std::enable_if<!std::is_convertible<F, std::string>::value, int>::type = 0>
RInterface<RDFDetail::RFilter<F, Proxied>, DS_t> RInterface<RDFDetail::RFilter<F, Proxied>, DS_t>
Filter(F f, const ColumnNames_t &columns = {}, std::string_view name = "") Filter(F f, const ColumnNames_t &columns = {}, std::string_view name = "")
...@@ -254,12 +282,20 @@ public: ...@@ -254,12 +282,20 @@ public:
/// in the dataset from subsequent transformations/actions. /// in the dataset from subsequent transformations/actions.
/// ///
/// Use cases include: /// Use cases include:
///
/// * caching the results of complex calculations for easy and efficient multiple access /// * caching the results of complex calculations for easy and efficient multiple access
/// * extraction of quantities of interest from complex objects /// * extraction of quantities of interest from complex objects
/// * column aliasing, i.e. changing the name of a branch/column
/// ///
/// An exception is thrown if the name of the new column is already in use. /// An exception is thrown if the name of the new column is already in use in this branch of the computation graph.
///
/// ### Example usage:
/// ~~~{.cpp}
/// // assuming a function with signature:
/// double myComplexCalculation(const RVec<float> &muon_pts);
/// // we can pass it directly to Define
/// auto df_with_define = df.Define("newColumn", myComplexCalculation, {"muon_pts"});
/// // alternatively, we can pass the body of the function as a string, as in Filter:
/// auto df_with_define = df.Define("newColumn", "x*x + y*y");
/// ~~~
template <typename F, typename std::enable_if<!std::is_convertible<F, std::string>::value, int>::type = 0> template <typename F, typename std::enable_if<!std::is_convertible<F, std::string>::value, int>::type = 0>
RInterface<Proxied, DS_t> Define(std::string_view name, F expression, const ColumnNames_t &columns = {}) RInterface<Proxied, DS_t> Define(std::string_view name, F expression, const ColumnNames_t &columns = {})
{ {
...@@ -283,8 +319,8 @@ public: ...@@ -283,8 +319,8 @@ public:
/// The following two calls are equivalent, although `DefineSlot` is slightly more performant: /// The following two calls are equivalent, although `DefineSlot` is slightly more performant:
/// ~~~{.cpp} /// ~~~{.cpp}
/// int function(unsigned int, double, double); /// int function(unsigned int, double, double);
/// Define("x", function, {"tdfslot_", "column1", "column2"}) /// df.Define("x", function, {"tdfslot_", "column1", "column2"})
/// DefineSlot("x", function, {"column1", "column2"}) /// df.DefineSlot("x", function, {"column1", "column2"})
/// ~~~ /// ~~~
/// ///
/// See Define for more information. /// See Define for more information.
...@@ -361,6 +397,11 @@ public: ...@@ -361,6 +397,11 @@ public:
/// \param[in] alias name of the column alias /// \param[in] alias name of the column alias
/// \param[in] columnName of the column to be aliased /// \param[in] columnName of the column to be aliased
/// Aliasing an alias is supported. /// Aliasing an alias is supported.
///
/// ### Example usage:
/// ~~~{.cpp}
/// auto df_with_alias = df.Alias("simple_name", "very_long&complex_name!!!");
/// ~~~
RInterface<Proxied, DS_t> Alias(std::string_view alias, std::string_view columnName) RInterface<Proxied, DS_t> Alias(std::string_view alias, std::string_view columnName)
{ {
// The symmetry with Define is clear. We want to: // The symmetry with Define is clear. We want to:
...@@ -392,8 +433,25 @@ public: ...@@ -392,8 +433,25 @@ public:
/// \param[in] filename The name of the output TFile /// \param[in] filename The name of the output TFile
/// \param[in] columnList The list of names of the columns/branches to be written /// \param[in] columnList The list of names of the columns/branches to be written
/// \param[in] options RSnapshotOptions struct with extra options to pass to TFile and TTree /// \param[in] options RSnapshotOptions struct with extra options to pass to TFile and TTree
/// \return a `RDataFrame` that uses the snapshotted tree as dataset
/// ///
/// This function returns a `RDataFrame` built with the output tree as a source. /// ### Example invocations:
/// ~~~{.cpp}
/// // without specifying template parameters (column types automatically deduced)
/// df.Snapshot("outputTree", "outputFile.root", {"x", "y"});
///
/// // specifying template parameters ("x" is `int`, "y" is `float`)
/// df.Snapshot<int, float>("outputTree", "outputFile.root", {"x", "y"});
/// ~~~
///
/// #### Using Snapshot as a lazy action
/// To book a Snapshot without triggering the event loop, one needs to set the appropriate flag in
/// `RSnapshotOptions`:
/// ~~~{.cpp}
/// RSnapshotOptions opts;
/// opts.fLazy = true;
/// df.Snapshot("outputTree", "outputFile.root", {"x"}, opts);
/// ~~~
template <typename... BranchTypes> template <typename... BranchTypes>
RResultPtr<RInterface<RLoopManager>> RResultPtr<RInterface<RLoopManager>>
Snapshot(std::string_view treename, std::string_view filename, const ColumnNames_t &columnList, Snapshot(std::string_view treename, std::string_view filename, const ColumnNames_t &columnList,
...@@ -411,6 +469,8 @@ public: ...@@ -411,6 +469,8 @@ public:
/// ///
/// This function returns a `RDataFrame` built with the output tree as a source. /// This function returns a `RDataFrame` built with the output tree as a source.
/// The types of the columns are automatically inferred and do not need to be specified. /// The types of the columns are automatically inferred and do not need to be specified.
///
/// See above for a more complete description and example usages.
RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename, RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename,
const ColumnNames_t &columnList, const ColumnNames_t &columnList,
const RSnapshotOptions &options = RSnapshotOptions()) const RSnapshotOptions &options = RSnapshotOptions())
...@@ -473,6 +533,8 @@ public: ...@@ -473,6 +533,8 @@ public:
/// ///
/// This function returns a `RDataFrame` built with the output tree as a source. /// This function returns a `RDataFrame` built with the output tree as a source.
/// The types of the columns are automatically inferred and do not need to be specified. /// The types of the columns are automatically inferred and do not need to be specified.
///
/// See above for a more complete description and example usages.
RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename, RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename,
std::string_view columnNameRegexp = "", std::string_view columnNameRegexp = "",
const RSnapshotOptions &options = RSnapshotOptions()) const RSnapshotOptions &options = RSnapshotOptions())
...@@ -492,6 +554,8 @@ public: ...@@ -492,6 +554,8 @@ public:
/// ///
/// This function returns a `RDataFrame` built with the output tree as a source. /// This function returns a `RDataFrame` built with the output tree as a source.
/// The types of the columns are automatically inferred and do not need to be specified. /// The types of the columns are automatically inferred and do not need to be specified.
///
/// See above for a more complete description and example usages.
RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename, RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename,
std::initializer_list<std::string> columnList, std::initializer_list<std::string> columnList,
const RSnapshotOptions &options = RSnapshotOptions()) const RSnapshotOptions &options = RSnapshotOptions())
......
...@@ -50,7 +50,7 @@ alt="DOI"></a> ...@@ -50,7 +50,7 @@ alt="DOI"></a>
- [Transformations](#transformations) -- manipulating data - [Transformations](#transformations) -- manipulating data
- [Actions](#actions) -- getting results - [Actions](#actions) -- getting results
- [Parallel execution](#parallel-execution) -- how to use it and common pitfalls - [Parallel execution](#parallel-execution) -- how to use it and common pitfalls
- [Class reference](#reference) -- most methods are implemented in the RInterface base class - [Class reference](#reference) -- most methods are implemented in the [RInterface](https://root.cern/doc/master/classROOT_1_1RDF_1_1RInterface.html) base class
## <a name="cheatsheet"></a>Cheat sheet ## <a name="cheatsheet"></a>Cheat sheet
These are the operations which can be performed with RDataFrame These are the operations which can be performed with RDataFrame
...@@ -169,16 +169,15 @@ d.Filter(IsGoodEvent).Foreach(DoStuff); ...@@ -169,16 +169,15 @@ d.Filter(IsGoodEvent).Foreach(DoStuff);
<tr> <tr>
<td> <td>
~~~{.cpp} ~~~{.cpp}
TTree *t = static_cast<TTree*>( TTree *t = nullptr;
file->Get("myTree") file->GetObject("myTree", t);
); t->Draw("x", "y > 2");
t->Draw("var", "var > 2");
~~~ ~~~
</td> </td>
<td> <td>
~~~{.cpp} ~~~{.cpp}
ROOT::RDataFrame d("myTree", file); ROOT::RDataFrame d("myTree", file);
auto h = d.Filter("var > 2").Histo1D("var"); auto h = d.Filter("y > 2").Histo1D("x");
~~~ ~~~
</td> </td>
</tr> </tr>
...@@ -196,22 +195,24 @@ RDataFrame's constructor is where the user specifies the dataset and, optionally ...@@ -196,22 +195,24 @@ RDataFrame's constructor is where the user specifies the dataset and, optionally
operations should work with. Here are the most common methods to construct a RDataFrame object: operations should work with. Here are the most common methods to construct a RDataFrame object:
~~~{.cpp} ~~~{.cpp}
// single file -- all ctors are equivalent // single file -- all ctors are equivalent
RDataFrame d1("treeName", "file.root");
TFile *f = TFile::Open("file.root"); TFile *f = TFile::Open("file.root");
RDataFrame d2("treeName", f); // same as TTreeReader
TTree *t = nullptr; TTree *t = nullptr;
f.GetObject("treeName", t); f.GetObject("treeName", t);
RDataFrame d1("treeName", "file.root");
RDataFrame d2("treeName", f); // same as TTreeReader
RDataFrame d3(*t); // TTreeReader takes a pointer, RDF takes a reference RDataFrame d3(*t); // TTreeReader takes a pointer, RDF takes a reference
// multiple files -- all ctors are equivalent // multiple files -- all ctors are equivalent
RDataFrame d3("myTree", {"file1.root", "file2.root"});
std::vector<std::string> files = {"file1.root", "file2.root"}; std::vector<std::string> files = {"file1.root", "file2.root"};
RDataFrame d3("myTree", files);
RDataFrame d4("myTree", "file*.root"); // see TRegexp's documentation for a list of valid regexes
TChain chain("myTree"); TChain chain("myTree");
chain.Add("file1.root"); chain.Add("file1.root");
chain.Add("file2.root"); chain.Add("file2.root");
RDataFrame d3(chain);
RDataFrame d4("myTree", {"file1.root", "file2.root"});
RDataFrame d5("myTree", files);
RDataFrame d6("myTree", "file*.root"); // see TRegexp's documentation for a list of valid regexes
RDataFrame d7(chain);
~~~ ~~~
Additionally, users can construct a RDataFrame specifying just an integer number. This is the number of "events" that Additionally, users can construct a RDataFrame specifying just an integer number. This is the number of "events" that
will be generated by this RDataFrame. will be generated by this RDataFrame.
...@@ -223,16 +224,6 @@ This is useful to generate simple data-sets on the fly: the contents of each eve ...@@ -223,16 +224,6 @@ This is useful to generate simple data-sets on the fly: the contents of each eve
transformation (explained below). For example, we have used this method to generate Pythia events (with a `Define` transformation (explained below). For example, we have used this method to generate Pythia events (with a `Define`
transformation) and write them to disk in parallel (with the `Snapshot` action). transformation) and write them to disk in parallel (with the `Snapshot` action).
### Programmatically get the list of column names
The list of column names available in the dataset can be obtained with the `GetColumnsNames` method:
~~~{.cpp}
RDataFrame d("myTree", "file.root");
auto colNames = d.GetColumnNames();
for (auto &&colName : colNames) {
std::cout << colName << std::endl;
}
~~~
### Filling a histogram ### Filling a histogram
Let's now tackle a very common task, filling a histogram: Let's now tackle a very common task, filling a histogram:
~~~{.cpp} ~~~{.cpp}
...@@ -387,12 +378,19 @@ Simple as that. More details are given [below](#parallel-execution). ...@@ -387,12 +378,19 @@ Simple as that. More details are given [below](#parallel-execution).
Here is a list of the most important features that have been omitted in the "Crash course" for brevity. Here is a list of the most important features that have been omitted in the "Crash course" for brevity.
You don't need to read all these to start using `RDataFrame`, but they are useful to save typing time and runtime. You don't need to read all these to start using `RDataFrame`, but they are useful to save typing time and runtime.
### Treatment of columns holding collections ### Programmatically get the list of column names
The `GetColumnsNames()` method returns the list of valid column names for the dataset:
~~~{.cpp}
RDataFrame d("myTree", "file.root");
std::vector<std::string> colNames = d.GetColumnNames();
~~~
### Reading and manipulating collections
When using RDataFrame to read data from a ROOT file, users can specify that the type of a branch is `RVec<T>` When using RDataFrame to read data from a ROOT file, users can specify that the type of a branch is `RVec<T>`
to indicate the branch is a c-style array, an STL array or any other collection type associated to a to indicate the branch is a c-style array, a `std::vector` or any other collection type associated to a
contiguous storage in memory. contiguous storage in memory.
Column values of type `RVec<T>` perform no copy of the underlying array data, it's in some sense a view, Column values of type `RVec<T>` perform no copy of the underlying array data
and offer a rich interface to operate on the array elements in a vectorised fashion. and offer a rich interface to operate on the array elements in a vectorised fashion.
The `RVec<T>` type signals to RDataFrame that a special behaviour needs to be adopted when snapshotting The `RVec<T>` type signals to RDataFrame that a special behaviour needs to be adopted when snapshotting
...@@ -400,16 +398,15 @@ a dataset on disk. Indeed, if columns which are variable size C arrays are treat ...@@ -400,16 +398,15 @@ a dataset on disk. Indeed, if columns which are variable size C arrays are treat
RDataFrame will correctly persistify them - if anything else is adopted, for example `std::span`, only RDataFrame will correctly persistify them - if anything else is adopted, for example `std::span`, only
the first element of the array will be written. the first element of the array will be written.
### Callbacks Learn more on [RVec](https://root.cern/doc/master/classROOT_1_1VecOps_1_1RVec.html).
Acting on a RResultPtr, it is possible to register a callback that RDataFrame will execute "everyNEvents"
on a partial result.
The callback must be a callable that takes a reference to the result type as argument and returns nothing. ### Callbacks
RDataFrame, acting as a full fledged data processing framework, will invoke registered callbacks passing It's possible to schedule execution of arbitrary functions (callbacks) during the event loop.
partial action results as arguments to them (e.g. a histogram filled with a part of the selected events). Callbacks can be used e.g. to inspect partial results of the analysis while the event loop is running,
drawing a partially-filled histogram every time a certain number of new entries is processed, or event
displaying a progress bar while the event loop runs.
Callbacks can be used e.g. to inspect partial results of the analysis while the event loop is running. For For example one can draw an up-to-date version of a result histogram every 100 entries like this:
example one can draw an up-to-date version of a result histogram every 100 entries like this:
~~~{.cpp} ~~~{.cpp}
auto h = tdf.Histo1D("x"); auto h = tdf.Histo1D("x");
TCanvas c("c","x hist"); TCanvas c("c","x hist");
...@@ -417,6 +414,12 @@ h.OnPartialResult(100, [&c](TH1D &h_) { c.cd(); h_.Draw(); c.Update(); }); ...@@ -417,6 +414,12 @@ h.OnPartialResult(100, [&c](TH1D &h_) { c.cd(); h_.Draw(); c.Update(); });
h->Draw(); // event loop runs here, this `Draw` is executed after the event loop is finished h->Draw(); // event loop runs here, this `Draw` is executed after the event loop is finished
~~~ ~~~
Callbacks are registered to a RResultPtr and must be callables that takes a reference to the result type as argument
and return nothing. RDataFrame will invoke registered callbacks passing partial action results as arguments to them
(e.g. a histogram filled with a part of the selected events).
Read more on RResultPtr::OnPartialResult().
### Default branch lists ### Default branch lists
When constructing a `RDataFrame` object, it is possible to specify a **default column list** for your analysis, in the When constructing a `RDataFrame` object, it is possible to specify a **default column list** for your analysis, in the
usual form of a list of strings representing branch/column names. The default column list will be used as a fallback usual form of a list of strings representing branch/column names. The default column list will be used as a fallback
...@@ -582,14 +585,6 @@ Objects read from each column are **built once and never copied**, for maximum e ...@@ -582,14 +585,6 @@ Objects read from each column are **built once and never copied**, for maximum e
When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated, When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
so it might be advisable to put the strictest filters first in the chain. so it might be advisable to put the strictest filters first in the chain.
### Using Snapshot as a lazy action
To use Snapshot without triggering the event loop, `RSnapshotOptions` is needed.
~~~{.cpp}
RSnapshotOptions opts;
opts.fLazy = true;
df.Snapshot("outputTree", "outputFile.root", {"x"}, opts);
~~~
### <a name="representgraph"></a>Printing the computation graph ### <a name="representgraph"></a>Printing the computation graph
It is possible to print the computation graph from any node to obtain a dot representation either on the standard output It is possible to print the computation graph from any node to obtain a dot representation either on the standard output
or in a file. or in a file.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment