[DF][NFC] Improve doxygen docs

- add example usages to Filter, Define, Snapshot - add some links to other parts of the documentation where relevant - small fixes here and there

[DF][NFC] Improve doxygen docs
79d981b2 · Enrico Guiraud · 205f015e · 79d981b2 · 79d981b2
Commit 79d981b2 authored 6 years ago by Enrico Guiraud
--- a/tree/dataframe/inc/ROOT/RDFInterface.hxx
+++ b/tree/dataframe/inc/ROOT/RDFInterface.hxx
@@ -138,6 +138,25 @@ public:
      AddDefaultColumns();
   }
+   ////////////////////////////////////////////////////////////////////////////
+   /// \brief Cast any RDataFrame node to a common type ROOT::RDF::RNode.
+   /// Different RDataFrame methods return different C++ types. All nodes, however,
+   /// can be cast to this common type at the cost of a small performance penalty.
+   /// This allows, for example, storing RDataFrame nodes in a vector, or passing them
+   /// around via (non-template, C++11) helper functions.
+   /// Example usage:
+   /// ~~~{.cpp}
+   /// // a function that conditionally adds a Range to a RDataFrame node.
+   /// RNode MaybeAddRange(RNode df, bool mustAddRange)
+   /// {
+   ///    return mustAddRange ? df.Range(1) : df;                   
+   /// }
+   /// // use as : 
+   /// ROOT::RDataFrame df(10);
+   /// auto maybeRanged = MaybeAddRange(df, true);
+   /// ~~~
+   /// Note that it is not a problem to pass RNode's by value.
   operator RNode() const
   {
      return RNode(std::static_pointer_cast<::ROOT::Detail::RDF::RNodeBase>(fProxiedPtr), *fLoopManager, fCustomColumns,
@@ -163,6 +182,15 @@ public:
   /// Even if multiple actions or transformations depend on the same filter,
   /// it is executed once per entry. If its result is requested more than
   /// once, the cached result is served.
+   ///
+   /// ### Example usage:
+   /// ~~~{.cpp}
+   /// // C++ callable (function, functor class, lambda...) that takes two parameters of the types of "x" and "y"
+   /// auto filtered = df.Filter(myCut, {"x", "y"});
+   //
+   /// // String: it must contain valid C++ except that column names can be used instead of variable names
+   /// auto filtered = df.Filter("x*y > 0");
+   /// ~~~
   template <typename F, typename std::enable_if<!std::is_convertible<F, std::string>::value, int>::type = 0>
   RInterface<RDFDetail::RFilter<F, Proxied>, DS_t>
   Filter(F f, const ColumnNames_t &columns = {}, std::string_view name = "")
@@ -254,12 +282,20 @@ public:
   /// in the dataset from subsequent transformations/actions.
   ///
   /// Use cases include:
-   ///
   /// * caching the results of complex calculations for easy and efficient multiple access
   /// * extraction of quantities of interest from complex objects
-   /// * column aliasing, i.e. changing the name of a branch/column
   ///
-   /// An exception is thrown if the name of the new column is already in use.
+   /// An exception is thrown if the name of the new column is already in use in this branch of the computation graph.
+   ///
+   /// ### Example usage:
+   /// ~~~{.cpp}
+   /// // assuming a function with signature:
+   /// double myComplexCalculation(const RVec<float> &muon_pts);
+   /// // we can pass it directly to Define
+   /// auto df_with_define = df.Define("newColumn", myComplexCalculation, {"muon_pts"});
+   /// // alternatively, we can pass the body of the function as a string, as in Filter:
+   /// auto df_with_define = df.Define("newColumn", "x*x + y*y");
+   /// ~~~
   template <typename F, typename std::enable_if<!std::is_convertible<F, std::string>::value, int>::type = 0>
   RInterface<Proxied, DS_t> Define(std::string_view name, F expression, const ColumnNames_t &columns = {})
   {
@@ -283,8 +319,8 @@ public:
   /// The following two calls are equivalent, although `DefineSlot` is slightly more performant:
   /// ~~~{.cpp}
   /// int function(unsigned int, double, double);
-   /// Define("x", function, {"tdfslot_", "column1", "column2"})
+   /// df.Define("x", function, {"tdfslot_", "column1", "column2"})
-   /// DefineSlot("x", function, {"column1", "column2"})
+   /// df.DefineSlot("x", function, {"column1", "column2"})
   /// ~~~
   ///
   /// See Define for more information.
@@ -361,6 +397,11 @@ public:
   /// \param[in] alias name of the column alias
   /// \param[in] columnName of the column to be aliased
   /// Aliasing an alias is supported.
+   /// 
+   /// ### Example usage:
+   /// ~~~{.cpp}
+   /// auto df_with_alias = df.Alias("simple_name", "very_long&complex_name!!!");
+   /// ~~~
   RInterface<Proxied, DS_t> Alias(std::string_view alias, std::string_view columnName)
   {
      // The symmetry with Define is clear. We want to:
@@ -392,8 +433,25 @@ public:
   /// \param[in] filename The name of the output TFile
   /// \param[in] columnList The list of names of the columns/branches to be written
   /// \param[in] options RSnapshotOptions struct with extra options to pass to TFile and TTree
+   /// \return a `RDataFrame` that uses the snapshotted tree as dataset
   ///
-   /// This function returns a `RDataFrame` built with the output tree as a source.
+   /// ### Example invocations:
+   /// ~~~{.cpp}
+   /// // without specifying template parameters (column types automatically deduced)
+   /// df.Snapshot("outputTree", "outputFile.root", {"x", "y"});
+   ///
+   /// // specifying template parameters ("x" is `int`, "y" is `float`)
+   /// df.Snapshot<int, float>("outputTree", "outputFile.root", {"x", "y"});
+   /// ~~~
+   ///
+   /// #### Using Snapshot as a lazy action
+   /// To book a Snapshot without triggering the event loop, one needs to set the appropriate flag in
+   /// `RSnapshotOptions`:
+   /// ~~~{.cpp}
+   /// RSnapshotOptions opts;
+   /// opts.fLazy = true;
+   /// df.Snapshot("outputTree", "outputFile.root", {"x"}, opts);
+   /// ~~~
   template <typename... BranchTypes>
   RResultPtr<RInterface<RLoopManager>>
   Snapshot(std::string_view treename, std::string_view filename, const ColumnNames_t &columnList,
@@ -411,6 +469,8 @@ public:
   ///
   /// This function returns a `RDataFrame` built with the output tree as a source.
   /// The types of the columns are automatically inferred and do not need to be specified.
+   ///
+   /// See above for a more complete description and example usages.
   RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename,
                                                 const ColumnNames_t &columnList,
                                                 const RSnapshotOptions &options = RSnapshotOptions())
@@ -473,6 +533,8 @@ public:
   ///
   /// This function returns a `RDataFrame` built with the output tree as a source.
   /// The types of the columns are automatically inferred and do not need to be specified.
+   ///
+   /// See above for a more complete description and example usages.
   RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename,
                                                 std::string_view columnNameRegexp = "",
                                                 const RSnapshotOptions &options = RSnapshotOptions())
@@ -492,6 +554,8 @@ public:
   ///
   /// This function returns a `RDataFrame` built with the output tree as a source.
   /// The types of the columns are automatically inferred and do not need to be specified.
+   ///
+   /// See above for a more complete description and example usages.
   RResultPtr<RInterface<RLoopManager>> Snapshot(std::string_view treename, std::string_view filename,
                                                 std::initializer_list<std::string> columnList,
                                                 const RSnapshotOptions &options = RSnapshotOptions())

--- a/tree/dataframe/src/RDataFrame.cxx
+++ b/tree/dataframe/src/RDataFrame.cxx
@@ -50,7 +50,7 @@ alt="DOI"></a>
 - [Transformations](#transformations) -- manipulating data
 - [Actions](#actions) -- getting results
 - [Parallel execution](#parallel-execution) -- how to use it and common pitfalls
- [Class reference](#reference) -- most methods are implemented in the RInterface base class
+- [Class reference](#reference) -- most methods are implemented in the [RInterface](https://root.cern/doc/master/classROOT_1_1RDF_1_1RInterface.html) base class
 ## <a name="cheatsheet"></a>Cheat sheet
 These are the operations which can be performed with RDataFrame
@@ -169,16 +169,15 @@ d.Filter(IsGoodEvent).Foreach(DoStuff);
 <tr>
   <td>
 ~~~{.cpp}
-TTree *t = static_cast<TTree*>(
+TTree *t = nullptr;
-   file->Get("myTree")
+file->GetObject("myTree", t);
-);
+t->Draw("x", "y > 2");
-t->Draw("var", "var > 2");
 ~~~
   </td>
   <td>
 ~~~{.cpp}
 ROOT::RDataFrame d("myTree", file);
-auto h = d.Filter("var > 2").Histo1D("var");
+auto h = d.Filter("y > 2").Histo1D("x");
 ~~~
   </td>
 </tr>
@@ -196,22 +195,24 @@ RDataFrame's constructor is where the user specifies the dataset and, optionally
 operations should work with. Here are the most common methods to construct a RDataFrame object:
 ~~~{.cpp}
 // single file -- all ctors are equivalent
-RDataFrame d1("treeName", "file.root");
 TFile *f = TFile::Open("file.root");
-RDataFrame d2("treeName", f); // same as TTreeReader
 TTree *t = nullptr;
 f.GetObject("treeName", t);
+RDataFrame d1("treeName", "file.root");
+RDataFrame d2("treeName", f); // same as TTreeReader
 RDataFrame d3(*t); // TTreeReader takes a pointer, RDF takes a reference
 // multiple files -- all ctors are equivalent
-RDataFrame d3("myTree", {"file1.root", "file2.root"});
 std::vector<std::string> files = {"file1.root", "file2.root"};
-RDataFrame d3("myTree", files);
-RDataFrame d4("myTree", "file*.root"); // see TRegexp's documentation for a list of valid regexes
 TChain chain("myTree");
 chain.Add("file1.root");
 chain.Add("file2.root");
-RDataFrame d3(chain);
+RDataFrame d4("myTree", {"file1.root", "file2.root"});
+RDataFrame d5("myTree", files);
+RDataFrame d6("myTree", "file*.root"); // see TRegexp's documentation for a list of valid regexes
+RDataFrame d7(chain);
 ~~~
 Additionally, users can construct a RDataFrame specifying just an integer number. This is the number of "events" that
 will be generated by this RDataFrame.
@@ -223,16 +224,6 @@ This is useful to generate simple data-sets on the fly: the contents of each eve
 transformation (explained below). For example, we have used this method to generate Pythia events (with a `Define`
 transformation) and write them to disk in parallel (with the `Snapshot` action).
-### Programmatically get the list of column names
-The list of column names available in the dataset can be obtained with the `GetColumnsNames` method:
-~~~{.cpp}
-RDataFrame d("myTree", "file.root");
-auto colNames = d.GetColumnNames();
-for (auto &&colName : colNames) {
-   std::cout << colName << std::endl;
-   }
-~~~
 ### Filling a histogram
 Let's now tackle a very common task, filling a histogram:
 ~~~{.cpp}
@@ -387,12 +378,19 @@ Simple as that. More details are given [below](#parallel-execution).
 Here is a list of the most important features that have been omitted in the "Crash course" for brevity.
 You don't need to read all these to start using `RDataFrame`, but they are useful to save typing time and runtime.
-### Treatment of columns holding collections
+### Programmatically get the list of column names
+The `GetColumnsNames()` method returns the list of valid column names for the dataset:
+~~~{.cpp}
+RDataFrame d("myTree", "file.root");
+std::vector<std::string> colNames = d.GetColumnNames();
+~~~
+### Reading and manipulating collections
 When using RDataFrame to read data from a ROOT file, users can specify that the type of a branch is `RVec<T>`
-to indicate the branch is a c-style array, an STL array or any other collection type associated to a
+to indicate the branch is a c-style array, a `std::vector` or any other collection type associated to a
 contiguous storage in memory.
-Column values of type `RVec<T>` perform no copy of the underlying array data, it's in some sense a view,
+Column values of type `RVec<T>` perform no copy of the underlying array data
 and offer a rich interface to operate on the array elements in a vectorised fashion.
 The `RVec<T>` type signals to RDataFrame that a special behaviour needs to be adopted when snapshotting
@@ -400,16 +398,15 @@ a dataset on disk. Indeed, if columns which are variable size C arrays are treat
 RDataFrame will correctly persistify them - if anything else is adopted, for example `std::span`, only
 the first element of the array will be written.
-### Callbacks
+Learn more on [RVec](https://root.cern/doc/master/classROOT_1_1VecOps_1_1RVec.html).
-Acting on a RResultPtr, it is possible to register a callback that RDataFrame will execute "everyNEvents"
-on a partial result.
-The callback must be a callable that takes a reference to the result type as argument and returns nothing.
+### Callbacks
-RDataFrame, acting as a full fledged data processing framework, will invoke registered callbacks passing
+It's possible to schedule execution of arbitrary functions (callbacks) during the event loop.
-partial action results as arguments to them (e.g. a histogram filled with a part of the selected events).
+Callbacks can be used e.g. to inspect partial results of the analysis while the event loop is running,
+drawing a partially-filled histogram every time a certain number of new entries is processed, or event
+displaying a progress bar while the event loop runs.
-Callbacks can be used e.g. to inspect partial results of the analysis while the event loop is running. For
+For example one can draw an up-to-date version of a result histogram every 100 entries like this:
-example one can draw an up-to-date version of a result histogram every 100 entries like this:
 ~~~{.cpp}
 auto h = tdf.Histo1D("x");
 TCanvas c("c","x hist");
@@ -417,6 +414,12 @@ h.OnPartialResult(100, [&c](TH1D &h_) { c.cd(); h_.Draw(); c.Update(); });
 h->Draw(); // event loop runs here, this `Draw` is executed after the event loop is finished
 ~~~
+Callbacks are registered to a RResultPtr and must be callables that takes a reference to the result type as argument
+and return nothing. RDataFrame will invoke registered callbacks passing partial action results as arguments to them
+(e.g. a histogram filled with a part of the selected events).
+Read more on RResultPtr::OnPartialResult().
 ### Default branch lists
 When constructing a `RDataFrame` object, it is possible to specify a **default column list** for your analysis, in the
 usual form of a list of strings representing branch/column names. The default column list will be used as a fallback
@@ -582,14 +585,6 @@ Objects read from each column are **built once and never copied**, for maximum e
 When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
 so it might be advisable to put the strictest filters first in the chain.
-### Using Snapshot as a lazy action
-To use Snapshot without triggering the event loop, `RSnapshotOptions` is needed.
-~~~{.cpp}
-RSnapshotOptions opts;
-opts.fLazy = true;
-df.Snapshot("outputTree", "outputFile.root", {"x"}, opts);
-~~~
 ### <a name="representgraph"></a>Printing the computation graph
 It is possible to print the computation graph from any node to obtain a dot representation either on the standard output
 or in a file.