From b550d3b74bdfd8efbff9ce52faa8652df0242d22 Mon Sep 17 00:00:00 2001 From: Enrico Guiraud <enrico.guiraud@cern.ch> Date: Sat, 11 Mar 2017 23:24:58 +0100 Subject: [PATCH] [TDF] Update user guide - add Histo{2D,3D}, Profile{1D,2D} to docs - Extra actions -> Queries - temporary branch -> temporary column - Histo -> Histo1D --- tree/treeplayer/src/TDataFrame.cxx | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/tree/treeplayer/src/TDataFrame.cxx b/tree/treeplayer/src/TDataFrame.cxx index c8ae86602b0..3ef1e879b80 100644 --- a/tree/treeplayer/src/TDataFrame.cxx +++ b/tree/treeplayer/src/TDataFrame.cxx @@ -67,7 +67,7 @@ common operations: building blocks to trigger custom calculations are available 1. **build a data-frame** object by specifying your data-set 2. **apply a series of transformations** to your data 1. **filter** (e.g. apply some cuts) or - 2. create a **temporary branch** (e.g. make available an alias or the result of a non trivial operation involving other branches) + 2. create a **temporary column** (e.g. the result of an expensive computation on branches, or an alias for a branch) 3. **apply actions** to the transformed data to produce results (e.g. fill a histogram) 4. <table> @@ -138,7 +138,7 @@ h->Draw(); ~~~ The first line creates a `TDataFrame` associated to the `TTree` "myTree". This tree has a branch named "MET". -`Histo` is an action; it returns a smart pointer (a `TActionResultPtr` to be precise) to a `TH1F` histogram filled with the `MET` of all events. +`Histo1D` is an action; it returns a smart pointer (a `TActionResultProxy` to be precise) to a `TH1F` histogram filled with the `MET` of all events. If the quantity stored in the branch is a collection, the histogram is filled with its elements. There are many other possible [actions](#overview), and all their results are wrapped in smart pointers; we'll see why in a minute. @@ -154,7 +154,7 @@ std::cout << *c << std::endl; ~~~ `Filter` takes a function (a lambda in this example, but it can be any kind of function or even a functor class) and a list of branch names. The filter function is applied to the specified branches for each event; it is required to return a `bool` which signals whether the event passes the filter (`true`) or not (`false`). You can think of your data as "flowing" through the chain of calls, being transformed, filtered and finally used to perform actions. Multiple `Filter` calls can be chained one after another. -### Creating a temporary branch +### Creating a temporary column Let's now consider the case in which "myTree" contains two quantities "x" and "y", but our analysis relies on a derived quantity `z = sqrt(x*x + y*y)`. Using the `AddColumn` transformation, we can create a new column in the data-set containing the variable "z": ~~~{.cpp} @@ -167,7 +167,7 @@ auto zMean = d.AddColumn("z", sqrtSum, {"x","y"}) .Mean("z"); std::cout << *zMean << std::endl; ~~~ -`AddColumn` creates the variable "z" by applying `sqrtSum` to "x" and "y". Later in the chain of calls we refer to variables created with `AddColumn` as if they were actual tree branches, but they are evaluated on the fly, once per event. As with filters, `AddColumn` calls can be chained with other transformations to create multiple temporary branches. +`AddColumn` creates the variable "z" by applying `sqrtSum` to "x" and "y". Later in the chain of calls we refer to variables created with `AddColumn` as if they were actual tree branches, but they are evaluated on the fly, once per event. As with filters, `AddColumn` calls can be chained with other transformations to create multiple temporary columns. ### Executing multiple actions As a final example let us apply two different cuts on branch "MET" and fill two different histograms with the "pt\_v" of the filtered events. @@ -212,19 +212,19 @@ auto min = d2.Filter([](double b2) { return b2 > 0; }, {"b2"}).Min(); ~~~ ### Branch type guessing and explicit declaration of branch types -C++ is a statically typed language: all types must be known at compile-time. This includes the types of the `TTree` branches we want to work on. For filters, temporary branches and some of the actions, **branch types are deduced from the signature** of the relevant filter function/temporary branch expression/action function: +C++ is a statically typed language: all types must be known at compile-time. This includes the types of the `TTree` branches we want to work on. For filters, temporary columns and some of the actions, **branch types are deduced from the signature** of the relevant filter function/temporary column expression/action function: ~~~{.cpp} // here b1 is deduced to be `int` and b2 to be `double` dataFrame.Filter([](int x, double y) { return x > 0 && y < 0.; }, {"b1", "b2"}); ~~~ If we specify an incorrect type for one of the branches, an exception with an informative message will be thrown at runtime, when the branch value is actually read from the `TTree`: the implementation of `TDataFrame` allows the detection of type mismatches. The same would happen if we swapped the order of "b1" and "b2" in the branch list passed to `Filter`. -Certain actions, on the other hand, do not take a function as argument (e.g. `Histo`), so we cannot deduce the type of the branch at compile-time. In this case **`TDataFrame` tries to guess the type of the branch**, trying out the most common ones and `std::vector` thereof. This is why we never needed to specify the branch types for all actions in the above snippets. +Certain actions, on the other hand, do not take a function as argument (e.g. `Histo1D`), so we cannot deduce the type of the branch at compile-time. In this case **`TDataFrame` tries to guess the type of the branch**, trying out the most common ones and `std::vector` thereof. This is why we never needed to specify the branch types for all actions in the above snippets. When the branch type is not a common one such as `int`, `double`, `char` or `float` it is therefore good practice to specify it as a template parameter to the action itself, like this: ~~~{.cpp} dataFrame.Histo1D("b1"); // OK if b1 is a "common" type -dataFrame.Histo<Object_t>("myObject"); // OK, "myObject" is deduced to be of type `Object_t` +dataFrame.Histo1D<Object_t>("myObject"); // OK, "myObject" is deduced to be of type `Object_t` // dataFrame.Histo1D("myObject"); // THROWS an exception ~~~ @@ -262,7 +262,7 @@ You see how we created one `double` variable for each thread in the pool, and la ### Call graphs (storing and reusing sets of transformations) **Sets of transformations can be stored as variables** and reused multiple times to create **call graphs** in which several paths of filtering/creation of branches are executed simultaneously; we often refer to this as "storing the state of the chain". -This feature can be used, for example, to create a temporary branch once and use it in several subsequent filters or actions, or to apply a strict filter to the data-set *before* executing several other transformations and actions, effectively reducing the amount of events processed. +This feature can be used, for example, to create a temporary column once and use it in several subsequent filters or actions, or to apply a strict filter to the data-set *before* executing several other transformations and actions, effectively reducing the amount of events processed. Let's try to make this clearer with a commented example: ~~~{.cpp} @@ -290,8 +290,8 @@ h2->Draw(); // first access to an action result: run event-loop! h3->Draw("SAME"); // event loop does not need to be run again here.. std::cout << "Entries in h1: " << h1->GetEntries() << std::endl; // ..or here ~~~ -`TDataFrame` detects when several actions use the same filter or the same temporary branch, and **only evaluates each filter or temporary branch once per event**, regardless of how many times that result is used down the call graph. Objects read from each branch are **built once and never copied**, for maximum efficiency. -When "upstream" filters are not passed, subsequent filters, temporary branch expressions and actions are not evaluated, so it might be advisable to put the strictest filters first in the chain. +`TDataFrame` detects when several actions use the same filter or the same temporary column, and **only evaluates each filter or temporary column once per event**, regardless of how many times that result is used down the call graph. Objects read from each branch are **built once and never copied**, for maximum efficiency. +When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated, so it might be advisable to put the strictest filters first in the chain. ## <a name="transformations"></a>Transformations ### Filters @@ -309,8 +309,8 @@ Statistics are retrieved through a call to the `Report` method: Stats are printed in the same order as named filters have been added to the graph, and *refer to the latest event-loop* that has been run using the relevant `TDataFrame`. If `Report` is called before the event-loop has been run at least once, a run is triggered. -### Temporary branches -Temporary branches are created by invoking `AddColumn(name, f, branchList)`. As usual, `f` can be any callable object (function, lambda expression, functor class...); it takes the values of the branches listed in `branchList` (a list of strings) as parameters, in the same order as they are listed in `branchList`. `f` must return the value that will be assigned to the temporary branch. +### Temporary columns +Temporary columns are created by invoking `AddColumn(name, f, branchList)`. As usual, `f` can be any callable object (function, lambda expression, functor class...); it takes the values of the branches listed in `branchList` (a list of strings) as parameters, in the same order as they are listed in `branchList`. `f` must return the value that will be assigned to the temporary column. A new variable is created called `name`, accessible as if it was contained in the dataset from subsequent transformations/actions. @@ -334,10 +334,11 @@ In the following, whenever we say an action "returns" something, we always mean |------------------|-----------------| | Count | Return the number of events processed. | | Take | Build a collection of values of a branch. | -| Histo | Fill a histogram with the values of a branch that passed all filters. | +| Histo{1D,2D,3D} | Fill a {one,two,three}-dimensional histogram with the branch values that passed all filters. | | Max | Return the maximum of processed branch values. | | Mean | Return the mean of processed branch values. | | Min | Return the minimum of processed branch values. | +| Profile{1D,2D} | Fill a {one,two}-dimensional profile with the branch values that passed all filters. | | Reduce | Reduce (e.g. sum, merge) entries using the function (lambda, functor...) passed as argument. The function must have signature `T(T,T)` where `T` is the type of the branch. Return the final result of the reduction operation. An optional parameter allows initialization of the result object to non-default values. | | **Instant actions** | **Description** | @@ -345,7 +346,7 @@ In the following, whenever we say an action "returns" something, we always mean | Foreach | Execute a user-defined function on each entry. Users are responsible for the thread-safety of this lambda when executing with implicit multi-threading enabled. | | ForeachSlot | Same as `Foreach`, but the user-defined function must take an extra `unsigned int slot` as its first parameter. `slot` will take a different value, `0` to `nThreads - 1`, for each thread of execution. This is meant as a helper in writing thread-safe `Foreach` actions when using `TDataFrame` after `ROOT::EnableImplicitMT()`. `ForeachSlot` works just as well with single-thread execution: in that case `slot` will always be `0`. | -| **Extra** | **Description** | +| **Queries** | **Description** | |-----------|-----------------| | Report | This is not properly an action, since when `Report` is called it does not book an operation to be performed on each entry. Instead, it interrogates the data-frame directly to print a cutflow report, i.e. statistics on how many entries have been accepted and rejected by the filters. See the section on [named filters](#named-filters-and-cutflow-reports) for a more detailed explanation. | -- GitLab