Skip to content
Snippets Groups Projects
  1. Apr 21, 2021
  2. Apr 20, 2021
  3. Apr 19, 2021
    • Vincenzo Eduardo Padulano's avatar
      [DF] Give priority to user provided npartitions · f040e051
      Vincenzo Eduardo Padulano authored
      The distributed RDataFrame constructor accepts an optional `npartitions` keyword argument. Previously, if this argument was provided by the user, it set the number of partitions in which the rdf would split the distributed computations.
      But then, right before starting the execution, the distributed backend implementation tried to optimize this number. In the case of Spark, an educated guess for the number of partitions would be spark.executor.cores * spark.executor.instances, that is the number of distributed nodes times the number of cores used for each node.
      If we let this optimization happen just before the start of the execution, it means we completely disregard the user provided value for `npartitions`. Instead, the backend guessing at a number of partitions should happen only if the user doesn't supply one.
      This commit addresses the issue by moving the call to `backend.optimize_npartitions` inside the initialization of the distributed dataframe object, plus adds a couple of tests to check the behaviour in the Spark backend.
      f040e051
Loading