Many computing-intensive processes in R involve the repeated evaluation of a function over many items or parameter sets. These so-called embarrassingly parallel calculations can be run serially with the
Map function, or in parallel on a single machine with
mcMap (from the
The rslurm package simplifies the process of distributing this type of calculation across a computing cluster that uses the Slurm workload manager. Its main function,
slurm_apply (and the related
slurm_map) automatically divide the computation over multiple nodes and write the necessary submission scripts. The package also includes functions to retrieve and combine the output from different nodes, as well as wrappers for common Slurm commands.
To illustrate a typical rslurm workflow, we use a simple function that takes a mean and standard deviation as parameters, generates a million normal deviates and returns the sample mean and standard deviation.
We then create a parameter data frame where each row is a parameter set and each column matches an argument of the function.
## par_mu par_sd ## 1 1 0.1 ## 2 2 0.2 ## 3 3 0.3
We can now pass that function and the parameters data frame to
slurm_apply, specifiying the number of cluster nodes to use and the number of CPUs per node. The latter (
cpus_per_node) determines how many processes will be forked on each node, as the
mc.cores argument of
library(rslurm) sjob <- slurm_apply(test_func, pars, jobname = 'test_apply', nodes = 2, cpus_per_node = 2, submit = FALSE)
## Submission scripts output in directory _rslurm_test_apply
The output of
slurm_apply is a
slurm_job object that stores a few pieces of information (job name, job ID, and the number of nodes) needed to retrieve the job’s output.
The default argument
submit = TRUE would submit a generated script to the Slurm cluster and print a message confirming the job has been submitted to Slurm, assuming your are running R on a Slurm head node. When working from a R session without direct access to the cluster, you must set
submit = FALSE. Either way, the function creates a folder called
\_rslurm\_[jobname] in the working directory that contains scripts and data files. This folder may be moved to a Slurm head node, the shell command
sbatch submit.sh run from within the folder, and the folder moved back to your working directory. The contents of the
\_rslurm\_[jobname] folder after completion of the
test_apply job, i.e. following either manual or automatic (i.e. with
submit = TRUE) submission to the cluster, includes one
results_*.RDS file for each node:
##  "results_0.RDS" "results_1.RDS"
The results from all the nodes can be read back into R with the
## s_mu s_sd ## 1 1.000137 0.09995552 ## 2 2.000144 0.19988175 ## 3 2.999822 0.30030102
The utility function
print_job_status displays the status of a submitted job (i.e. in queue, running or completed), and
cancel_slurm will remove a job from the queue, aborting its execution if necessary. These functions are R wrappers for the Slurm command line functions
outtype = 'table', the outputs from each function evaluation are row-bound into a single data frame; this is an appropriate format when the function returns a simple vector. The default
outtype = 'raw' combines the outputs into a list and can thus handle arbitrarily complex return objects.
res_raw <- get_slurm_out(sjob, outtype = 'raw') res_raw[1:3]
## [] ## s_mu s_sd ## 1.00013690 0.09995552 ## ## [] ## s_mu s_sd ## 2.0001445 0.1998817 ## ## [] ## s_mu s_sd ## 2.999822 0.300301
The utility function
cleanup_files deletes the temporary folder for the specified Slurm job.
In addition to
slurm_apply, rslurm also defines a
slurm_call function, which sends a single function call to the cluster. It is analogous in syntax to the base R function
do.call, accepting a function and a named list of parameters as arguments.
## Submission scripts output in directory _rslurm_test_call
slurm_call involves a single process on a single node, it does not recognize the
cpus_per_node arguments; otherwise, it accepts the same additional arguments (detailed in the sections below) as
The function passed to
slurm_apply can only receive atomic parameters stored within a data frame. Suppose we want instead to apply a function
func to a list of complex R objects,
obj_list. In that case we can use the function
slurm_map, which is similar in syntax to
lapply from base R and
map from the
purrr package. Its first argument is a list which can contain objects of any type, and its second argument is a function that acts on a single element of the list.
sjob <- slurm_map(obj_list, func, nodes = 2, cpus_per_node = 2)
The output generated by
slurm_map is structured the same way as
slurm_apply. The procedures for checking the job status, extracting the results of the job, and cleaning up job files are also the same as described above.
Each of the tasks started by
slurm_map begin by default in an “empty” R environment containing only the function to be evaluated and its parameters. If we want to pass additional arguments to the function that do not vary with each task, we can simply add them as additional arguments to
slurm_map, like in this example, where we want to take the logarithm of many integers but always use
log(x, base = 2).
To pass additional objects to the jobs that aren’t explicitly included as arguments to the function passed to
slurm_map, we can use the argument
global_objects. For example we might want to use an inline function that calls two other previously defined functions.
sjob <- slurm_apply(function(a, b) c(func1(a),func2(b)), data.frame(a, b), global_objects = c("func1", "func2"), nodes = 2, cpus_per_node = 2)
global_objects argument specifies the names of any R objects (besides the parameters data frame) that must be accessed by the function passed to
slurm_apply. These objects are saved to a
.RDS file that is loaded on each cluster node prior to evaluating the function in parallel.
By default, all R packages attached to the current R session will also be attached (with
library) on each cluster node, though this can be modified with the optional
Particular clusters may require the specification of additional Slurm options, such as time and memory limits for the job. The
slurm_options argument allows you to set any of the command line options (view list) recognized by the Slurm
sbatch command. It should be formatted as a named list, using the long names of each option (e.g. “time” rather than “t”). Flags, i.e. command line options that are toggled rather than set to a particular value, should be set to
slurm_options. For example, the following code sets the command line options
As mentioned above, the
slurm_apply function creates a job-specific folder. This folder contains the parameters as a RDS file and (if applicable) the objects specified as
global_objects saved together in a RData file. The function also generates a R script (
slurm_run.R) to be run on each cluster node, as well as a Bash script (
submit.sh) to submit the job to Slurm.
More specifically, the Bash script tells Slurm to create a job array and the R script takes advantage of the unique
SLURM\_ARRAY\_TASK\_ID environment variable that Slurm will set on each cluster node. This variable is read by
slurm_run.R, which allows each instance of the script to operate on a different parameter subset and write its output to a different results file. The R script calls
parallel::mcMap to parallelize calculations on each node.
--dependency option can be utilized by taking the job ID from the
slurm_job object returned by
The ID can be manually added to the slurm options. In the following example, the job ID of
sjob1 is used to ensure that
sjob2 does not begin running until after
# Job1 sopt1 <- list(time = '1:00:00', share = TRUE) sjob1 <- slurm_apply(test_func, pars, slurm_options = sopt1) # Job2 depends on Job1 pars2 <- data.frame(par_mu = 2:20, par_sd = seq(0.2, 2, length.out = 20)) sopt2 <- c(sopt1, list(dependency=sprintf("afterany:%s", sjob1$jobid))) sjob2 <- slurm_apply(test_func2, pars2, slurm_options = sopt2)
submit.sh are generated from templates, using the
whisker package; these templates can be found in the
rslurm/templates subfolder in your R package library. There are two templates for each script, one for
slurm_apply and the other (with the word “single”" in its title) for
While you should avoid changing any existing lines in the template scripts, you may want to add
#SBATCH lines to the
submit.sh templates in order to permanently set certain Slurm command line options and thus customize the package to your particular cluster setup.