A Better Pipeline For LSF
February 21, 2022
Simply put, I am faced with the problem of needing a better toolkit to do efficient work. The past better half of a year has seen me working interchangeably between local machines and NCSU’s computing cluster. However, as I move between both there is a constant sense that the way I am working is inefficient and that there has to be a better way. The feeling of inefficiency is in large part due to the incompatible nature of writing data intensive code locally but only being able to test on the cluster due to various factors. As of the time of writing this, the current workflow is
- Write code locally
- Push to git
- SSH into the cluster
- Clone/fetch changes to code
- Write or change job-script to reflect new changes
Job scripts are written with tcsh scripting and while this works well enough with small programs and code bases, it quickly becomes unmanageable as the scope grows and with it the need for iterative testing and debugging. Compounding upon the inefficiency is that in an academic setting major changes to the direction of the code base can and do happen daily as new breakthroughs are realized and future ones tested. One can only go through having to write a new job-script to test and debug every change to data I/O or algorithm design only for a job to fail 4 hours in because a path had a single character wrong. Additionally, my work is not confined to one language but is mostly split between Python, R, and Bash all passing data from I/O between each other. A standard way to pipeline this processes is to use flat file scripting, where each program writes it’s output to disk to pass to the next. While this method works locally, there is not a productive way to work when working on the cluster.
What I want is a wrapper written in a higher level than tcsh that would provide better tooling and checking for jobs. The question then is what is the best way to create it.
First, a quick overview of the LSF cluster. LSF (Platform Load Sharing Facility) is a Batch-Scheduler which allows for users to submit non interactive  jobs to a HPC cluster. Users interact with the cluster via SSHing into login nodes, constructing job scripts, and then using them to send computations to compute nodes. The scheduler then handles all the queuing and optimizes the distribution of compute resources.
bsub command then is how users send off jobs
bsub < jobfile.csh
jobfile.csh might look like
#BSUB -n 6 # Number of MPI tasks
#BSUB -R span[host=1] # MPI tasks per node
#BSUB -x # Exclusive use of nodes
#BSUB -J GlobalIndex # Name of job
#BSUB -W 2:30 # Wall clock time
#BSUB -o out.%J # Standard out
#BSUB -e err.%J # Standard error
module load conda
conda activate geo-ml
python3 job.py -i ... -o ... --kwargs ...
Rscript job.R ...
What I want
So then what do I want?
- Dry runs
- Project based job parameters
- Sub job testing
I imagine something similar to Snakemake  with a highlevel Pythonic api that provides modern benefits that tcsh simply cannot. A tool that provides a functionality for dry runs to check parameter paths, handles the sending of jobs, and provides “project profiles” or configuration files for projects that provide consitent cluster parameters between users and jobs. However, most of the pipelining software out there fail in the sense that they are not quite modular enough for experimentation. Rather they are excellent in running the same pipeline, specified by a single file, when new data becomes available and providing the outputs. It is fantastic for it’s use case, but limits one to a single main file to run your pipeline without the ability to specify which portion of the pipeline (sub jobs) you need to run. This is particularly limiting when one needs to test sub portions of a pipeline as the code base changes. If I can effectively test sub portions of the pipeline, ensure their validity with dry runs, and better profile them I think it would make my work significantly more effecient.