home | blog | art | now | git gpg | email | rss

pollux.codes

Files for the pollux.codes site
git clone git://pollux.codes/git/pollux.codes.git
Log | Files | Refs
commit bfd35fcb6827175c27ca5ab4d4e92e2d7bd3dba9
parent 1ef50547ca2b81b6309002888ef9e28aa17878be
Author: Pollux <pollux@pollux.codes>
Date:   Fri, 30 May 2025 17:41:03 -0500

new blog post: data analysis using makefiles

Signed-off-by: Pollux <pollux@pollux.codes>

Diffstat:
Acontent/blog/data-analysis-using-makefiles.md | 247+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 247 insertions(+), 0 deletions(-)

diff --git a/content/blog/data-analysis-using-makefiles.md b/content/blog/data-analysis-using-makefiles.md @@ -0,0 +1,247 @@ ++++ +title = 'Data Analysis Using Makefiles' +date = 2025-05-30T23:30:48-05:00 +blogtags = ['programming'] ++++ + +As an astrophysicist, I frequently have to perform complex data analysis, +which involves performing many sequential steps, some of which may take hours +to complete. I then have to fine-tune this analysis to correct errors, consider +alternative methods, etc... This can be very time consuming, since each +alteration requires me to not only re-run the required analysis steps, but also +to figure out which steps need to be re-run. If I make a mistake, it could cost +me many hours of waiting for things to re-run. + +I have recently settled on using make for managing my data processing +steps. Make was originally made for helping with the compilation of large +programs, but many of the optimizations for compilation can also be leveraged +to process data more intelligently, and to automatically figure out the fastest +way to update data products. I will talk about how I used these optimizations +to improve my data-analysis workflow as well as the Makefile I will be using +for the forseeable futere. + +## Code as Prerequisites + +Among other things, a makefile defines targets and rules for making them. A +target is (usually) a file that needs to be created, such as an executable or an +object file in the case of compilation. Each target can have a rule associated +with it, which is a sequence of one or more shell commands that are executed in +order to (re-)create the file if needed. + +A typical rule for linking a program from several compiled object files may +look like this: + +```make +my-program: main.o utils.o + gcc -o my-program main.o utils.o +``` + +This rule tells make that if it needs to create the file `my-program`, then it +just needs to run the command on the second line, which will create it. The rule +also has several prerequisites, listed after the colon, which are all the files +used as input into the command. This allows make to intelligently update the +target file only if its prerequisites have changed. It does this by looking at +the last modified time of each file. If any of the prerequisites were modified +since the target has, then make assumes that the target needs to be updated and +runs the rule again. + +When make gets really powerful is when we chain these together. We can have the +prerequisites for one rule itself be a target with a separate rule. Make will +intelligently update the files only when needed, minimizing the time needed to +update all the output files when the input is changed. For data analysis, we can +use this to only re-do processing steps if the input data files have changed. A +rule to accomplish this may look something like this: + +```make +data/my-processed-data.csv: data/my-input-data.csv + python my-pipeline-step.py +``` + +This way, we can keep the processed data up to date, and only change it if the +input data file has changed. Just as with compilation, this input data file +itself can be a target with its own rule, allowing for easily navigating a +complex network of inter-dependent data products to be navigated easily. + +One small issue with this rule is that the target data file won't be regenerated +if I alter the code, as I often do. Luckily, this is quite simple to fix. We can +just add the program being run to the list of prerequisites. + +```make +data/my-processed-data.csv: my-pipeline-step.py \ + data/my-input-data.csv + python my-pipeline-step.py +``` + +Now if I modify the code, the datafiles will be updated accordingly. + +## Generating Prerequisites Automatically + +As cool as this is, it can still get tedious in large projects. I have had +makefiles with several dozen rules, which can get very difficult to maintain. +I also have to remember to add new rules whenever I create a new +file, and remember to update the rules themselves when I change the files that +each step uses. + +A similar problem exists when compiling C programs in the form of header files. +Each header file contains the definitions of functions implemented within the +corresponding C file. Each C file will often include the headers of other C +files to use the functions they implement. This means if a header is modified, +all the C files that include it need to be recompiled. When it comes to using +make, that means each compiled target needs to have these headers in its list of +prerequisites. This is annoying to maintain, since the C files are often +modified to include new headers, or to not include headers that previously +were. Each of these +changes must be reflected in the Makefile in order for the program to compile +properly. + +Recognizing this issue, compilers come with a nifty feature to generate +prerequisite lists for C files. How this works is much easier to explain with an example. In a clean, empty directory, create a file named +main.c with the following code: + +```c +#include "myheader.h" +#include "myotherheader.h" +``` + +Also make sure those header files exist, it's okay if they are empty. + +Now run the following command to compile this file into an object. + +```sh +gcc -c -MMD -MF main.d -o main.o main.c +``` + +You will get a `main.o` object file, but you will also get a `main.d` file with +the following contents: + +```make +main.o: main.c myheader.h myotherheader.h +``` + +This is a make rule! Importantly, it contains the headers we included in our +`main.c` file. We can now simply include this file in our main Makefile, along +with a second rule specifying how to compile main.c (and create main.d), and we +have a self-updating makefile! This works because when multiple rules exist for +a single target, make combines all the prerequisites together. + +```make +main.o: main.c + gcc -c -MMD -MF main.d -o main.o main.c + +-include main.d +``` + +Make is also intelligent enough to re-import +`main.d` if it changes when main.o is created. It will double check that the file +is still up to date even after its prerequisites have changed, and update it +again if it isn't. + +Before I apply this technique to improve our data-processing Makefile. There is +one more optimization that we can apply. + +## Pattern Rules and Automatic Variables + +Large programs often have hundreds of C files, some even have many +thousands. Each of these files needs to be compiled individually in order to +create an executable. A naive makefile for such a project may look like this: + +```make + +my-large-program: main.o my-code.o utils.o ... # (ad nauseum) + gcc -o my-large-program main.o my-code.o utils.o ... + +my-code.o: my-code.c + gcc -c -o my-code.o my-code.c + +utils.o: utils.c + gcc -c -o utils.o utils.c + +# ... +# (ad nauseum) + +``` + +Notice how repetetive this is? Each C file has a separate rule, but each one is +nearly identical. Additionally, we have a single rule that has every compiled +object file as a prerequisite. + +Make has a few features to make this much simpler, namely pattern rules and +automatic variables. These let us define entire classes of rules at once, and +dynamically alter the rule based on the target being made. + +Here is another makefile that does more or less the same thing as the one above, +but only has two rules. + +```make +cfiles = $(wildcard *.c) +ofiles = $(patsubst %.c,%.o,$(cfiles)) + +my-large-program: $(ofiles) + gcc -o $@ $^ + +%.o: %.c + gcc -c -o $@ $^ +``` + +We define a couple variables, namely `cfiles` and `ofiles`. The first is just a +list of all files in the working directory ending with `.c`, the second is a +list of the corresponding object files, which we calculate by simply replacing +the file extension on each element in `cfiles`. We then use the list of object +files as the prerequisites of the main program, which saves us having to list +them out individually. We also have a pattern rule, which is used whenever we +need to make a file ending with `.o`. The variables `$@` and `$^` are automatic +variables, which are substituted with the target and prerequisites, +repsectively. Using these, we can define how to compile every C file using only +a single rule, simplifying our makefile significantly. This also allows our +makefile to automatically adjust itself when we add new C files. + +## Putting It All Together + +When we combine all of these features of make, along with some inline shell +scripting and string manipulation, we arrive at the following makefile: + +```make +datadir = data +executables = $(shell find . -maxdepth 1 -executable -type f) + +.makerules: $(executables) + @printf 'Regenerating makefile...\n' + @echo -n 'all:' > .makerules + @for file in $^; do\ + provides=$$(cat $${file} | grep '^# PROVIDES: '\ + | cut -d ' ' -f3-);\ + echo -n ' $$(patsubst %,$(datadir)/%,\ + '$${provides}')';\ + done >> .makerules + @echo >> .makerules + @for file in $^; do\ + depends=$$(cat $${file} | grep '^# DEPENDS: '\ + | cut -d ' ' -f3-);\ + provides=$$(cat $${file} | grep '^# PROVIDES: '\ + | cut -d ' ' -f3-);\ + echo '$$(patsubst %,$(datadir)/%,'$${provides}')':\ + $${file} '$$(patsubst %,$(datadir)/%,'\ + $${depends}')';\ + echo '\t./'$${file};\ + done >> .makerules + +-include .makerules +``` + +What this makefile does is generate the rules needed to perform the various +steps of data analysis, and update them when the code changes. The names of +the datafiles are read from the scripts themselves. Each piece of code must +contain the following comment: + +```python +# DEPENDS: <prerequisite data files> +# PROVIDES: <produced data files> +``` + +These comments are parsed from each file and turned into makefile rules, which +are then included in the main makefile. A rule called `all`, which is run by +default, is also created and has every data file produced as a prerequisite. This +way, simply running `make` by itself will ensure every data file is up to date, +and will only re-run a data processing step if either the code or one of the +data files listed under `DEPENDS` has been changed, thus minimizing the amount +of processing done to update the data.