commit bfd35fcb6827175c27ca5ab4d4e92e2d7bd3dba9
parent 1ef50547ca2b81b6309002888ef9e28aa17878be
Author: Pollux <pollux@pollux.codes>
Date: Fri, 30 May 2025 17:41:03 -0500
new blog post: data analysis using makefiles
Signed-off-by: Pollux <pollux@pollux.codes>
Diffstat:
1 file changed, 247 insertions(+), 0 deletions(-)
diff --git a/content/blog/data-analysis-using-makefiles.md b/content/blog/data-analysis-using-makefiles.md
@@ -0,0 +1,247 @@
++++
+title = 'Data Analysis Using Makefiles'
+date = 2025-05-30T23:30:48-05:00
+blogtags = ['programming']
++++
+
+As an astrophysicist, I frequently have to perform complex data analysis,
+which involves performing many sequential steps, some of which may take hours
+to complete. I then have to fine-tune this analysis to correct errors, consider
+alternative methods, etc... This can be very time consuming, since each
+alteration requires me to not only re-run the required analysis steps, but also
+to figure out which steps need to be re-run. If I make a mistake, it could cost
+me many hours of waiting for things to re-run.
+
+I have recently settled on using make for managing my data processing
+steps. Make was originally made for helping with the compilation of large
+programs, but many of the optimizations for compilation can also be leveraged
+to process data more intelligently, and to automatically figure out the fastest
+way to update data products. I will talk about how I used these optimizations
+to improve my data-analysis workflow as well as the Makefile I will be using
+for the forseeable futere.
+
+## Code as Prerequisites
+
+Among other things, a makefile defines targets and rules for making them. A
+target is (usually) a file that needs to be created, such as an executable or an
+object file in the case of compilation. Each target can have a rule associated
+with it, which is a sequence of one or more shell commands that are executed in
+order to (re-)create the file if needed.
+
+A typical rule for linking a program from several compiled object files may
+look like this:
+
+```make
+my-program: main.o utils.o
+ gcc -o my-program main.o utils.o
+```
+
+This rule tells make that if it needs to create the file `my-program`, then it
+just needs to run the command on the second line, which will create it. The rule
+also has several prerequisites, listed after the colon, which are all the files
+used as input into the command. This allows make to intelligently update the
+target file only if its prerequisites have changed. It does this by looking at
+the last modified time of each file. If any of the prerequisites were modified
+since the target has, then make assumes that the target needs to be updated and
+runs the rule again.
+
+When make gets really powerful is when we chain these together. We can have the
+prerequisites for one rule itself be a target with a separate rule. Make will
+intelligently update the files only when needed, minimizing the time needed to
+update all the output files when the input is changed. For data analysis, we can
+use this to only re-do processing steps if the input data files have changed. A
+rule to accomplish this may look something like this:
+
+```make
+data/my-processed-data.csv: data/my-input-data.csv
+ python my-pipeline-step.py
+```
+
+This way, we can keep the processed data up to date, and only change it if the
+input data file has changed. Just as with compilation, this input data file
+itself can be a target with its own rule, allowing for easily navigating a
+complex network of inter-dependent data products to be navigated easily.
+
+One small issue with this rule is that the target data file won't be regenerated
+if I alter the code, as I often do. Luckily, this is quite simple to fix. We can
+just add the program being run to the list of prerequisites.
+
+```make
+data/my-processed-data.csv: my-pipeline-step.py \
+ data/my-input-data.csv
+ python my-pipeline-step.py
+```
+
+Now if I modify the code, the datafiles will be updated accordingly.
+
+## Generating Prerequisites Automatically
+
+As cool as this is, it can still get tedious in large projects. I have had
+makefiles with several dozen rules, which can get very difficult to maintain.
+I also have to remember to add new rules whenever I create a new
+file, and remember to update the rules themselves when I change the files that
+each step uses.
+
+A similar problem exists when compiling C programs in the form of header files.
+Each header file contains the definitions of functions implemented within the
+corresponding C file. Each C file will often include the headers of other C
+files to use the functions they implement. This means if a header is modified,
+all the C files that include it need to be recompiled. When it comes to using
+make, that means each compiled target needs to have these headers in its list of
+prerequisites. This is annoying to maintain, since the C files are often
+modified to include new headers, or to not include headers that previously
+were. Each of these
+changes must be reflected in the Makefile in order for the program to compile
+properly.
+
+Recognizing this issue, compilers come with a nifty feature to generate
+prerequisite lists for C files. How this works is much easier to explain with an example. In a clean, empty directory, create a file named
+main.c with the following code:
+
+```c
+#include "myheader.h"
+#include "myotherheader.h"
+```
+
+Also make sure those header files exist, it's okay if they are empty.
+
+Now run the following command to compile this file into an object.
+
+```sh
+gcc -c -MMD -MF main.d -o main.o main.c
+```
+
+You will get a `main.o` object file, but you will also get a `main.d` file with
+the following contents:
+
+```make
+main.o: main.c myheader.h myotherheader.h
+```
+
+This is a make rule! Importantly, it contains the headers we included in our
+`main.c` file. We can now simply include this file in our main Makefile, along
+with a second rule specifying how to compile main.c (and create main.d), and we
+have a self-updating makefile! This works because when multiple rules exist for
+a single target, make combines all the prerequisites together.
+
+```make
+main.o: main.c
+ gcc -c -MMD -MF main.d -o main.o main.c
+
+-include main.d
+```
+
+Make is also intelligent enough to re-import
+`main.d` if it changes when main.o is created. It will double check that the file
+is still up to date even after its prerequisites have changed, and update it
+again if it isn't.
+
+Before I apply this technique to improve our data-processing Makefile. There is
+one more optimization that we can apply.
+
+## Pattern Rules and Automatic Variables
+
+Large programs often have hundreds of C files, some even have many
+thousands. Each of these files needs to be compiled individually in order to
+create an executable. A naive makefile for such a project may look like this:
+
+```make
+
+my-large-program: main.o my-code.o utils.o ... # (ad nauseum)
+ gcc -o my-large-program main.o my-code.o utils.o ...
+
+my-code.o: my-code.c
+ gcc -c -o my-code.o my-code.c
+
+utils.o: utils.c
+ gcc -c -o utils.o utils.c
+
+# ...
+# (ad nauseum)
+
+```
+
+Notice how repetetive this is? Each C file has a separate rule, but each one is
+nearly identical. Additionally, we have a single rule that has every compiled
+object file as a prerequisite.
+
+Make has a few features to make this much simpler, namely pattern rules and
+automatic variables. These let us define entire classes of rules at once, and
+dynamically alter the rule based on the target being made.
+
+Here is another makefile that does more or less the same thing as the one above,
+but only has two rules.
+
+```make
+cfiles = $(wildcard *.c)
+ofiles = $(patsubst %.c,%.o,$(cfiles))
+
+my-large-program: $(ofiles)
+ gcc -o $@ $^
+
+%.o: %.c
+ gcc -c -o $@ $^
+```
+
+We define a couple variables, namely `cfiles` and `ofiles`. The first is just a
+list of all files in the working directory ending with `.c`, the second is a
+list of the corresponding object files, which we calculate by simply replacing
+the file extension on each element in `cfiles`. We then use the list of object
+files as the prerequisites of the main program, which saves us having to list
+them out individually. We also have a pattern rule, which is used whenever we
+need to make a file ending with `.o`. The variables `$@` and `$^` are automatic
+variables, which are substituted with the target and prerequisites,
+repsectively. Using these, we can define how to compile every C file using only
+a single rule, simplifying our makefile significantly. This also allows our
+makefile to automatically adjust itself when we add new C files.
+
+## Putting It All Together
+
+When we combine all of these features of make, along with some inline shell
+scripting and string manipulation, we arrive at the following makefile:
+
+```make
+datadir = data
+executables = $(shell find . -maxdepth 1 -executable -type f)
+
+.makerules: $(executables)
+ @printf 'Regenerating makefile...\n'
+ @echo -n 'all:' > .makerules
+ @for file in $^; do\
+ provides=$$(cat $${file} | grep '^# PROVIDES: '\
+ | cut -d ' ' -f3-);\
+ echo -n ' $$(patsubst %,$(datadir)/%,\
+ '$${provides}')';\
+ done >> .makerules
+ @echo >> .makerules
+ @for file in $^; do\
+ depends=$$(cat $${file} | grep '^# DEPENDS: '\
+ | cut -d ' ' -f3-);\
+ provides=$$(cat $${file} | grep '^# PROVIDES: '\
+ | cut -d ' ' -f3-);\
+ echo '$$(patsubst %,$(datadir)/%,'$${provides}')':\
+ $${file} '$$(patsubst %,$(datadir)/%,'\
+ $${depends}')';\
+ echo '\t./'$${file};\
+ done >> .makerules
+
+-include .makerules
+```
+
+What this makefile does is generate the rules needed to perform the various
+steps of data analysis, and update them when the code changes. The names of
+the datafiles are read from the scripts themselves. Each piece of code must
+contain the following comment:
+
+```python
+# DEPENDS: <prerequisite data files>
+# PROVIDES: <produced data files>
+```
+
+These comments are parsed from each file and turned into makefile rules, which
+are then included in the main makefile. A rule called `all`, which is run by
+default, is also created and has every data file produced as a prerequisite. This
+way, simply running `make` by itself will ensure every data file is up to date,
+and will only re-run a data processing step if either the code or one of the
+data files listed under `DEPENDS` has been changed, thus minimizing the amount
+of processing done to update the data.