home | blog | art | now | git gpg | email | rss

pollux.codes

Files for the pollux.codes site
git clone git://pollux.codes/git/pollux.codes.git
Log | Files | Refs

data-analysis-using-makefiles.md (9962B)


+++ title = 'Data Analysis Using Makefiles' date = 2025-05-30T23:30:48-05:00 blogtags = ['programming'] +++

As an astrophysicist, I frequently have to perform complex data analysis, which involves performing many sequential steps, some of which may take hours to complete. I then have to fine-tune this analysis to correct errors, consider alternative methods, etc... This can be very time consuming, since each alteration requires me to not only re-run the required analysis steps, but also to figure out which steps need to be re-run. If I make a mistake, it could cost me many hours of waiting for things to re-run.

I have recently settled on using make for managing my data processing steps. Make was originally made for helping with the compilation of large programs, but many of the optimizations for compilation can also be leveraged to process data more intelligently, and to automatically figure out the fastest way to update data products. I will talk about how I used these optimizations to improve my data-analysis workflow as well as the Makefile I will be using for the forseeable futere.

Code as Prerequisites

Among other things, a makefile defines targets and rules for making them. A target is (usually) a file that needs to be created, such as an executable or an object file in the case of compilation. Each target can have a rule associated with it, which is a sequence of one or more shell commands that are executed in order to (re-)create the file if needed.

A typical rule for linking a program from several compiled object files may look like this:

my-program: main.o utils.o
    gcc -o my-program main.o utils.o

This rule tells make that if it needs to create the file my-program, then it just needs to run the command on the second line, which will create it. The rule also has several prerequisites, listed after the colon, which are all the files used as input into the command. This allows make to intelligently update the target file only if its prerequisites have changed. It does this by looking at the last modified time of each file. If any of the prerequisites were modified since the target has, then make assumes that the target needs to be updated and runs the rule again.

When make gets really powerful is when we chain these together. We can have the prerequisites for one rule itself be a target with a separate rule. Make will intelligently update the files only when needed, minimizing the time needed to update all the output files when the input is changed. For data analysis, we can use this to only re-do processing steps if the input data files have changed. A rule to accomplish this may look something like this:

data/my-processed-data.csv: data/my-input-data.csv
    python my-pipeline-step.py

This way, we can keep the processed data up to date, and only change it if the input data file has changed. Just as with compilation, this input data file itself can be a target with its own rule, allowing for easily navigating a complex network of inter-dependent data products to be navigated easily.

One small issue with this rule is that the target data file won't be regenerated if I alter the code, as I often do. Luckily, this is quite simple to fix. We can just add the program being run to the list of prerequisites.

data/my-processed-data.csv: my-pipeline-step.py \
                            data/my-input-data.csv
    python my-pipeline-step.py

Now if I modify the code, the datafiles will be updated accordingly.

Generating Prerequisites Automatically

As cool as this is, it can still get tedious in large projects. I have had makefiles with several dozen rules, which can get very difficult to maintain. I also have to remember to add new rules whenever I create a new file, and remember to update the rules themselves when I change the files that each step uses.

A similar problem exists when compiling C programs in the form of header files. Each header file contains the definitions of functions implemented within the corresponding C file. Each C file will often include the headers of other C files to use the functions they implement. This means if a header is modified, all the C files that include it need to be recompiled. When it comes to using make, that means each compiled target needs to have these headers in its list of prerequisites. This is annoying to maintain, since the C files are often modified to include new headers, or to not include headers that previously were. Each of these changes must be reflected in the Makefile in order for the program to compile properly.

Recognizing this issue, compilers come with a nifty feature to generate prerequisite lists for C files. How this works is much easier to explain with an example. In a clean, empty directory, create a file named main.c with the following code:

#include "myheader.h"
#include "myotherheader.h"

Also make sure those header files exist, it's okay if they are empty.

Now run the following command to compile this file into an object.

gcc -c -MMD -MF main.d -o main.o main.c

You will get a main.o object file, but you will also get a main.d file with the following contents:

main.o: main.c myheader.h myotherheader.h

This is a make rule! Importantly, it contains the headers we included in our main.c file. We can now simply include this file in our main Makefile, along with a second rule specifying how to compile main.c (and create main.d), and we have a self-updating makefile! This works because when multiple rules exist for a single target, make combines all the prerequisites together.

main.o: main.c
    gcc -c -MMD -MF main.d -o main.o main.c

-include main.d

Make is also intelligent enough to re-import main.d if it changes when main.o is created. It will double check that the file is still up to date even after its prerequisites have changed, and update it again if it isn't.

Before I apply this technique to improve our data-processing Makefile. There is one more optimization that we can apply.

Pattern Rules and Automatic Variables

Large programs often have hundreds of C files, some even have many thousands. Each of these files needs to be compiled individually in order to create an executable. A naive makefile for such a project may look like this:


my-large-program: main.o my-code.o utils.o ... # (ad nauseum)
    gcc -o my-large-program main.o my-code.o utils.o ...

my-code.o: my-code.c
    gcc -c -o my-code.o my-code.c

utils.o: utils.c
    gcc -c -o utils.o utils.c

# ...
# (ad nauseum)

Notice how repetetive this is? Each C file has a separate rule, but each one is nearly identical. Additionally, we have a single rule that has every compiled object file as a prerequisite.

Make has a few features to make this much simpler, namely pattern rules and automatic variables. These let us define entire classes of rules at once, and dynamically alter the rule based on the target being made.

Here is another makefile that does more or less the same thing as the one above, but only has two rules.

cfiles = $(wildcard *.c)
ofiles = $(patsubst %.c,%.o,$(cfiles))

my-large-program: $(ofiles)
    gcc -o $@ $^

%.o: %.c
    gcc -c -o $@ $^

We define a couple variables, namely cfiles and ofiles. The first is just a list of all files in the working directory ending with .c, the second is a list of the corresponding object files, which we calculate by simply replacing the file extension on each element in cfiles. We then use the list of object files as the prerequisites of the main program, which saves us having to list them out individually. We also have a pattern rule, which is used whenever we need to make a file ending with .o. The variables $@ and $^ are automatic variables, which are substituted with the target and prerequisites, repsectively. Using these, we can define how to compile every C file using only a single rule, simplifying our makefile significantly. This also allows our makefile to automatically adjust itself when we add new C files.

Putting It All Together

When we combine all of these features of make, along with some inline shell scripting and string manipulation, we arrive at the following makefile:

datadir = data
executables = $(shell find . -maxdepth 1 -executable -type f)

.makerules: $(executables)
    @printf 'Regenerating makefile...\n'
    @echo -n 'all:' > .makerules
    @for file in $^; do\
        provides=$$(cat $${file} | grep '^# PROVIDES: '\
                                 | cut -d ' ' -f3-);\
        echo -n ' $$(patsubst %,$(datadir)/%,\
                              '$${provides}')';\
    done >> .makerules
    @echo >> .makerules
    @for file in $^; do\
        depends=$$(cat $${file} | grep '^# DEPENDS: '\
                                | cut -d ' ' -f3-);\
        provides=$$(cat $${file} | grep '^# PROVIDES: '\
                                 | cut -d ' ' -f3-);\
        echo '$$(patsubst %,$(datadir)/%,'$${provides}')':\
              $${file} '$$(patsubst %,$(datadir)/%,'\
              $${depends}')';\
        echo '\t./'$${file};\
    done >> .makerules

-include .makerules

What this makefile does is generate the rules needed to perform the various steps of data analysis, and update them when the code changes. The names of the datafiles are read from the scripts themselves. Each piece of code must contain the following comment:

# DEPENDS: <prerequisite data files>
# PROVIDES: <produced data files>

These comments are parsed from each file and turned into makefile rules, which are then included in the main makefile. A rule called all, which is run by default, is also created and has every data file produced as a prerequisite. This way, simply running make by itself will ensure every data file is up to date, and will only re-run a data processing step if either the code or one of the data files listed under DEPENDS has been changed, thus minimizing the amount of processing done to update the data.