Post 3: Data-Parallel Functional Programming in Datalog

In the early days of our PPoSS project, we built Slog, an untyped logic programming language that extended Datalog to be much closer to Scheme. In this post, I want to describe Slog informally, but also describe some of our initial excitements, alongside key challenges we faced in our prototype implementations. Slog was initially designed using MPI and InfiniBand, running on Theta (a now-retired supercomputer at Argonne). We still maintain an old version of the Slog 1.0 repository, but our focus has moved on; the reality is that the codebase had grown to a gnarly mishmash of partial implementations along with a research prototype compiler. We have recently started new efforts to reboot the ideas from our initial Slog project to modern hardware (e.g., GPU clusters) and especially focusing on GPUs. In previous attempts, we had planned to integrate Slog with SMT solving–personally, I am wary of this in light of the recent LLM takeover. However, I am still optimistic on high-performance exact reasoning, especially in a way which enables richly-expressive queries that involve nontrivial logic reasoning.

Functional Programming in Datalog

Slog starts by adding a single feature to Datalog: all facts are now first-class citizens and now may be referenced, linked, and used to trigger rule evaluation. Formally, the language is Datalog∃!–every fact’s existence is realized by an S-expression, its identity:

(= id0 (ref "x"))
(= id1 (lam "x" id0))
(= id2 (app id1 id1))

Notice how in this style, we can write a nested term in a flattened manner. Vanilla Datalog allows us only to write rules of the form (R s ...) where s must be either a literal or some variable. Using the flattening above, we can generalize datalog to working on S-expressions:

[(free e y) (=/= x y) --> (free ?(lam x e) y)]

This rule is written in Slog, using forward implication; you can write in either, I wrote it in a forward style here to make it look closer to logic. There is something special happening with the rule, which is the ? construct. This “huh”-clause in the head of the rule triggers on the existence of a suitable ground candidate that matches the pattern: whenever some (lam x e) matches, the rule fires, to check whether y is free in e. The ?-clause here is crucial, because it serves to ground the rule: there’s no way we could quantify over all expressions (lam x e). Instead, we trigger on every insertion of a new (lam x e). This ability, to trigger on pattern matches and demand further evaluation–used in conjunction with the fully-data-parallel backend, is what enables Slog to fully data-parallelize applications such as the memoized lambda calculus.

We need some more rules to complete the program:

(free ?(ref x) x) ;; a variable reference is free at the top level
[(free e0 x) --> (free ?(app e0 e1) x)]
[(free e1 x) --> (free ?(app e0 e1) x)]

Now, we can take databases like this:

(app (lam "x" (app (ref "x") (ref "y"))) (lam "z" (ref "x")))

In Slog, the representation of such an S-expression ubiquitously employs sharing: an expression like the above is actually a tree, in which each leaf is tuple-sized and assigned a 64-bit hash, which is ultimately located to (via a bucket/sub-bucket decomposition) a unique node within the cluster. This means that you never “have” a linked tuple all on one node at one time, until the very end, in which case we do a massive parallel I/O phase write to write the output to disk (etc., this is a bit of a simplification).

Slog grew out of work on massively-parallel binary joins on supercomputers, allowing some of the largest-possible transitive closures. Transitive closure (TC) is of course the poster child for Datalog, due to the fact that it’s a very simple problem and generally anything tantamount to reachability-based analyses reduces to TC. Transitive closure has its uses of course, but the issue is understanding how we leverage the kind of power we get from high-performance Datalog engines to attack general-purpose problems. In Slog, the idea was to exploit the design, plus all of the standard Scheme tricks, to effect fully-data-parallel programming in a general sense.

The idea is rather simple: once you have these first-class S-expressions as triggers for existential rules, you very trivially get a fully-memoizing, data-parallel small-step interpreter of the lambda calculus. I will get to that in a bit, but first let me explain the basic idea:

(define (fib n)
  (match n
    [0 1]
    [1 1]
    [n (+ (fib (- n 1)) (fib (- n 2)))]))

Now, let’s look at its translation into Slog. First, we recognize that the code involves a function. In Slog, we need to trigger functions on “demands” for their results. In this case, we use (do-fib N) as a demand to evaluate (fib N). Once we observe a demand for N, we will either (a) return the result, in the base case, or (b) materialize a demand for strictly-smaller Ns, which will (by induction) eventually materialize a result. First I’ll show the base cases:

(fib ?(do-fib 0) 1)
(fib ?(do-fib 1) 1)

Notice how they generate (fib (do-fib N) n) where n has been definitely ascertained to be the result of (fib N). Now, the inductive case needs to guard (in traditional languages this is guarded via left-to-right check-then-fire of the guards, perhaps):

[(do-fib n) (> n 1) --> (continue0 n (do-fib {- n 1}))]
[(continue0 n (do-fib n1)) (fib (do-fib n1) n2) --> (continue1 n n1 (do-fib {- n 2}))]
[(continue1 n n1 (do-fib n2)) (fib (do-fib n2) n3) --> (fib (do-fib n) {+ n1 n3})]

In this last case, I manually materialized a continuation: a demand to continue the computation: we want to (a) first do (fib (- n 1)) and compute it as v0, (b) then do the same for n-2, and then (c) return their result. A typical functional language would enable us to do this by virtue of adhering to a specific evaluation order, thereby enabling us to write nested expressions and to have them linearized (e.g., via SSA, ANF, CPS, etc. conversions) for us via semantics-preserving transformations in the compiler. Note that the code uses {- n 1}, a new construct which does native integer subtraction: Slog 1.0 (MPI) has limited support for integer arithmetic in a 36-bit integral range (since all values are 64 bits).

The last construct, !, performs CPS conversion on the code, emitting an “inner rule,” (which I will elide for brevity) which continues the computation. This enables us to write the last case of fib using direct-style recursion:

[(do-fib n) (> n 1) --> (fib !(do-fib {- n 1}) v0)
                        (fib !(do-fib {- n 2}) v1)
                        (fib {+ v0 v1})]

Informally, ! says: “materialize a ground fact under the !, which is then the unique tuple which plugs into the ! to continue the head of the rule.” In the first case, we materialize (do-fib {- n 1}), which is assigned the ID id0. Crucially, the usage of ! now splits the rule into a subordinate rule which triggers on: (do-fib n) ∧ (> n 1) ∧ (fib !(do-fib {- n 1}) v0). This subordinate rule now grounds v0 in the scope of the call to materialize (do-fib {- n 2}), and so on–ultimately effecting the kind of CPS-converted structure I showed in the manual version above.

Leveraging just ? and !, we can trivially write eager, memoized, fully-data-parallel functional programs in direct style. For example, here is an implementation of the lambda calculus (credit to former student Davis):

; Beta reduction
[(subst !(do-subst eb x ea) eb+)
 -->
 (step ?(app (lambda x eb) ea)
       eb+)]

; Capture avoiding substitution
; Note: notice how the `!` in the body there is grounded implicitly by
; the `?(do-subst (app ef ea) x e+)` in the _head_, since `?` in the
; head forms an implicit body clause.)
[(subst !(do-subst ef x e+) ef+) (subst !(do-subst ea x e+) ea+)
 --> (subst ?(do-subst (app ef ea) x e+) (app ef+ ea+))]
[(subst !(do-subst eb x e+) eb+) (=/= x y)
 --> (subst ?(do-subst (lambda y eb) x e+) (lambda y eb+))]
(subst ?(do-subst (lambda x eb) x e+) (lambda x eb))
(subst ?(do-subst (ref x) x e+) e+)

; Free variables
(free x ?(ref x))
[(free x ?(app e0 e1))
 <-- (or (free x e0) (free x e1))]
[(free x ?(lambda a e0))
 <-- (=/= x a) (free x e0)]

Some Profound Implications and Serious Challenges

Fully data-parallel program at the finest possible granularity. Spiritually, our ideas have a close relationship to thread compaction, but there are clearly some downsides: nothing this good is ever for free. The whole fixpoint computation happens in parallel, across a huge number of cores (even GPU cores, now). All-to-all communication happens to exchange, deduplicate, and materialize data: deduplication and materialization happen by virtue of each tuple being routed to a specific core in a deterministic way (based on a hash) known by everyone. This means that any functional program you write in Slog will be fully broken down at the per-execution-step level into an iteration. Thus, if a rule is broken down into N subordinate rules, executing the rule will actually take N iteration cycles. Slog works using semi-naive evaluation: when a potentially-new fact is discovered, we first check to see if it exists in the full database, if not we add it to a “delta” database–we only consider new triggers on the delta database; this semi-naive evaluation works in concert with memoization to build the computation. Last, because all tuples in Slog are represented in a linked fashion–and consequently each node of a tree strewn throughout the cluster–rules which use match on nested expressions also necessitate desugaring into subordinate rules (and commensurately extra communication).

This is all to say: Slog’s massive data parallelism comes at a huge constant-factor cost. Imagine a straight-line block of well-optimized assembly which uses the cache efficiently. Now imagine that same code, implemented in Slog, constantly having to literally send out InfiniBand packets in a tight loop! Massive slowdown, to the point that it is clearly a waste of energy to do this on traditional supercomputers for general-purpose computation. For workloads that are join-dominant (essentially Datalog), this might be a solid approach. But for other workloads, you want to be able to exploit the cache, registers, branch predictors–CPUs are blazing fast when we execute them on straight-line (or relatively predictable) code, and we throw a lot of that away when doing this very fine-grained thread compaction.

That realization was something that really killed some of my enthusiasm for general-purpose programming in Slog directly with abstractions, among several other issues. First, like many Datalogs, we were persistently hitting join ordering issues: in any nontrivial Datalog program, you have multiple clauses conjoined in the body; this can also be exacerbated depending on how the desugaring happens in complex usages of ! and the like. Now, the issue is that in Slog, due to the way we achieved parallelization, we needed to materialize binary joins, physically splitting them out as separate subcomputations that were physically located in memory. If you have a lot of throughput, this might not be so bad (with respect to runtime), but the killer is asymptotic overhead: for some queries (e.g., triangles), you are destined to have an asymptotically-larger number of intermediate results compared to the final output. This translates to a real practical limitation: if a large Datalog application (like a program analysis) gets stuck on intermediate join blowup, it would really defeat the point of using the massive available parallel throughput that a supercomputer might offer.

One other thing is that Slog natively supports programming with multimethods, due to the ability to write rules which trigger ad-hoc on the structure of any of their arguments. So does that mean we solved massively-parallel multimethods? Well, not quite: it turns out that due to the desugaring I mentioned, multi-methods turn into multi-way joins, which then end up being hard to plan correctly (especially via explicit binary joins, as Slog v1 did).

Tricky joins became a real sticking point for many of us as we began to aspire to use our cluster Datalog engines to do real applications. Recently, my student (Yihao) and myself have spent more of our time focusing on GPU Datalogs–though he also has some multi-GPU work that extends Slog, but not in a way that meaningfully uses the S-expression work I mentioned just now. We have been thinking about the future of massively-parallel analytic reasoning with GPUs, and the degree to which the ideas from Slog apply. Right now, our focus has been on redesigning our engines for modern GPUs and GPU clusters, and also leveraging state-of-the-art algorithms for joins which are worst-case optimal and scale GPU-assisted reasoning to the highest-possible level. We’ll be posting more updates about this later, but we’re hoping to release a high-performance prototype in the coming months, and we’re in the early stages for a Python frontend to leverage our GPU reasoning facilities.

I see the Slog project as a very exciting case study in fully data-parallel, thread-compacted, memoized functional programming with ad-hoc polymorphism. The potential excitement is, I still feel, in leveraging the potential of that execution paradigm to build more interesting general-purpose reasoning machinery such as proof search systems. Slog’s tabled-but-top-down execution model (enabled via defunctionalization and materialization of thunks) enables (in principle) tactics such as iterative deepening, or even speculative exploration. An issue in those settings is that–when you saturate a massive algebra of terms–you often want to equate certain classes of terms ala EUF. In Slog, you simply can’t do this: terms are explicit, and cannot be unified in a meaningful way–meaningfully extending congruence closure, Egglog, etc. to the fully-distributed Slog setting would not be immediately obvious (but also not altogether impossible). However, engines such as the recent Egglog rewrite also place an algorithmic focus on worst-case optimal joins as well. Because we’re interested in incorporating union find as a key part of our data structures, our recent efforts have focused on shared-memory systems—I am not sure we see the necessity in multi-node E-graphs quite yet–we have pursued at least some initial efforts in integrating Egglog into Ascent, and also have E-graphs as a key milestone we’re hoping to support in our upcoming Slog2.

As someone who has focused a lot on analytic reasoning, I think there’s a real prudent concern to not compete directly with excellent projects such as Mojo. These things represent excellent general-purpose tooling that will surely destroy our systems when the workload is not fundamentally reasoning-bound (by which I mean, dominated primarily via materialization and joins). Right now, we’ve been focused on large-scale analytic reasoning workloads, but I think the future probably lies in integrating our high-performance reasoning with LLMs as well, or agentic pipelines. I see the kind of exact and database-backed reasoning we’re doing as a great harmony to generative LLMs. Those settings involve large amounts of data, along with the need to deal with provenance, incremental updating, etc. I also expect that there are serious performance gains to be had by directly fusing some of the LLM and reasoning machinery, as GPU bandwidth and cache efficiency will always be a killer–though this is largely speculation on my part for now.

Post 2: NSF PPoSS Status Update–2024 Year in Review

Blog Archive

Archive of all previous blog posts