Post 2: NSF PPoSS Status Update–2024 Year in Review
In this post I’ll discuss our group’s published progress from 2024. For the past few years, we’ve been working to design the highest-performance declarative analytics languages and reasoning engines for complex logically-specified tasks. The applications of our systems span many domains including program analysis (e.g., DOOP), security (e.g., ddisasm), graph and social-media mining, knowledge representation, business analytics (e.g., RDFox), and medical reasoning (e.g., MediKanren). The applications are united by their need to combine complex declarative specifications with scalability on leadership-class systems (supercomputers, GPU clusters, datacenter GPUs, etc.) and internet-scale datasets (large social-media programs, the Linux kernel, etc.). During 2024, our collaboration–funded by an NSF PPoSS Large–had papers accepted at several top conferences detailing our work on state-of-the-art Datalog engines on GPUs (accepted at ASPLOS ‘25 and AAAI ‘25) and supercomputers (accepted at VLDB ‘25).
Prior to PPoSS
This post by Tom Gilray gives some background to the award
We were first motivated by designing real-world context-sensitive program analysis tools: while an input program might be tens of thousands of lines, analyzing it in a sufficiently sound and precise manner might demand the enumeration of billions of analysis states. Additionally, program analysis systems are extremely complex, both in their specification but also implementation, especially if you want to design a high-performance, parallel system. Prior to declarative specification and synthesis of program analyses (DOOP is thousands to tens of thousands of lines), program analysis toolkits for real languages (e.g., WALA for Java) were large monoliths (hundreds of thousands or millions of lines) whose implementation was quite far intellectually from the underlying mathematical specification.
My collaborators and I spent years designing ad-hoc implementations of sound program analysis algorithms, often disappointed that our ad-hoc implementations lagged state-of-the-art systems by orders of magnitude. We realized that we would need experience implementing our analyses using the fastest and most scalable tools available. After experimenting with many of the popular state-of-the-art engines, we realized that there were still practical and theoretical limitations that prevented us from immediately applying them to design high-performance program analysis systems. This led us to designing new logic programming languages and engines, targeting a breadth of high-performance hardware and language extensions.
Of course there are many different concerns here: how does this relate to traditional persistent databases, how do you deal with hierarchies of different storage and computation capacities, what trade-offs do you make surrounding error-correction, etc. For now, we try to make things as simple as possible on ourselves, getting things to work in-memory (VRAM, RAM, etc…) and taking a materialization-based strategy. A big focus of ours has been scaling large Datalog benchmarks, including traditional graph algorithms (transitive closure, same generation), along with more practical program analysis workloads (context-sensitive points-to analysis of Linux, etc…). In pursuit of this, our students (and us too!) have built an exciting array of open-source engines demonstrating specific instantiations of these choices in pursuit of this goal.
Optimizing Datalog for the GPU (ASPLOS ‘25)
Paper: https://arxiv.org/pdf/2311.02206
GitHub: https://github.com/harp-lab/gdlog
The first paper is set to appear at ASPLOS ‘25, and details the design of the first truly SOTA GPU Datalog solver. While there was substantial prior work in advancing Datalog on the GPU (RedFox, Shovon’s USENIX ATC paper), there were no general-purpose libraries which allowed realistically-sized Datalog programs and were especially competitive with just using the CPU. Since the GPU has tons of SIMD cores–and because it’s got a huge memory bandwidth advantage–you would really expect that the GPU could crush the CPU on truly high-scale workloads.
Yihao led the development of this work, developing a data structure called the Hash-Indexed Sorted Array (HISA). This GPU-based data structure is used to store relations in GPU VRAM. To repeat this: the whole computation is stored resident in GPU VRAM, which differentiates this work from prior work that uses disk-backed storage to scale to many TB. HISA is a multi-tiered data structure that layers indexing (necessary for efficient range-querying) on top of an underlying sorted data array (friendly to the GPU).
The results are very promising, showing up to a 45x speedup on context-sensitive points-to analysis (of httpd
) versus an optimally-configured Soufflé (16 cores of an EPYC Milan server chip). We saw promising results with other workloads too (transitive closure, etc…). The Milan is not a threadripper, and this is a task dominated by memory bandwidth, so one would expect that the gap might lessen with a better CPU and better CPU memory bandwidth–but in our unscientific experiments on a variety of machines and GPUs, we found that this engine solidly outperformed the CPU in every instance. The issue, of course, is that your workload has to fit in VRAM. If your workload falls out of VRAM, our ideas just don’t apply. This helps put our results into perspective: we’re not claiming to replace every Datalog or database engine (but other Datalog engines typically crash when they run out of RAM too), and a long-term goal might be to take the great potential for speedups we’re seeing and to integrate that into a system backed by persistent storage.
Column-Oriented GPU Datalog (AAAI ‘25)
Paper: https://thomas.gilray.org/pdf/column-datalog-gpu.pdf
GitHub: https://github.com/harp-lab/vflog
Our submission to ASPLOS had been the result of several rejected paper submissions, and by the time we submitted the work, we were ready to move on to improvements in our GPU Datalog engines. So over last summer, Yihao implemented a number of radical architectural improvements, taking a column-oriented approach. The result of this effort was his new implementation, VFLog, a state-of-the-art GPU-based Datalog implementation as a set of C++ relational algebra kernels (ultimately implemented in CUDA).
The novelty of this work is our application of a column-oriented approach (also taken by, e.g., DuckDB) to a modern, state-of-the-art Datalog engine (on the GPU). Column-oriented layouts have been hugely popular in conventional (CPU-based) analytic database engines due to their cache-friendliness, among other things. We expected that taking this approach in the context of GPU Datalog might yield an even faster implementation than our ASPLOS ‘25 paper. We were happy to see that it did, showing ~2.5x improvements (vs. ASPLOS ‘25) on large experiments with relatively modest memory overhead. We found that our column-oriented approach dovetailed nicely with the GPU’s caching and facilitated coalesced memory access.
We did a wider range of experiments in this work (appearing at AAAI ‘25, with a talk on Thursday, Feb 27th by Yihao), doing the usual graph and program analysis queries, but also incorporating some (subsets of) benchmarks from KRR workloads (e.g., the LUBM benchmark). A potential criticism of this work might be the concern that speedups are due only to the increased memory throughput of modern datacenter GPUs. This is a fair worry, and one we wanted to study–Yihao also ran his column-oriented system on the CPU (Table 3, benchmark of LUBM queries) and found our technique still performed better than Nemo, VLog, and RDFox. Of course, the H100 is still faster: <.01s versus .15s (CPU).
Datalog with First-Class Facts (VLDB ‘25)
Paper: https://arxiv.org/pdf/2411.14330v1
GitHub: https://github.com/harp-lab/slog-lang1/
Orthogonal to this work on GPU Datalog has been our work on Slog, a truly distributed Datalog engine that supports chain-forward programming with S-expressions. Slog can run on multiple cores of a beefy laptop/server, but it’s designed to be run on leadership-class supercomputers with a high-performance interconnect offering low-latency all-to-all communication.
Slog is a reimagining of Datalog to support “first-class” facts. In Slog, all facts have an identity, canonically represented by an S-expression. Slog rules allow us to build S-expressions in a safe way (such that everything is well-founded, and may be computed bottom-up). For example, we might want to write a rule defining the free variables of an expression:
;; BAD
[(free x (lambda (y) e)) <-- (free x e) (=/= x y)]
The issue with this rule is that y
is not bound by anything in the body.
Obviously, we might want to universally quantify over it–but that leads to
nontermination (no finite fixed-point, etc.). Instead, we reject this program as unsafe in Slog, but we do permit a grounded version of the program:
;; GOOD
[(free x ?(lambda (y) e)) <-- (free x e) (=/= x y)]
The ?
clause in the above Slog is syntactic sugar. It says, when the fact (lambda (y) e)
exists, and also everything in the body, then materialize the head. The ?
clause is elaborated into:
;; without syntactic sugar...
[(free x id_h0) <-- (= id_h0 (lambda (y) e)) (free x e) (=/= x y)]
This grounds the execution of free
in something finite (our program, the input database). This fact: that facts have an identity and algebraic structure which may be introspected upon, is the only change Slog makes from vanilla (textbook) Datalog. Formally, we explain Slog’s formalization as unique existential quantification (Section 2), such that it forms a more restricted version of existential Datalog (which is )
Intuitively, we can see Slog as the most trivial extension of Datalog to S-expressions, enabling chain-forward programming via matching over (and producing) S-expressions. This naturally enables defunctionalization ala Reynolds:
;; fib(0) = 1
(fib ?(demand-fib 0) 1)
;; fib(1) = 1
(fib ?(demand-fib 1) 1)
;; fib(n) = fib(n-1) + fib(n-2)
[(fib ?(demand-fib n) nsum) <-- (> n 1)
(fib !(demand-fib {- n 1}) n1)
(fib !(demand-fib {- n 2}) n2)
(= nsum {+ n1 n2})]
The first two rules should be familiar by now–the last requires
some explaining. In the comments I define the function
in Haskell style by destruction on the first argument. In Slog,
these translate to (disjoint, in this case) rules.
Each base case is syntactically a fact, but the ?
clause means that this
is really a rule under the hood (“if (demand-fib 0)
exists, then materialize (fib ... 1)
”).
The last case is the recursive case. The issue is that we need to first
call fib
on n-1
, take its result, and then continue to call fib
on n-2
. To start off, Slog handles builtins without much of an issue: {+ n1 n2}
is syntactic sugar for an instance of a fresh variable v
, and an additional constraint (+ n1 n2 v)
–the +
table is built in and handled in the way you’d expect (our MPI implementation handles simple chains of built-ins over integers). The more confusing part is the !
clause, which says: if you can ground n
, then materialize (demand-fib {- n 1})
. Then, usage of the !
clause splits the rule into multiple subordinate rules, materializing a continuation that waits for a return value (the second column of fib
) in response to the generation of (demand-fib {- n 1})
. In sum, this rule is split into several subordinate rules, which (a) are triggered by (demand-fib n)
(when (> n 1)
), (b) materialize the call to (demand-fib {- n 1})
, (c) waits for the return value v1
from (fib {- n 1} v0)
, then (d) materializes (demand-fib {- n 2})
, which waits for the value v2
from (fib {- n 2} v2)
, and finally computes nsum
and populates the result of fib
.
In the end, we support a language that looks shockingly close to Scheme–but is grounded and chain forward. This is not a new observation, however–while we had been working to publish this work, several folks (especially Pacak, Erdweg, etc.) have independently discovered Datalog’s potential for a compilation target of functional programming. We see Slog (and our work on Datalog∃!) as harmonious with that work (e.g., IncA and similar frameworks). However, there are still significant issues with compiling general-purpose functional programs to Slog (or any other Datalog)–for example, every opportunity that a traditional CPU-based compiler has to batch small intraprocedural blocks becomes totally crushed by the fully-CPS compilation methodology I’ve sketched here. We have exposed every subcomputation, which offers a ton of opportunity for parallelism and throughput–but is in practice a huge imposition, especially in terms of space complexity. I hope to detail some of these issues and explore related ideas throughout the course of this project and blog.
At the end of the day, we compile (both in formalism and implementation) Slog down to Datalog∃!—our core formalism extending Datalog with first-class facts. We sketch a very small formalism of Datalog∃! as an extension of Datalog to observe a single constructor “down” into an S-expression, and build new S-expressions of a single constructor (in each iteration). We then generalize that tiny language to a language that allows you to write patterns involving multiple constructors deep (e.g., (free (lam (x) ...))
). Our VLDB paper stops short of defining
the semantics for the entire Slog language, however. That’s explicitly not a contribution of that paper, and we leave its potential publication for future work, though we have an Arxiv draft here which sketches some gory details to a degree of formalization.
Our implementation presents Datalog∃! as a set of C++ API implemented on top of MPI. Like other Datalogs, we compile down to high-performance relational algbera kernels which play nicely with indexing, semi-naïve evaluation, etc. Before this NSF PPoSS project, my collaborators (Sidharth Kumar, UIC and Thomas Gilray, WSU) demonstrated the fundamental scalability of our parallel relational algebra kernels with promising results up to 32k processes (roughly CPU cores of a supercomputer) for simple queries (transitive closure). Around 2019, Thomas Gilray led the development of the initial Slog interpreter and compiler. Around 2020-22, my student Arash Sahebolamri enhanced the Slog language and we collaboratively submitted various papers about Slog to various conferences. A challenge with that work was the sheer scale of the task: we were introducing new kernels, a new language, a new compiler–all for an complex application domain (massively-scalable declaratively-specified context-sensitive analysis). After submitting to various venues, we submitted to VLDB and got a “revision” decision. Based on some pointers from my student Yihao, and informed by the IVM reading group I take part in, we were able to rewrite the paper to really refine the key intellectual contributions in the work and crystalize on a contribution (Datalog∃!) that unified the various engineering choices of the project. We ended up cutting out a significant portion of the Slog language details to focus on the key contributions in that work, which was accepted to appear at VLDB ‘25.
Next Steps and Our Long-Term Vision
We are around 1.5 years into our 5 year project. The Slog vision encapsulates both ever-expanding machine architectures alongside constraints for enhanced reasoning across rich application domains. Keeping a focus on what we can actually publish and contribute to the research community is critical. I really like our style so far: each paper has a separate repository, makes separate assumptions, and presents a unifying arc of results and evaluation (along with open-source artifacts). I hope we will keep that up. This is a bit in tension with developing a grand unified system which can handle any type of machine, language extension, etc. I expect we’re a bit far away from this–I even wonder if that’s within scope of the whole project.
For now, we want to keep innovating along three fronts: algorithmic, semantic, and engineering. For the algorithmic front, we want to collaboratively work with folks to study more robust join plans for practical Datalog programs–large Datalog programs are extremely fragile with things like join ordering in practice, and large popular applications have been hand-optimized by knowledgable engineers. We hope to study robust join planning for iterative relational algebra (over various semirings, say). Next, we want to continue to push the boundary on Datalog expressivity extensions, studying the impact of materialization versus recomputation and applying our ideas to tackle probabilistic Datalog. Last, we are making a lot of progress on engineering: across multiple nodes, each with multiple GPUs. I am hoping we will have good news to report along that front soon.
Acknowledgements
My students:
- Arash Sahebolamri
- Significant development of the Slog language and compiler
- Yihao Sun
- Development of GPU Datalog, working on its extensions
My collaborators:
- Thomas Gilray, Washington State
- Sowmith Kunapaneni, Tom’s student
- Sidharth Kumar, University of Illinois at Chicago
- Ahmedur Rahman Shovon, Sid’s student