Post 1: Datalog, ChainForward Computation, and Relational Algebra
Our setting is logic programming, a field which attempts to design programming languages whose semantics have a close relationship to formal logic. The reason we might want to do this is that it suits our application domain more precisely than an implementation in a traditional programming language. Thus, using a logic programming language allows us to write more obviouslycorrect code, and perhaps even code that can be extracted cleanly from a certified implementation. Alternatively, if we did it ourselves, we’d have to do what our compiler (interpreter, …) would do anyway, so there’s no sense in doing it manually. Unfortunately, when we see a powerful tool, we are tempted to use it for everything: if our application is not ultimatelysuited to the operationalization strategy of the logic programming engine we’re using, we simply obfuscate the issue in a veneer of formalism and end up with leaky abstractions. This is, I speculate, why logic programming languages have never caught on broadly for generalpurpose programming. In this blog, I will detail the various tradeoffs and implementation paradigms for modern logic programming engines, starting from Datalog and with a focus on program analysis.
The history of logic programming is rich, and I will not attempt to recount it all. Here I will focus on more restricted, applicationspecific languages, especially Datalog and its derivatives. The specific features of these languages, and the particulars of their implementation, often dovetail with a “right place, right time” effect. For example, Datalog backed by BDDs was a significant step forward in terms of production program analyses. More modern implementations eschew BDDs for more explicit representations, but it remains the case that engineers and computer scientists are on the lookout for logicprogrammingbased approaches to hard problems, especially those which deal intrinsically in the enumeration of large state spaces.
Perhaps one of logic programming’s most exciting motivations is program analysis. Program analysis systems automatically prove properties about, find bugs in, or simply help us understand our programs. These can come in a variety of forms from ondemand ineditor typechecking (runs in microseconds) to wholeprogram, contextsensitive pointsto analysis (days). Program analyses are notoriously hard to specify, and are especially hard to implement in a way that provides a close relationship to the formal specification. Additionally, program analyses often grapple with large state spaces in practice to solve interesting problems, and require some amount of thought regarding highperformance implementation.
ChainForward Computation and Datalog
The central evaluation mechanism in Datalog, and its derivatives, is to saturate a set of rules (e.g., Horn clauses) to a fixed point, to obtain a knowledge database in some domain. We call the computation “chain forward” because the evaluation of such languages is guided by an ordering on knowledge, typically set inclusion (in the case of traditional Datalogs). In these settings, we define an “immediate consequence” operator, which tells us everything which must be known, as a consequence of what we currently know; crucially, this operator is typically monotonic: we do not lose knowledge over time, though we will discuss some interesting departures from this assumption. Applying this immediate consequence operator repeatedly yields a stream of knowledge databases over time. Assuming the immediate consequence operator is monotone, this stream of knowledge databases over time forms an ascending chain according to the ordering on knowledge.
Datalog’s syntax consists of “facts” and “rules.” Facts are “known
statements” and always have the following form: R(c0, ...)
, where
R
is a relation name (identifier) and the arguments are
constants. For example >(3,2)
might be a fact, along with
reaches("n0","n1")
. But facts may not include variables, for
example: reaches("n0",x)
is disallowable as a fact, because x
is
not bounded in any way.
(define relation? symbol?)
(define atom? symbol?) ;; atomic lits are symbols
;; variables must be explicitly tagged, not 'x, but '(var x)
(define (var? x) (match x [`(var ,(? symbol? x)) #t] [_ #f]))
(define (varoratom? x) (or (atom? x) (var? x)))
;; tuples are untagged
(define (tuple? f)
(match f
[`((,(? atom?) ...)) #t]
[_ #f]))
;; facts add a relation name (otherwise how would we know it)
(define (fact? f)
(match f
[`((,(? relation? r) ,(? atom?) ...)) #t]
[_ #f]))
;; notice that ∧ is implicit in the body
(define (rule? r)
(match r
[`((,(? relation? headrel) ,(? varoratom?) ...) <
(,(? relation? bodyrels) ,(? varoratom?) ...) ...)
#t]
[_ #f]))
Let’s look at simplest interesting example: transitive closure. The Datalog program (using Soufflé) to implement transitive closure:
.output path
.decl edge(x:number,y:number)
.decl path(x:number,y:number)
edge(1,2). edge(2,3). edge(3,5). edge(5,4). edge(4,1). edge(4,8).
path(x, y) : edge(x, y).
path(x, y) : path(x, z), edge(z, y).
Running this program in Soufflé (souffle tc.dl
) yields an output
database, the transitivelyclosed graph, in the file path.csv
.
Datalog cannot search (disjunction), saturated conjunctions only
Disjunctions in the head of a rule is disallowed: this is not semantically within reach of Datalog. The following is invalid:
Q(x,...) ∨ R(x,...) ← P(...) ∧ R(...)
Once we have more than one positive literal in a clause, we need a SAT
solver. SAT solvers combine search (“guess new things”) and deduction
(“derive consequences”); Datalog solvers only employ deduction. Both
SAT and Datalog engines share some overlapping ideas; for example,
both use indexing, to accelerate joins (Datalog) and for efficient
unit propagation (SAT). But (>2)SAT is strictly harder than Datalog:
2SAT can be written as Horn clauses (Q ← R
is ¬R ∨ Q
), but 3SAT
and beyond are out of reach. By contrast, ∧
in the head of a rule
presents no serious semantic issue: Q(x,...) ∧ R(x,...) ← ...
can
easily be desugared into two rules: Q(x,...) ← ...
and R(x,...) ←
...
.
SAT solvers have an importance difference from Datalog solvers: they forget information. This is important because, while you can use Datalog to do SAT solving just fine, you will pay a massive performance due to the concomitant overmaterialization induced by enumerating all possibilities: you would be literally materializing O(2ⁿ). Of course, there are languages that embrace such cartoonishlyhuge state spaces by design (answerset programming, for example), and their designs are subsequently guided around these issues; indeed, grounding onthefly is an active research topic within ASP.
As an example, think about how you could write 5clique in Datalog:
clique3(A, B, C) :
edge(A, B), edge(A, C), edge(B, C),
A != B, A != C, B != C.
clique4(A, B, C, D) :
clique3(A, B, C),
edge(A, D), edge(B, D), edge(C, D),
A != D, B != D, C != D.
clique5(A, B, C, D, E) :
clique4(A, B, C, D),
edge(A, E), edge(B, E), edge(C, E), edge(D, E),
A != E, B != E, C != E, D != E.
SemiNaïve Evaluation
One issue with the repeated application of the rules is that if we use an explicit setbased representation of tuples, each iteration we’ll “rediscover” all knowledge from every previous iteration—this translates to additional data load, without commensurate knowledge throughput. In Datalog, the solution is to employ a compilation into an incrementalized IR, ala seminaïve evaluation. For example, the recursive rule in transitive closure becomes:
path(x,z) ← path(x,y) ∧ edge(y,z)

becomes No Δ versions for edge as it is static

Δpath(x,z) ← (Δpath(x,y) ∧ edge(y,z))  path
i.e.,
Δpath(x,z) ∪= (Δpath(x,y) ⋈ edge(y,z))  path
The rule expands into a single rule, because the relation edge
never
changes–thus, tracking a delta version would be
irrelevant. Additionally, we assume that at the end of each iteration,
Δpath
is merged into path
. In the more general case such as:
g(y,x) ∧ p(x,z) ← p(x,y) ∧ g(y,z)
We would need to split the rule into several versions: one to join
Δp
with g
, one to join p
with Δg
, and one to join Δp
with
Δg
. Think about what would happen if we have only Δp ⋈ Δg
: if we
have facts (x,y)
in Δp
and (y,z)
in Δg
, everything works
fine. But what happens if (x,y)
skips an iteration? It would be hard
to ensure that doesn’t happen (though some work certainly explores
this to a degree), and so we diversify our rules to enable us to catch
things in Δp
and Δg
.
Previously, I mentioned the resulting rule would look something like:
Δpath(x,z) ∪= (Δpath(x,y) ⋈ edge(y,z))  path
In fact, all rules within a fixedpoint (more specifically, an SCC of rules) will have the structure:
ΔR(x...) ∪= (ΔR(x,...) ⋈ Q(y,...))  R
There is a crucial tacit point to be explained here: deduplication,
i.e.,  R
is dirt cheap when implemented thoughtfully, and it is
possible to parallelize nicely. There is no explicit scan of R
in
implementing subtraction, rather every rule generates a set of
possiblynew tuples, which are deduplicated in some efficient manner
to add to ΔR
at the end of each iteration, before emptying ΔR
in
preparation for the next iteration.
I will not code up seminaïve evaluation here, but will likely discuss a general approach to forward differentation and incrementalization in subsequent posts.
Relational Algebra
The explicit focus on bound variables has made the above presentation informal, as substitution was never defined. Indeed, substitution is “where computation happens” here in much the same way as in the λcalculus. However, handling binders is a bit tedious, and it turns out there is a better, more systematic way. In the same way that category theory will avoid mentioning concrete points for products A × B, relational algebra will avoid mentioning explicitlynamed points for manipulating relations.
The lack of explicit binders allows us to write an obviouslycorrect (albeit slow) interpreter for a conjunctive query that produces sets of tuples as its output:
;; q: query?
;; db: hashmap from relation name ↦ ℘(Tuple)
;; returns a set of tuples
(define (interpretquery q db)
(match q
[`(literaltuple ,es ...) (set es)]
[`(scan ,R) (hashref db R)]
[`(select from ,q+ where column ,n equals ,k)
(list>set (filter (lambda (x) (equal? (listref x n) k))
(set>list (interpretquery q+ db))))]
[`(,q0 ∪ ,q1) (setunion (interpretquery q0 db)
(interpretquery q1 db))]
[`(,q0 ∩ ,q1) (setintersect (interpretquery q0 db)
(interpretquery q1 db))]
[`(reorder ,q ,order ...)
(let ([ts (set>list (interpretquery q db))])
(set>list
(map (λ (t) (map (λ (i) (listref t i)) order))
ts)))]
[`(project ,q to first ,n)
(list>set (map (λ (t) (take t n)) (set>list (interpretquery q db))))]
[`(,q0 ⋈ ,q1 on first ,N)
(list>set
(foldl
(λ (t0 tups)
(foldl (λ (t1 tups)
(if (equal? (take t0 N) (take t1 N))
(setadd tups (append (take t0 N) (drop t0 N) (drop t1 N)))
tups))
The implementation is mostly standard: notice that ⋈ degenerates to
the cartesian product in the case that N=0 (since (take t 0)
returns
the empty list for all t
). We include only binary joins, kary joins
are typically decomposed into chains or trees of binary joins: R ⋈ Q
⋈ S
can be construed as either (R ⋈ Q) ⋈ S
or R ⋈ (Q ⋈ S)
(3 or
more joins would allow us to consider trees, too). Unfortunately, the
wrong choice could force the enumeration to explode: if a subordinate
join produces lots of junk which is later filtered out by the outer
join, we may end up doing more work than an alternative join
plan. Join planning–the process of optimally deciding on a join
decomposition–is a challenging aspect of the implementation; for some
programs, any particular static join plan is be inefficient, as
relation sizes dynamically evolve, the optimal join plan may shift as
well. While traditional databases put a lot of effort into query
planning, Datalog has taken a slightly different approach, with only
scant recent work exploring dynamic plans.
With this implementation, we can write query to find us all of the
onehop transitive edges: the query we want is edge(x,y), path(y,z)
,
which operationalizes to a join between edge and path; since we want
to line up the y
s we reorder edge
, then reorder the join result
(to gather x
and z
), then project the first two elements.
(interpretquery
'(project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
to first 2)
(hash 'edge (set '(a b) '(b c) '(c d) '(b e))
'path (set '(a b) '(b c) '(c d) '(b e))))
;; yields (set '(a e) '(b d) '(a c))
Assuming we’re willing to write programs in terms of relational
algebra, we can iteratively generate new tuples. An RArule?
extends
a specific relation (recall we’re taking a pointfree style) with the
result of a query. We can interpret rules by evaluating the query and
unioning the result set into the necessary relation. A program is a
set of rules, which we can evaluate by iteratively (to a fixed point)
applying all rules until we see no changes in the database.
(define (RArule? r)
(match r
[`(,(? relation? R) ∪= ,(? RAquery? q)) #t]
[_ #f]))
(define (interpretrule r db)
(matchdefine `(,R ∪= ,q) r)
(hashset db R (setunion (hashref db R (set)) (interpretquery q db))))
(define (interpretprogram rules)
(define (nextiteration db)
(foldl (λ (r db) (interpretrule r db)) db rules))
(define (dosomemore db)
(if (equal? db (nextiteration db))
db
(dosomemore (nextiteration db))))
(dosomemore (hash)))
Now we can write small programs:
(interpretprogram
'((edge ∪= (literaltuple a b))
(edge ∪= (literaltuple b c))
(edge ∪= (literaltuple c d))
(path ∪= (scan edge))
(path ∪= (project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
to first 2))))
Next time, we’ll plan to talk a bit about how to systematically compile our highlevel rules into the RA plans, and discuss the various tradeoffs in doing so.
The full code
Is here in this gist.
;; Relational algebra
#lang racket
(define relation? symbol?)
(define atom? symbol?) ;; atomic lits are symbols
;; variables must be explicitly tagged, not 'x, but '(var x)
(define (var? x) (match x [`(var ,(? symbol? x)) #t] [_ #f]))
(define (varoratom? x) (or (atom? x) (var? x)))
;; tuples are untagged
(define (tuple? f)
(match f
[`((,(? atom?) ...)) #t]
[_ #f]))
;; facts add a relation name (otherwise how would we know it)
(define (fact? f)
(match f
[`((,(? relation? r) ,(? atom?) ...)) #t]
[_ #f]))
;; notice that ∧ is implicit in the body
(define (rule? r)
(match r
[`((,(? relation? headrel) ,(? varoratom?) ...) <
(,(? relation? bodyrels) ,(? varoratom?) ...) ...)
#t]
[_ #f]))
;; should all be true
(rule? '((p) <))
(rule? '((r) < (q)))
(rule? '((path x y) < (edge x x)))
(rule? '((path x y) < (edge x y) (path y z)))
;; RA queries
(define (RAquery? q)
(match q
[`(literaltuple ,(? atom? es) ...) #t]
;; scan a relation
[`(scan ,(? relation? R)) #t]
[`(select from ,(? RAquery?)
where column ,(? nonnegativeinteger?)
equals ,(? atom?))
#t]
;; natural join on the first N columns, values must be equal
[`(,(? RAquery? q0) ⋈ ,(? RAquery? q1) on first ,(? nonnegativeinteger? N)) #t]
[`(,(? RAquery? q0) ∪ ,(? RAquery? q0)) #t]
[`(,(? RAquery? q0) ∩ ,(? RAquery? q0)) #t]
;; reorder tuples
[`(reorder ,(? RAquery? q) ,(? nonnegativeinteger?) ...) #t]
;; project the first n elements of tuples
[`(project ,(? RAquery? q) to first ,(? nonnegativeinteger? n)) #t]
;; need closedworld assumption
[`( ,(? RAquery? q)) #t]
[_ #f]))
;; q: query?
;; db: hashmap from relation name ↦ ℘(Tuple)
;; returns a set of tuples
(define (interpretquery q db)
(match q
[`(literaltuple ,es ...) (set es)]
[`(scan ,R) (hashref db R)]
[`(select from ,q+ where column ,n equals ,k)
(list>set (filter (lambda (x) (equal? (listref x n) k))
(set>list (interpretquery q+ db))))]
[`(,q0 ∪ ,q1) (setunion (interpretquery q0 db)
(interpretquery q1 db))]
[`(,q0 ∩ ,q1) (setintersect (interpretquery q0 db)
(interpretquery q1 db))]
[`(reorder ,q ,order ...)
(let ([ts (set>list (interpretquery q db))])
(set>list
(map (λ (t) (map (λ (i) (listref t i)) order))
ts)))]
[`(project ,q to first ,n)
(list>set (map (λ (t) (take t n)) (set>list (interpretquery q db))))]
[`(,q0 ⋈ ,q1 on first ,N)
(list>set
(foldl
(λ (t0 tups)
(foldl (λ (t1 tups)
(if (equal? (take t0 N) (take t1 N))
(setadd tups (append (take t0 N) (drop t0 N) (drop t1 N)))
tups))
tups
(set>list (interpretquery q1 db))))
(set)
(set>list (interpretquery q0 db))))]))
(define (RArule? r)
(match r
[`(,(? relation? R) ∪= ,(? RAquery? q)) #t]
[_ #f]))
(define (interpretrule r db)
(matchdefine `(,R ∪= ,q) r)
(hashset db R (setunion (hashref db R (set)) (interpretquery q db))))
(define (interpretprogram rules)
(define (nextiteration db)
(foldl (λ (r db) (interpretrule r db)) db rules))
(define (dosomemore db)
(if (equal? db (nextiteration db))
db
(dosomemore (nextiteration db))))
(dosomemore (hash)))
;; relations: hash from relation names to their associated arities
;; herbrandbase is a list of atoms
(define (herbranduniverse relations herbrandbase)
;; generate a cartesian product of arity n with atoms from herbrandbase
(define (h n)
(if (= n 0)
'(())
(apply append (map (λ (lst) (map (λ (x) (cons x lst)) herbrandbase)) (h ( n 1))))))
(foldl
(λ (R acc) (hashset acc R (h (hashref relations R))))
(hash)
(hashkeys relations)))
(interpretquery
'(project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
to first 2)
(hash 'edge (set '(a b) '(b c) '(c d) '(b e))
'path (set '(a b) '(b c) '(c d) '(b e))))
(interpretprogram
'((edge ∪= (literaltuple a b))
(edge ∪= (literaltuple b c))
(edge ∪= (literaltuple c d))
(path ∪= (scan edge))
(path ∪= (project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
to first 2))))