Our setting is logic programming, a field which attempts to design programming languages whose semantics have a close relationship to formal logic. The reason we might want to do this is that it suits our application domain more precisely than an implementation in a traditional programming language. Thus, using a logic programming language allows us to write more obviously-correct code, and perhaps even code that can be extracted cleanly from a certified implementation. Alternatively, if we did it ourselves, we’d have to do what our compiler (interpreter, …) would do anyway, so there’s no sense in doing it manually. Unfortunately, when we see a powerful tool, we are tempted to use it for everything: if our application is not ultimately-suited to the operationalization strategy of the logic programming engine we’re using, we simply obfuscate the issue in a veneer of formalism and end up with leaky abstractions. This is, I speculate, why logic programming languages have never caught on broadly for general-purpose programming. In this blog, I will detail the various trade-offs and implementation paradigms for modern logic programming engines, starting from Datalog and with a focus on program analysis.

The history of logic programming is rich, and I will not attempt to recount it all. Here I will focus on more restricted, application-specific languages, especially Datalog and its derivatives. The specific features of these languages, and the particulars of their implementation, often dovetail with a “right place, right time” effect. For example, Datalog backed by BDDs was a significant step forward in terms of production program analyses. More modern implementations eschew BDDs for more explicit representations, but it remains the case that engineers and computer scientists are on the lookout for logic-programming-based approaches to hard problems, especially those which deal intrinsically in the enumeration of large state spaces.

Perhaps one of logic programming’s most exciting motivations is program analysis. Program analysis systems automatically prove properties about, find bugs in, or simply help us understand our programs. These can come in a variety of forms from on-demand in-editor type-checking (runs in microseconds) to whole-program, context-sensitive points-to analysis (days). Program analyses are notoriously hard to specify, and are especially hard to implement in a way that provides a close relationship to the formal specification. Additionally, program analyses often grapple with large state spaces in practice to solve interesting problems, and require some amount of thought regarding high-performance implementation.

Chain-Forward Computation and Datalog

The central evaluation mechanism in Datalog, and its derivatives, is to saturate a set of rules (e.g., Horn clauses) to a fixed point, to obtain a knowledge database in some domain. We call the computation “chain forward” because the evaluation of such languages is guided by an ordering on knowledge, typically set inclusion (in the case of traditional Datalogs). In these settings, we define an “immediate consequence” operator, which tells us everything which must be known, as a consequence of what we currently know; crucially, this operator is typically monotonic: we do not lose knowledge over time, though we will discuss some interesting departures from this assumption. Applying this immediate consequence operator repeatedly yields a stream of knowledge databases over time. Assuming the immediate consequence operator is monotone, this stream of knowledge databases over time forms an ascending chain according to the ordering on knowledge.

Datalog’s syntax consists of “facts” and “rules.” Facts are “known statements” and always have the following form: R(c0, ...), where R is a relation name (identifier) and the arguments are constants. For example >(3,2) might be a fact, along with reaches("n0","n1"). But facts may not include variables, for example: reaches("n0",x) is disallowable as a fact, because x is not bounded in any way.

(define relation? symbol?)
(define atom? symbol?) ;; atomic lits are symbols

;; variables must be explicitly tagged, not 'x, but '(var x)
(define (var? x) (match x [`(var ,(? symbol? x)) #t] [_ #f]))
(define (var-or-atom? x) (or (atom? x) (var? x)))

;; tuples are untagged
(define (tuple? f)
  (match f
    [`((,(? atom?) ...)) #t]
    [_ #f]))

;; facts add a relation name (otherwise how would we know it)
(define (fact? f)
  (match f
    [`((,(? relation? r) ,(? atom?) ...)) #t]
    [_ #f]))

;; notice that ∧ is implicit in the body
(define (rule? r)
  (match r
    [`((,(? relation? head-rel) ,(? var-or-atom?) ...) <-- 
       (,(? relation? body-rels) ,(? var-or-atom?) ...) ...)
     #t]
    [_ #f]))

Let’s look at simplest interesting example: transitive closure. The Datalog program (using Soufflé) to implement transitive closure:

.output path
.decl edge(x:number,y:number)
.decl path(x:number,y:number)
edge(1,2). edge(2,3). edge(3,5). edge(5,4). edge(4,1). edge(4,8).
path(x, y) :- edge(x, y).
path(x, y) :- path(x, z), edge(z, y).

Running this program in Soufflé (souffle tc.dl) yields an output database, the transitively-closed graph, in the file path.csv.

Datalog cannot search (disjunction), saturated conjunctions only

Disjunctions in the head of a rule is disallowed: this is not semantically within reach of Datalog. The following is invalid:

Q(x,...) ∨ R(x,...) ← P(...) ∧ R(...)

Once we have more than one positive literal in a clause, we need a SAT solver. SAT solvers combine search (“guess new things”) and deduction (“derive consequences”); Datalog solvers only employ deduction. Both SAT and Datalog engines share some overlapping ideas; for example, both use indexing, to accelerate joins (Datalog) and for efficient unit propagation (SAT). But (>2)-SAT is strictly harder than Datalog: 2-SAT can be written as Horn clauses (Q ← R is ¬R ∨ Q), but 3-SAT and beyond are out of reach. By contrast, in the head of a rule presents no serious semantic issue: Q(x,...) ∧ R(x,...) ← ... can easily be desugared into two rules: Q(x,...) ← ... and R(x,...) ← ....

SAT solvers have an importance difference from Datalog solvers: they forget information. This is important because, while you can use Datalog to do SAT solving just fine, you will pay a massive performance due to the concomitant over-materialization induced by enumerating all possibilities: you would be literally materializing O(2ⁿ). Of course, there are languages that embrace such cartoonishly-huge state spaces by design (answer-set programming, for example), and their designs are subsequently guided around these issues; indeed, grounding on-the-fly is an active research topic within ASP.

As an example, think about how you could write 5-clique in Datalog:

clique3(A, B, C) :-
    edge(A, B), edge(A, C), edge(B, C),
    A != B, A != C, B != C.
clique4(A, B, C, D) :-
    clique3(A, B, C),
    edge(A, D), edge(B, D), edge(C, D),
    A != D, B != D, C != D.
clique5(A, B, C, D, E) :-
    clique4(A, B, C, D),
    edge(A, E), edge(B, E), edge(C, E), edge(D, E),
    A != E, B != E, C != E, D != E.

Semi-Naïve Evaluation

One issue with the repeated application of the rules is that if we use an explicit set-based representation of tuples, each iteration we’ll “rediscover” all knowledge from every previous iteration—this translates to additional data load, without commensurate knowledge throughput. In Datalog, the solution is to employ a compilation into an incrementalized IR, ala semi-naïve evaluation. For example, the recursive rule in transitive closure becomes:

path(x,z) ← path(x,y) ∧ edge(y,z)
          | 
   becomes| No Δ versions for edge as it is static
          |
Δpath(x,z) ← (Δpath(x,y) ∧ edge(y,z)) - path
i.e., 
Δpath(x,z) ∪= (Δpath(x,y) ⋈ edge(y,z)) - path

The rule expands into a single rule, because the relation edge never changes–thus, tracking a delta version would be irrelevant. Additionally, we assume that at the end of each iteration, Δpath is merged into path. In the more general case such as:

g(y,x) ∧ p(x,z) ← p(x,y) ∧ g(y,z)

We would need to split the rule into several versions: one to join Δp with g, one to join p with Δg, and one to join Δp with Δg. Think about what would happen if we have only Δp ⋈ Δg: if we have facts (x,y) in Δp and (y,z) in Δg, everything works fine. But what happens if (x,y) skips an iteration? It would be hard to ensure that doesn’t happen (though some work certainly explores this to a degree), and so we diversify our rules to enable us to catch things in Δp and Δg.

Previously, I mentioned the resulting rule would look something like:

Δpath(x,z) ∪= (Δpath(x,y) ⋈ edge(y,z)) - path

In fact, all rules within a fixedpoint (more specifically, an SCC of rules) will have the structure:

ΔR(x...) ∪= (ΔR(x,...) ⋈ Q(y,...)) - R

There is a crucial tacit point to be explained here: deduplication, i.e., - R is dirt cheap when implemented thoughtfully, and it is possible to parallelize nicely. There is no explicit scan of R in implementing subtraction, rather every rule generates a set of possibly-new tuples, which are deduplicated in some efficient manner to add to ΔR at the end of each iteration, before emptying ΔR in preparation for the next iteration.

I will not code up semi-naïve evaluation here, but will likely discuss a general approach to forward differentation and incrementalization in subsequent posts.

Relational Algebra

The explicit focus on bound variables has made the above presentation informal, as substitution was never defined. Indeed, substitution is “where computation happens” here in much the same way as in the λ-calculus. However, handling binders is a bit tedious, and it turns out there is a better, more systematic way. In the same way that category theory will avoid mentioning concrete points for products A × B, relational algebra will avoid mentioning explicitly-named points for manipulating relations.

The lack of explicit binders allows us to write an obviously-correct (albeit slow) interpreter for a conjunctive query that produces sets of tuples as its output:

;; q: query?
;; db: hashmap from relation name ↦ ℘(Tuple)
;; returns a set of tuples
(define (interpret-query q db)
  (match q
    [`(literal-tuple ,es ...) (set es)]
    [`(scan ,R) (hash-ref db R)]
    [`(select from ,q+ where column ,n equals ,k)
     (list->set (filter (lambda (x) (equal? (list-ref x n) k))
                        (set->list (interpret-query q+ db))))]
    [`(,q0 ∪ ,q1) (set-union (interpret-query q0 db)
                             (interpret-query q1 db))]
    [`(,q0 ∩ ,q1) (set-intersect (interpret-query q0 db)
                                 (interpret-query q1 db))]
    [`(reorder ,q ,order ...)
     (let ([ts (set->list (interpret-query q db))])
       (set->list
        (map (λ (t) (map (λ (i) (list-ref t i)) order))
             ts)))]
    [`(project ,q to first ,n)
     (list->set (map (λ (t) (take t n)) (set->list (interpret-query q db))))]
    [`(,q0 ⋈ ,q1 on first ,N)
     (list->set
      (foldl
       (λ (t0 tups)
         (foldl (λ (t1 tups)
                  (if (equal? (take t0 N) (take t1 N))
                      (set-add tups (append (take t0 N) (drop t0 N) (drop t1 N)))
                      tups))

The implementation is mostly standard: notice that ⋈ degenerates to the cartesian product in the case that N=0 (since (take t 0) returns the empty list for all t). We include only binary joins, k-ary joins are typically decomposed into chains or trees of binary joins: R ⋈ Q ⋈ S can be construed as either (R ⋈ Q) ⋈ S or R ⋈ (Q ⋈ S) (3 or more joins would allow us to consider trees, too). Unfortunately, the wrong choice could force the enumeration to explode: if a subordinate join produces lots of junk which is later filtered out by the outer join, we may end up doing more work than an alternative join plan. Join planning–-the process of optimally deciding on a join decomposition–is a challenging aspect of the implementation; for some programs, any particular static join plan is be inefficient, as relation sizes dynamically evolve, the optimal join plan may shift as well. While traditional databases put a lot of effort into query planning, Datalog has taken a slightly different approach, with only scant recent work exploring dynamic plans.

With this implementation, we can write query to find us all of the one-hop transitive edges: the query we want is edge(x,y), path(y,z), which operationalizes to a join between edge and path; since we want to line up the ys we reorder edge, then reorder the join result (to gather x and z), then project the first two elements.

(interpret-query
 '(project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
           to first 2)
 (hash 'edge (set '(a b) '(b c) '(c d) '(b e))
       'path (set '(a b) '(b c) '(c d) '(b e))))
;; yields (set '(a e) '(b d) '(a c))

Assuming we’re willing to write programs in terms of relational algebra, we can iteratively generate new tuples. An RA-rule? extends a specific relation (recall we’re taking a point-free style) with the result of a query. We can interpret rules by evaluating the query and unioning the result set into the necessary relation. A program is a set of rules, which we can evaluate by iteratively (to a fixed point) applying all rules until we see no changes in the database.

(define (RA-rule? r)
  (match r
    [`(,(? relation? R) ∪= ,(? RA-query? q)) #t]
    [_ #f]))

(define (interpret-rule r db)
  (match-define `(,R ∪= ,q) r)
  (hash-set db R (set-union (hash-ref db R (set)) (interpret-query q db))))

(define (interpret-program rules)
  (define (next-iteration db)
    (foldl (λ (r db) (interpret-rule r db)) db rules))
  (define (do-some-more db)
    (if (equal? db (next-iteration db))
        db
        (do-some-more (next-iteration db))))
  (do-some-more (hash)))

Now we can write small programs:

(interpret-program
 '((edge ∪= (literal-tuple a b))
   (edge ∪= (literal-tuple b c))
   (edge ∪= (literal-tuple c d))
   (path ∪= (scan edge))
   (path ∪= (project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
                     to first 2))))

Next time, we’ll plan to talk a bit about how to systematically compile our high-level rules into the RA plans, and discuss the various trade-offs in doing so.

The full code

Is here in this gist.

;; Relational algebra
#lang racket

(define relation? symbol?)
(define atom? symbol?) ;; atomic lits are symbols

;; variables must be explicitly tagged, not 'x, but '(var x)
(define (var? x) (match x [`(var ,(? symbol? x)) #t] [_ #f]))
(define (var-or-atom? x) (or (atom? x) (var? x)))

;; tuples are untagged
(define (tuple? f)
  (match f
    [`((,(? atom?) ...)) #t]
    [_ #f]))

;; facts add a relation name (otherwise how would we know it)
(define (fact? f)
  (match f
    [`((,(? relation? r) ,(? atom?) ...)) #t]
    [_ #f]))

;; notice that ∧ is implicit in the body
(define (rule? r)
  (match r
    [`((,(? relation? head-rel) ,(? var-or-atom?) ...) <-- 
       (,(? relation? body-rels) ,(? var-or-atom?) ...) ...)
     #t]
    [_ #f]))

;; should all be true
(rule? '((p) <--))
(rule? '((r) <-- (q)))
(rule? '((path x y) <-- (edge x x)))
(rule? '((path x y) <-- (edge x y) (path y z)))

;; RA queries
(define (RA-query? q)
  (match q
    [`(literal-tuple ,(? atom? es) ...) #t]
    ;; scan a relation
    [`(scan ,(? relation? R)) #t]
    [`(select from ,(? RA-query?)
              where column ,(? nonnegative-integer?)
              equals ,(? atom?))
     #t]
    ;; natural join on the first N columns, values must be equal
    [`(,(? RA-query? q0) ⋈ ,(? RA-query? q1) on first ,(? nonnegative-integer? N)) #t]
    [`(,(? RA-query? q0) ∪ ,(? RA-query? q0)) #t]
    [`(,(? RA-query? q0) ∩ ,(? RA-query? q0)) #t]
    ;; reorder tuples
    [`(reorder ,(? RA-query? q) ,(? nonnegative-integer?) ...) #t]
    ;; project the first n elements of tuples
    [`(project ,(? RA-query? q) to first ,(? nonnegative-integer? n)) #t]
    ;; need closed-world assumption
    [`(- ,(? RA-query? q)) #t]
    [_ #f]))

;; q: query?
;; db: hashmap from relation name ↦ ℘(Tuple)
;; returns a set of tuples
(define (interpret-query q db)
  (match q
    [`(literal-tuple ,es ...) (set es)]
    [`(scan ,R) (hash-ref db R)]
    [`(select from ,q+ where column ,n equals ,k)
     (list->set (filter (lambda (x) (equal? (list-ref x n) k))
                        (set->list (interpret-query q+ db))))]
    [`(,q0 ∪ ,q1) (set-union (interpret-query q0 db)
                             (interpret-query q1 db))]
    [`(,q0 ∩ ,q1) (set-intersect (interpret-query q0 db)
                                 (interpret-query q1 db))]
    [`(reorder ,q ,order ...)
     (let ([ts (set->list (interpret-query q db))])
       (set->list
        (map (λ (t) (map (λ (i) (list-ref t i)) order))
             ts)))]
    [`(project ,q to first ,n)
     (list->set (map (λ (t) (take t n)) (set->list (interpret-query q db))))]
    [`(,q0 ⋈ ,q1 on first ,N)
     (list->set
      (foldl
       (λ (t0 tups)
         (foldl (λ (t1 tups)
                  (if (equal? (take t0 N) (take t1 N))
                      (set-add tups (append (take t0 N) (drop t0 N) (drop t1 N)))
                      tups))
                tups
                (set->list (interpret-query q1 db))))
       (set)
       (set->list (interpret-query q0 db))))]))

(define (RA-rule? r)
  (match r
    [`(,(? relation? R) ∪= ,(? RA-query? q)) #t]
    [_ #f]))

(define (interpret-rule r db)
  (match-define `(,R ∪= ,q) r)
  (hash-set db R (set-union (hash-ref db R (set)) (interpret-query q db))))

(define (interpret-program rules)
  (define (next-iteration db)
    (foldl (λ (r db) (interpret-rule r db)) db rules))
  (define (do-some-more db)
    (if (equal? db (next-iteration db))
        db
        (do-some-more (next-iteration db))))
  (do-some-more (hash)))

;; relations: hash from relation names to their associated arities
;; herbrand-base is a list of atoms 
(define (herbrand-universe relations herbrand-base)
  ;; generate a cartesian product of arity n with atoms from herbrand-base
  (define (h n)
    (if (= n 0)
        '(())
        (apply append (map (λ (lst) (map (λ (x) (cons x lst)) herbrand-base)) (h (- n 1))))))
  (foldl
   (λ (R acc) (hash-set acc R (h (hash-ref relations R))))
   (hash)
   (hash-keys relations)))

(interpret-query
 '(project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
           to first 2)
 (hash 'edge (set '(a b) '(b c) '(c d) '(b e))
       'path (set '(a b) '(b c) '(c d) '(b e))))

(interpret-program
 '((edge ∪= (literal-tuple a b))
   (edge ∪= (literal-tuple b c))
   (edge ∪= (literal-tuple c d))
   (path ∪= (scan edge))
   (path ∪= (project (reorder ((reorder (scan edge) 1 0) ⋈ (scan path) on first 1) 1 2 0)
                     to first 2))))

<
Previous Post
Modern Deduction Blog 0: Introduction
>
Blog Archive
Archive of all previous blog posts