Project 2: Implementing Datalog

Note: you are encouraged to work in groups of up to three.

In this project, you will implement the logic programming language Datalog. Datalog is a restricted version of SAT to Horn clauses. For example, the following is a short Datalog program:

path(x,y) :- edge(x,y)
path(x,z) :- path(x,y) edge(y,z)

edge(Syracuse,Rochester)
edge(Rochester,Ithaca)
edge(Rochester,Buffalo)

The program has two rules: the first says that any pair x,y forming an edge is also a path. Conceptually, we can think of edge as an “input” relation; opertationally, the first rule copies edge to path. The second rule iteratively builds up path. It says: whenever there is a path(x,y) such that there is also an edge(y,z) (notice: y matches), there must be path(x,z) as well. Executing the program materializes (via saturation) the relation path.

Your Assignment

You will build an interpreter for Datalog, which parses .dl files and executes them to a fixed point. Your .sdl (simple DL) files will be a simplified format intended to be easy for you to parse, but roughly similar to the input format for the state-of-the-art Souffle engine–we encourage you to use Souffle to help understand Datalog, and we have Souffle versions of the test files. Your program will take a single command-line argument, which is the .sdl file which will be executed.

python3 yourdatalog.py transitive_closure.sdl

Your program’s output should be a collection of .tsv (tab-separated-values) files corresponding to the various relations of your output.

Every group must implement the “positive fragment” of Datalog sketched below. Additionally, you must implement one of two options:

Stratified negation
Semi-naive evaluation

Simple Datalog

We define a simple version of Datalog, which I call “simple” Datalog (.sdl). This is meant to be simpler to parse. In .sdl programs, there are only two types of lines allowed (other than empty lines): (a) facts and (b) rules.

Facts look like this:

edge(1,2)
foo(5,2,1)
bar(2,5,1,2)

Notice that the only type of allowed “atomic” Data in simple Datalog is the integers (e.g., there are no strings, etc… allowed). Also notice the precise formating: it is always relation_name(n0,...) with the relation named, followed by (, followed by a number of comma-separated integers (you can assume they are positive). This should be easy to parse in many languages via a combination of regular expressions, splitting, etc… Last, observe that facts do not allow any variables: only integers are allowed as arguments to the relation.

Rules are more general than facts, and define inductively-computed relations:

path(x,y) :- edge(x,y)
path(x,z) :- path(x,y) path(y,z)
foo(x,5) :- bar(x,1,y) edge(x,y)

The :- is a common (but antiquated) symbol frequently used in logic programming–it means ←. So these rules could be read alternatively as:

path(x,y) ← edge(x,y)
path(x,z) ← path(x,y) ∧ path(y,z)
foo(x,5) ← bar(x,1,y) ∧ edge(x,y)

These are Horn clauses. A Horn clause is a clause with at most one positive literal. To understand this, let’s unfold the definition of ← (notice it’s backwards) to obtain:

path(x,y) ∨ ¬edge(x,y)
path(x,z) ∨ ¬path(x,y) ∨ ¬path(y,z)
foo(x,5) ∨ ¬bar(x,1,y) ∨ ¬edge(x,y)

Now we can see clearly: Horn clauses which are set up for unit propagation. Datalog is essentially the restriction of propositional SAT to Horn clauses. We call the positive atom the “head” of the rule and the negative atoms the “body.” In general, the evaluation strategy of Datalog proceeds to a fixed-point, executing each rule until no more rules produce any useful information. While it is possible to simply use a SAT solver, Datalog’s restriction to Horn clauses enables a much more efficient evaluation strategy (semi-naive evaluation) and the ability to exploit a host of other niceties.

Syntax Summary

In sum, you will accept as input a .sdl file specifying the problem. Non-empty input lines in .sdl files will have one of two formats:

Facts. Relations (alphanumeric, including _) applied to a comma-separated list of numbers: f(1,2,3) but not f(x,1).
Rules. A head of a rule, followed by :-, followed by one or more bodies. The head and body clauses are of the form f(x,...), specifying either (a) an integer literal or (b) a variable name. Variables used in the head must appear in the body (i.e., they must be grounded).

Positive Datalog and Stratified Negation

Notice that vanilla Datalog does not include negation. It is fairly straightforward to add negation, provided that its computation may be stratified. For example, look at the following:

symm(x,y) :- edge(x,y) edge(y,x)
non_symm(x,y) :- edge(x,y) not(symm(y,x))

This program is not vanilla Datalog, but it is possible to understand what it means intuitively: first compute symm, then compute non_sym by iterating over edge and including pairs (x,y) for which there is no entry for symm(y,x) in the relation symm.

This is called “stratified” negation. A Datalog program that uses stratified negation is essentially equivalent to a chain of Datalog programs, each which fully materializes a subordinate relation (symm in this case) and then consults it using not.

If you would like to implement stratified negation, first implement the positive Datalog fragment (i.e., vanilla Datalog). Then, think about how to stratify the program so that whenever you need to invoke not on a relation, the relation is materialized via a previous stratum.

If you do this phase, provide at least two nontrivial tests that your implementation works. You may take some liberty with the parsing and concrete syntax, as long as you say what you are doing.

Note that not all programs can be stratified. For example, the following is an error:

path(x,y) :- edge(x,y)
path(x,z) :- not(path(x,y)) edge(y,z)

The issue is that x and y are not grounded. You will need to carefully consider this in your implementation.

Semi-Naive Evaluation

Consider the transitive closure program from earlier:

path(x,y) :- edge(x,y)
path(x,z) :- path(x,y) edge(y,z)
edge(1,2)
edge(2,3)
edge(2,4)
...

The “naive” semantics is to run each rule at every iteration, building up the set path. We call this strategy “naive” because as the size of path increases, it starts to become very slow: every element of path must be considered, including every previously-discovered path. This is an asymptotically-larger amount of extra work (in general–precise problem dynamics vary of course), and will result in serious slowdown in general.

The solution is to realize that we only need to track a delta of newly-discovered paths, since those are the only new ones that could lead us to find new paths. In the above example, we would use our delta version for path (rather than the “full” relation), allowing us to skip previously-examined paths.

Making this precise and implementing it will involve some open-ended amount of work on your part, but I will sketch pieces of the solution.

Testcases

A number of testcases are given in the test subdirectory. The instructor will also use a small corpus of their own tests to test your code and check its correctness–but they will not be harder than the ones distributed to students, and you may be given the opportunity to fix broken code.

Each testcase in the .sdl format also has a corresponding .dl format. For the purposes of this project, the notion of “correctness” is correctness vs. Souffle. My test scripts run the Souffle version of the program, run your your interpreter on the corresponding .sdl file, and then use the csvdiff tool on the appropriate output relation.

Building your Datalog Interpreter

As long as you can parse input files and produce TSV output files, you are given wide liberty in terms of your implementation. You may use, e.g., Python, Rust, C++, Racket, Haskell, OCaml, etc… I have a reference implementation in Racket.

Your interpreter may take either the naive or semi-naive evaluation strategy. You may also consider compiling to relational algebra kernels (i.e., for loops). This is also possible, it is up to you. I encourage you to write a dirt-simple naive interpreter. Once you finish with that for the positive fragment, try adding semi-naive evaluation. Or, similarly, explore how stratification would work.

Submitting your Program

Please send me your implementation, in whatever language, along with instructions on how to run it. Your program should take a single input file in the .sdl format and produce TSVs as output files (one for each relation in the program).

Please also include a short document that describes the high-level design you have taken, and briefly summarizes what each group member did.

Grading

[40%] – Correct implementation of parsing for Simple Datalog .sdl files
[50%] – Correct implementation of the positive Datalog fragment, any implementation methodology (including naive).
[10%] – Choose one of the following: -> Semi-naive evaluation -> Stratified negation

I would like to make it so that even if you only implement the naive, positive fragment of Datalog, you can still get a 90%. However, I seriously encourage you to work towards implementing semi-naive evaluation or stratified negation. I will consider partial effort on these parts assuming you have a correctly-functioning naive interpreter first (you can’t have nothing working).

Best of luck and try to have fun!