Project 2: Implementing Datalog
Project due date:
Note: you are encouraged to work in groups of up to three.
In this project, you will implement the logic programming language Datalog. Datalog is a restricted version of SAT to Horn clauses. For example, the following is a short Datalog program:
path(x,y) :- edge(x,y)
path(x,z) :- path(x,y) edge(y,z)
edge(Syracuse,Rochester)
edge(Rochester,Ithaca)
edge(Rochester,Buffalo)
The program has two rules: the first says that any pair x,y
forming
an edge
is also a path
. Conceptually, we can think of edge
as an
“input” relation; opertationally, the first rule copies edge
to
path
. The second rule iteratively builds up path
. It says:
whenever there is a path(x,y)
such that there is also an edge(y,z)
(notice: y matches), there must be path(x,z)
as well. Executing
the program materializes (via saturation) the relation path
.
Your Assignment
You will build an interpreter for Datalog, which parses .dl
files
and executes them to a fixed point. Your .sdl (simple DL) files will
be a simplified format intended to be easy for you to parse, but
roughly similar to the input format for the state-of-the-art
Souffle engine–we encourage you to
use Souffle to help understand Datalog, and we have Souffle versions
of the test files. Your program will take a single command-line
argument, which is the .sdl
file which will be executed.
python3 yourdatalog.py transitive_closure.sdl
Your program’s output should be a collection of .tsv (tab-separated-values) files corresponding to the various relations of your output.
Every group must implement the “positive fragment” of Datalog sketched below. Additionally, you must implement one of two options:
- Stratified negation
- Semi-naive evaluation
Simple Datalog
We define a simple version of Datalog, which I call “simple” Datalog
(.sdl
). This is meant to be simpler to parse. In .sdl
programs,
there are only two types of lines allowed (other than empty lines):
(a) facts and (b) rules.
Facts look like this:
edge(1,2)
foo(5,2,1)
bar(2,5,1,2)
Notice that the only type of allowed “atomic” Data in simple Datalog
is the integers (e.g., there are no strings, etc… allowed). Also
notice the precise formating: it is always relation_name(n0,...)
with the relation named, followed by (
, followed by a number of
comma-separated integers (you can assume they are positive). This
should be easy to parse in many languages via a combination of regular
expressions, splitting, etc… Last, observe that facts do not allow
any variables: only integers are allowed as arguments to the relation.
Rules are more general than facts, and define inductively-computed relations:
path(x,y) :- edge(x,y)
path(x,z) :- path(x,y) path(y,z)
foo(x,5) :- bar(x,1,y) edge(x,y)
The :-
is a common (but antiquated) symbol frequently used in logic
programming–it means ←. So these rules could be read alternatively
as:
path(x,y) ← edge(x,y)
path(x,z) ← path(x,y) ∧ path(y,z)
foo(x,5) ← bar(x,1,y) ∧ edge(x,y)
These are Horn clauses. A Horn clause is a clause with at most one positive literal. To understand this, let’s unfold the definition of ← (notice it’s backwards) to obtain:
path(x,y) ∨ ¬edge(x,y)
path(x,z) ∨ ¬path(x,y) ∨ ¬path(y,z)
foo(x,5) ∨ ¬bar(x,1,y) ∨ ¬edge(x,y)
Now we can see clearly: Horn clauses which are set up for unit propagation. Datalog is essentially the restriction of propositional SAT to Horn clauses. We call the positive atom the “head” of the rule and the negative atoms the “body.” In general, the evaluation strategy of Datalog proceeds to a fixed-point, executing each rule until no more rules produce any useful information. While it is possible to simply use a SAT solver, Datalog’s restriction to Horn clauses enables a much more efficient evaluation strategy (semi-naive evaluation) and the ability to exploit a host of other niceties.
Syntax Summary
In sum, you will accept as input a .sdl
file specifying the
problem. Non-empty input lines in .sdl
files will have one of two
formats:
- Facts. Relations (alphanumeric, including
_
) applied to a comma-separated list of numbers:f(1,2,3)
but notf(x,1)
. - Rules. A head of a rule, followed by
:-
, followed by one or more bodies. The head and body clauses are of the formf(x,...)
, specifying either (a) an integer literal or (b) a variable name. Variables used in the head must appear in the body (i.e., they must be grounded).
Positive Datalog and Stratified Negation
Notice that vanilla Datalog does not include negation. It is fairly straightforward to add negation, provided that its computation may be stratified. For example, look at the following:
symm(x,y) :- edge(x,y) edge(y,x)
non_symm(x,y) :- edge(x,y) not(symm(y,x))
This program is not vanilla Datalog, but it is possible to
understand what it means intuitively: first compute symm
, then
compute non_sym
by iterating over edge
and including pairs (x,y)
for which there is no entry for symm(y,x)
in the relation symm
.
This is called “stratified” negation. A Datalog program that uses
stratified negation is essentially equivalent to a chain of Datalog
programs, each which fully materializes a subordinate relation (symm
in this case) and then consults it using not
.
If you would like to implement stratified negation, first implement
the positive Datalog fragment (i.e., vanilla Datalog). Then, think
about how to stratify the program so that whenever you need to invoke
not
on a relation, the relation is materialized via a previous
stratum.
If you do this phase, provide at least two nontrivial tests that your implementation works. You may take some liberty with the parsing and concrete syntax, as long as you say what you are doing.
Note that not all programs can be stratified. For example, the following is an error:
path(x,y) :- edge(x,y)
path(x,z) :- not(path(x,y)) edge(y,z)
The issue is that x
and y
are not grounded. You will need to
carefully consider this in your implementation.
Semi-Naive Evaluation
Consider the transitive closure program from earlier:
path(x,y) :- edge(x,y)
path(x,z) :- path(x,y) edge(y,z)
edge(1,2)
edge(2,3)
edge(2,4)
...
The “naive” semantics is to run each rule at every iteration, building
up the set path
. We call this strategy “naive” because as the size
of path
increases, it starts to become very slow: every element of
path
must be considered, including every previously-discovered
path. This is an asymptotically-larger amount of extra work (in
general–precise problem dynamics vary of course), and will result in
serious slowdown in general.
The solution is to realize that we only need to track a delta of
newly-discovered paths, since those are the only new ones that could
lead us to find new paths. In the above example, we would use our
delta version for path
(rather than the “full” relation), allowing
us to skip previously-examined paths.
Making this precise and implementing it will involve some open-ended amount of work on your part, but I will sketch pieces of the solution.
Testcases
A number of testcases are given in the test
subdirectory. The
instructor will also use a small corpus of their own tests to test
your code and check its correctness–but they will not be harder than
the ones distributed to students, and you may be given the opportunity
to fix broken code.
Each testcase in the .sdl
format also has a corresponding .dl
format. For the purposes of this project, the notion of “correctness”
is correctness vs. Souffle. My test scripts run the Souffle version of
the program, run your your interpreter on the corresponding .sdl
file, and then use the csvdiff
tool on the appropriate output
relation.
Building your Datalog Interpreter
As long as you can parse input files and produce TSV output files, you are given wide liberty in terms of your implementation. You may use, e.g., Python, Rust, C++, Racket, Haskell, OCaml, etc… I have a reference implementation in Racket.
Your interpreter may take either the naive or semi-naive evaluation strategy. You may also consider compiling to relational algebra kernels (i.e., for loops). This is also possible, it is up to you. I encourage you to write a dirt-simple naive interpreter. Once you finish with that for the positive fragment, try adding semi-naive evaluation. Or, similarly, explore how stratification would work.
Submitting your Program
Please send me your implementation, in whatever language, along with
instructions on how to run it. Your program should take a single input
file in the .sdl
format and produce TSVs as output files (one for
each relation in the program).
Please also include a short document that describes the high-level design you have taken, and briefly summarizes what each group member did.
Grading
-
[40%] – Correct implementation of parsing for Simple Datalog
.sdl
files -
[50%] – Correct implementation of the positive Datalog fragment, any implementation methodology (including naive).
-
[10%] – Choose one of the following: -> Semi-naive evaluation -> Stratified negation
I would like to make it so that even if you only implement the naive, positive fragment of Datalog, you can still get a 90%. However, I seriously encourage you to work towards implementing semi-naive evaluation or stratified negation. I will consider partial effort on these parts assuming you have a correctly-functioning naive interpreter first (you can’t have nothing working).
Best of luck and try to have fun!