Goto considered helpful
Let’s re-examine our first example: \(F = \{\{A,B,X,Y,Z\}, \{A,C,X,Y,Z\}\}\). We wish to save room in such a way that we can still manipulate \(F\) easily, which rules out general compression techniques. In fact, saving room whilst preserving running times is not enough; we need to improve running times as well! To solve our problems, we must use both time and space more efficiently than bit fields.
With this in mind, we notice the two sets in \(F\) have much in common, and represent the entire family \(F\) as the following function:
Input: a set \(S\)
-
I7: if \(A \notin S\) return 0; else goto I6;
-
I6: if \(B \notin S\) goto I5; else
-
if any of \(C,...,W\) are in \(S\) return 0; else goto I4;
-
-
I5: if \(C \notin S\) return 0; else
-
if any of \(D,...,W\) are in \(S\) return 0; else goto I4;
-
-
I4: if \(X \notin S\) return 0; else goto I3;
-
I3: if \(Y \notin S\) return 0; else goto I2;
-
I2: if \(Z \notin S\) return 0; else return 1;
A set \(S\) is in \(F\) if and only if this function returns 1. We can represent this program as a directed acyclic graph:
We interpret this dag as a program as follows. Start at the top node. The node labelled \(i\) means we test if the \(i\)th letter of the alphabet is in the set \(S\); if not, we follow the dotted line, otherwise we follow the solid line.
If the destination node has label \(j\), we return 0 if any letter strictly between \(i\) and \(j\) lies in \(S\). Then we repeat this process on the destination node. If we reach the node \(\top\) then we return 1, and similarly if we reach \(\bot\) we return 0.
In some sense we have the best possible program if we must examine membership of \(S\) in alphabetical order. There are no unnecessary tests; we return as soon as we are certain whether \(S\) is in our family \(F\). Contrast this with a hash table of bit fields, where even computing a hash requires looking at every bit.
Hence we have replaced 2 bit fields of size 26 with a single directed acyclic graph of 8 nodes. It may appear this is not worth the trouble but after several join operations for example, we’ll have sets with many elements in common, suggesting this approach will eventually win. For instance, \(F \sqcup G\) benefits more from our scheme because all 4 of its sets contain \(A, X, Y, Z\), and all but one contain \(C\).
Additionally, even without join operations and their ilk, certain problems naturally involve families of sets that compress well when stored in the above form.
How do we generalize and formalize these ideas so we can systematically construct dags like the above from a given family? Even if we can, is this representation suited for the our problems?
By the way, there are a couple of technicalities which this carefully chosen example avoids. In general, the start node may have a label higher than 1, and also, we might have a node smaller than 26 pointing to \(\bot\). After the next section, it should be clear how to convert these cases into appropriate if statements.