When two players execute a large computation they need to agree with each other on a labelling of the individual operations so that they know where each received result belongs. In VIFF we call these labels program counters. We will try to explain the design of these counters in this document and list some ideas for alternative implementations.
The basic setup in VIFF is a set of players who communicate over reliable point-to-point links, e.g., TCP or SSL connections. It is important to remember that these links guarantee that all transmitted messages arrive unchanged at the destination and that they arrive in the order sent.
The very first version of VIFF network data was numbered in the most naive way possible: using a single counter for each player. This worked fine most of the time, but once in a while a test would fail to give the correct result. It was only when one ran thousands of multiplications that the bug appeared, but its cause was quite simple. Consider a program like this where we assume that the shares a, b, c, and d have already been correctly defined:
x = mul(a, b) y = mul(c, d)
Back then, the mul() function was implemented like this:
def mul(share_a, share_b): inc_pc() result = gather_shares([share_a, share_b]) result.addCallback(lambda (a, b): a * b) result.addCallback(shamir_share) result.addCallback(recombine, threshold=2*threshold) return result
where inc_pc() took care of incrementing the global program counter. This simple implementation worked 99.99% of the time with three players connected on a LAN, but once in a while it would fail to calculate the correct results.
In our example program, shamir_share() is called twice: once when a and b are ready, and once when c and d are ready. Most of the time a and b are ready first on all players, and so they all agree on the program counter value for this call to shamir_share(). But when we have bad luck, one player sees c and d arrive first and so the two calls to shamir_share() are switched for that player.
The problem is the asynchronous nature of Twisted: all players agree on the execution tree, but depending on the exact timing they might reach the nodes in the tree in a different order. The tree looks like this in our little example:
x y | | mul mul / \ / \ a b c d
and the two mul() can be called in either order since they do not depend on each other.
The solution used now in VIFF has two ingredients. First, callbacks that depend on the program counter (like func:shamir_share in our example above) are not added with addCallback() but instead with the special schedule_callback() method. This method saves the program counter in effect at the time of the its call, and ensures that the saved program counter is temporarily made active when the callback is called.
Secondly, the program counter is a list of counters. This is necessary to ensure that we can allocate new fresh counters at any point in the execution tree. The execution tree is never explicitly constructed in VIFF, so a simple static numbering is not possible.
The program counter starts at the value . It is changed in two cases:
when a callback is scheduled using schedule_callback(), a new sub-program counter is allocated. A sub-program counter is simply a program counter with another digit. Because of the asynchronous network, the callback will be invoked at an unknown later time. When invoked, it sees the sub-program counter. This ensures that that the parties agree on any network traffic produced in the callback.
When a piece of code like this:
def cb(ignored): print "callback:", self.program_counter d = Deferred() print "main:", self.program_counter self.schedule_callback(d, cb) print "main:", self.program_counter d.callback(None)
is executed, one will see output like this:
main:  main:  callback: [0, 0]
some functions depend on a unique program counter. These functions simply increase the last digit in the current program counter:
self.program_counter[-1] += 1
We have come up with some alternative solutions, which are detailed below. More good ideas are of course welcome!
An attractive alternative is to label data sent over the net based on its history. The idea is that we associate a label H(x) with each variable x. The history is defined when new variables are defined — if x = a * b, then we can set H(x) = ("mul", H(a), H(b)). To avoid growing the history without bounds we can hash it with a cryptographic hash function to bring it down to a fixed size.
The problem with this idea is that we sometimes need to assign a history to a variable that depends on no other variables. An example of this is the result of a call to prss_share() which takes no useful arguments. A possible solution would be to add some dummy arguments on which the history could be based, even though they wont be used by the method. So if you would normally call the function hat() with no arguments to get a Rabbit object, you have to change your code from this:
rabbit = hat()
rabbit = hat(dummy=locals())
where the call to locals() gives you access to the local variables. If the use of locals() could be hidden this might be an acceptable solution.
Using the history of the variables has the big advantage that we label each piece of transmitted data with just information that is relevant to it: namely its position in the tree formed by the calculation and not its position in the execution tree formed by the implementation in VIFF. This is conceptually cleaner than the current solution.
Another idea is to solve the labelling problem by having some external tool transform the program into one with explicit labels. Each send and each receive operation needs to be labelled and the labels much match pair-wise.
It is not entirely clear how this should work in the presence of loops and if it is possible to implement this in an easier way than the current program counters.