Compilation and Backend-Independent Vectorization for Multi-Party Computation (Extended Version)

Benjamin Levy and Ben Sherman T
{levyb3, shermb}@rpi.edu
Rensselaer Polytechnic Institute
Troy, New York

Muhammad Ishaq
istraqm@purdue.edu
Purdue University
West Lafayette, Indiana

Lindsey Kennard
fireelemental.ne@gmail.com
STR
Boston, Massachusetts

Ana Milanova
milanova@cs.rpi.edu
Rensselaer Polytechnic Institute
Troy, New York

Vassilis Zikas
vzikas@purdue.edu
Purdue University
West Lafayette, Indiana

ABSTRACT

Research on MPC programming technology largely falls at the two ends of the classical compiler: (1) work on front-end language design (e.g., Wysteria, Viaduct) and (2) work on back-end protocol implementation (e.g., ABY, MOTION).

In this work, we formalize the MPC Source intermediate language and advance the idea of what we call backend-independent optimizations, in a close analogy to machine-independent optimizations in the classical compiler. We present a compiler framework that takes a Python-like routine and produces MOTION code. We focus on a specific backend-independent optimization: novel SIMD-vectorization on MPC Source, which we show leads to significant speedup in circuit generation time and running time, as well as significant reduction in communication and number of gates over the corresponding iterative schedule.

1 INTRODUCTION

Multi-party computation (MPC) allows $N$ parties $p_1, \ldots, p_N$ to perform a computation on their private inputs securely. Informally, security means that the secure computation protocol computes the correct output (correctness) and it does not leak any information about the individual party inputs, other than what can be deduced from the output (privacy).

MPC theory dates back to the early 1980-ies [35, 20, 6, 14]. Long the realm of theoretical cryptography, MPC has seen significant advances in programming technology in recent years. These advances bring MPC closer to practical and wider applicability — MPC technology has been employed in real-world scenarios such as auctions [10], biometric identification [8], and privacy-preserving machine learning [28, 27]. The goal is to improve technology so that programmers can write secure and efficient programs without commanding extensive knowledge of cryptographic primitives.

The problem, therefore, is to build a high-level programming language and a compiler, and there has been significant advance in this space, e.g., [5, 9, 37, 21, 25, 1, 11] among other work. Current research largely falls at the two ends of the classical compiler: (1) work on front-end language design and (2) work on back-end protocol implementation. Work on language design focuses on high-level constructs necessary to express multiple parties, computation by different parties, and information flow from one party to another [31, 1]. On the other end, work on protocol implementation focuses on cryptographic foundations and their efficient circuit-level implementation [15, 4, 11], e.g., implementation of operations (e.g., MUL, ADD) using different sharing protocols (Boolean or Arithmetic GMW [20] or Yao’s garbled circuits [35]), as well as efficient share conversion from one representation to another.

Earlier compilers did both back-end and front-end translation as their aim was to demonstrate applicability of MPC on real-world programming problems. As the field advanced, works have focused more closely on front-end language design (e.g., Wysteria [31] and Viaduct [1]) or back-end “circuit-level” design and implementation (e.g., MOTION [11]).

In this work we focus on an intermediate language and what we call backend-independent optimizations, in a close analogy to machine-independent optimizations in the classical compiler. The following figure summarizes our key idea:

![Diagram of backend-independent optimization](image)

We formalize the MPC Source [22] intermediate representation and emphasize optimization over MPC Source. As in classical compilers, we envision different front ends (e.g., our front end IMP Source, other) compiling into MPC Source. MPC Source is particularly suitable for optimizations such as protocol mixing [13, 22, 18] and SIMD-vectorization, which takes advantage of amortization at the circuit level. The MPC Source IR exposes the linear structure of MPC programs, which simplifies program analysis; this is in contrast to source, which has if-then-else constructs. In the same time, MPC Source is sufficiently “high-level” to support analysis and optimizations that take into account control and data flow in a specific program. MPC Source facilitates cost modeling. Again as in classical compilers, we envision translation of MPC Source (optimized or unoptimized) into MOTION, SPDZ, or other back-end code.
1.1 Our Contribution

In this paper, we develop a compiler framework that takes a Python-like routine and produces MOTION code: we describe (a) the IMP Source language, its syntax and semantic restrictions, (b) translation into MPC Source, (c) a specific backend-independent optimization: novel SIMD-vectorization on MPC Source, and (d) translation from MPC Source into MOTION code.

We focus on the MOTION framework as our back-end for several reasons. First, it is the state of the art in terms of performance [11]. Second, it provides an API over efficient implementation for a wide variety of cryptographic operations in three different protocols — Arithmetic GMW, Boolean GMW, and BMR — which allows for protocol mixing [13, 22, 18], a known backend-independent optimization. Third, MOTION provides API for SIMD-level operations, which amortize cost and lead to significant improvement in memory footprint and throughput [15, 4, 11]. It enables MPC Source-level vectorization, a key focus of this paper.

Our second contribution is an analytical model for cost estimation of amortized schedules. Originally, we hoped that optimal scheduling (under our model, which essentially minimizes the length of the schedule) was tractable, as the problem appeared simpler than the classical scheduling problem. Unfortunately, we show that optimal scheduling is NP-hard via a reduction to the Shortest Common Supersequence (SCS) problem. Cost modeling is important as it drives not only vectorization but optimizations such as protocol mixing and scheduling as well [13].

Our most important contribution is the implementation and evaluation of the compiler framework. We demonstrate expressiveness of the source language by running the compiler on 16 programs with interleaved if- and for-statements; these include classical MPC benchmarks such as PSI and Biometric matching, as well as kMeans, Histogram, and other examples from the literature. Our compiler takes the routine and generates non-vectorized MOTION code (from MPC Source on the picture above). It then optimizes MPC Source and generates vectorized MOTION code (from Optimized MPC Source). We then run the two versions using Boolean GMW and the BMR protocols. (MOTION, which is designed for protocol mixing, supports Arithmetic GMW, however, it does not implement Comparison (CMP) and Multiplexing (MUX) as they would be rather inefficient.) In our experiments vectorized code exhibits 24x improvement on average in circuit generation time for Boolean GMW (20x for BMR), 7x reduction in communication (2x), 97x reduction in number of gates (91x), 4x improvement in setup time in the LAN setting (23x) and 21x improvement in online time, again in the LAN setting (18x).

The compiler can be found at https://github.com/milana2/ParallelizationForMPC where https://github.com/milana2/ParallelizationForMPC/tree/master/gh-pages shows experiments on small input sizes and details the different stages of the compiler.

Our results emphasize the importance of backend-independent optimizations — vectorization (described in this work) and protocol mixing (tackled in previous works [13, 22, 18]) are two optimizations readily available at the level of MPC Source. We believe that our work can lead to future work on backend-independent compilation and optimization, ushering new MPC optimizations and combinations of optimizations in the vein of standard compilers, and thus bringing MPC programming technology closer to practice and wider applicability.

1.2 Outline

The rest of the paper is organized as follows. §2 presents an overview of the compiler. §3 describes our model for cost estimation and argues NP-hardness of optimal scheduling. §4 details the front-end phases of the compiler, §5 focuses on backend-independent vectorization, and §6 describes translation into MOTION. §7 presents the experimental evaluation. §8 discusses related work and §9 concludes.

All our code, including benchmark Python-like code, compilation phases, and generated MOTION code is available on Github. We do not provide the link only to preserve anonymity, however, we will gladly make it available upon request from reviewers. The Github setup generates graphs and intermediate code for each program along each compiler phase and runs MOTION on small inputs to generate tables of data (the experiments in the paper run on real LAN and WAN). We plan to release the link and code, which we believe will be useful to researchers.

2 OVERVIEW

2.1 Source

As a running example, consider Biometric matching, a standard MPC benchmark. An intuitive (and naive) implementation is as shown in Listing 1(a). Array C is the feature vector of D features that we wish to match and S is the database of N size-D vectors that we match against.

Our compiler takes essentially standard IMP [29] syntax and imposes certain semantic restrictions. (We detail the restrictions in the following sections.) The programmer writes an iterative program and annotates certain inputs and outputs as shared. In the example arrays C and S are shared, meaning that they store shares, however, the array sizes D and N respectively are plaintext. The code iterates over the entries in the database and computes the Euclidean distance of the current entry S[i] and C (its square actually). The program returns the index of the vector that gives the best match plus the corresponding sum of squares.

2.2 MPC Source and Cost of Schedule

Our compiler generates an intermediate representation, MPC Source. MPC Source has a linear SSA form. MPC Source for Biometric Matching is shown and described in detail in Listing 1(b).

We turn to our analytical model to compute the cost of the iterative program. Assume cost β for a local MPC operation (e.g., ADD in Arithmetic sharing) and cost α for a remote MPC operation (e.g., MUX, CMP, etc.). Assuming that ADD is β and SUM, CMP and MUX are α, the MPC Source in Listing 1(b) gives rise to an iterative schedule with cost N*(2α + β) + N(3α).

A key contribution is the vectorizing transformation. We can compute all N*D subtraction operations (line 9 in (b)) in a single SIMD instruction; similarly we can compute all multiplication operations (line 10) in a single SIMD instruction. And while computation of an individual sum remains sequential, we can compute the N sums in parallel.
def biometric(C: shared[list[int]], D: int, 
S: shared[list[int]], N: int) -> 
shared[tuple[int]]: 
min_sum : int = MAX_INT 
min_idx : int = 0 
for i in range(0, N): 
    min_sum2 = PHI(min_sum1, min_sum4) 
    min_idx2 = PHI(min_idx1, min_idx4) 
    sum2 = 0 
    for j in range(0, D): 
        sum3 = PHI(sum2, sum4) 
        p = MUL(d[d]) 
        sum4 = ADD(sum3, p) 
        t = CMP(sum3, min_sum2) 
        min_sum3 = sum3 
        min_idx3 = i 
        min_sum4 = MUX(t, min_sum3, min_sum2) 
        min_idx4 = MUX(t, min_idx3, min_idx2) 
    return (min_sum2, min_idx2)

1  def biometric(C: shared[list[int]], D: int, 
2     S: shared[list[int]], N: int) -> 
3     shared[tuple[int]]: 
4     min_sum : int = MAX_INT 
5     min_idx : int = 0 
6     for i in range(0, N): 
7         min_sum2 = PHI(min_sum1, min_sum4) 
8         min_idx2 = PHI(min_idx1, min_idx4) 
9         sum2 = 0 
10        for j in range(0, D): 
11            sum3 = PHI(sum2, sum4) 
12            p = MUL(d[d]) 
13            sum4 = ADD(sum3, p) 
14            t = CMP(sum3, min_sum2) 
15            min_sum3 = sum3 
16            min_idx3 = i 
17            min_sum4 = MUX(t, min_sum3, min_sum2) 
18            min_idx4 = MUX(t, min_idx3, min_idx2) 
19        return (min_sum2, min_idx2)

(a) IMP Source (b) MPC Source (c) Optimized MPC Source

Table 1: Biometric Matching: From (a) IMP Source to (b) MPC Source: First, MPC Source is an SSA form. Second, it is linear. The conditional in lines 13-14 in IMP Source turns into the linear code in lines 12-16 in MPC Source. The test turns into the CMP operation t = CMP(sum3, min_sum2), followed by the true-branch sequence, followed by the MUX operations. The first MUX operation selects the value of min_sum when t is true, then min_sum gets the value of the second multiplexer argument, min_sum2, otherwise it takes the value of the third argument, min_sum3. Third, MPC Source is a special form of SSA. The SSA φ-nodes at the if-then-else (lines 13-15) turn into MUX operations, while the φ-nodes at for-loops turn into pseudo PHI nodes with a straightforward semantics. From (b) MPC Source to (c) Optimized MPC Source: The compiler determines that SUB and MUL in “naive” MPC Source (lines 9 and 10 in (b)) can be fully vectorized into the SIMD SUB and MUL in optimized MPC Source (lines 9 and 10 in (c)). Notation p[i, j] denotes a 2-dimensional array with fully vectorized dimensions. The computation of sum (line 11 in (b)) is sequential across the j-dimension, but it is parallel across the i-dimension. The loop in lines 12-16 in (c) illustrates; here p[i, j] refers to the j-th column in p. Unfortunately, CMP and MUX remain sequential.

2.3 Vectorized MPC Source and Cost of Schedule

Our compiler produces the vectorized program shown and described in Listing 1(c). Note that this is still our intermediate representation, Optimized MPC Source. Subsequently, the compiler turns this code into MOTION variables, loops and SIMD primitives, which MOTION then uses to generate the circuit.

In MPC back ends, executing n operations “at once” in a single SIMD operation costs a lot less than executing those n operations one by one. This is particularly important when there is communication (i.e., in remote), since many 1-bit values are sent at once rather than sequentially. We elaborate on the cost model in Section §3 but for now consider that each operation has a fixed portion (does benefit from amortization) and a variable portion (does not benefit from amortization): \( \alpha = \alpha_{\text{fix}} + \alpha_{\text{var}} \). This gives rise to the following formula for amortized cost: \( f(n) = \alpha_{\text{fix}} + n \alpha_{\text{var}} \), as opposed to unamortized cost \( g(n) = n \alpha_{\text{fix}} + n \alpha_{\text{var}} \). (We extend the same reasoning to β-instructions.)

Thus, the fixed cost of the vectorized program amounts to \( 2\alpha_{\text{fix}} + D\beta_{\text{fix}} + N(3\alpha_{\text{fix}}) \). (The variable cost is the same in both the vectorized and non-vectorized programs.) The first term in the sum corresponds to the vectorized subtraction and multiplication (lines 9-10 in (c)), the second term corresponds to the for-loop on \( j \) (lines 12-16) and the third one corresponds to the remaining for-loops on \( i \) (lines 19-25). Clearly, \( 2\alpha_{\text{fix}} + D\beta_{\text{fix}} + N(3\alpha_{\text{fix}}) << ND(2\alpha_{\text{fix}} + \beta_{\text{fix}}) + N3\alpha_{\text{fix}} \). Empirically, we observe that \( \alpha_{\text{var}} \approx 0 \) and orders of magnitude improvement (e.g., 12x in online time in GMW). Additionally, the unvectorized version runs out of memory for \( N = 256 \), while the vectorized one runs with the standard maximal input size \( N = 4096 \).
3 ANALYTICAL MODEL

This section presents a model to reason about the cost of execution of MPC programs, including accounting for amortization. We define the assumptions and setting in §3.1. We proceed to define the scheduling problem in §3.2, which we expected to be able to solve optimally. §3.3 shows that the problem is NP-hard via a reduction to the Shortest Common Supersequence (SCS) problem. Despite the negative general result, we expect the formulation in terms of SCS to be useful as sequences are short and few in practice.

3.1 Scheduling in MPC

For this treatment we make the following simplifying assumptions:

(1) All statements in the program execute using the same protocol (sharing). That is, there is no share conversion.

(2) There are two tiers of MPC instructions, local and remote. A local instruction (e.g., ADD in Arithmetic, XOR in Boolean) has cost $\alpha$ and a remote instruction (e.g., MUX, MUL, SEH, etc.) has cost $\alpha$, where $\alpha > \beta$. We assume that all remote instructions have the same cost.

(3) In MPC frameworks, executing $n$ operations “at once” in a single SIMD operation costs a lot less than executing those $n$ operations one by one. Following Amdahl’s law, we write $\alpha = \frac{1}{p}\alpha + (1-p)\alpha$, where $p$ is the fraction of execution time that benefits from amortization and $(1-p)$ is the fraction that does not, and $s$ is the available resource. Thus, $n\alpha = \frac{2}{s}\alpha_s + n(1-p)\alpha$. For the purpose of the model we assume that $s$ is large enough and the term $\frac{2}{s}\alpha_s$ amounts to a fixed cost incurred regardless of whether $n$ is 10,000 or just 1. (This models the cost of preparing and sending a packet from party A to party B for example.) Therefore, amortized execution of $n$ operations is $f(n) = \alpha_s + n\alpha_s$. We have $\alpha_s << n\alpha_s$.

(4) MPC instructions scheduled in parallel benefit from amortization only if they are the same instruction. Given our previous assumption, 2 MUL instructions can be amortized in a single SIMD instruction that costs $\alpha_s + 2\alpha_s$, however a MUL and a MUX instruction still cost $2\alpha_s + 2\alpha_s$ even when scheduled “in parallel”.

3.2 Problem Statement

As mentioned earlier, at the lowest level, we have two types of MPC instructions (also called gates or operations in similar works)

1) local/non-interactive (e.g., an addition instruction $A$) and

2) remote/interactive (e.g., a multiplication instruction $M$).

Given a serial schedule (a linear graph) of an MPC program i.e. a sequence of instructions $S := \{S_1, \ldots, S_n\}$, where $S_i \in \{A, M\}$, $1 \leq i \leq n$, and a def-use dependency graph $G(V, E)$ corresponding to $S$, our task is to construct a parallel schedule (another linear graph) $P := \{P_1, \ldots, P_m\}$ observing the following conditions:

(1) All $P_i$’s consist of MPC instructions of the same kind, e.g., all MUL, MUX, ADD, etc.

(2) Def-use dependencies of the graph $G(V, E)$ are preserved i.e. if instructions $S_i, S_j, i < j$ form a def-use i.e. an edge exists from $S_i$ to $S_j$ in $G$, then they can only be mapped to $P_i, P_j$ such that $i' < j'$.

Correctness. Correctness of $P$ is guaranteed by definition. Preserving def-use dependencies means the computed function remains the same in both $S$ and $P$.

The cost of schedule $S$ is

$$\text{cost}(S) = \sum_{i=1}^{n} \text{cost}(S_i) = L_{\alpha} \alpha_{\text{fix}} + L_\beta \beta_{\text{fix}} + L_{\alpha} \alpha_{\text{var}} + L_\beta \beta_{\text{var}}$$

where $L_{\alpha}$ is the number of $\alpha$-instructions and $L_\beta$ is the number of $\beta$ ones. (We used this formula to compute the cost of the unrolled MPC Source program in §2.) The cost of schedule $P$ is more interesting:

$$\text{cost}(P) = \sum_{i=1}^{m} \text{cost}(P_i) = \sum_{i=1}^{m} \text{cost}(P_i)$$

Each $P_i$ may contain multiple instructions, and $\text{cost}(P_i)$ is amortized. Thus, according to our model $\text{cost}(P_i) = \alpha_{\text{fix}} + |P_i|\alpha_{\text{var}}$ if $P_i$ stores $|P_i|$ $\alpha$-instructions, or $\text{cost}(P_i) = \beta_{\text{fix}} + |P_i|\beta_{\text{var}}$ if it stores $\beta$-instructions. (Similarly, we used this formula to compute the cost of the Optimized MPC Source program in §2.)

Our goal is to construct a parallel schedule $P$ that reduces the program cost (when compared to cost of $S$), possibly an optimal schedule. Originally we hoped that the problem is simpler and computation of the optimal schedule is tractable. Unfortunately, the optimal schedule turns out to be NP-hard via a reduction to the Shortest Common Supersequence problem.

3.3 Scheduling is NP-hard

To prove that optimal scheduling is an NP-Hard problem, we consider the following convenient representation. An MPC program is represented as a set of sequences $\{s_1, \ldots, s_n\}$ of operations. In each sequence $s_i$ operations depend on previous operations via a def-use i.e. $s_i[j]$, $j > 1$ depends on $s_i[j - 1]$.

As an example, consider the MPC program consisting of the following three sequences, all made up of two distinct $\alpha$-instructions $M_1$ and $M_2$, e.g., $M_1$ is MUL and $M_2$ is MUX. The right arrow indicates a def-use dependence, meaning that the source node must execute before the target node:

(1) $M_1 \rightarrow M_2 \rightarrow M_1$
(2) $M_1 \rightarrow M_1 \rightarrow M_1$
(3) $M_2 \rightarrow M_1 \rightarrow M_2$

The problem is to find a schedule $P$ with minimal cost. For example, a schedule with minimal cost for the sequences above is $M_1(1), M_1(2) ; M_1(2) ; M_2(1), M_2(2), M_2(3) ; M_1(1), M_1(2), M_1(3) ; M_2(3)$.

The parentheses above indicate the sequence where the instruction comes from: (1), (2), or (3). Cost of schedule $P$ is computed using Eq. (2) above and it amounts to $5\alpha_{\text{fix}} + 9\alpha_{\text{var}}$.

The problem of finding a schedule $P$ with a minimal cost $\text{cost}(P)$ is shown to be NP-Hard problem, as it can be reduced to the problem of finding a shortest common supersequence, a known NP-Hard problem [34]. The shortest common supersequence problem is as follows: given two or more sequences find the shortest sequence...
that contains all of the original sequences. This can be solved in $O(n^k)$ time, where $n$ is the cardinality of the longest sequence and $k$ is the number of sequences. We can see that the optimal schedule is the shortest schedule, since the shortest schedule minimizes the fixed cost while the variable cost remains the same.

To formalize the reduction, suppose $P$ is a schedule with minimal cost (computed by a black-box algorithm). Clearly $P$ is a supersequence of each sequence $s_i$, i.e., $P$ is a common supersequence of $s_1, \ldots, s_n$. It is also a shortest common supersequence. The cost of $\text{cost}(P) = L^* + N\alpha_{\text{var}}$, where $L^*$ is the length of $P$ and $N$ is the total number of instructions across all sequences. Now suppose, there exist a shorter common supersequence $P'$ of length $L'$, $\text{cost}(P') < \text{cost}(P)$ since $L'\alpha_{\text{var}} + N\alpha_{\text{var}} < L^* + N\alpha_{\text{var}}$, contradicting the assumption that $P$ has the lowest cost.

4 COMPILER FRONT END

Fig. 1 presents an overview of our compiler. We start the section with a brief description of the top-level syntax and semantic restrictions in §4.1. We proceed to describe the front end of the compiler: §4.2 describes translation from IMP Source to SSA, and §4.3 describes translation form SSA into MPC Source.

We describe analysis on MPC Source and backend-independent vectorization in §5 and translation into MOTION code in §6.

4.1 Syntax and Semantic Restrictions

Source syntax is essentially standard IMP syntax but with for-loops:

```plaintext
e := e op e | x | const | A[e] expression
s := s; s | x = e | A[e] = e | assignment stmt
for i in range(l) : s | for stmt
if e : s else : s if stmt
```

The syntax allows for array accesses, arbitrarily nested loops, and if-then-else control flow. Expressions are typed $\langle q \tau \rangle$. Where qualifier $q$ and type $\tau$ are:

```plaintext
r := int | bool | list[int] | list[bool] base types
q := shared | plain qualifiers
```

The system of types is mostly standard, and in our experience, a sweet spot between readability and expressivity. The shared qualifier denotes shared values, i.e., ones shared among the parties and computed upon under secure computation protocols; the plain qualifier denotes plaintext values. Subtyping is $\text{plain} <: \text{shared}$, meaning that we can convert a plaintext value into a shared one, but not vice versa. Subtyping on qualified types is again as expected, it is covariant in the qualifier and invariant in the type:

$\langle q_1 \tau_1 \rangle <: \langle q_2 \tau_2 \rangle \iff q_1 <: q_2$ and $\tau_1 = \tau_2$.

Our compiler imposes certain semantic restrictions that enforces throughout the various phases of compilation. We note that in some cases, the restrictions can be easily lifted and we plan to do so in future iterations of the work.

1. Loops are of the form $0 <= i < l$ and bounds are fixed at compile time. It is a standard restriction in MPC that the bounds must be known at circuit-generation time.
2. Arrays are one-dimensional. $N$-dimensional arrays are linearized and accessed in row-major order and at this point the programmer is responsible for linearization and access. (This restriction can be easily lifted.)
3. Array subscripts are plaintext values as specified by the rule:
   
   \[ \Gamma \vdash e : \langle \text{plain int} \rangle \quad \Gamma \vdash A : \langle \text{list[t]} \rangle \quad \tau \in \{\text{int, bool}\} \]
   
   \[ \Gamma \vdash A[e] : \langle \tau \rangle \]

   The subscript $e$ is a function of the indices of the enclosing loops. For read access, the compiler allows an arbitrary such function. However, it restricts write access to canonical writes, i.e., $A[i,j,k] = ...$ where $i, j$ and $k$ loop over the three dimensions of $A$. Write access such as for example $A[i,j+2] = ...$ is not allowed. (This is again a restriction we imposed for convenience in our current implementation; we plan to extend the compiler with arbitrary write access.)
4. The final restriction involves $\text{MUX}$ as expressed by the rule:

   \[ \Gamma \vdash e_1 : \langle q_1 \text{ bool} \rangle \quad \Gamma \vdash e_2 : \langle q_2 \tau \rangle \quad \Gamma \vdash e_3 : \langle q_3 \tau \rangle \quad \tau \in \{\text{int, bool}\} \]
   
   \[ \Gamma \vdash \text{MUX}(e_1, e_2, e_3) : \langle q_1 \lor q_2 \lor q_3 \tau \rangle \]

The arguments of $\text{MUX}$ are restricted to base types. (Again, this is just a restriction of our current implementation that will be lifted.)

This causes an infeasibility as we could not write

```plaintext
1 if e: A[i] = val
```

Instead we had to write

```plaintext
1 if not(e): val = A[i]
```

We note that while the programmer writes annotations at the level of IMP Source (as in Listing 1(a)), the annotations propagate through the transformations; annotations are inferred and checked with a taint analysis at the level of MPC Source. The programmer annotates only inputs.

For the rest of this section we write $i, j, k$ to denote the loop nest: $i$ is the outermost loop, $j$ is immediately nested in $i$, and so on until $k$ and we use $I, J, K$ to denote the corresponding upper bounds. We write $A[i, j, k]$ to denote canonical access to an array element. In the program, canonical access is achieved via the standard row-major order formula: $\langle I + J \ast K + i + j \ast k \rangle$. To simplify the presentation we describe our algorithms in terms of three-element tuples $i, j, k$, however, discussion easily generalizes to arbitrarily large loop nests.

4.2 From IMP Source to SSA

Our compiler translates from Source to SSA as follows:

Parsing: Use Python’s ast module to parse the input source code to a Python AST.

Syntax checking: Ensure that the AST matches the restricted subset defined in Section §4.1. This step outputs an instance of the restricted_ast.function class, which represents our restricted subset of the Python AST.

3-address CFG conversion: Convert the restricted-syntactic AST to a three-address control-flow graph. To do this, first, add an empty basic block to the CFG and mark it as current. Next, for each statement in the restricted AST’s function body, process the statement.
Statements can either be for-loops, if-statements, or assignments (as in §4.1). Rules for processing each kind of statement are given below:

1. For-loops: Create new basic blocks for the loop condition (the condition-block), the loop body (the body-block), and the code after the loop (the after-block). Insert a jump from the end of the current block to the condition-block. Then, mark the condition-block as the current block. Insert a for-instruction at the end of the current block with the loop counter variable and bounds from the AST. Next, add an edge from the current block to the after-block labeled "FALSE" and an edge from the current block to the body-block labeled "TRUE". Then, set the body-block to be the current block and process all statements in the AST's loop body. Finally, insert a jump to the condition-block and set the after-block as current.

2. If-statements: Create new basic blocks for the "then" statements of the if-statement (the then-block), the "else" statements of the if-statement (the else-block), and the code after the if-statement (the after-block). At the end of the current block, insert a conditional jump to the then-block or else-block depending on the if-condition in the AST. Next, mark the then-block as current, process all then-statements, and add a jump to the after-block. Similarly, mark the else-block as current, process all else-statements, and add a jump to the after-block. Finally, set the after-block to be the current block, and give it a merge condition property equal to the condition of the if-statement.

3. Assignments: In the restricted-syntax AST, the left-hand side of assignments can be a variable or an array subscript. If it is an array subscript, e.g., A[i] = x, change the statement to A = Update(A, i, x). If the statement is not already three-address code, for each sub-expression in the right-hand side of the assignment, insert an assignment to a temporary variable.

SSA conversion: Convert the 3-address CFG to SSA with Cytron’s algorithm.

4.3 From SSA to MPC Source

Once the compiler converts the code to SSA, it transforms \( \phi \)-nodes that correspond to if-statements into MUX nodes. From the 3-address CFG conversion step, \( \phi \)-nodes corresponding to if-statements will be in a basic block with the merge condition property. For example, if \( X3 = \phi(X1,X2) \) is in a block with merge condition C, the compiler transforms it into \( X3 = \text{MUX}(C, X1, X2) \). Next, the compiler runs the dead code elimination algorithm from Cytron’s SSA paper.

Next, the control-flow graph is linearized into MPC Source, which has loops but no if-then-else-statements. This means that both branches of all if-statements are executed, and the MUX nodes determine whether to use results from the then-block or from the else-block. The compiler linearizes the control-flow graph with a variation of breadth-first search. Blocks with the "merge condition" property are only considered the second time they are visited, since that will be after both branches of the if-statement are visited. (The Python AST naturally gives rise to a translation where each conditional has exactly two targets, and each "merge condition" block has exactly two incoming edges, a TRUE and a FALSE edge. Thus, each \( \phi \)-node has exactly two multiplexer arguments, which dovetails into MUX. This is in contrast with Cytron’s algorithm which operates at the level of the CFG and allows for \( \phi \)-nodes with multiple arguments.) Each time the compiler visits a block, it adds the block’s statements to the MPC source. If the block ends in a for-instruction, the compiler recursively converts the body and code after the loop to MPC source and adds the for-loop and code after the loop to the main MPC source. If the block does not end in a for-instruction, the compiler recursively converts all successor branches to MPC source and appends these to the main MPC source.

Now, the remaining \( \phi \)-nodes in MPC source are the loop header nodes. We call these nodes \( \text{pseudo} \) \( \phi \)-nodes and we write \( \text{PHI} \) in MPC Source. A pseudo \( \phi \)-node \( X1 = \text{PHI}(X0,X2) \) in a loop header is evaluated during circuit generation. If it is the 0-th iteration, then the \( \phi \)-node evaluates to \( X0 \), otherwise, it evaluates to \( X2 \).

5 Backend-Independent Vectorization

This section describes our vectorization algorithm. While vectorization is a longstanding problem, and we build upon existing work on scalar expansion and classical loop vectorization [3], our algorithm is unique as it works on the MPC Source SSA-form representation. We posit that vectorization over MPC Source is a new problem that warrants a new look, in part because of MPC’s unique linear structure and in part because vectorization meshes in with other
 §5.1 describes our dependence analysis and §5.2 describes scalar expansion, which lifts scalars (and arrays) to the corresponding loop dimensionality to create opportunities for vectorization. §5.3 describes our core vectorization algorithm and §5.4 argues correctness of the vectorization transformation. §5.5 extends vectorization with array writes.

5.1 Dependence Analysis

We build a dependence graph where the nodes are the MPC Source statements and the edges represent the def-use relations.

Def-use Edges. We distinguish the following def-use edges:

- same-level edge \( X \rightarrow Y \) where \( X \) and \( Y \) are in the same loop nest, say \( i, j, k \). E.g., the def-use edge 9 to 10 in the Biometric MPC Source in Listing 1 is a same-level edge. A same-level edge can be a back-edge in which case a \( \phi \) node is the target of the edge. E.g., 15 to 4 in Biometric is a same-level back-edge.

- outer-to-inner \( X \rightarrow Y \) where \( X \) is in an outer loop nest, say \( i \), and \( Y \) is in an inner one, say \( i, j, k \). E.g., 1 to 4 in Biometric forms is an outer-to-inner edge.

- inner-to-out-outer \( X \rightarrow Y \) where \( X \) is a \( \phi \)-node in an inner loop nest, \( i, j, k \), and \( Y \) is in the enclosing loop nest \( i, j \), E.g., the def-use from 8 to 12 gives rise to an inner-to-out-outer edge. An inner-to-out-outer edge can be a back-edge as well, in which case both \( X \) and \( Y \) are \( \phi \)-nodes with the source \( X \) in a loop nested into \( Y \)'s loop (not necessarily immediately).

- mixed forward edge \( X \rightarrow Y \). \( X \) is in some loop \( i, j, k \) and \( Y \) is in a loop nested into \( i, j, k' \). We transform mixed forward edges as follows. Let \( x \) be the variable defined at \( X \). We add a variable and assignment \( x' = x \) immediately after the \( i, j, k \) loop. Then we replace the use of \( x \) at \( Y \) with \( x' \). This transforms the mixed forward edge into an "inner-to-outer" forward edge followed by an outer-to-inner forward edge. Thus, Basic Vectorization handles one of "same-level", "inner-to-outer", or "outer-to-inner" def-use edges.

Closures. We define \( \text{closure}(n) \) where \( n \) is a PHI-node. Intuitively, it computes the set of nodes (i.e., statements) that form a dependence cycle with \( n \). The closure of \( n \) is defined as follows:

- \( n \) is in \( \text{closure}(n) \)
- \( X \) is in \( \text{closure}(n) \) if there is a same-level path from \( n \) to \( X \), and \( X \rightarrow n \) is a same-level back-edge.
- \( Y \) is in \( \text{closure}(n) \) if there is a same-level path from \( n \) to \( Y \) and there is a same-level path from \( Y \) to some \( X \) in \( \text{closure}(n) \).

5.2 Scalar Expansion

An important component of our algorithm is the scalar expansion to the corresponding loop dimensionality, which is necessary to expose opportunities for vectorization. In the Biometric example, \( d = S[i \cdot D + j] - C[j] \) equiv. to \( d = S[i,j] - C[j] \), which gave rise to \( N \times D \) subtractions in the sequential schedule, is lifted.

The argument arrays \( S \) and \( C \) are lifted and the scalar \( d \) is lifted: \( d[i,j] = S[i,j] - C[j] \). The algorithm then detects that the statement can be vectorized.

Arrays. Conceptually, we treat all variables as arrays. There are three kinds of arrays.

- Scalars: These are scalar variables we expand into arrays for the purposes of vectorization. For those, all writes are canonical writes and all reads are canonical reads. We will raise dimension when a scalar gives rise to an outer-to-inner dependence edge (e.g., \( \text{sum}!2 \) in line 6 of the MPC source code will be raised to a 1-dimensional array since \( \text{sum}!2 \) is used in the inner \( j \)-loop). We will drop dimension when a scalar gives rise to an inner-to-out-outer dependence edge (e.g., \( \text{sum}!2 \) for which the lifted inner loop computes \( D \) values, but the outer loop only needs the last one.)

- Read-only input arrays: There are no writes, while we may have non-canonical reads, \( f(i,j,k) \). Vectorization adds raise dimension operations at the beginning of the function to lift these arrays to the dimensionality of the loop where they are used, possibly reshaping the arrays. If there are multiple "views" of the input array, there would be multiple raise dimension statements to create each one of these views. The invariant is that at reads in loops, the reads of "views" of the original input array are canonical.

- Read-write output arrays: Writes are canonical (by restriction) but reads can be non-canonical. Dependences limit vectorization when non-canonical read access refers to array writes in previous iterations, thus creating loop-carried dependences. We may apply both raise and drop dimension, however, they respect the fixed dimensionality of the output array. The array cannot be raised to a dimension lower than its canonical (fixed) dimensionality and it cannot be dropped to lower dimension. In addition, non-canonical reads may require lifting (i.e., reshaping) of the array after the most recent write rather than in the beginning of the program in order to reduce a non-canonical read to a canonical one.

Raise dimension. The \( \text{raise} \_ \text{dim} \) function expands a scalar (or array). There are two conceptual versions of \( \text{raise} \_ \text{dim} \). One reshapes an arbitrary access into a canonical read access in the corresponding loop. It takes the original array, the access pattern function \( f(i,j,k) \) in loop nest \( i, j, k \) and the loop bounds \( ((i : I), (j : J), (k : K)) \):

\[
\text{raise} \_ \text{dim}(A, f(i,j,k), ((i : I), (j : J), (k : K)))
\]

It produces a new 3-dimensional array \( A' \) by iterating over \( i, j, k \) and setting each element of \( A' \) as follows:

\[
A'[i,j,k] = A[f(i,j,k)]
\]

The end result is that uses of \( A[f(i,j,k)] \) in loop nest \( i,j,k \) are replaced with canonical read-accesses to \( A'[i,j,k] \) that can be vectorized. In the running Biometric example, \( C' = \text{raise} \_ \text{dim}(C, (i : N, j : D)) \) lifts the 1-dimensional array \( C \) into a 2-dimensional array. The \( i,j \) loop now accesses \( C' \) in the canonical way, \( C'[i,j] \). Similarly, \( S' = \text{raise} \_ \text{dim}(S, i \cdot D + j, (i : N, j : D)) \) tries to lift \( S \), but the operation turns into a no-op because \( S \) is already a 2-dimensional array and the read access is canonical.

The other version of \( \text{raise} \_ \text{dim} \) lifts a lower-dimension array into a higher-dimension for access in a nested loop. It is necessary when processing outer-to-inner dependences. Here \( A \) is an \( i \)-array and
We present our algorithm followed by an example on Biometric.

5.3 Basic Vectorization

We present our algorithm followed by an example on Biometric.

5.3.1 Algorithm

There are two key phases of the algorithm. Phase 1 inserts raise dimension and drop dimension operations according to def-uses. E.g., if there is an inner-to-outer dependence, it inserts raise dimension, and similarly, if there is an outer-to-inner dependence, it inserts drop dimension. After this phase operations work on arrays of the corresponding dimensionality and we optimistically vectorize all arrays.

Phase 2 proceeds from the inner-most towards the outer-most loop. For each loop it anchors dependence cycles (closures) around pseudo PHI nodes then removes vectorization from the dimension of that loop. There are two important points in this phase. First, it may break a loop into smaller loops which would discover opportunities for vectorization in intermediate statements in the loop. Second, it handles writes array. It creates opportunities for vectorization in the presence of write arrays, even though Cytron’s SSA adds a back-edge to the array PHI-node, thus killing vectorization of statements that read and write that array.

The excerpts in red color in the pseudo code below highlight the extension with array writes. We advise the reader to omit the extension for now and consider just read-only arrays. We explain the extension in §5.5. (As many of our benchmarks include write arrays, it plays an important role.)

Phases 3 cleans up local arrays of references (this is an optional phase and our current implementation does not include it) and Phase 4 explicitly turns operations into MOTION SIMD operations.

5.3.2 Example: Biometric

We now demonstrate the workings of the basic algorithm on our running example. Recall the MPC Source for Biometric:

```c
for each MPC stmt : x = Op(y_1, y_2) in loop i, j, k do
for each closure cl do
  for each PHI-node n in loop i_1,...,i_d do
    compute closure(n)
  end for
  { Phase 2: Recreating for-loops for cycles; vectorizable stmts hoisted up. }
  for each dimension d from highest to 0 do
    for each PHI-node n in loop i_1,...,i_d do
      if n is vectorizable then
        compute closure(n)
        create for i_d in ... loop
        add PHI-nodes in cl to header block
        add target-less PHI-node for A if cl updates array A
        add statements in cl to loop in some order of dependences
        } Dimension is not vectorizable: }
        change l_d to l_d in all statements in loop
        treat for-loop as monolith node for def-uses: e.g., some def-use edges become same-level.
    end for
    for each target-less PHI-node A!1 = PHI(A!0,A!k) do
      in vectorizable stmts, replace use of A!1 with A!0
      discard PHI-node if not used in any cl, replacing A!1 with A!0 or A!k appropriately
    end for
  end for
  { Phase 3: Remove unnecessary dimensionality. }
  } A dimension i is dead on exit from stmt x[i...i...] = ... if all stmts with targets outside of the enclosing for i ... MOTION loop end at target (use) x' = drop_dim(x, i). }
  } for each stmt and dimension x[i...i...] = ... do
    if i is a dead dimension on exit from stmt x[i...i...] = ..., remove i from x (all defs and uses)
  end for
  } Now clean up drop_dim and raise_dim }
  case-def-use edge stmt'(def of y_n) → stmt(def of x) of
    same-level: y'_n is y_n
    outer-to-inner: add y'_n[i, j, k] = raise_dim(y_n) at stmt'
    (more precisely, right after stmt')
    inner-to-outer: add y'_n[i, j, k] = drop_dim(y_n) at stmt'
    (more precisely, in loop of stmt right after loop of stmt')
  end for
  { Optimistically vectorize all. I means vectorized dimension. }
```
Compilation and Backend-Independent Vectorization for Multi-Party Computation Conference’17, July 2017, Washington, DC, USA

Phase 1 of Vectorization Algorithm. The transformation preserves the dependence edges. It raises the dimensions of scalars and optimistically vectorizes all operations. The next phase discovers loop-carried dependencies and removes affected vectorization.

In the code below statements (e.g., min_sum!3 = sum!3) are implicitly vectorized. The example illustrates the two different versions of raise_dim. E.g., raise_dim(C, j, (i:N,j:D)) reshapes the read-only input array. drop_dim(sum!3) removes the j dimension of sum!3.

```
sum!3 = PHI(sum!2, sum!4)
d = SUB(S((i:D) + j),C[j])
p = MUL(d,d)
sum!4 = ADD(sum!3,p)
t = CMP(sum!3,min_sum!2)
min_sum!3 = sum!3
min_idx!2 = i
min_sum!4 = MUX(t, min_sum!3, min_sum!2)
min_idx!4 = MUX(t, min_idx!3, min_idx!2)
return (min_sum!2, min_idx!2)
```

Phase 2 of Vectorization Algorithm. This phase analyzes statements from the innermost loop to the outermost. The key point is to discover loop-carried dependencies and re-introduce loops whenever dependencies make this necessary.

Starting at the inner phi-node sum!3 = PHI(...), the algorithm first computes its closure. The closure amounts to the phi-node itself and the addition node sum!4 = ADD(sum!3,p!3), accounting for the loop-carried dependency of the computation of sum. The algorithm replaces this closure with a for-loop on j removing vectorization on j. Note that the SUB and MUL computations remain outside of the loop as they do not depend on PHI-nodes that are part of cycles.

The dependences are from p[I,J] = MUL(d[I,J],d[I,J]) to the monolithic for-loop and from the for-loop to sum!3^= drop_dim(sum!3).

Lower case index, e.g., i, indicates non-vectorized dimension, while uppercase index, e.g., I indicates vectorized dimension.

After processing the inner loop code becomes:

```
min_sum!1 = MAX_INT
min_sum!1^ = raise_dim(min_sum!1, (i:N))
min_idx!1 = 0
S^ = raise_dim(S, ((i + D) + j), (i:N,j:D))
C^ = raise_dim(C, j, (i:N,j:D))
for i in range(0,N):
  min_sum!2[i] = PHI(min_sum!1^ [i], min_sum!4[i])
  min_idx!2 = PHI(min_idx!1^ [i], min_idx!4[i])
  sum!2 = [0,...,0]
  sum!2^ = raise_dim(sum!2, (i:D))
  d[I,J] = SUB(S^ [I,J],C^ [I,J])
p[I,J] = MUL(d[I,J],d[I,J])
for j in range(0,D):
  sum!3[I,J] = PHI(sum!2^ [I,J], sum!4[I,J−1])
  sum!4[I,J] = ADD(sum!3[I,J],p[I,J])
  sum!3^ = drop_dim(sum!3)
  t[I] = CMP(sum!3^ [I],min_sum!2[I])
  min_sum!3 = sum!3^ +
  min_idx!3 = i
  min_sum!4[I,J] = MUX(t[I], min_sum!3[I], min_sum!2[I])
  min_idx!4[I,J] = MUX(t[I], min_idx!3[I], min_idx!2[I])
  min_sum!2^ = drop_dim(min_sum!2)
  min_idx!2^ = drop_dim(min_idx!2)
return (min_sum!2^, min_idx!2^)
```

When processing the outer loop two closures arise, one for min_sum!2[I] = PHI(...) and one for min_idx!2[I] = PHI(...). Since the two closures do not intersect, we have two distinct for-loops on i:

```
min_sum!1 = MAX_INT
min_sum!1^ = raise_dim(min_sum!1, (i:N))
min_idx!1 = 0
S^ = raise_dim(S, ((i + D) + j), (i:N,j:D))
C^ = raise_dim(C, j, (i:N,j:D))
for i in range(0,N):
  min_sum!2[i] = PHI(min_sum!1^ [i], min_sum!4[i])
  min_idx!2 = PHI(min_idx!1^ [i], min_idx!4[i])
  sum!2 = [0,...,0]
  sum!2^ = raise_dim(sum!2, (i:D))
  d[I,J] = SUB(S^ [I,J],C^ [I,J])
p[I,J] = MUL(d[I,J],d[I,J])
for j in range(0,D):
  sum!3[I,J] = PHI(sum!2^ [I,J], sum!4[I,J−1])
  sum!4[I,J] = ADD(sum!3[I,J],p[I,J])
  sum!3^ = drop_dim(sum!3)
  t[I] = CMP(sum!3^ [I],min_sum!2[I])
  min_sum!3 = sum!3^ +
  min_idx!3 = i
  min_sum!4[I,J] = MUX(t[I], min_sum!3[I], min_sum!2[I])
  min_idx!4[I,J] = MUX(t[I], min_idx!3[I], min_idx!2[I])
  min_sum!2^ = drop_dim(min_sum!2)
  min_idx!2^ = drop_dim(min_idx!2)
return (min_sum!2^, min_idx!2^)
```
Phase 3 of Vectorization Algorithm. This phase removes redundant dimensionality. It starts by removing redundant dimensions in MOTION loops followed by removal of redundant drop dimension statements. It then does (extended) constant propagation to "bypass" raise statements, followed by copy propagation and dead code elimination.

The code becomes closer to what we started with:

```
sum2 = 0
for i in range(0, N):
    min_idx2[i] = PHI(min_idx4[i], min_idx3[i], min_idx2[i])
    min_sum4[i] = MUX(t[i], min_idx3[i], min_idx2[i])
    for j in range(0, D):
        // j is redundant for sum3 and sum4
        if j in range(0, D):
            sum3[i] = PHI(sum2[i], sum4[i])
        sum4[i] = ADD(sum3[i], p[i])

// drop_dim is redundant, removing
// then copy propagation and dead code elimination
min_idx3 = [0, 1, 2, ..., N - 1] // i.e., min_idx3[i] = [i, (i - N]
```

Phase 4 of Basic Vectorization. This phase adds SIMD operations:

```
sum2 = [0, ..., 0]
d[j] = SUB_SIMD(sum3[i], C[i], j)
p[i] = MUL_SIMD(d[j], d[j])
```

5.4 Correctness Argument

We build a correctness argument that loosely follows Abstract Interpretation. First we define the MPC Source Syntax. The domain of MPC Source programs expressible in the syntax (with certain semantic restrictions) is the abstract domain $A$. We then define the linearization of an MPC Source program as an interpretation over the syntax. The linearization, which is a schedule (as in §2), is the concrete domain $C$. Since we reason over def-use graphs in $A$ we define a partial order relation over elements of $A$ in terms of def-use relations. We define a partial order over elements of $C$ as well, in terms of def-use relations in the concrete domain $C$. We prove two theorems that state (informally) that the schedule corresponding to the original program computes the same result as the schedule corresponding to the vectorized program.

**MPC Source Syntax.** Fig. 2 states the syntax and linearization semantics of MPC Source. Although notation is heavy, the linearization simply produces schedules as discussed in §3; the iterative MPC Source gives rise to what we call sequential schedule where loops are unrolled and MPC Source with vectorized dimensions gives rise to what we called parallel schedule. For simplicity, we consider only scalars and read-only arrays, however, the treatment extends to write arrays as well. $x[i, j, k]$ denotes the value of scalar variable $x$ at loop nest $i, j, k$. Upper case $f$ denotes a vectorized dimension and lower case $i, k$ denote iterative dimensions. Our compiler imposes semantic restrictions over the syntax: (1) $x$ is treated as a 3-dimensional array and (2) $x[i, j, k]$ must be enclosed into for-loops on non-vectorized dimensions $i$ and $k$.

```
for i in range(l):
    ...
for k in range(K):
    ...
```
We extend the definition to def-use edges in the obvious way: a
written for loop unrolls the loop.

\[ y(s) = y(s_1) \land y(s_2) \]

\[ y(x[i,j,k] = \text{op SIMD}(y_1[i,j,k], y_2[i,j,k])) = \]
\[ x[i,0,k] = y_1[i,0,k] \land y_2[i,0,k] \]
\[ x[i,1,k] = y_1[i,1,k] \land y_2[i,1,k] \]
\[ x[i,J−1,k] = y_1[i,J−1,k] \land y_2[i,J−1,k] \]
\[ \text{operation} \]
\[ \text{constant} \]
\[ \text{pseudo PHI} \]
\[ \text{drop dimension(s)} \]
\[ \text{loop} \]
\[ y(\text{for i in range}(l): s) = \]
\[ y(s)[0/1] \land y(s)[1/1] \land \ldots \land y(s)[J−1/1] \]

Figure 2: MPC Source Syntax and Semantics. \( y \) defines the semantics of MPC source which is a linearization of MPC Source. A
SIMD operation parallelizes operations across the vectorized
\[ \gamma(s_1) = \gamma(s_2) \]

for loop unrolls the loop. \( ; \) denotes sequential execution. Iterative MPC Source trivially extends to non-vectorized dimensions
over the enclosing loops.

Partial Orders. For each MPC Source program \( a \) we compute the def-use edges in the standard way: if statement \( s_1 \in a \) defines
variable \( x \), e.g., \( x[i,j,k] = \ldots \), and statement \( s_2 \in a \) uses \( x \), e.g.,
\[ x[i,j,k] = \text{PHI}(x'[i,j,k], x_2[i,j,k]) \]
\[ x[i,j] = \text{drop_dim}(x'[i,j,k], k) \]
\[ \gamma(\text{for i in range}(l): s) = \]
\[ y(s)[0/1] \land y(s)[1/1] \land \ldots \land y(s)[J−1/1] \]

\[ y(a_0) \text{ is carried out by } y(a_1) \text{ as well, computing the same result.} \]
\[ \text{Theorem 1 states that a def-use edge in MPC Source } a_0 \text{ gives}
\]
Consider a loop $j$ enclosed in some fixed $i$. Only if an update (definition) $A_m[f(i,j,k)] = ...$, and use: ...

if $\exists i, j, j', k, k'$, s.t. $0 \leq i < I$, $0 \leq j, j' < J$, $0 \leq k, k' < K$, $j < j'$,

and $f(i, j, k) = f'(i, j', k')$ then

dep = True

end if

end for

if dep == False then

remove back edge into A’s $\phi$-node in loop $j$.

end if

end for

Consider a loop $j$ enclosed in some fixed $i$. Only if an update (definition) $A_m[f(i,j,k)] = ...$ at some iteration $j$ references the same array element as a use $... = A_m[f'(i,j,k)]$ at some later iteration $j'$, we may have a loop-carried dependence for $A$ due to this def-use pair. (In contrast, Cytron’s algorithm inserts a loop-carried dependency every time there is an array update.) The algorithm above examines all def-use pairs in loop $j$, including defs and uses in nested loops, searching for values $i, j, j', k, k'$ that satisfy $f(i, j, k) = f'(i, j', k')$. If such values exist for some def-use pair, then there is a potential loop-carried dependence on $A$; otherwise there is not and we can remove the spurious backward edge thus “freeing up” statements for vectorization.

Consider the earlier example. There is a single loop, $i$. Clearly, there is no pair $i$ and $i'$, where $i < i'$ that make $i = i'$ due to the def-use pairs of $A[6-7]$ and $6-8$. Therefore, we remove the back edge from 6 to the phi-node 2. Analogously, we remove the back edges from 7 to 3 and from 8 to 4. However, there are many values $i < i'$ that make $i = i' - 1$ and the back edge from 9 to 5 remains (def-use pairs for $D$). As a result of removing these spurious edges, Vectorization will find that statement 6 is vectorizable. Statements 7, 8 and 9 will correctly appear in the FOR loop.

Note however, that this step renders some array phi-nodes targetless. We handle target-less phi-nodes with a minor extension of Vectorization (Phase 2). First, we merge closures that update the same array. This simplifies handling of array $\phi$-nodes: if each closure is turned into a separate loop each will need to have its own array phi-node to account for the update and this would complicate the analysis. Second, we add the target-less node of array $A$ back to the closure that updates $A$ — the intuition is, even if there is no loop-carried dependence from writes to reads on $A$, $A$ is written and the write (i.e., update) cannot be vectorized; therefore, the updated array has to carry to the next iteration of the loop. Third, in cases when the phi-node remains target-less, i.e., cases when the array write can be vectorized, we have to properly remove the phi-node replacing uses of the left-hand side of the phi-node with its arguments.

5.5.2 Restricting Array Writes

For now, we restrict array updates to canonical updates. Assume (for simplicity) a two-dimensional array $A[i][j]$. A canonical update is the following:

```python
for i in range(I):
    for j in range(J):
        A[i][j] = ...
        ...
```

The update $A[i][j]$ can be nested into an inner loop and there may be multiple updates, i.e., writes to $A[i][j]$. However, update such as $A[i][j] = ...$ or $A[i][j][1] = ...$, etc., is not allowed. Additionally, while there could be several different loops that perform canonical updates, they must be of the same dimensionality; i.e., an update of higher or lower dimension, e.g., $A[i][j][k] = ...$ is not allowed. We compute the canonical dimensionality of each write array by examining the array writes in the original program and rejecting programs that violate the canonical write restriction. This restriction simplifies reasoning in this early stage of the compiler; we will look to relax the restriction in future work.

Another restriction/assumption is that we assume the output array is given as input with initial values, and it is of size consistent with its canonical dimensionality.

Reads through an arbitrary formula, such as $A[i-1]$ for example, are allowed; currently, the projection function returns dummy values if the read formula is out of bounds; we assume the programmer ensures that the program still computes correct output in this case.

5.5.3 Changes to Basic Vectorization

In addition to the changes for the handling of target-less phi-nodes, Basic Vectorization has to handle def-use edges $X \rightarrow Y$ where $X$ defines and $Y$ uses an array variable. The definition can be an update $A_2 = \text{update}(A_1, i...)$, a pseudo $\phi$-node $A_2 = \text{PHI}(A_0, A_1)$, etc. Note that $\phi$-nodes for arrays have no subscript operations the way there are subscript operations in analysis-introduced arrays representing scalars. While there are variations, the most intuitive implementation will perform Basic Vectorization Phase 1 as is, inserting raise_dim and drop_dim in the same way. However, the implementation of raise dimension and drop dimension will be adapted because the dimension cannot be raised or dropped to a dimension lower than the canonical one. Consider a def-use edge $X \rightarrow Y$ for an array $A$.

1. same-level $X \rightarrow Y$. Do nothing, propagate the array, which happens to be of the right dimension.
2. inner-to-outer $X \rightarrow Y$ triggers the addition of drop_dim. However, the dimensionality cannot be dropped below the canonical dimensionality of the array. E.g., if the dimensionality of the loop enclosure $X$ is already at the canonical one, then drop_dim has no effect.
3. outer-to-inner $X \rightarrow Y$ triggers raise_dim. Again, if the dimensionality of the loop enclosure of $Y$ is smaller or same as the canonical dimensionality of the array, then it has no effect, otherwise, if dimensionality is greater than the canonical dimensionality, raise_dim(...) (at $X$) is the same as in Basic Vectorization.
4. "mixed" $X \rightarrow Y$. We assume that the mixed edge is transformed into an inner-to-outer followed by outer-to-inner edge before we perform vectorization, just as with Basic vectorization.

If the use of the array is a read $A[f(i,j,k)]$ different than a canonical read $A[i,j,k]$, then we need to add a reshape operation as all arrays are $A[i,j,k]$. It can be added after raise_dim/drop_dim or incorporated in these operations. The bulk of the change is in Phase 2 of Vectorization as outlined earlier.
5.5.4 Examples with Array Writes

Example 1. First, the canonical dimensionality of all A,B,C and D is 1. After Phase 1 of Vectorization the Aiken’s array write example will be (roughly) as follows:

\[
\begin{align*}
1 & \quad \text{for } i \in \text{range}(N): \\
2 & \quad A_1 = \text{PHI} \left( A_0 \cdot A_2 \right) \\
3 & \quad B_1 = \text{PHI} \left( B_0 \cdot B_2 \right) \\
4 & \quad C_1 = \text{PHI} \left( C_0 \cdot C_2 \right) \\
5 & \quad D_1 = \text{PHI} \left( D_0 \cdot D_2 \right) \\
6 & \quad A_2 = \text{update} \left( A_1, i, B_1[I] \cdot 10 \right) \\
7 & \quad B_2 = \text{update} \left( B_1, i, A_2[I] \cdot D_1[I-1] \right) \\
8 & \quad C_2 = \text{update} \left( C_1, i, A_2[I] \cdot D_1[I-1] \right) \\
9 & \quad D_2 = \text{update} \left( D_1, i, B_2[I] \cdot C_2[I] \right) \\
\end{align*}
\]

Note that since all def-uses are same-level (i.e., reads and writes of the array elements) no raise dimension or drop dimension happens.

Phase 2 computes the closure of \(cl = \{5,7,8,9\}\) while 6 is vectorizable. Recall that 2,3, and 4 are target-less phi-nodes. Since the closure \(cl\) includes updates to \(B\) and \(C\), the corresponding phi-nodes are added back to the closure and the def-use edges are added back to the target-less nodes. The uses of \(A_1\) and \(B_1\) in the vectorized statement turn into uses of \(A_0\) and \(B_0\) respectively; this is done for all original target-less phi-node. (But note that \(A_0\) is irrelevant; the update writes into array \(A_2\) in parallel.) Finally, the target-less phi-node for \(A\) is discarded.

\[
\begin{align*}
1 & \quad A_2 = \text{update} \left( A_0, i, ADD \cdot SIMD(B_0[I],10) \right) \\
2 & \quad \text{equiv. to } A_2[I] = ADD \cdot SIMD(B_0[I],10) \\
3 & \quad \text{for } i \in \text{range}(N): \text{// MOTION loop} \\
4 & \quad B_1 = \text{PHI} \left( B_0 \cdot B_2 \right) \\
5 & \quad C_1 = \text{PHI} \left( C_0 \cdot C_2 \right) \\
6 & \quad D_1 = \text{PHI} \left( D_0 \cdot D_2 \right) \\
7 & \quad B_2 = \text{update} \left( B_1, i, A_2[I] \cdot D_1[I-1] \right) \\
8 & \quad \text{equiv. to } B_2[I] = A_2[I] \cdot D_1[I-1] \\
9 & \quad C_2 = \text{update} \left( C_1, i, A_2[I] \cdot D_1[I-1] \right) \\
10 & \quad D_2 = \text{update} \left( D_1, i, B_2[I] \cdot C_2[I] \right) \\
\end{align*}
\]

Example 2. Now consider the MPC Source of Histogram:

\[
\begin{align*}
1 & \quad \text{for } i \in \text{range}(0, \text{num_bins}): \\
2 & \quad \text{res1} = \text{PHI} \left( \text{res}, \text{res2} \right) \\
3 & \quad \text{for } j \in \text{range}(0, \text{N}): \\
4 & \quad \text{res2} = \text{PHI} \left( \text{res1}, \text{res3} \right) \\
5 & \quad \text{tmp1} = \left( A[I] == 1 \right) \\
6 & \quad \text{tmp2} = \left( \text{res2}[I] \cdot B[I] \right) \\
7 & \quad \text{tmp3} = \text{MUX} \left( \text{tmp1}, \text{res2}[I], \text{tmp2} \right) \\
8 & \quad \text{res3} = \text{Update} \left( \text{res2}, i, \text{tmp3} \right) \\
9 & \quad \text{return} \text{res1} \\
\end{align*}
\]

The canonical dimensionality of \(res\) is 1. Also, the phi-node \(res1\) is a target-less phi-node (the implication being that the inner loop can be vectorized across \(i\)). After Phase 1, Vectorization produces the following code (statements are implicitly vectorized along \(i\) and \(j\)). In a vectorized update statement, we can ignore the incoming array, \(res2\) in this case. The update writes (in parallel) all locations of the 2-dimensional array, in this case it sets up each \(res3[I,j] = \text{tmp3}[I,j]\).

\[
\begin{align*}
1 & \quad A_1 = \text{raise} \cdot\text{dim} \left( A, j, \left( (\text{num_bins}), (j,N) \right) \right) \\
2 & \quad B_1 = \text{raise} \cdot\text{dim} \left( B, j, \left( (\text{num_bins}), (j,N) \right) \right) \\
3 & \quad \text{for } i \in \text{range}(0, \text{num_bins}): \\
4 & \quad \text{res1} = \text{PHI} \left( \text{res}, \text{res2}^2 \right) \# \text{target-less phi-node} \\
5 & \quad \text{res1}^\nu = \text{raise} \cdot\text{dim} \left( \text{res1}, (j,N) \right) \\
6 & \quad \text{for } j \in \text{range}(0, \text{N}): \\
7 & \quad \text{res2} = \text{PHI} \left( \text{res1}^\nu, \text{res3} \right) \\
8 & \quad \text{tmp1} = \left( A[I] == 1 \right) \\
9 & \quad \text{tmp2} = \left( \text{res2} + \text{B1} \right) \\
10 & \quad \text{tmp3} = \text{MUX} \left( \text{tmp1}, \text{res2}, \text{tmp2} \right) \\
11 & \quad \text{res3} = \text{Update} \left( \text{res2}, (I,J), \text{tmp3} \right) \\
12 & \quad \text{res1}^\nu = \text{drop} \cdot\text{dim} \left( \text{res2} \right) \\
13 & \quad \text{return} \text{res1}^\nu \\
\end{align*}
\]

Processing the inner loop in Phase 2 vectorizes \(\text{tmp1} = \left( A[I] == 1 \right)\) along the \(j\) dimension but leaves the rest of the statements in a MOTION loop. Processing the outer loop is interesting. This is because the PHI node is a target-less node, and therefore, there are no closures! Several things happen. (1) Everything can be vectorized along the \(i\) dimension. (2) We remove the target-less PHI node, however, we must update uses of \(res1\) appropriately: the use at \(\text{raise} \cdot\text{dim}\) goes to the first argument of the PHI function and the use at \(\text{drop} \cdot\text{dim}\) goes to the second argument.

\[
\begin{align*}
1 & \quad A_1 = \text{raise} \cdot\text{dim} \left( A, j, \left( (\text{num_bins}), (j,N) \right) \right) \\
2 & \quad B_1 = \text{raise} \cdot\text{dim} \left( B, j, \left( (\text{num_bins}), (j,N) \right) \right) \\
3 & \quad \text{for } i \in \text{range}(0, \text{N}): \\
4 & \quad \text{res2} = \text{PHI} \left( \text{res1}^\nu, \text{res3} \right) \\
5 & \quad \text{tmp2}[I,I] = \left( \text{res2}[I,I] + \text{B1}[I,I] \right) \\
6 & \quad \text{tmp3}[I,J] = \text{MUX} \left( \text{tmp1}[I,J], \text{res2}[I,J], \text{tmp2}[I,J] \right) \\
7 & \quad \text{res3} = \text{Update} \left( \text{res2}, (I,J), \text{tmp3} \right) \\
8 & \quad \text{return} \text{res2}^2 \\
\end{align*}
\]

6 COMPILER BACK END

MOTION code generation requires that variables are marked as plain or shared following the type system in §4.1. We require that all inputs are marked as either shared or plaintext, however, we infer qualifiers for the rest of the variables. §6.1 briefly describes the taint analysis and §6.2 describes MOTION code generation.

6.1 Taint Analysis

The taint analysis works on MPC Source, which lacks if-then-else control flow. This significantly simplifies treatment as there is no need to handle conditionals and implicit flow. Specifically, the compiler uses the following rules, which are standard in positive-negative qualifier systems (here shared is the positive qualifier and plain is the negative one):

(1) Loop counters are always plain.
6.2 From (Optimized) MPC Source to MOTION

MOTION supports FOR loops and SIMD operations, so translation from MPC source to MOTION C++ code is relatively straightforward.

Variable declarations: Our generated C++ uses the following variable-naming scheme: shared variables are named the same as in the MPC Source with the ! replaced with an underscore (e.g. sum!2 would be translated to sum_2). Plaintext variables follow the same naming convention as shared variables but are prefixed with _MPC_PLAINTEXT_. The shared representation of constants are named _MPC_CONSTANT_ followed by the literal constant (e.g. the shared constant 0 would be named _MPC_CONSTANT_0).

The generated MOTION code begins with the declaration of all variables used in the function, including loop counters. If a variable is a vectorized array, it is initialized to a correctly-sized array of empty MOTION shares. Additionally, each plaintext variable and parameter has a shared counterpart declared. Next, all constant values which are used as part of shared expressions are initialized as a shared input from party 0. Finally, plaintext parameters are converted used as shared inputs from party 0 to initialize their shared counterparts.

Code generation: Once the function preamble is complete, the MPC Source is translated into C++ one statement at a time. The linear structure of MPC Source enables this approach to translation. If there is no vectorization present in a statement, translation to C++ is straightforward: outside of MUX statements and array updates, non-vectorized assignments, expressions, and returns directly translate into their C++ equivalents. Non-vectorized MUX statements are converted to MOTION’s MUX member function on the condition variable. Array updates are translated into two C++ assignments: one to update the value in the original array and one to assign the new array as shown in Listing 2.

MPC FOR loops are converted to C++ FOR loops which iterate the loop counter over the specified range. Pseudo PHI nodes are broken into two components: the “FALSE” branch which assigns the initial value of the PHI node and the “TRUE” branch which assigns the PHI node’s back-edge. The assignment of the “FALSE” branch occurs right before the PHI node’s enclosing loop. As these assignments may rely on the loop counter, the loop counter is initialized before these statements. Inside of the PHI node’s enclosing loop, a C++ if statement is inserted to only assign the true branch of the PHI node after the first iteration. Listing 3 illustrates this translation.

Vectorization and SIMD operations: Vectorization is handled with utility functions to manage accessing and updating slices of arrays. All SIMD values are stored in non-vectorized form as 1-dimensional std::vectors in row-major order. Whenever a SIMD value is used in an expression, the utility function vectorized_access() takes the multi-dimentional representation of a SIMD value, along with the size of each dimension and the requested slice’s indices, and converts that slice to a MOTION SIMD value. Because MOTION supports SIMD operations using the same C++ operators as non-SIMD operations, we do not need to perform any other transformations to the expression. Therefore, once vectorized accesses are inserted the translation of an expression containing SIMD values is identical to that of expressions without SIMD values.

Similarly, the vectorized_assign() function assigns a potentially SIMD value to a slice of a vectorized array. This operation cannot be done with a simple subscript as SIMD assignments will update a range of values in the underlying array representation.

Updating SIMD arrays is also implemented differently from updating non-vectorized arrays. Instead of separating the array update from the assignment of the new array, these steps are combined with the vectorized_update() utility function. This function operates identically to vectorized_assign(), however it additionally returns the array after the assignment occurs. This value is then used for the assignment to the new variable. Listing 4 illustrates vectorized_assign() and vectorized_update() on the Biometric example.

Reshaping and raising dimensions: Raising the dimensions of a scalar or array uses the lift() utility function which takes a lambda for the raised expression and the dimensions of the output. This function is also used for the scalar expansion of values which have been lifted out of FOR loops as described in §5.2. This function evaluates the expression for each permutation of indices along the dimensions and returns the resulting array in row-major order. The lambda accepts an array of integers representing the index along each of the dimensions being raised, and the translation of the expression which is being raised replaces each of the dimension index variables with the relevant subscript of this array. There is also a special case of the lift() function which occurs when we are raising an array. In this case, instead of concatenating the array for each index, we extend the array along all dimensions being raised which are not present in the array already. For example, when raising an array with dimensions \( N \times M \) to an array with dimensions \( N \times M \times D \), the input array will simply be extended along the \( D \) dimension: \( A'[n, m, d] = A[n, m] \) for every \( d \). If the input array is already correctly sized it will be returned as-is.

Dropping dimensions use the drop_dim() and drop_dim_monoreturn() utility functions. They function identically but the latter returns a scalar for the case when the final dimension of an array is dropped. These functions take the non-vectorized representation of an array, along with the dimensions of that array, and return the array with the final dimension dropped.
Variables are often initialized to a common constant value (e.g. 0), generated vectorized MOTION code often includes multiple copies and from SIMD values which our utility functions perform, our loop counters are only used as plaintext. This prevents unnecessary increase in the number of input gates when i.e., when the counter flows to a shared computation. This is to be used, however we only generate this conversion when necessary, still be converted to a shared value on each iteration that they are a shared input for each initialization constant. Loop counters must this approach decreases the number of input gates by only creating the ciphertext value as a shared input from party 0. Due to this limitation, our translation to MOTION code attempts to minimize the number of conversions from plaintext values to shares are performed by providing support publicly-known constants for these protocols. So all conversions from plaintext to shares in MOTION do not implement all operations for other protocols. MOTION does not support publicly-known constants for these protocols, so all conversions from plaintext values to shares are performed by providing the plaintext value as a shared input from party 0. Due to this limitation, our translation to MOTION code attempts to minimize the number of conversions from a plaintext value. This is accomplished by creating a shared copy of each plaintext variable and updating that copy in lock-step with the plaintext variable. Since variables are often initialized to a common constant value (e.g. 0), this approach decreases the number of input gates by only creating a shared input for each initialization constant. Loop counters must still be converted to a shared value on each iteration that they are used, however we only generate this conversion when necessary, i.e., when the counter flows to a shared computation. This is to prevent unnecessary increase in the number of input gates when loop counters are only used as plaintext.

Due to the SSA translation phase as well as the conversions to SIMD values which our utility functions perform, our generated vectorized MOTION code often includes multiple copies of arrays and scalar values. These copies do not incur a runtime cost as the arrays simply hold pointers to the underlying shares, so no new shares or gates are created as a result of this copying. Cost in MPC programs is dominated by shares and computation on shares.

### Table 2: MOTION Translation: Array Updates

<table>
<thead>
<tr>
<th>IMP Source</th>
<th>MPC Source</th>
<th>MOTION Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>A[i] = val</td>
<td>A_1[i] = val;</td>
<td>A_1[i] = val;</td>
</tr>
<tr>
<td>update(A1, i, val)</td>
<td>A_2 = A_1;</td>
<td>A_2 = A_1;</td>
</tr>
</tbody>
</table>

### Table 3: MOTION Translation: FOR loop with Phi nodes

<table>
<thead>
<tr>
<th>MPC Source</th>
<th>MOTION Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>tmp = PHI(arr[i], val0)</td>
<td>PHI(arr[i], val0)</td>
</tr>
<tr>
<td>_MPC_PLAINTEXT_i = 0;</td>
<td>_MPC_PLAINTEXT_i = 0;</td>
</tr>
<tr>
<td>_MPC_PLAINTEXT_i += 1;</td>
<td>_MPC_PLAINTEXT_i += 1;</td>
</tr>
<tr>
<td>for (; _MPC_PLAINTEXT_i &lt; _MPC_PLAINTEXT_N; _MPC_PLAINTEXT_i++)</td>
<td>for (; _MPC_PLAINTEXT_i &lt; _MPC_PLAINTEXT_N; _MPC_PLAINTEXT_i++)</td>
</tr>
<tr>
<td>if (_MPC_PLAINTEXT_i != 0)</td>
<td>if (_MPC_PLAINTEXT_i != 0)</td>
</tr>
<tr>
<td>if (_MPC_PLAINTEXT_i != 0)</td>
<td></td>
</tr>
<tr>
<td>tmp = val0;</td>
<td>tmp = val0;</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>

### Table 4: MOTION Translation: Assignment to SIMD value

<table>
<thead>
<tr>
<th>MPC Source</th>
<th>MOTION Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>vectorized_assign(sum_4, _MPC_PLAINTEXT_N), [true],</td>
<td>vectorized_assign(sum_4, _MPC_PLAINTEXT_N), [true],</td>
</tr>
<tr>
<td>vectorized_access(sum_3, _MPC_PLAINTEXT_N), [true],</td>
<td>vectorized_access(sum_3, _MPC_PLAINTEXT_N), [true],</td>
</tr>
<tr>
<td>vectorized_access(p, _MPC_PLAINTEXT_A, _MPC_PLAINTEXT_B), [true, false],</td>
<td>vectorized_access(p, _MPC_PLAINTEXT_A, _MPC_PLAINTEXT_B), [true, false],</td>
</tr>
<tr>
<td>raise_dimensions(i, j, (i:N, j:M))</td>
<td>raise_dimensions(i, j, (i:N, j:M))</td>
</tr>
<tr>
<td>lift(std::function([&amp;](const std::vector<a href="">std::uint32_t</a> &amp;idxs) {return idxs[0] + idxs[1];}),</td>
<td></td>
</tr>
<tr>
<td>_MPC_plain_text_N, _MPC_plain_text_M))</td>
<td></td>
</tr>
</tbody>
</table>

### Table 5: MOTION Translation: Raising dimensions

We tested our framework with several benchmarks. For the multiparty computation (MPC), we restricted our evaluation to 2 party computation (2PC) setting because it requires fewer computing resources. We stress that there is no such inherent restriction in our framework. We use hardware resources provided by CloudLab[17] and consider two network settings, namely Local Area Network (LAN) and Wide Area Network (WAN). In the LAN setting, we use 16-core AMD 7302P 3.0GHz processors and 128GB of RAM. The connection between these machines had 10Gbps bandwidth and sub-millisecond latency. This setting reflects typical LAN use.
case considering that 10Gbps LAN is increasingly common in business networks and is now available even in some home networks. In the WAN setting we again used a c6525-25g machine (located in Utah, US) for the first party and a c228g1 machine (located in Wisconsin, US) for the second. The c228g1 machine is equipped with two Intel E5-2630 8-core 2.4GHz processors and 128GB of RAM. We measured the connection bandwidth between these machines to be 560Mbps and average round trip time (RTT) to be 38ms. At the time of this writing, all major internet providers in the US offer 1Gbps connections to home consumers, therefore this setting reasonably reflects the typical WAN use case.

We run all experiments 5 times and report average values of various metrics. Note that standard deviation — shown as error bar evaluation:

7.2 Benchmarks

In a preliminary experiment we ran various metrics. Note that standard deviation — shown as error bar evaluation:

7.2 Benchmarks

In the following, we say both for an experiment in which we execute both non-vectorized and vectorized protocols and vec for the vectorized only experiment. In a preliminary experiment we ran N=2, N=4, all the way to N=4096; the non-vectorized version ran out of memory at the value both and we fixed this value for these experiments. The vectorized one completed for N=4096 for nearly all benchmarks (numbers are not shown for space reasons; however, they times are largely consistent, e.g., if both is N = 2^k and it completes in X seconds for vectorized, then N = 2^{12} completes in (12−k)X seconds). We used the following benchmarks in our evaluation:

(1) Biometric Matching: Server has a database S of N records, each record’s dimension is D. Client submits a query C, client and server compute the closest record to C in an MPC. We use N=128 for both and N=4096 for vec. D is fixed at 4.

(2) Convex Hull: Given a polygon of N vertices (split between Alice and Bob), convex hull is computed in an MPC. It is adapted from [19]. We use N=32 for both experiment and N=256 for vec experiment.

(3) Count 102: Alice has a string of length N of symbols, Bob has a regular expression of the form 1(0^*)2, together they compute number of substrings that match the regular expression. It is adapted from [19]. We use N=1024 for both and N=4096 for vec.

(4) Count 10: Same as Count 102 except now the regular expression is of the form 1(0^+). Parameters are same as above.

(5) Cryptonets Max Pooling: Given a matrix of rows × cols elements that are split between Alice and Bob, they compute the max pooling subroutine of the cryptonet benchmark[16]. We use rows=64, cols=64 for both experiment.

(6) Database Join: given two databases with A and B 2-element records, compute cross join. We use A=B=32 for both and A=B=64 for vec.

(7) Database Variance given a database of len records, compute variance. We use len=512 for both and len=4096 for vec.

(8) Histogram: Given N 5-star ratings, compute their histogram, taken from [22, 19]. We use N=512 for both and N=4096 for vec.

(9) Inner Product: given two vectors, each of N elements, compute their inner product. We use N=512 for both and N=4096 for vec.

(10) k-means Iteration: performs the iteration subroutine of k-means database clustering operation [23, 33]. Here len1 is the size of input data, and len2 is the number of clusters. We use len1=32, len2=5 for both and len1=256, len2=8 for vec.

(11) Longest 102: Similar to Count 102 except that it computes the largest substring matching the regular expression. We use same parameters as Count 102, adapted from [19].

(12) Max Distance b/w Symbols Alice has a string of N symbols and Bob has some symbol b. The MPC computes the maximum distance between bs in the string. We adapted it from [19]. We use N=1024 for both and N=2048 for vec.

(13) Minimal Points Given a set of N points (split between Alice and Bob), a set of minimal points is computed i.e. there is no other point that has both a lower x and y coordinate, adapted from [19]. We use N=32 for both and N=64 for vec.

(14) MNIST ReLU given an input of outer × inner elements, executes the MNIST ReLU subroutine. We use inner=512 for both and inner=2048 for vec. outer is fixed at 16.

(15) Private Set Intersection (PSI) Alice holds set S1 with size SA, Bob holds set S2 with size SB, together they compute intersection of their sets. We use SA=SB=128 for both and SA=SB=1024 for vec.

7.3 Results and Analysis

A detailed summary of the effects of vectorization on various benchmarks is presented in Table 6. We show circuit evaluation times in Fig. 3. In terms of amenability to vectorization, we divide benchmarks into 3 categories: 1) High: these include convex hull, cryptonets max pooling, minimal points and private set intersection. These benchmarks are highly parallelizable and see 25x to 70x speedup in BMR, and 30x to 55x in GMW protocol. 2) Medium: these include biometric matching, DB Variance, histogram, inner product, k-means iteration and MNIST ReLU. These benchmarks have non-parallelizable phases e.g. the summing phase of inner product and biometric matching. Still, most computation is parallelizable and it results in speedup from 5x to 25x in BMR, and 2x to 25x in GMW protocol. 3) Low: these include the Database Join and the regular expression benchmarks (count 102, count 10, longest 102 and max distance between symbols). There is less parallelizable computation in these programs, thus the speedup is lower. We see a speedup from 1.1x to 2x in BMR. In GMW, DB Join, Count 102 and Count 10s see speedup from 1.1x to 1.3x. However, longest102 and max distance between symbols suffer a slowdown of 0.5x. There is opportunity for vectorization in these benchmarks according to our analytical model, particularly, a there is a large EQ that is vectorized, although a large portion of the loop cannot be vectorized. We observed that transformation to vectorized code increased multiplicative depth and, the negative effect of increased depth is more noticeable in a round-based protocol like GMW. The cause of the increase is not clear — we conjecture that MOTION performs optimizations over the non-vectorized loop body that decreases depth; also, EQ is relatively inexpensive in Boolean GMW and BMR.
We look closely at the Biometric Matching benchmark: circuit evaluation in Fig. 4, communication size in Fig. 5 and circuit generation time in Fig. 6. For input size beyond N=128 the memory usage exceeds available memory and prevents circuit generation. Consequently, non-vectorized bars are missing beyond this threshold. Notice that vectorization improves all metrics. Comparing performance improvement between BMR and GMW, we see more speedup for BMR (23x vs 10x), GMW gets more communication size reduction (10x vs 2.5x) and circuit generation sees a speedup of 35x and 45x for BMR and GMW respectively.

Since our vectorization framework is network agnostic, it produces the same circuit for both LAN and WAN. This means that the performance improvement between BMR and GMW, we see more speedup for BMR (23x vs 10x), GMW gets more communication size reduction (10x vs 2.5x) and circuit generation sees a speedup of 35x and 45x for BMR and GMW respectively.

Table 6: Vectorized vs Non-Vectorized Comparison, times in seconds (in LAN setting where applicable), Communication in MiB, Numbers in 1000s, values rounded to nearest integer, benchmark names ending in V are vectorized.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>GMW</th>
<th>BMR</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Online</td>
<td>Setup</td>
</tr>
<tr>
<td>Biometric Matching</td>
<td>146</td>
<td>16</td>
</tr>
<tr>
<td>Biometric Matching (V)</td>
<td>12</td>
<td>4</td>
</tr>
<tr>
<td>Convex Hull</td>
<td>48</td>
<td>6</td>
</tr>
<tr>
<td>Convex Hull (V)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>Count 102</td>
<td>79</td>
<td>6</td>
</tr>
<tr>
<td>Count 102 (V)</td>
<td>71</td>
<td>5</td>
</tr>
<tr>
<td>Count 10s</td>
<td>79</td>
<td>6</td>
</tr>
<tr>
<td>Count 10s (V)</td>
<td>71</td>
<td>4</td>
</tr>
<tr>
<td>Cryptonets (Max Pooling)</td>
<td>50</td>
<td>11</td>
</tr>
<tr>
<td>Cryptonets (Max Pooling) (V)</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Database Join</td>
<td>70</td>
<td>8</td>
</tr>
<tr>
<td>Database Join (V)</td>
<td>54</td>
<td>6</td>
</tr>
<tr>
<td>Database Variance</td>
<td>166</td>
<td>18</td>
</tr>
<tr>
<td>Database Variance (V)</td>
<td>37</td>
<td>6</td>
</tr>
<tr>
<td>Histogram</td>
<td>94</td>
<td>10</td>
</tr>
<tr>
<td>Histogram (V)</td>
<td>33</td>
<td>5</td>
</tr>
<tr>
<td>Inner Product</td>
<td>127</td>
<td>15</td>
</tr>
<tr>
<td>Inner Product (V)</td>
<td>16</td>
<td>5</td>
</tr>
<tr>
<td>k-means</td>
<td>108</td>
<td>12</td>
</tr>
<tr>
<td>k-means (V)</td>
<td>6</td>
<td>3</td>
</tr>
<tr>
<td>Longest 102</td>
<td>93</td>
<td>7</td>
</tr>
<tr>
<td>Longest 102 (V)</td>
<td>169</td>
<td>6</td>
</tr>
<tr>
<td>Max. Dist. b/w Symbols</td>
<td>71</td>
<td>8</td>
</tr>
<tr>
<td>Max. Dist. b/w Symbols (V)</td>
<td>166</td>
<td>7</td>
</tr>
<tr>
<td>Minimal Points</td>
<td>35</td>
<td>5</td>
</tr>
<tr>
<td>Minimal Points (V)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>MNIST ReLU</td>
<td>132</td>
<td>31</td>
</tr>
<tr>
<td>MNIST ReLU (V)</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>Private Set Intersection</td>
<td>95</td>
<td>9</td>
</tr>
<tr>
<td>Private Set Intersection (V)</td>
<td>1</td>
<td>2</td>
</tr>
</tbody>
</table>

compared to ADD and MUL, which also de-emphasizes the benefit of vectorization. We propose a simple heuristic although we do leave all the benchmarks in the table: if the transformation increases circuit depth beyond some threshold (e.g. more than 10% of the original circuit), we can reject the transformation. Note that in some settings it may still be desirable to vectorize e.g. in data constrained environments. As shown in Fig. 8, vectorization results in reduced communication (fewer bits are transferred).

We present evaluation for communication size in Fig. 8, circuit generation time in Fig. 9, number of gates in Fig. 10 and online time and setup time in Fig. 11 and Fig. 12 respectively.
The number of gates and communication size remain the same. Moreover, time for circuit generation, which is a local operation, also remains unchanged. Setup and Online times, however, increase due to lower bandwidth and higher latency of the WAN. Indeed, this is what we observe in Fig. 7.

7.4 Comparison with MOTION-native Inner Product

Finally, we compared our automatically generated routine for Inner Product with the manually SIMD-ified MOTION-native routine in the distribution. We were surprised that we were an order of
magnitude slower in Boolean GMW as our circuit ran a significantly larger number of communication rounds. Upon investigation, it turned out that the vectorized multiplications were essentially the same, however, our addition loop incurred significant cost (ADD is non-local and expensive in Boolean GMW). The MOTION-native loop ran result += mult_unsimdified[i]; while our loop generated
and ran result[i] = result[i-1] + mult_unsimdified[i]; recall that the scalar expansion is an artifact of our vectorization. We rewrote the accumulation (manually, for testing purposes) and that led to the comparable speedup!

MOTION’s compiler performs analysis that informs circuit generation and the example illustrates the power of the analysis. In the above example, MOTION does the standard divide-and-conquer accumulation. The poor performance of our loop is not a limitation of our approach, but it could be a limitation of our current implementation. Recall that the next phase of Vectorization, in §5, which we have not implemented yet, gets rid of redundant dimensions and generates if (i!=0) { result_prev = result; } result = result_prev + mult_unsimdified[i]. And while it is unrealistic to expect that MOTION’s static analysis will detect the associative accumulation in the scalar expansion code, it is realistic to expect that it will in the above only slightly more complex code. It still might lead to confusion in the static analysis as analysis on the AST is difficult (unfortunately, we ran out of time and did not experiment further).

We conjecture that MPC Source, a straight-forward representation, will not only enable detection of general associative loops, but also allow for program synthesis to increase opportunities for divide-and-conquer parallelization [19]; we leave this as future work.

8 RELATED WORK

MPC languages and compilers. Languages and compilers for secure computation have seen significant attention and advances in recent years. The early MPC compilers Fairplay [5], and Sharemind [9]
were followed by PICCO [37], Obliv-C [36], TinyGarble [32], Wystiria [31], and others. A new generation of MPC compilers includes SPDZ/SCALE-MAMBA/MP-SPDZ [25] and the ABY/HyCC/MOTION [15, 13, 11] frameworks. These two families are the state-of-the-art and are actively developed. Another recent development is Viaduct, a functional language and compiler that supports a range of secure computation frameworks, including MPC and ZKP. Hastings et al. present a review of compiler frameworks [21].

While each of these languages and compilers brings in new ideas and advances, none addresses the problem of “circuit independent” intermediate representation and optimization. We envision a classical compiler structure: (1) a Wystiria, Viaduct, Obliv-C, or IMP Source front end, including rich type systems and AST-level semantic analysis, compile into the MPC Source IR, (2) MPC Source-level optimizations take place, followed by (3) back-end compilers into circuits. Our focus is at the intermediate level.

Many works focus on the implementation of MPC protocols exposing an API to the programmer. For example, the ABY/MOTION line of compiler frameworks provides a library of MPC primitives; the programmer writes MPC programs in C++ on top of the library. These back ends implement different protocols and allow for mixing, but notably, they leave it to the programmer to assign different protocols to different parts of the computation and perform share conversion accordingly. In addition, MOTION provides SIMD primitives, which allows for efficient execution of MPC operations, but again, using SIMD primitives is the responsibility of the programmer. There is interest in frameworks for automatic mixing, e.g., [13, 22, 18].

Other works, e.g., Obliv-C [36], Wystiria [31] and Viaduct [1] focus on higher-level language design, particularly information-flow systems that restrict flow between secure and insecure parts of the program. OblivVM [26] has similar goals to ours — it optimizes oblivious (i.e., secure) source code into an efficient garbled circuit. Our work complements OblivVM, in the sense that while OblivVM relies on programmer annotations such as map-reduce constructs, we automatically detect opportunities for optimization at an intermediate level of representation. We have implemented vectorization, however, divide-and-conquer is a natural next step. Ozdemir et al. [30] develop CirC, an IR with backends into zero-knowledge proof primitives as well as SMT primitives. Our work focuses on MPC, which CirC does not support yet. MPC Source is a higher level of representation than CirC, e.g., it does not unroll loops, thus presenting a smaller-size input program suitable for vectorization and scalable program analysis. Overall, we view our work and CirC as orthogonal — we are focusing on analysis and optimization while CirC focuses on infrastructure.

HyCC [13] is a compiler from C Source into ABY circuits. It does source-to-source compilation with the key goal to decompose the program into modules and then assign protocols to modules. In contrast, we focus on MPC Source-level optimizations, specifically vectorization, although we envision future optimizations as well. We formalize MPC Source and reasoning about transformations, which we conjecture is more tractable than reasoning over the higher-level AST. On the other hand, HyCC does inter-procedural optimizations, while our analysis is intra-procedural.

We will explore context-sensitive inter-procedural analysis over MPC Source in future work. HyCC, similarly to Buscher [12] uses an of-the-shelf source-to-source polyhedral compiler to perform vectorization at the level of source code. The disadvantage of using an of-the-shelf source-to-source compiler is that it solves a more general problem than what MPC presents and may forgo opportunities for optimization — concretely, it is well-known that vectorization and polyhedral compilation do not work well with conditional [7, 24]. In contrast, we consider vectorization at the level of MPC Source which linearizes conditionals; we are able to handle programs with interleaved if-statements and for-loops and achieve significant speedup.

**Classical HPC compilers.** Automatic vectorization is a longstanding problem in high-performance computing (HPC). There are thousands of works in this area reflecting over 40 years of research. We presented a vectorization algorithm for MPC Source, essentially extending classical loop vectorization [3]. In HPC vectorization, conditional control flow presents a challenge — one cannot estimate the cost of a schedule or vectorize branches in a straightforward manner — in contrast to MPC Source vectorization. We view Karrenberg’s work on Whole function vectorization [24] as most closely related to ours — it linearizes the program and vectorizes both branches of a conditional applying masking to avoid execution of the branch-not-taken code, and selection (similar to MUX) to select the correct value based on the result of the condition at runtime. The problem is that masking and selection, or more generally, handling control predicates [7, 24], can lead to slowdown.

We argue that vectorization over linear MPC Source is a different problem, one that warrants a new look, while drawing from results in HPC. Since both branches of the conditional and the multiplexer always execute, not only can we apply aggressive vectorization on linear code, but (perhaps more importantly) we can also build analytical models that accurately predict execution time. These models in turn would drive optimizations such as vectorization, protocol mixing, and others. Vectorization meshes in with those additional optimizations in non-trivial ways.

Furthermore, extensions of classical loop vectorization with array writes, arbitrary indexing, including non-affine indexing, and interaction with SSA are non-trivial and present novel challenges and opportunities for contribution. Polyhedral parallelization [7] considers a higher-level source (typically AST) representation, while our work takes advantage of linear MPC Source and SSA form. The work by Karrenberg [24] is rare in that space, in the sense that it considers vectorization over SSA form, which has similarities to MPC Source. We consider different array representation, notion of dependence, and reasoning about dependence, which we conjecture is more suitable for MPC Source.

**9 CONCLUSION AND FUTURE WORK**

We presented a formalization of the MPC Source intermediate language followed by a specific back-end optimization at the level of MPC Source: novel SIMD-vectorization. We demonstrated that vectorization has significant impact on performance. We are excited about the opportunities for future work — integration with protocol mixing, divide-and-conquer reasoning and parallelization, as well as
as inter-procedural context-sensitive analysis at the level of MPC
Source will improve MPC programmability and efficiency.

REFERENCES

Viaduct: an extensible, optimizing compiler for secure distributed programs.
In Stephen N. Freund and Eran Yahav, editors, PLDI ’21: 42nd ACM SIGPLAN
International Conference on Programming Language Design and Implementa-
In Richard L. Wexelblat, editor, Proceedings of the ACM SIGPLAN’88 Conference
on Programming Language Design and Implementation (PLDI), Atlanta, Georgia,
Wang, editors, Thomas Schneider. Hycc: Compilation of hybrid protocols for practical secure
computation. In Proceedings of the 32nd Annual International Conference on Approximation
Wang, editors, Thomas Schneider. Hycc: Compilation of hybrid protocols for practical secure
computation. In Proceedings of the 32nd Annual International Conference on Approximation
In Richard L. Wexelblat, editor, Proceedings of the ACM SIGPLAN’88 Conference
on Programming Language Design and Implementation (PLDI), Atlanta, Georgia,
The polyhedral model is more widely applicable than you think.
In Rajiv Gupta, editor, Compiler Construction, 19th International Conference, CC
2010, Held as Part of the Joint European Conferences on Theory and Practice of
[8] Marina Blanton and Paolo Gasti. Secure and efficient protocols for iris and fin-
In Proceedings of the 33rd Annual Conference on Computer and Communications Security,
[10] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-
preserving machine learning. In 2017 ACM Conference on Computer and Communications
[18] Niklas Büscher, Daniel Demmler, Stefan Katzenbeisser, David Kretzmer, and Michael Backes.
[22] Niklas Büscher, Daniel Demmler, Stefan Katzenbeisser, David Kretzmer, and Michael Backes.
[27] Niklas Büscher, Daniel Demmler, Stefan Katzenbeisser, David Kretzmer, and Michael Backes.