• No results found

This section is going to look at theaggregation step in federated learning. As mentioned in Section 2.2, the server broadcasts the model to the participating devices. The devices apply local updates to the global model, and then the updated models are sent back to the server to be aggregated. In Section 2.2.1, one can observe that the aggregation methods use a weighted arithmetic average to aggregate the updates.

This is described by Equation 7.

Suppose that a clientisends an update which is a statistical outlier. This can happen because the client has outlier data, or if the client is sending corrupted updates to the server as a result of an adversarial attack, as discussed in Section 2.3. This outlier update will then have a significant impact on the aggre-gation, since statistical mean is not a robust method. An example is that ifniwt+1i njwjt+1, where

Equation 11 shows that the aggregated modelwt+1 will be strongly influenced by the updates of clienti, which makes the global model vulnerable to corrupted updates.

The paperRobust Aggregation for Federated Learning [27] remedies the problem described above by us-ing geometric median instead of arithmetic average. The solution is presented in Section 2.4.3. The goal of Krishna et al. [27] was to develop an aggregation procedure for federated learning which is both robust against corrupted updates sent by client devices and privacy-preserving. Privacy-preservation can be ex-pensive, since the system can require more than just sending the updates to the server [27]. This may lead to more communication overhead, and may require a more complex implementation that differs from al-ready existing solutions. Krishna et al. required that the aggregation should be communication-efficient and practical, i.e., require minimal engineering overhead relative to existing systems.

2.4.1 Secure Average Oracle

Bonawitz et al. [17] were the first to implement a Secure Average Oracle, which is a failure-robust pro-tocol for secure aggregation of high-dimensional data. This aggregation method computes the weighted arithmetic average

m

X

k=1

αkwk (12)

such that clientk’s weight-updatewk is not revealed to any other client or to the server. In Equation 12, mis the number of selected clients, andαk is the weight of the clientk, i.e., how significant the client is when aggregating the updates. Thus, the Secure Average Oracle aggregates the updates in a privacy-preserving manner, and is typically implemented using cryptographic protocols. Krishna et al. decided to use the Secure Average Oracle as a module in their robust aggregation method. By using a small num-ber of calls to a Secure Average Oracle, the goals stated in the last paragraph of Section 2.4 would be ful-filled. By definition, this would make the aggregation method privacy-preserving since the weights are not shared to the clients or to the server. Moreover, communication-efficiency would be achieved if the num-ber of calls to the Secure Average Oracle is small. Since Krishna et al. reused the Secure Average Oracle, which was already introduced by Bonawitz et al. in 2017, practicality was also achieved as it requires min-imal divergence from existing solutions [27].

This aggregation scheme requires multiple calls to a Secure Average Oracle, making the aggregation itera-tive. Multiple calls are required because a single call to the Secure Average Oracle only returns the mean as described in Equation 7. In Section 2.4, it was established that the mean is not robust which means that the aggregation method has to be iterative.

2.4.2 Corruption Model

Before aggregation, each device computes an update. If the device iscorrupted, the update may be arbi-trary. If not, the update is in line with the stochastic gradient descent on the local data. During aggre-gation, all devices behave nominally, i.e., they give their update to the Secure Average Oracle, even if it is corrupted [27]. In other words, corrupted and non-corrupted devices are treated equally during the it-erative aggregation. This happens because the Secure Average Oracle does not know if a device has been corrupted, or if it is operating with outlier data.

The termcorrupted may be ambiguous, since a device can be subject to several types of corruption, and each corruption type depends on the capability of the adversary. Krishna et al. allowed these corruption types in their robust aggregation algorithm:

• Non-adversarial:

This corruption type may rise from bugs in the hardware or software of the client device, bugs in the pipeline, etc. As the name suggests, there is no adversary involved here, and as long as the device behaves as expected during aggregation, this type of corruption is allowed.

• Static data poisoning:

Static data poisoning is when the adversary can modify the training data of a client, but the modi-fication is independent of the training of the federated model. This is allowed under the corruption model. This is explained in greater detail in Section 2.3.1.

• Adaptive data poisoning:

Adaptive data poisoning extends static data poisoning by modifying the training data, while being able to read the federated model. This makes the adversary able to read the model, and then modify the training data such that it hurts the global model the most.

• Update poisoning:

Update poisoning is an extension of both static and adaptive data poisoning. The adversary is able to write the update of the model directly, without having to change the data to get the desired up-date. This is the most general setting that is allowed in this corruption model. This is explained in greater detail in Section 2.3.1.

The corruption model described above does not allow an adversary to modify the aggregation method.

This is possible in the Byzantine adversarial setting [27]. Thus, for each round of the iterative aggregation, the adversary can’t modify the calculated average. Therefore, Byzantine robustness is not compatible with the Secure Average Oracle. Nevertheless, static and adaptive data poisoning, as well as update poisoning, are allowed in this corruption model [27].

2.4.3 Robust Aggregation using the Geometric Median

As shown in Section 2.4, the statistical mean is not robust against outliers. Therefore, Krishna et al. had to find a new way to aggregate the updates from the clients, since the Secure Average Oracle does not ad-dress this problem. A substitute for the statistical mean is the statistical median. This method is robust to outliers since all the values are sorted, and the value of the middle is returned as the median. In 1991, Lopuhaa and Rousseeuw [28] found that the methodgeometric median is robust. The geometric median is a method for finding the multivariate median of a set of points in a Euclidean space, and is defined as the minimizer of

wherewk ∈Rd are the weight-updates from the different clients, andαk >0 defines how much each client is weighted. Therefore, az ∈ Rd has to be found such that the sum of the l2-distances fromwk to zis

minimized. By comparing Equation 7 to Equation 13, one can observe that the first equation computes the average of the sum of thel2-distancessquared, while the second equation has no quadratic term. As shown in Figure 11, the mean tends to move towards the outlier, whereas the geometric median does not do that. For this reason, geometric median is a popular method in robust machine learning, starting with the seminal work of Nemirovski and Yudin in 1983 [29] [27].

Figure 11: The arithmetic mean (AM) converges towards the outlier, while the geometric median (GM) does not [27].

The geometric median can be computed algorithmically with the Smoothed Weiszfeld algorithm from 1937 [30]

z= Pm

k=1βkwk

Pm k=1βk

where βk = αk

max{ν,kz−wkk2}, (14)

whereν is a threshold value for the max operator. In each iteration, the Secure Average Oracle receives the weight vectorswk from each clientk, and calculates an estimate of the geometric medianz, where ini-tial client weights areβk = αk. z is broadcasted to the clients, where the clients calculateβk. In other words, the clients adjust how much they should be weighted during the calculation of the geometric me-dian. This can be seen in Equation 14, since the initial weightαk is divided by max{ν,kz−wkk2}. Sup-pose two clientsiandj with their respective weight updateswi andwj and initial weightsαij. If

kz−wik2>kz−wjk2, (15) thel2-distance of clienti’s update from the estimated median is larger than that of clientj, then

αi

max{ν,kz−wik2} < αj

max{ν,kz−wjk2} =⇒ βi< βj. (16) Assume thatν <kz−wjk2.Then,βi < βj, which means that clientiis less significant than clientj. This is a consequence of clienti’s updates being further away from the geometric median, which could poten-tially be outliers. This process is also described by Figure 12. Each iteration is implemented with one call to Secure Average Oracle. Krishna et al. [27] showed that 3-5 iterations, or calls, to the Secure Average Oracle suffice to weight the clients appropriately. Therefore, this method is communication efficient.

Figure 12: 3 rounds of Robust Federated Aggregation, where the corrupted devices get scaled down. At round 1, each client is weighted equally. For each round that passes, the corrupted clients are weighted less and less. In round 3 one can observe that the corrupted clients are mostly disregarded [27].

By substituting the arithmetic mean in the Federated Averaging algorithm with the geometric median, the Robust Federated Aggregation (RFA) algorithm is obtained. This algorithm can be observed in Algorithm 3.

Algorithm 3: Robust Federated Aggregation (RFA) [27]

wt+1←GeometricMedian(wt+1k ) end

initializeN as total number of iterations;

βkk;

foreach iterationifrom 1 toN do z← PPmk=1mβkwk

The goal of the Robust Federated Aggregation (RFA) algorithm, and all the other federated aggregation algorithms, is to minimize

where (x, y) are covariate-response pairs, andL is a loss function. In other words, the goal is to minimize the weighted average of per-device objectivesFk. To show the theoretical convergence rate bound of the RFA algorithm, Krishna et al. [27] assumed that

L(w, x, y) =Ex,y∼P(y−wTφ(x))2, (18) whereφ(x) is a feature representation of the covariate x. This is the least squares objective, which is both convex and quadratic. Further, eachx, yare drawn from the same distributionP for all devices, and each αk = 1/K. Hence, the data has to be IID to prove the theoretical bound of the RFA algorithm, but it will

still converge for other loss functionsL, and for non-IID data [27]. Another assumption is thatF is a µ-strongly convex,L-smooth function, which are classical assumptions in convex optimization when proving theoretical bounds [31]. The fraction of corrupted devices has to beρ < 12,which means that the major-ity of devices has to be non-corrupted devices. This assumption corresponds to the properties of geometric median which has a breakdown point of 0.5, thus up to half of the points may be arbitrary. Other assump-tions by Krishna et al. were that the feature representationφ(x) is bounded, and that the noise variance is σ2, whereσis the standard deviation in the probability distribution P [27].