Faster Optimization-Based Meta-Learning Adaptation Phase

Neural networks require a large amount of annotated data to learn. Meta-learning algorithms propose a way to decrease the number of training samples to only a few. One of the most prominent optimization-based meta-learning algorithms is Model-Agnostic Meta-Learning (MAML). However, the key procedure of adaptation to new tasks in MAML is quite slow. In this work we propose an improvement to MAML meta-learning algorithm. We introduce Lambda patterns by which we restrict which weight are updated in the network during the adaptation phase. This makes it possible to skip certain gradient computations. The fastest pattern is selected given an allowed quality degradation threshold parameter. In certain cases, quality improvement is possible by a careful pattern selection. The experiments conducted have shown that via Lambda adaptation pattern selection, it is possible to significantly improve the MAML method in the following areas: adaptation time has been decreased by a factor of 3 with minimal accuracy loss; accuracy for one-step adaptation has been substantially improved.


INTRODUCTION
The neural network accuracy for image classification has significantly improved thanks to deep convolutional neural networks.However, a very large number of images is required for such networks to train successfully.For instance, all of the ResNet [1] neural network configurations from ResNet-18 to ResNet-152 (18 and 152 layers deep correspondingly) are trained on the ImageNet dataset [2], which contains 1.281.167images and 1.000 classes (about 1.200 samples per class).Obviously, for many of the practically significant tasks it is impossible to collect and label a dataset that large.Thus, learning deep convolutional networks from scratch might yield poor results.Because-of that, on the smaller datasets typically an approach called transfer learning is used instead.That is, an ImageNet pretrained network of a particular architecture is taken and then further finetuned on the target (smaller) dataset [1; 3; 4].However, training on few examples per class is still a challenge.This contrasts to how we, humans, learn, when even a single example given to a child might be enough.Also, it is hard to estimate the quality of a certain ImageNet pretrained network on the target dataset.Hence, we get a model selection problem: if the model A is better than the model B on ImageNet, will it be better on our small dataset?A promising approach to resolving both of these problems is to use metalearning or its benchmark known as few-shot learning.Meta-learning trains the network on a set of different tasks, which are randomly sampled from the whole space of tasks.By learning the network in such a way, it is assumed that the network will learn features that are relevant to all of the tasks and not only to the single one, i.e., will learn more general features.
In this work we focus on one of the most prominent optimization-based meta-learning methods, called MAML [5].This method has become a keystone, and as it will be shown in the literature overview section, many of the newer method base on its ideas.Training of the MAML method is split into the so-called adaptation and meta-gradient update phases.
The subject of study is the class of optimizationbased meta-learning algorithms.
It has been shown that adaptation phase of the MAML is quite slow to perform [6], and in general, high neural network execution speed is a major problem for applications [7].In this work we introduce gradient update patterns, i.e., a selective update of the neural network weights during the adaptation phase.
The purpose of this work is to show that by carefully selecting the newly-proposed gradient update pattern, it is possible to: 1) increase the execution speed of MAML adaptation phase; 2) significantly improve MAML performance in case, when only 1 adaptation phase is used.The testing results will be shown on a publicly-available few-shot learning dataset CIFAR-FS [8].

PROBLEM STATEMENT
The goal behind meta-learning is to train a neural network Φ(θ), that is capable of adapting to the new previously unknown tasks given a small number of examples.Meta-learning is also said to be learning to learn problem.The training procedure is defined using a concept of tasks, that are sampled from the whole task space ρ(T) of the problem domain.The task is a tuple T = {S, Q}, consisting of the so-called Support Set S = {X S , y S } and Query Set Q = {X Q , y Q } [5; 9-11].In literature, the Query Set is also sometimes referred to as Target Set.Support Set {X S, y S } is used to adapt (or train) the network to the new task.The set S is small.X S are the network inputs, y S -the expected predictions.The number of examples per class is denoted as K and written as K-shot.K is typically in range from 1 to 20, although no hard upper-bound is defined.X Q , y Q are the query inputs and expected outputs correspondingly.Number of classes N the network should distinguish between is denoted as N-way.
We have given the general training procedure, next we define it in more detail for image classification optimization-based meta-learning, which this paper is focused on.Optimization meta learning is defined in 2 steps: 1) adap-tation step, which computes adaptation weights in a form of function θ'(θ), that minimize task-specific error L(y s , Φ(θ', X S )); 2) meta-gradient update, which updates meta-weights θ.The idea behind such training procedure is that by finding good weights θ, it will be possible to adapt to new previously unseen tasks with few training examples in the adaptation procedure.For classification, the loss function used is typically cross-entropy (1): .) , ( log )) , ( , ( We define the algorithm-specific part in the Materials and Methods section.In this work we set a goal of improving adaptation step execution time and accuracy.

REVIEW OF THE LITERATURE
The meta-learning approaches are mainly divided into 3 broad categories [12]: metric-based, model-based and optimization-based.Representatives of each group differ in the neural network design and training procedure.In this work we focus on classification methods, yet applications exist in literally every field of machine learning [5; 13-15], such as NLP, Reinforcement Learning, Face Verification, etc.
Next, we describe each category of meta-learning methods.1) In metric-based methods the goal is to define a neural network architecture that produces an embedding into a metric space and a similarity measure (metric), so that the distance between embeddings of the same class is smaller than that of different classes.Examples of such methods include Siamese Networks [16], Matching Networks [17], Prototypical Networks [9].2) In model-based methods the network architecture is designed, so that the model has explicit memory cells, which help the network to adapt quickly, for instance, Memory-Augmented Neural Networks [18].3) In optimization-based learning the network architecture is not changed, which means that conventional architectures for image classification can be used.One of the quintessential methods in this category is MAML [5], which defines the training procedure as a 2 nd -order optimization problem.The method applicability has been shown in regression, classification and reinforcement learning.Two popular datasets were considered for image classification: Omniglot [19] and miniI-mageNet [10; 17], where MAML has beaten with a margin many of the previous methods.After MAML has been introduced, a lot of works have proposed its modifications.Reptile [20] has simplified MAML training scheme, MAML++ [11] has given practical recommendation on improving MAML training stability.In has been noted that while MAML++ has introduced more parameters to the network, total training time has decreased thanks to the performance optimizations proposed.Authors of Meta-SGD [21] note that by learning not only network weights, but also separate update coefficient for each of the weights, it is possible to achieve higher accuracies.However, the network training time and memory con-sumption has significantly increased as twice the number of the parameters should be optimized.
In contrast to previous works, in this paper we focus on improving the network adaptation and not training time.We assume that after the initial training, the network can be adapted to multiple tasks in an online format.Thus, minimizing adaptation time is an important problem.The results obtained in the paper will be applicable to many of the optimization-based algorithms, including but not limited to the ones mentioned above.

MATERIALS AND METHODS
In this work we propose a modification to the MAML algorithm.As we have described in the problem statement section above, this class of algorithms is defined in terms of adaptation and meta-gradient update phases.The algorithm starts by randomly sampling a training task T i ~ρ(T).To sample a task T i means to 1) randomly select N classes from all classes that are available in the dataset split (training, validation or test, based on which accuracy we want to compute); 2) randomly select K images per each of N classes for the Support Set and K Q images per each class N for the Query Set.The first phase of the algorithm is adaptation, where MAML minimizes loss function (1) on the Support Set by performing several stochastic gradient descent steps.To do that the algorithm iteratively builds model weights θ i (j) (θ) via formula (2), note that θ i (0) ≡ θ: ) .
, , ) In essence, in (3) the algorithm updates the meta-weights θ by averaging computed loss function (1) on the Query Set, for the neural networks Φ with weights θ i (P) on several tasks T i , i.e., in this step the algorithm backpropagates through the losses of all the task-specific adaptations.Throughout the paper we use 4 tasks for the meta-update step.Note, that in (2) task-specific weights θ i (j) are computed on the Support Set, and in (3) Query Set is used for the loss computation.Also, in contrast to the conventional neural network training procedure the loss function is computed twice: first, to compute the adaptation weights θ i (P) in (2); second, to compute the resulting adaption loss in (3).Also, in (2) the gradient is taken by task-specific weights θ i (j-1) from previous step, and in (3) the gradient is taken by meta-weights θ.Thus, as can be seen from formulas (2), (3), the method requires Hessian computation during the meta-gradient update, hence, this is a second-order optimization method.The whole training procedure can be seen in algorithm 1.A more detailed information can be found in the original paper [5].
Algorithm 1. MAML adaptation procedure 1: Randomly sample task Τ i from task space ρ(T) 2: For each task For iteration j = {1, …, P} 4: Adapt the network via formula (2) using S i 5: End for 6: End for 7: Update meta-weight θ via (3) using Q i and the task specific weights θ i Next, we define our modified adaptation procedure.Given a convolutional neural network that has B layers, we define an adaptation pattern (4), where Λ j is an indicative function as defined in (5), which indicates layers of the network that should be updated during backpropagation.

{
}, We say that pattern is full if l ∀ : Λ l = 1.In this case our adaptation phase will be equivalent to the one proposed in MAML.We consider all possible patterns Λ, except l ∀ : Λ l = 0, when no weights can be updated, thus, no adaptation is possible.We assume that updating only certain weights might be useful, because the neural networks tend to learn features that differ in complexity, the closer the layer is to the input the simple the features are [22].Also, authors of Meta-SGD [21] have shown that by learning weight-specific learning rates the resulting quality was superior to the original MAML algorithm.However, Meta-SGD approach was much slower to train as both weights and learning rates have to be learned during the training procedure.Training time in our approach is intact.In contrast to previous works, we propose to update only certain weights, thus, essentially freezing some layers.This allows us to decrease gradient computations required during the adaptation phase as is shown on Fig. 1 for a convolutional network that contains 4 convolutional and a single fully-connected (linear) layer.In Fig. 1 the backpropagation pass goes in the direction opposite to arrows (forward pass).The architecture is taken as an example and can be arbitrary in practice.For the example pattern Λ = {0,1,0,1,1}, we can see that for the Convolutional Block 4 and the Linear layers both the gradient is computed and the weights are updated.For Convolutional Block 3 gradients are computed, but weights are not updated as Convolutional Block 2 requires weight update.However, for Convolutional Block 1 no gradients computation or weight update are performed.Given the above-described Λ pattern description, the updated adaptation formula will look as follows ( 6): ) .
, , 4 EXPERIMENTS To conduct the experiments, we have reimplemented the MAML algorithm.The following paragraph describes the details.
The authors of MAML have defined convolutional neural network architecture and have used it for miniImageNet experiments.This network is commonly referred to as "CNN4" in the later meta-learning literature.It has 4 convolutional blocks, followed by a linear layer.Each of the blocks has a convolutional layer with kernel size of 3 and padding of 1, followed by the Batch Normalization [23], ReLU activation and Max Pooling with kernel size of 2. Number of filters in the convolutional layers is a configurable parameter, the authors have used 32, which we follow.Number of outputs in the linear layer is defined by K for Kway classification problem.Training is performed via Adam [24] gradient descent method as meta-optimizer with learning rate of β = 10 -3 and α = 0.01 as the adaptation step size.Each model has been trained for 600 epochs.While the authors used meta-batch size of 2 for 5-shot and 4 for 2-shot experiment to reduce training memory consumption, we stick to 4 as it leads to slightly better performance on CIFAR-FS [8] dataset during our experiments.Also, the dataset memory footprint is small, so we don't have to reduce memory consumption by using a smaller batch-size.Each epoch has 100 randomly sampled tasks.For the gradient update K N ⋅ samples are taken for N-shot K-way classification problem for training and 15 samples per class for evaluation, thus following [10].
In addition, we have modified the network adaptation procedure, so that it updates only weights defined by pattern Λ as defined in (4)- (6).
For the experiments we have used the novel CIFAR-FS [8] dataset.It has been constructed from a well-known classification dataset, called CIFAR-100 [25].It has images of different kinds of mammals, reptiles, flowers, man-made things, etc.The images are in color and have a size 32x32.Originally, this dataset was not supposed to be used in a few-shot learning setting.In [8]  The exact classes that go into each split are important for testing the resulting accuracy and are defined in [8].By using different classes for training and testing, the adaptation to the new classes can be better estimated.After such training the model is expected to quickly adapt to the new, unseen classes.We have taken the CIFAR-FS dataset for our experiments as it hasn't been analyzed by the MAML authors and is also faster to compute than miniI-mageNet.
All of the training procedures and time measurements were done on our own MAML implementation and tested on NVIDIA GTX 1050Ti GPU.

RESULTS
Given the network configuration as described in the experiments section, we have implemented the MAML algorithm.CIFAR-FS accuracy and adaptation timings are presented in Table 1.In (6) we have proposed a modified adaptation scheme, where only a part of weights is updated during the adaptation procedure.To begin with, we consider only trivial patterns Λ , where only one network layer is updated during the adaptation procedure.We show the accuracy on the test set in Fig. 2, where in a and b we conduct the experiment for 1-shot 5-way and 5-shot 5-way configurations correspondingly.To see the impact of the number of adaptation steps, we also show the accuracies for P = 10 (default) and 1, 3, 5 adaptation steps.As it can be seen, the model accuracy differs significantly between the configurations.For 1-shot 5-way, learning one of the three first convolutional layers only has no effect, the accuracy remains on the level of random guessing (20%).However, training either convolutional layer 4 or the last linear layer improves the model accuracy.Note, that the number of parameters in layers differs.In Table 2, we show the number of parameters for each layer.Note that final layer has different number of parameters depending on N output classes.It can be seen that the first convolutional layer and the final linear (fully-connected) layers have fewer parameter than inner convolutional blocks.This can explain the fact that learning only the linear layer has worse performance.For 5-shot 5-way we see that only convolutional layers 3 and 4 have a positive impact on the performance if adapted alone.Interestingly, number of adaptation steps has a significant impact on the performance with only convolutional layer #3 enabled.As we will see later, such an impact is higher, than when the full network is updated during the adaptation.
In Fig. 3, a and b we depict a similar experiment for 1shot 2-way and 5-shot 2-way configurations correspondingly.Note, that random guessing baseline for these con-figurations is now at 50%, so the lower bound for accuracy is now higher than in Fig. 2.Here we see an opposite trend, where updating the first layers also has a positive impact on the resulting accuracy.Contrasting to previous experiment, updating the Convolutional Block 4 only doesn't provide the best results in either case.As one of the goals in our work is to improve the model adaptation speed, we have timed experiments for trivial patterns Λ.On Fig. 4 and 5 we show the model adaptation time corresponding to all of the four configurations depicted on Fig. 2 and 3.As we can see, in both cases we have a similar trend where the closer the layer we update to the end of the network, the smaller the adaption time is.This follows our previous idea that by skipping some gradient computations (as have been shown on Fig. 1), adaptation time can be reduced.
As can be seen from Fig. 4 and 5, number of adaptation steps has a significant impact on the adaptation speed.On Fig. 6 we show the model accuracy for each of the four scenarios and on Fig. 7 we depict the corresponding timings, both shown with respect to the number of the adaptation steps.As before, the experiments have been conducted for P = 1, 3, 5 and 10 adaptation steps.The results between those reference points have been linearly interpolated.The presented accuracies and timings are the average taken for all 31 possible patterns Λ. Note, that throughout the article we exclude pattern l ∀ : Λ l = 0, as no weights can be changed for such pattern, therefore no adaptation is possible.As can be seen, while the adaptation time grows linearly with the number of adaptation steps, the accuracy growth plateaus at around 5 adaptation steps.Actually, for the full pattern Λ increasing number of adaptation steps from 5 to 10 has less than 0.3% improvement in accuracy.In typical practical scenarios such an improvement is insignificant.Thus, we suggest that performing 10 adaptation steps is redundant.Next, we try to search for such a pattern Λ and number of adaptation steps, so that the resulting accuracy drops no more than 0.07 times the full pattern accuracy.We see such a quality degradation threshold reasonable for practical applications.It should be noted that the approach we propose can be applied with an arbitrary quality degradation threshold.We show such patterns in table 3. Based on this table, we suggest using the Λ * = {1,0,1,1,1}, which offers factor of 3.0 speed improvement with an insignificant quality loss.It can be seen that pattern Λ = {0,1,1,1,1} also suits the specified criteria and also has a slightly higher (factor of 3.1) performance improvement, however, it has a significantly lower performance for both of the 2-way configurations, degrading on 2.5% and 3.2% relative to the best selected pattern Λ * .We consider such a degradation not worth the speed up.The fact that enabling first CNN layer is sig-nificant for the 2-way learning accuracy, closely follows the presented above description of Fig. 3. Also, not to be mistaken, in Fig. 2-5, we had only one layer updated during the adaptation phase (thus Σ l Λ l = 1), however, the best selected pattern Λ * has all expect one layer updated.Finally, we pose a question, whether updating only part of weights in the neural network can improve the method performance.We have discovered, that in extreme case of learning with a single adaptation step (P = 1), we have significant improvement in 5-way adaptation performance by updating with a partial pattern Λ.The performance for the full pattern, as well as a partial, is shown in table 4. We have also performed a search of all cases, when our approach gives better results than the original with P = 1.The results are shown in Table 5. Selected Pattern Λ 1,1,1,0,1 1,1,1,0,1 1,1,0,1,1 1,1,0,1,0

DISCUSSION
In [22] it has been shown that each trained neural network's convolutional layer has a different meaning.The first layer tends to learn simple features, like edges, lines or color gradients.The second layer increases its complexity and understands simple shapes, e.g., circles, corners or stripes, while the last layers learn high-level features, such as eyes, faces, text-like objects, etc.The exact features learned, obviously, depend on the training dataset, however, such logic is retained.In the few-shot learning classification scenario the tasks differ by the types of objects that the model has to classify (e.g., horse, vehicle, frog, etc.).As we have described in the experiments section, train and test sets have different disjoint classes presented.Thus, it might be reasonable to expect that only the last layers of the network should be changed to adapt to the new tasks and classes.This is exactly what we see in the case of 5-way classification as is shown on Fig. 2.However, such a statement contradicts to the experiment results from Fig. 3.By examining the original CIFAR-100 dataset, we can see that image labels (classes) form larger coarse groups.For instance, coarse class (or superclass) "aquatic mammals" contains "beaver", "dolphin", "otter", "seal", "whale".Other examples of superclasses include "fish", "large carnivores", "household electrical devices", etc.The training itself is performed on finer classes.From the examples we have picked, it becomes obvious that instances of different classes have a significant variation in color.Images of aquatic mammals and fish typically contain blue and gray colors, while large carnivores might have more yellow and green.Thus, in case of 2-way classification it is more probable that both classes will be picked from a single or several similar superclasses than in case of 5-way classification.Consequently, we suggest that updating the first layer of a neural network in a 2-way few-shot learning scenario adjusts the feature distribution to the one expected by the following neural network layers.We see this as an analogy of how a human eye works: it adjusts the amount of light coming to the retina by expanding or contracting the pupil, so that it becomes easier to see the details.
From Table 3 we see that keeping the inner layers stale is the most fruitful way to improve the performance, with little to no quality loss.A substantial increase in adaptation speed has been achieved with a target quality loss set to 7% relative to the original pattern Λ = {1,1,1,1,1} and P = 10 adaptation steps.The actual quality loss turns out to be even smaller as we have skipped slightly faster, but worse pattern Λ = {0,1,1,1,1}.Thereby, with the best Λ * = {1,0,1,1,1} and P = 3 adaptation steps, we achieve a factor of 3.0 speed improvement.Our quality losses are the following: 1-shot 2-way is 0.78%, 5-shot 2-way is 1.97% 1-shot 5-way is 4.86% and 5-shot 5-way is 0.71%.Even smaller quality losses can be achieved by consulting table 3. Note, that these are relative quality losses.If the losses are computed in absolute terms, they become even more negligible.Thus, we state that have achieved a significant adaptation time reduction with small-enough quality loss.
We also discuss a way to improve algorithm quality by selecting a pattern Λ.In an extreme case of single adaptation step, avoiding to update the inner layer has helped to improve the overall model quality as is shown in table 4. We have also been able to find such a pattern for each of the few-shot learning configurations such that it improves the model performance for P = 1 adaptation step in table 5.It is curious that no such behavior is observed in cases when P > 1.To the best of our knowledge such behavior has not been previously observed and should be further investigated.

CONCLUSIONS
MAML is an optimization-based few-shot learning method that is able to learn an arbitrary neural network by using only a few samples per class.Many algorithms follow the learning scheme proposed in MAML.In this work we solve the problems of 1) long adaptation time, and 2) poor performance in cases when a single adaptation step is used.
The scientifical novelty of obtained results is that the method of reducing number of gradient computations during MAML adaptation phase has been introduced via the newly proposed Λ patterns.By selecting an appropriate adaptation pattern, we have significantly improved the method in the following areas: 1) long MAML adaptation time has been decreased by the factor 3 with minimal accuracy loss; 2) accuracy for cases when only a single adaptation step is used has been substantially improved.
The practical significance of obtained results is that an improvement of adaptation time of the widespread MAML algorithm will enable applicability of the algorithm on less powerful devices and will in general decrease the time needed for the algorithm to adapt to new tasks.
Prospects for further research are to investigate a way of a more robust automatic pattern selection scheme for an arbitrary training dataset and network configuration.

Figure 1 -
Figure 1 -Λ pattern backpropagation scheme.Backpropagation is performed in order reverse to the arrows.In red -gradients are computed, networks weights are updated; yellow -gradients are computed, no network weight update; green -both gradient computation and network weight update are skipped it has been suggested to split 100 classes into train, validation and test sets.If it has been the non-few-shot neural network training, we would expect all of the 100 classes to be represented in each of the sets, only the images themselves would have been split.However, in few-shot learning case different disjoint classes are taken.Thus, 64 training, 16 validation and 20 test set classes have been selected.

a b Figure 2 -Figure 3 -
Adaptation accuracy for trivial Λ patterns, i.e., only a single layer is updated during adaptation: a is 1-shot 5-way, b is 5-shot 5-way a b Adaptation accuracy for trivial Λ patterns, i.e., only a single layer is updated during adaptation: a is 1-shot 2-way, b is 5-shot 2-way

a b Figure 4 -Figure 6 -
Figure 6 -Accuracy averaged for all patterns Λ for different N-shot K-way problems with respect to the number of adaptation steps P

Figure 7 -
Figure 7 -Adaptation time averaged for all patterns Λ for different N-shot K-way problems with respect to the number of adaptation steps P

Table 1 -
Accuracies and adaptation timings on CIFAR-FS dataset

Table 2 -
Number of parameters for each layer

Table 3 -
Adaptation speedup depending on pattern Λ and the number of adaptation steps.Patterns with loss degradation of less than 7% (relative to full pattern Λ and 10 adaptation steps) are shown.