A MODEL AND TRAINING ALGORITHM OF SMALL-SIZED OBJECT DETECTION SYSTEM FOR A COMPACT AERIAL DRONE

.


ABBREVIATIONS
CNN is a convolutional neural network; ELM is a extreme learning machine; IoU SHFN is a Intersection over Union; is a single-hidden-layer feedforward network PCA is a Principal Component Analysis; SSD is a Single Shot Detector; VGG is a Visual Geometry Group; XOR is a exclusive OR logical operation; YOLO is a You Only Look Once.uniform_random random is the function of generation of a random number from the uniformed distribution from the assigned range; step_size is the size of a range of the search for new solutions, neighboring to current s ; current s is a current solution of simulated annealing algorithm r w is a the weight vector linking the input layer with the r-th hidden node, ς is an any small non-negative number; 0 η is the initial value of learning rate; final η is the final value of learning rate; t η is the current value of learning rate; ( ) x ϕ is a activation function that can be any bounded non-constant piecewise continuous functions; n is a the number of matched predicted box.

INTRODUCTION
Aerial drones are widely used in the tasks of search, rescue, remote inspection and robotic aerial security services.One of the ways to increase the functional efficiency of the aerial drone is to integrate artificial intelligence technologies for on-board sensors data analysis.The surveillance cameras is the most informativeness sensor and object detection function is in demand.Development of accurate detectors of visual objects of interest is a promising direction, but the limited computing resources and the weight of the drone complicate the task.Resource restrictions do not make it possible to implement in compact drone the models of visual data analysis, adapted to a full range of possible observation conditions and a variety of modifications to objects of interest.This causes the need for the development of computationally effective models and algorithms of adaptation to new condition of operation, inherent in the specific field of application area.
In terms of computational efficiency and generalizing power, the leader among the models for visual information analysis is convolutional neural networks.However, training and retraining of convolutional networks requires significant computing resources and large amount of labeled training samples.It is possible to reduce computational load of retraining for adaptation of model to the new conditions of functioning due to the use of the transfer learning techniques.It based on copying the first layers of the network, trained on the ImageNet dataset or another large dataset [1,2].However, the layers of high-level feature representation needs learning from scratch.In this case, it is difficult to estimate in advance the required number of neurons in each convolutional layer.Therefore, to use the principles of growing neural gas is promising approach to unsupervised learning of high-level layers and to determine automatically the necessary number of neurons.In this case, the output layers of detector model require fine-tuning, which is typically implemented as one of modifications of the error backpropagation algorithm [2,3].However, this algorithm is characterized by a low convergence rate and getting stuck in local minima of loss function.There are alternative metaheuristic search optimization algorithms, however, effectiveness of using these algorithms in problems of networks fine tuning is scantily explored [3].
That is why the research, aimed at development of model and effective training algorithm of the objects detector under conditions of limited computing resources and learning data, is relevant.
Z is characterized by small-sized objects on aerial photos, be given.In this case, total number of ground-truth bounding boxes per class does not exceed 200 samples.Moreover, the structure of the vector of model parameters of object detector is known In this case, the constraints It is necessary to find optimal values of parameters g (1) which provide to achieve the maximum value of complex criterion J .
When the object detector functions in its inference mode, it is necessary to provide high confidence of localization and classification of objects of interest on test images.

REVIEW OF THE LITERATURE
The works in [4,5] proposed the models of the visual feature extraction based on Haar-like filters, histograms of oriented gradients, local binary patterns, histograms of visual words and other models of local information descriptors.In this case, high-level contextual information is ignored that leads to decrease effectiveness of smilesized objects detection under conditions of limited volume of training sets.In addition, non-hierarchical feature representation models are characterized by high labor consuming computing under conditions of a large variation of observations [5].
In the image analysis problems, numerous models of hierarchical feature representation model based on CNN are widely used [6].VGG-16, VGG-19, ResNet-50, GoogleNet, MobileNet, SqueezeNet are the most popular of them [5,7].These networks differ in the number of layers, existence of residual connections and multiscale filters in each of the layers.In this case, it known that the models trained on dataset ImageNet accumulate in themselves important information regarding the analysis of visual images [5,8].Despite many target domains are far from the ImageNet context, the few layers of trained networks are possible to be reused.In addition, narrowing the scope of application, for example, by reducing the number of classes, makes it possible to reduce the resource requirements of objects detector.
It was proposed in works [8,9] to perform fine-tuning of object detector based on convolutional network VGG-16 using mini-batch stochastic gradient descent.However, this would require a significant volume of training set and a few days of working on the graphics processing unit for successful training.In work [10], it was proposed to scan of the normalized high-level feature map by a sliding window, in each position of which classification analysis was carried out.In research work [11], it was proposed to carry out classification analysis of a high-level feature representation using information-extreme decision rules.The main idea of this approach is to transform input space of primary features into binary Hamming space where building radial-basis decision rules.This approach provides high computation efficiency because simple operation of comparison with thresholds and Hamming distance calculation based on logical XOR and counting are used.However, solutions of issue of adaptation of high-level layers of feature extractor to domain area of use and fast optimization algorithm of thresholds for features was not proposed.Random forest based feature induction and boosted similarity sensitive coding are two promising approaches to speedup thresholds optimization for binary feature encoding, but integration aspects are not investigated [12].
The works in [7,13] proposed the unsupervised learning of convolutional layers based on autoencoder or the restricted Boltzmann machine that require a large amount of resources for obtaining an acceptable result.In articles [14,15], it is proposed to combine the principles of neural gas and sparse coding for learning convolutional filters for unlabeled datasets.This approach has a soft competitive learning scheme, which facilitates robust convergence to close to optimal distributions of the neurons over the data.In this case, embedding the sparse coding algorithms makes it possible to improve the noise immunity and generalization ability of feature representation.However, the number of neurons is not known in advance and is assigned at the discretion of a developer.
Usage of growing neural gas principles to automatically determine the required number of neurons is a promising approach to training of high-level convolutional layers of convolutional network [15].However, the mechanism of the insertion of new neurons based on setting the insertion period leads to distortion of the learned structures and instability of the learning process.However, it was shown in article [16] that it is possible to ensure stability of learning by replacement of the neurons insertion period with the threshold of a maximum distance of a neuron from every datapoint of the training set, referred to it.However, the mechanisms of neurons updating for adaptation of the learning process to the sparse representation of observations have not been revised yet.
The problem of objects detection based on feature maps of convolutional network is solved by using detection layers YOLO, Faster R-CNN, and SSD [17,18].An important part of these layers is bounding box regression model which provides precise object localization on image.However, training such layers under conditions of a limited size of training dataset and computing resources based on stochastic gradient descent is ineffective.One of the promising ways to implement of bounding box regression model is using the ELM which characterized by rapid training to obtain the least squares solution of regression problem [19].In order to eliminate overfitting which occurs when the number of hidden layer nodes is large, the incremental learning via successively adds hidden nodes is actual to investigate.In this case, the task of fine-tuning of feature extractor can be solve by application of metaheuristic algorithms as alternative of gradient descent approach.Among them, it is worth highlighting the simulated annealing algorithm which characterized by better convergence and less probability of getting stuck in the "bad" local optimum [20].However, its use in the problems of fine tuning of convolutional filters remains insufficiently studied.

MATERIALS AND ALGORITHMS
To solve the problem of the development of the model for data analysis under conditions of limited volume of training set and computing resources of a compact aerial drone, it is essential to maximize the use of all available a priori information.Transfer learning technique is one of the examples of using a priori information, accumulated in the trained network for the future reuse [1,2].This technique allows the lower layers of the model to be borrowed from a pre-trained deep learning network and the top layers to be adapted to the particular domain requirements.However, when objects of small sizes are detected, the zone of interest contains little information and the context, which allows to eliminate uncertainty, becomes increasingly more important.Fig. 1 depicts a proposed architecture for the detection of objects of small sizes based on a combination of transfer learning and contextual information obtained by concatenating the feature maps of different artificial neural network layers.Upscaling is used to provide uniform shape of each channel of feature map.Concatenation with upscaling are considered as single upscale-concatenation layer.

Non-Maximum Suppression
Figure 1 -Generalized architecture of the detector Inception, Xception, VGG and Fire, among others, are some of the popular choices of modules to be used when constructing deep convolutional networks.These modules have different microarchitecture, which, in turn, implies different computational complexity and learning efficiency.We propose to adopt the lower layers from a pre-trained Squeezenet network, which consists of Fire modules and is characterized by the high computational efficiency.Upper layers of the network in this case can be built with the simple VGG modules which afford significant flexibility where different learning techniques are concerned, At the first stage of training algorithm, we propose to use unsupervised pre-training of the high-level layers of the network to maximize utilization of the unlabeled domain training samples.In this case, to ensure the noise immunity and informativeness of feature representation, it is proposed to calculate the activation of each feature map pixel based on orthogonal matching pursuit algorithm [12].
It is proposed to carry out unsupervised training of the high-level layers of a neural network using a growing sparse coding neural gas, based on the principles of growing neural gas and sparse coding.In this case, the dataset for training high-level filters of the convolutional network is formed by partitioning input images or feature maps into the patches.These patches reshape to 1D vectors, arriving at the input of the algorithm of growing sparse coding neural gas, the basic stages of which are given below: 1) initialization of the counter of learning vectors t:= 0; 2) two initial nodes (neurons) w a and w b are assigned by random choice out of the training dataset.Nodes w a and w b are connected by the edge, the age of which is zero.These nodes are considered non-fixed; 3) the following vector x, which is set to the unit length (L2-normalization), is selected; 4) set each basis vector , 1, to unit length (L2normalization); 5) calculation of the measure of similarity of input w х v ≥ we proceed to step 12. Otherwise, we add new non-fixed neuron w r to the point that coincides with input vector w r =x, besides, a new edge that connects w r and w s0 is added, then we proceed to step 13; 10) node w s0 and its topological neighbors (the nodes, connected with it by the edge) are displaced in the direction to input vector x according to the Oja's rule [14] by formulas 0 0 0 0 0 0 ( ) , : , T s w х v ≥ we label neuron w s0 as fixed; 12) if w s0 and w s1 are connected by the edge, its age is nulled, otherwise, a new edge with the zero age is formed between w s0 and w s1 ; 13) all edges in the graph with the age of more than a max are removed.In the case when some nodes do not have any incident edges (become isolated), these nodes are also removed; 14) if t<t max , proceed to step 15, otherwise -increment of the step counter t:=t+1 and proceed to step 3; 15) if all neurons are fixed, the algorithm implementation stops, otherwise, proceed to step 3 and a new learning epoch begins (repetition of training dataset).
Concatenations of the feature maps of different layers of the artificial neural network leads to a high dimensionality problem of feature representation.To counter that, we propose to use one of the simplest techniques -Principal Component Analysis.This would allow removal of the features from the low levels of the network which are insensitive to the specific domain context.
In addition, it is proposed to carry out such classification analysis of the feature map in the framework of boosting and the so-called informationextreme technology.This makes it possible to synthesize a classifier with low computational complexity and relatively high accuracy under of limited training sets size constraint [11].
Boosted information-extreme classifier that evaluates belonging of j-th datapoint j x (pixel of feature map) with 1 N features to one of the Z classes performs feature encoding using boosted trees and decision rules constructed in radial basis of binary Hamming space.In this case, there is the training set { , | 1, } The training of boosted information-extreme classifier is performed according to the following steps.1. Initialize weight 1/ j w n = .
2. For k = 1 … K do 3. Bootstrap k D from D using probability distribution ( ) Train decision tree k T on k D using entropy criterion to measure the quality of split.
5. Binary encoding of j x datapoint from D using concatenation of decision paths from 1 T ,.., k T .A datapoint j x is classified in the leaf node by boosted trees.Each decision node receives a unique identifier.If a test is satisfied in a node, then the corresponding bit is asserted.Finally, the encodings for each tree are combined by concatenation (or more generally by hashing the features ID onto a smaller dimensional space) [12].
output of the step.Hence the equality z z n n = ∑ is met.
6. Build information-extreme decision rules in radial basis of binary Hamming space and compute optimal information criterion : , where , , where z E is computed as the normed modification of the S. Kullback's information measure [11]: In order to increase learning efficiency, it is common to reduce the problem of the multi-class classification to the series of the two-class one by the principle "oneagainst-all".In this case, to avoid the problems of class imbalance, due to the majority of negative datapoints in  The network with R hidden nodes can approximate these N samples with zero error when all the parameters are allowed to be adjusted freely, i.e., there exist r β , r w and r b .The above n equations can be compactly rewritten as the matrix equation .
In order to solve such issues of network training as redundant hidden nodes and slow convergence rate, the orthogonal incremental ELM is proposed to use as regression model.It avoids redundant nodes and obtains the least squares solution of equation H Y β = through incorporating the Gram-Schmidt orthogonalization method into well-known incremental extreme learning machine.Rigorous proofs in theory of convergence of orthogonal incremental extreme learning are given by Li Ying [19].The training of orthogonal incremental ELM is performed according to the following steps.
1. Set maximum number of iterations max L and expected learning accuracy 0 E .
2. For L = 1 … do 3. Increase by one the number of hidden nodes: 1 r r = + .4. Randomly generate one hidden node and calculate its output vector r h .

If || ||
r v ≥ ε calculate the output weight for the new hidden node / ( ) and calculate the new residual error For training classifier and regression model or finetuning of feature extractor we collect training set using matching strategy to determine which default bounding boxes correspond to ground truth bounding boxes.Default bounding boxes are defined as feature map pixels reprojected on input image.We consider condition of aerial survey with down oriented camera and high altitude (higher than 100m) therefore multiple anchor boxes associated with feature map pixel are not used.In this case each default box is associated with feature map pixel reprojected on input image.Each ground truth box is matched to the default box with the best Jaccard overlap and default boxes is matched to any ground truth box with IoU higher than 0.4.Regression model is trained only on positive samples (matched default boxes).
The complex criterion J (2) of learning efficiency of object detector should be takes into account both effectiveness of classification analysis J Cls and bounding box prediction Loc J .It is proposed to calculate the criterion of object classification effectiveness from formula 1 1 .
Effectiveness of bounding box prediction is proposed to calculate from formula ˆ1 1 ˆn At last stage of training algorithm, it is necessary to fine tune of high-level layers of feature extractor after unsupervised learning in order to take into account significant imbalance between objects of interest and background patches.It is proposed to use simulated annealing as metaheuristic search optimization algorithms.The efficiency of the simulated annealing algorithm depends on the implementation of the create_neighbor_solution procedure, forming a new solution s i on the i-th iteration of the algorithm.Fig. 2 shows a pseudocode of the simulated annealing algorithm, which is implemented by the epochs_max iterations, on each of which function f() is calculated by passing a labeled training dataset through the model of the system of detection and calculation of a complex criterion (2) [3,20].
annealing algorithm An analysis of the pseudocode in Fig. 2 shows that current solution s current , in relation to which new best solutions s best are sought for, is updated in case of providing a new solution of the criterion increase (2), or randomly from the Gibbs distribution.In this case, an initial search point that is formed by the create_initial_solution procedure can be either randomly generated or a result of the preliminarily training by another algorithm.To generate new solutions in the create_neighbor_solution procedure, it is proposed to use the simplest non-adaptive algorithm, which can be represented as formula [3]: The non-maximum suppression algorithm is used for filtering unnecessary actions of the detector to one and the same image object [17,18].
Thus, we propose a model and object detector training algorithm based on the fusion of different techniques with the aim of maximizing obtained domain context information under the small labelled training set and limited computational resources constraints.

EXPERIMENTS
To train the object detector, 200 images from dataset of Inria Aerial Image Labeling Dataset were used [21].Each image has the resolution of 5,000×5,000 pixels.5,00 unlabeled 224×224 pixel images were generated through random crop with rotation for unsupervised learning.Also 200 labeled images 224×224 pixel images were generated for supervised learning.Labeled training set was augmented to create 1000 instances by adding noise, a contrast change, rotation and cropping.
A large number of vehicles in the urban area were presented in the Inria Aerial Image Labeling Dataset.Vehicles were selected as the objects of interest and the urban area was considered a usage domain.In this case, the set of classes contained Z=3 classes, where the first class corresponded to cars, the second class corresponded to trucks, and the third one -to the background.The size of objects in pixels in random images varied in the range of [7×7, …, 10×10].
In accordance with transfer learning technique, the first 7 Fire modules of pretrained convolutional neural network Squeezenet were adopted.As a result, each input image was encoded into the feature map with 13×13×384 pixel dimensions.
The subsequent layer is trained unsupervised on the unlabeled datasets from the usage domain area consisting of filters with kernels of 3×3 with stride=1.Output feature map is formed by concatenation of feature maps from Fire 6, Fire 7 modules and last convolutional layer.
It is proposed first to train the detector using the unsupervised pretrained last layer of feature extractor via growing sparse coding neural gas without fine tuning.In this case, a fixed value of the training dataset reconstruction ν=0.8 is used during training.In addition, in the information-extreme classifier of the feature map pixels, the number nodes in decision trees is constrained to 16.The depth of each tree is set to 6.
In order to improve the results of machine learning of the detector, informativeness of feature description is increased by fine-tuning of the unsupervised trained convolutional layers.In this case, the following parameters of simulated annealing algorithm were used: c=0.98, 0 T =10, epochs_max=6000, step_size=0.001.Each fine-tuning step is involved a re-training of a regressor and classifier.To maximize the model's generalizing ability and minimize computational complexity, we implemented sequential tuning of the hyperparameter ν responsible for the density of neuron distribution with a step of 0.1.
Prior unsupervised learning of the upper convolutional layers on unlabelled samples from the intended usage domain is aimed at increasing the subsequent supervised machine learning efficiency.It is worthwhile considering the influence the parameters of the growing sparse coding neural gas algorithm used in unsupervised learning have on the results of supervised learning.Table 1 presents the machine learning results and quantity N c of generated convolutional filters (neurons) as a function of the parameter ν, which characterises the accuracy of coverage of the training set by the convolutional filters.Analysis of the Table 1 demonstrates that quantity of neurons and values of partial and general optimisation criteria increases with the growth in hyperparameter ν (2).However, at ν ≤ 0.8 model accuracy as per the test set increases with the increase in parameter ν, however a further increase in this parameter leads to decrease in results quality due to overfitting.At the same time, selection of the quantity of the principal components for each value of ν was made in accordance with the Kaiser criterion : selecting only the principal components with eigenvalues exceeding 1.
Fig. 3 provides the graphical representations of the dependency of learning efficiency information criterion (4) on container radii of each class.These can be used to evaluate the accuracy and noise immunity of the synthesized classification decision rules.
Analysis of Fig. Analysis of Fig. 4 shows that prior unsupervised learning on the basis of sparsely coded growing neural gas algorithm allows to improve the final outcome of the supervised learning with simulated annealing algorithm.The use of prior unsupervised learning allows reaching the global maximum of criterion (2) more than 10 times faster.Apart from that, the resulting model test on the training set indicates that the use of prior unsupervised learning allows to reduce the overfitting effects when the labelled training set is limited.
The information criterion (2) calculated for the training set was equal to J train =0.921 which provides of 96% correctly detection on test dataset where prior unsupervised learning was applied.At the same time, there was a substantial difference between the criteria J train =0.3011 and 85% correctly detection on test dataset where unsupervised learning was not applied.
Thus the proposed algorithm of the prior unsupervised training of the upper layers allows to increase the learning efficiency criteria and percentage of the objects detected in the test images.In addition, the use of such algorithm allows to reduce the overfitting effect and increase the speed of convergence on the global maximum in the subsequent supervised learning process when the labelled training set size is limited.The relationship between the learning efficiency criterion and the parameters of the simulated annealing algorithm was not considered in the present study, however.Thus, further research will be focused on improving the detector model and development of the algorithms parameter tuning on the process of machine learning.on the number of learning epochs: 1 -before prior unsupervised learning; 2 -after prior unupervised learning AНОТАЦІЯ Актуальність.Розроблено обчислювально просту модель і ефективний алгоритм навчання бортової системи детектування об'єктів на місцевості.Об'єкт дослідження -процес детектування малорозмірних об'єктів на аерофотознімках в умовах обмежених обчислювальних ресурсів і невизначеності, зумовленої малим об'ємом розміченої навчальної вибірки.Предмет дослідження -модель і метод навчання моделі для детектування малорозмірних об'єктів на аерофотознімках.

BH 1 K 2 K 2 N
is a set of ground-truth bounding boxes which correspond to objects of interest on k -th image; r b is a the bias of the r-th hidden node; z b is a support vector of data distribution in class o z X ; { } d is a set of concentric radii of hyperspherical сontainer in binary Hamming space; z E is a training efficiency criterion of decision rule for is a the hidden layer output matrix of the SHFN; k I is a k -th RGB image i IoU is a IoU between ground truth box and appropriate i-th predicted box; is a the size of training sets; is a the size of test sets; M is a number of regression model outputs; N is a dimension of instance; is a number of induced binary features; n is a size of dataset; z n is a the volume of the training set of o z X class; j o is a the output of the network with respect to the jth input vector j x ;

yβ
vector of weight coefficients of the neuron-winner; sn w Δ is the correction vector of weight coefficients of the topological neighbors of neuron-winner; j x is a j -th input vector, is a label of j-th instance; z α is a false-positive rates of classification decisions regarding belonging of input vectors to the o z X class; r β is a the weight vector linking the output layer with the r-th hidden node, is a false-negative rates of classification decisions regarding belonging of input vectors to the o z X class; b ε is the constants of the update force of weight coefficients of the neuron-winner; n ε is the constants of the update force of weight coefficients of topological neighbors of the neuronwinner; datasets, a synthetic class is an alternative to the o z X class.The synthetic class is represented by z n datapoints of the remaining classes, which are the closest to support vector z b .7. Test obtained information-extreme rules on dataset D and compute error rate for each sample from D .Under the inference mode, decision on belonging of datapoint b to one class from set { loop.Another important task in detecting precise object boundaries hampered by subsampling is bounding box prediction.We propose to implement bounding box prediction on the basis of a regression model based on a

Figure 3 - 6 DISCUSSION
Figure 3 -Dependency of classifier learning efficiency information criterion on class container radius: a -class 1 o X ; b -class 2 o X