SCOPING ADVERSARIAL ATTACK FOR IMPROVING ITS QUALITY

.


ABBREVIATIONS
W -network's weight matrix; x -network's input vector, image pixel brightness vector; y -unit vector, which defines object attribution to one class or another; z -network's output vector; ẑ -desired (target) neural network's output; α -algorithm parameter; ( ) k σ ⋅ -output layer's k th neuron softmax activation function, which computes layer output by its input.INTRODUCTION An increasing number of tasks is being solved with neural-network-based solutions.Neural networks have reached the dominant position in image recognition since 2012, when in ImageNet Large Scale Visual Recognition Challenge AlexNet has got first place with a large margin [1].A growing need of analyzing neural networks and their vulnerabilities arises with a more prominent use in security, video surveillance systems, self-driving cars and robots.
Pretrained neural networks are being used by many companies to reach their goals.A set of security vulnerabilities is disclosed due to a wide spread of neural networks with similar architecture that are trained on publicly available datasets.Given a slight modification of an input data small enough to be imperceptible by a human being it is possible to make the neural network misclassify the data or even output some specific class.These are known as non-targeted and targeted adversarial attacks respectively.
By exploring trained neural network attack algorithms, understanding attack preconditions and structural image changes caused by it, it will be possible to make neural networks more efficient and robust to the input data perturbations.
In most cases the process of neural network inference can be treated as a black box even though its architecture and training dataset is known, giving a clear interpretation of underlying weights is a difficult problem.Lately a range of research papers with an attempt to track and formalize network inference model through its input data and with the use of optimization and statistics theories has been published (i.e.[2][3][4]).Despite that fact, development of the toolchain for understanding and diagnosing machine learning is still a prioritized research problem.
The object of study is the neural network attack process.
The subject of study is the types of adversarial attacks, consequences caused by them and possible reasons of their existence.
The purpose of the work is to: 1) develop different attack algorithms on a single-layer trained neural network given the results of a preliminary analysis of the network's weights; 2) estimation of image quality loss after the modification; 3) comparison of the attack results conducted by the developed algorithms with different adversarial attack types that were previously published in the research papers.( ) ( ( ), ( ),..., ( ))

PROBLEM STATEMENT
2. Find such a perturbation , which for a given value { } where ˆ(( 3. Find such a perturbation , which for a deliberately chosen x , such that

REVIEW OF THE LITERATURE
The term "Adversarial attack" has been first introduced in the work [3], where it has been shown how with a minor change of an image's pixels leads to a misclassification during neural network inference procedure.What makes the problem more severe is that in many cases these changes cannot be observed by a human.As it has been shown by the authors, such a perturbation is not a random network's training issue.The very same modified image can be transferred with a success to a different network trained on a different but a similar dataset.To archive that goal a Box-constrained L-BFGS algorithm has been introduced.Another algorithm FGSM (Fast Gradient Step Method) has been developed as an enhancement, where a first-order approximation of the loss function is used to generate adversarial examples [6].I-FGM is an iterative algorithm that builds on top of the ideas introduced previously and uses gradient of the loss function.These algorithms have a simpler interpretation and are faster to perform the attack.In 2017 an international neural network defense and attack competition has been held by the Google Brain team.The competition's winners have developed an algorithm to attack neural networks with a known architecture.As it has been shown, the generated images have been able not only to mislead the target network, but also other networks trained on other datasets of a similar kind or networks with another architecture.In all the cases attacks were performed on a bitmap to retain the perturbations untouched.
A lot of modern computer vision systems that are being used in security critical domains use deep neural networks behind the scenes.That's why many of the research papers target a scenario of a real-world attack.As such, paper [9] offers a road sign attack algorithm RP2.By placing stickers and drawing graffiti on the real signs, it has been made possible to force the network to misclassify "Stop" sign as a "Speed limitation".Moreover, the perturbations have been shown to be robust to the angle and distance change to a camera.
Other algorithms of generating reliable physical adversarial perturbations are shown in papers [10, 11 and others].
An efficient black box attack algorithm ZOO has been presented in the paper [12].This type of an attack can be conducted without any knowledge of the neural network inner workings, the only requirement is to have access to its inference engine: one should be able to send input data and get back class probability distribution.A method of stochastic coordinate descent has been used to optimize the target function.The coordinate to be updated next is chosen by utilizing computed gradient and hessian.Gradient components are calculated with a finite difference method.To further optimize the speed, attack is performed in a hierarchical method, where small scaled images are attacked first and get enlarged over time.
An algorithm to generate efficient targeted adversarial images using optimization methods has been proposed in [13] (C&W attack).
Implementations of the described algorithms as well as some other can be found in an opensource library cleverhans [14].
Different contemporary methods of generating adversarial examples in the field of deep learning were explored and summarized in [15].Where classification of attacks by characteristics, target goal and features has been presented.A vast review of a research held in the field of machine learning attacks, analysis of the adversarial example precursors and defense methods against them was done in [16].
An attempt to build simplified yet efficient adversarial attack methods on a logistic regression trained on the MNIST dataset has been made in this work.We are exploring two types of white box attacks (based on an assumption of full network weights and architecture knowledge): targeted and non-targeted.Such a choice has been motivated by the following factors: 1) a simplicity of a network's configuration and interpretability of its weights as of an importance of image pixels towards recognizing a picture as a sample of this or that class; 2) an ability of adversarial examples to efficiently highjack neural networks different to the one for which they were generated (as noted in [17]).

MATERIALS AND METHODS
The MNIST dataset will be used for the neural network training.Among its advantages is a small size and ability to make accurate predictions even using simple neural networks.The dataset consists of handwritten digits 0-9 of size 28 28 × .Each digit is a normalized grayscale image.The training subset consists of 60.000 examples, the testing subset 10.000.Both contain samples of digits handwritten by distinct people.Let's unroll images into single-dimensional vector and assume each pixel to be a separate input feature.As a prepossessing step the pixel intensities are normalized into [0, 1] range, which is done by dividing its values by 255 (0 stands for black pixels, 1 for the white ones).
A single-layer neural network is built with an input layer of 784 I = neurons (by the number of pixels in the unrolled image) and 10 K = in the output layer (by the number of classes).
Should be noted that during mathematical neural network training problem statement softmax activation function was not chosen at random.As is known, softmax serves a goal of transforming an arbitrary real-valued vector into a probability distribution of the inferred classed.For example, a network can identify a picture as 8 with a probability 0.9 and as 6 with a likelihood of 0. As it has been noted above, the logistic regression's key advantage, which will be used further down to build an attack algorithm, is interpretability of a weight matrix ik W as of an importance or a contribution of i th image pixel towards k th class classification.Precisely, if 0 ik W > , it is expected that an increase of pixel brightness by a some 0 δ> will lead to a higher confidence towards classifying an image as an example of k th class, and if the weight 0 ik W < its decrease will lead to a decay of the probability.
Representation of all inferred classes in a form of a pixel importance map towards classifying each image as an instance of i th class is shown on Fig. 1.
The presented illustrations can be thought of as some generic neural network digit representation.
As shown by the detailed weight matrix analysis: if its element ik W has a large enough positive value (relatively to other matrix elements), then i th image pixel "whiteness" is important for classification of a digit as an instance of k th class; in the opposite case, when ik W is large enough by modulo, but is negative, then black regions are important for k th class.ik W that is close to zero means that color of that pixel has no importance for classification towards k th class.Following the above-described logic, we can use an element-wise multiplication to get pixels equally important for an image classification as of an instance of both classes and element-wise subtraction to get regions whose pixel brightness is more important for one class than the other.Fig. 2 has an example of an element-wise multiplication for digits 0 and 8, where light regions are equally important for both classes.
The result of a subtraction 8  Obviously, subtraction matrices are more significant for performing a targeted attack.By increasing pixel brightness in regions with a large difference (light regions on the figure), one can increase probability of classification of an image as of an instance of the subtrahend and decrease as of an example of the minuend.
To estimate image quality loss L ∞ -norm has been used in [8], that is the largest deviation of a pixel brightness over entire image.For images, whose pixel values are bound in the range [0, 255] deviations up to 15 points were permitted.However, such a metric allows to generate nearly unrecognizable (when compared to the source) images, which is not something we are up to.So, a metric that is highly correlated with a human perception is needed.The best results can be obtained by using one of the following metrics: MAE (Mean Absolute Error), PSNR (Peak Signal to Noise Ratio), SSIM (Structural Similarity Index).The first two are easy to compute and are frequent to be used, but they do not take human vision features into account.SSIM metrics has been introduced as an improvement on top of MAE and PSNR and is tuned to match human visual perception system as it is shown in [18].SSIM metric values lie in range [ 1,1] − .The maximum value signifies that images are identical.This is the metrics that is to be used for an algorithm's quality estimation.
Where image is an image obtained on the previous algorithm step or the source one if this is the first algorithm step; source_weights is the trained network weight matrix for original image label; target_weights is the trained network weight matrix for a class, towards which we want to change the prediction; min_difference is the minimal difference between classes' weight matrices for a pixel to be an attack target; step signifies pixel brightness change on the current iteration; { } 0; 0.5;1 α ∈ defines an algorithm modification.
Note, that together with α , step and min_difference, the number of algorithm steps max_steps could also be treated as algorithm parameter.Recommended values for the described parameter values are to follow.
As it was remarked above, by the constraints on the target class j , adversarial attacks are divided into two subtypes: -if the goal is to assign to an image class j instead of class k, then such an attack is called targeted.This type of attack can be accomplished via the described algorithm with

EXPERIMENTS
By using the developed algorithm, an attack 0 → 8 is performed with a goal to force a neural network to misclassify 0 as 8.A generalized pixel importance matrix is used for that (Fig. 3).Adversarial attack results are shown on Fig. 4. Algorithm parameters: 0.5 α = , min_difference = 0.0, step = 0.01, max_steps=10.As is seen, the attack has succeeded: the digit has been classified as 8 with a probability 44%, in the meantime the source and modified images are virtually the same, which is approved by the SSIM metrics value 0.934.
It should be considered, that if all image pixels are being attacked by bearing in mind only the sign of the attack difference (as it is done in [6]), then the attack will still be successful.However, in such a case image noise can be viewed easily.Moreover, for an attacked image that is classified as 8 with a probability 0.396, SSIM metric value drops down to 0.892, which is significantly lower.
Let's visualize influence of min_difference attack parameter onto the attack result.The algorithm step has been increased for the effect to be more pronounced.In our case 99% of pixel weights lie in range [−1;1], so the difference by modulo is within [0;2], that's the range for the min_difference parameter.Fig. 5 has the results of targeted attack 0 → 8 displayed with algorithm parameters: 0.5 α = , min_difference = 1.2, step = 0.1, max_steps = 10.As it turns out, it has been enough to modify pixel brightness of only 4 source image points for the attack to succeed.SSIM value of 0.954 has been achieved on the 8 th algorithm step.Experiments akin to the one presented above have been performed on a set of images and all with a success.However, it has been noted, that an algorithm has an interesting feature, where in some cases it leads the attack not directly to the target clas For example, for the targeted attack 6 into 1 during the first algorithm steps we have 6 misclassified as 2. In a case when such an intermediate class appears during the attack, perceptual image quality is degraded (for the example above we got SSIM = 0.635).
The problem's roots have been investigated by utilizing PCA (Principal Component Analysis).It has been noticed after having visualized scatted plot for points of classes 2, 6 and 1, that intercluster difference for digits 2 and 6 is a lot lower that the one for 6 and 1.And as is shown on fig.6 (for α=0.5), the difference vector between the target and source classes passes through the field of twos.s, but through some intermediate instead.While performing targeted attack such issues can be avoided by using another algorithm modification with parameter 1 α= .Thus, by applying such a modification on the one hand, for a targeted attack we will strive to maximize output of the target neuron when compared to others.On the other hand, we will minimize source neuron output in respect to other ones to accomplish nontargeted attack, which can be achieved with an additional algorithm modification with 0 α = .
Much better modified image quality can be obtained by the virtue of such algorithm modifications (for example, for the same 6 → 1 attack SSIM score has risen to 0.780).Fig. 6 has trajectories of the source image 6 while being a subject to modification by the original algorithm and both its variations (targeted and nontargeted).
5 RESULTS Generalized attack analysis was performed next.By launching targeted attack for each pair of source and target classes, a success rate heatmap has been drawn (Fig. 7).Source classes are shown on the left, the target ones in the bottom.
Heatmap elements are SSIM values averaged across all the attacks for a given source, target pair.Mean quality over the whole test dataset is 0.76 -such images after attack will still be correctly classified by a human.Top score has been achieved for an attack of similar digits i.e.: 8 → 9, 9 → 8, 0 → 8, 3 → 8.The worst quality degradation was for attack 1 → 0. This can be explained by the fact that the vital region for zero is a black hole in the middle, which gets usually overlapped by a white bar of a one digit.Should be noted, that 0 → 1 attack requires much fewer image modifications then the one into opposite direction, which is proved by comparing SSIM metric value (higher by 0.21).As attacks have been built on real test set samples opposed to the generic digit silhouettes which got learnt by the neural network, the heatmap SSIM values lack symmetry.Training set quality loss heatmap has a similar feature in it.Lower average image quality loss can be attained by employing a stricter parameter selection algorithm.While fig.7 has losses computed for a low empirically chosen min_difference=0.5 value, by selecting the best value from range [0.5, 1.2] an increase of SSIM to 0.87 score has been observed.
A higher average SSIM score 0.93 has been reached for the non-targeted attack case, which means that source and target images are nearly impossible to distinguish with a naked eye.As previously, digits 0, 8, were the best ones to attack, 1 has proved to be the most problematic (see the boxplot on Fig. 8).By analyzing non-targeted attack results, one can come up to a conclusion that if image quality loss deviation is high for different images of a certain class (i.e.some images are easy to attack, while others not), then algorithm is inefficient (as it can only change digits that look similar to several classes); if, conversely, the deviation is small, then the algorithm is efficient.
SSIM plots with respect to min_difference parameter value have allowed to make a conclusion about the fact that each class has a tendency of an image quality rise jointly with min_difference increase up to 0.9 point, such a trend is especially noticeable for the class of nines.Specifically, the human perceived image quality loss will be substantially lower in case of a strong change of several pixels, then when all image points are slightly modified.Taking this feature into account is the thing that makes our algorithm standout among all other known in literature methods.
Among the fast gradient methods, the most efficient algorithm is I-FGM with L 2 norm loss [7].An attack for each source, target pair has been conducted by following the above described procedure for the case without min_difference selection.Algorithm has been successful on all test images but has achieved a lower SSIM score of 0.83.Lastly, the question of adversarial image transfer has been considered.This way, we want to perform the socalled black box attack, when we don't have any knowledge about network's weights or architecture, the only allowed operation is to query neural network prediction engine by submitting some images.The attack will be performed by using the above described logistic regression architecture, then an attempt to transfer each image to a 5-layered unknown neural network will be made.
For the results reproducibility neural network architecture is to follow, yet this knowledge has not been used in any way during the attack phase.The neural network has a 5-layer fully-connected architecture with layer sizes of 200, 100, 60, 30, i.e. 4 hidden, one output with 10 neurons and one input with 784.As a mean of regularizing the network Batch Normalization has been applied after the first layer, Dropout after the second one.ReLU has been used as an activation function for all layers but the last one, where we have switched to a Softmax function instead.After 100 epochs of training using Adam optimization algorithm, training set accuracy has reached 98.65%, the test one 98.51%.
Figure 9 has a generalized heatmap representation of an attack transfer success probabilities.By the abovedescribed procedure an average probability of 33% successfully transferred images has been achieved.Interesting to note, that in many cases images that were difficult to attack for the original network have seen a higher transfer rates than the ones needed only minor image changes.For instance, it has been possible to successfully transfer 87% of 1 → 0 attack images, which have been one of the most ones, but only 14% 9 → 7 attack images.
Let's follow along the 4 → 9 attack procedure.Each step will have the predicted digit with its probability shown for the attacked single-layer classifier (SL) and 5layer fully-connected network (FC5) (Fig. 10).It should be observed that after there were enough changes to cheat the original network, it has been necessary to make 4 more steps to deceive the 5-layer one.This means that while the two networks have a similar decision boundary yet each one has it biased with respect to another one.
Considering the above-described thoughts, a generalized targeted attack with neural network transfer has been conducted once again.This way, after having performed a successful attack on the source network, 5 more algorithm steps were made (where number 5 was chosen empirically), which makes it possible to transfer 91% of adversarial images with a minimal image quality loss.
6 DISCUSSION Hence, another way of analyzing neural network safety has been presented (logistic regression in particular) against input data attacks.Simplified targeted and non-targeted logistic regression attack algorithms for handwritten digit classification problem on the MNIST dataset has been built.A visual "importance" interpretation of each image pixel for its classification as of an instance of a class has been given.An analysis has been performed that has permitted to define classes the most vulnerable to the attack as well as images for which class predicted by the neural network can be changed unnoticeably for a human being.The proposed algorithm gives a possibility of conducting a successful adversarial attack by modifying only several image pixels which minimizes image data loss.
The analysis of image quality loss performed based on Structural Image Similarity Index (SSIM) for targeted and non-targeted neural network attacks for the proposed algorithm shows that the developed algorithm provides a better image quality of the attack in comparison to other gradient methods.
Adversarial examples, built with the developed algorithm, have been successfully transferred to a different neural network with 5 layers of an unknown architecture.High change of adversarial image transfer to a network with a vastly different network architecture makes the algorithm applicable for attacking restrictedaccess systems.

CONCLUSIONS
A logistic regression adversarial attack algorithm for the MNIST dataset handwritten digit classification task has been proposed.Targeted and non-targeted neural network attacks can be performed by utilizing one of the two algorithm modifications.The relevance is explained by the fact of an increasing growth of neural network use in the field of public safety and of a critical need of exploring neural network attack methods and of their precursors.
The scientific novelty of obtained results is that for the first time adversarial attack algorithm has been built upon the attack scoping idea.The presented fast and efficient attack algorithm is able to attack both the whole image as well as separate image regions, which makes the attack algorithm more flexible.An image information loss can be minimized by modifying only a couple of pixels.The practical significance of obtained results is that an early neural network vulnerability diagnostic can be performed by utilizing the proposed algorithms and image quality loss analysis system, which is a pivotal point towards a safer practical neural network use.
Prospects for further research are to study physical neural network adversarial attack transfer with the use of an ordinary pen, based on a pixel importance of the selected class.
Выводы.Состязательные примеры, построенные на основе идеи ограничения области атаки, а также методику анализа входных данных легко обобщается и на другие задачи распознавания, что делает ее применимой для анализа ряда ABBREVIATIONS FGSM Fast Gradient Sign Method; L-BFGS Limited-memory BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm; MAE Mean Absolute Error; MNIST Modified National Institute of Standards and Technology; PSNR Peak Signal-to-Noise Ratio; RP2 Robust Physical Perturbations; Softmax Softened (via exponent) max function.SSIM Structural Similarity Index; NOMENCLATURE bbiases vector; I -feature space dimensionality, neuron count in the network's input layer; imagesource image; K -number of classes, neuron count in the output layer; M -training batch size, used to define number of images used in a single optimization method step during training; max_stepsnumber of attack steps; min _ difference -minimal difference between classes' weights allowed for a pixel attack; S -source class; source_weightstrained network's weight matrix for a class predicted by the neural network for the source image;

1 .
Considering such a "probabilistic" classification during neural network training, cross-entropy loss ( , ; , ) W b z y γ has been selected as an error metric between computed outputs y and desired z .By performing gradient descent for minimizing ( , ) G W b function, in 12 epochs (roughly 40 seconds worth of training time) an accuracy score of 97.0% and 92.7% on training and test sets has been achieved.

Fig. 3 .
Fig. 3.It is easily seen that in this case light tones signify pixel importance for classifying 8, the dark ones for classifying 0.

Figure 1 -Figure 2 Figure 3 -
Figure 1 -Image pixel weights for each dataset digit Let's denote neural network attack problem statement.The output is presented as a probability distribution of handwritten digit classes z .Consider that the network predicts image as an instance of class {0,1, 2,...,9} S some pixels' brightness, we want to change neural network prediction to {0,1, 2,..., 9}, .In so doing, we enforce image correctness by clamping brightness values into a range [0for each point i in the image find corresponding weights S W and T W in weight matrices for source S and target T classes let (1 ) that, it is said that the algorithm has succeeded to attack an image only if the algorithm has been able to change neural network's predicted class into digit j in a finite number of steps, and has failed in all other cases; -if the goal is to reassign classification of an image of a class k, to any different one j k ≠ , then the attack is called non-targeted.The algorithm parameter 0 α = can be used to perform such an attack.The result is a success if an incorrect classification has been achieved in a finite number of steps.

Figure 7 -
Figure 7 -SSIM metric values heatmap for each source, target attack pair

Figure 8 -Figure 9 -
Figure 8 -Low non-targeted attack SSIM score deviation.Algorithm is good at attacking diverse images of a single class