METHOD OF DATA DEPERSONALIZATION IN PROTECTED AUTOMATED INFORMATION SYSTEMS

Context. The problem of data depersonalization in information systems is considered. The analysis of modern approaches to depersonalization of data is carried out, it is revealed and proved by need of creation of the new method allowing to increase security of the processed data and their reliability. The object of the study was a model of data depersonalization, allowing to reduce the cost of protecting information systems. Objective. The goal of the work is the analysis of modern methods of depersonalization and the creation of a method that eliminates the identified shortcomings, with an increased level of confidentiality and use of hashing of critical data and a private key. Method. A method of personal data depersonalization is proposed, based on the method of entering identifiers using hashing of critical data and a private key, which allows to increase the confidentiality of information processed in information systems. Methods are proposed for selecting key critical attributes from primary documents that uniquely identify the subject of personal data, the method of generating initial sets, which divides the source data into two disjoint subsets, the method of generating a hash identifier from a unique sequence and a private key that depersonalizes information and enhances its confidentiality. Results. The developed method is implemented in software and researched while solving the problems of depersonalization. Conclusions. The carried out experiments confirmed the efficiency of the proposed method and allow to recommend it for implementation in automated information systems for processing personal data for solving problems of depersonalization. Prospects for further research may be in the creation of hardware streamlined data depersonalization allowing to increase the speed of processing and confidentiality of data in the information systems.


ABBREVIATIONS
PD is a personal data; ISPD is an information system of personal data.

NOMENCLATURE D is a personal data table;
M is a total amount of attributes; N is a table rows count; 1 A , 2 A are datasets; K is a number of key attributes; F is a hash function; ik a is a rows of data of the table; P is an original message; f is a multi-round non-key reshuffle; Lota  Rho  Pi  Chi  Theta  ,  ,  ,  , are hash functions; are arrays; x is an amount; i is a counter; Z is a hashing results; r is an array defining the count of bits of reshuffle for each state; PK is a private key.

INTRODUCTION
In modern automated systems a large amount of personal data of various security classes is processed. In accordance with the Convention for the Protection of Individuals with regard to Automatic Processing of Personal Data (Strasbourg, January 28, 1981) (with changes and additions), the operator must ensure the confidentiality of the data being processed, which leads to significant material costs [1][2][3]. So the cost of protecting one workplace of an automated personal data processing system can be more than 1000 US dollars, and the number of workstations of an automated system can be several hundreds of dollars. Also the problem faced by many companies, collecting and storing consents to the processing of personal data that require handwritten completion or using an electronic signature, is known. To solve this problem, the methods of depersonalization can be used [4].
The object of the research is the process of transforming confidential personal data into anonymous, nonconfidential sequence.
The process of converting confidential personal data into an impersonal non-confidential sequence usually takes a lot of time, has a low resistance to attacks and has limitations at processing large amounts of personal data with frequent changes.
The subject of the research is the methods of deflating personal data.
Known methods of data depersonalization [5][6][7][8] have low speed; in records relationships between attributes of depersonalized data and their corresponding personal data attributes are partially preserved; if the values of individual attributes change, only the composition of the data can change, not the depersonalization. Therefore, in order to increase the speed and confidentiality of data depersonalization, it is necessary to develop a method to eliminate the identified shortcomings.
The purpose of the work is to increase the speed and quality of the process of depersonalization of data processed in automated information systems.

PROBLEM STATEMENT
Let us assume that the raw data is given in a form of preliminary values ), ,..., , ( where M is the total attribute count and N is the table row count. Attributes d may be key and non-key. As the result the number of the key values is equal to ). < < (0 M K K While forming of data for hashing a private key PK with bitness of 512 is used. For a given sequence of data, the depersonalization function can be represented as the task of splitting data into two sets, 1 A and 2 A , wherein 1 A contains confidential data, 2 A the anonymous information, and finding a unique i i k a a a PK data block from any 0 d is impossible, which in turn allows to establish the interrelation of the elements of the first and second sets.

REVIEW OF THE LITERATURE
In the process of analysis of modern methods of PD depersonalization the following methods were studied: method of identifiers implementation, method of change of composition or semantic, method of decomposition, mixing method.
1) Method of identifiers implementation is a replacement of personal data values with creation of a table (guide) of conformity of identifiers with the initial data. The disadvantages of this method are: a) In the request and in the response to this request the type of representation of PD attributes that were replaced with identifiers is changed. b) In the records the relations between attributes of depersonalized data and PD attributes corresponding to them are saved.
c) It is applicable to a small amount of PD attributes and the small volume of a PD array.
2) Method of change of composition or semantics is the change of composition or semantics of personal data by replacement with statistic processing, transformation, compilation or replacement of some information [9]. This method has the next disadvantages: a) Application of this method is uneffective for PD depersonalization, because during PD attributes extracting it is necessary to consider the possibility of depersonalization with the usage of these attributes. b) During basic replacement of values of separate attributes only change of PD composition can happen, but not depersonalization. c) In record relations between attributes of depersonalized data and the attributes of personal data corresponding to them are partially saved. d) Applicable when processing tasks do not require personalization of depersonalized data, if it is needed this process can be used on small data arrays.
3) Method of decomposition is division of an array of personal data into several sub-arrays with subsequent separate storage of sub-arrays . The basic disadvantages are: a) It saves relations between attributes of depersonalized data and PD attributes corresponding to them in records of each storage. b) Is applicable on large arrays of PD. c) Resistance to attacks depends on the complexity of setup of relations between tables 4) Mixing method is a reshuffle of separate values or groups of values of personal data attributed in an array of persona data. This method has these disadvantages: a) This method does not save relations between attributes of depersonalized data and personal data attributes corresponding to them in records.
b) Resistance to attacks increases with growth of the size of the array of personal data. c) In applicable to large arrays of personal data with frequent changes in data.
The algorithms for the implementation of the identifiers' priming method are represented by functions, some of which consider various cryptographic approaches for generating an identifier for the connection between the cross-reference table and the depersonalized database. For example, a unique and relevant identifier of an individual is obtained by using a one-way cryptographic function from the following attributes: the surname, name, patronymic and date of birth of the individual -O.A. Vishnyakova and D. N. Lavrov [9]. There is also a patent for a method of identifying a subject of personal data using a SIM card as an identifier for communication, proposed by E. S. Volokitina [10]. The method has been successfully implemented in educational organizations. The featured algorithm successfully solves the security problem during processing anonymous data. However, the use of an additional identifier complicates the processing and increases costs.
Algorithms for the implementation of a method of changing the composition or semantics are presented by I. Y. Kuchin [11], which proposes an approach of encoding identifying attributes based on the developed algorithm. A distinctive feature of the work is the analytical justification for the choice of the composition of the identifying group and the provision of a given degree of anonymity as part of an anonymous database. This method has been introduced in the healthcare field, however, the issue of ensuring security is solved only when storing personal data, not when dealing with other information processing modes.
Algorithms for the implementation of the mixing method are presented by works that propose the use of mixing algorithms aimed at the storage of PI or its transmission over open communication channels. For example, K. O. Bondarenko and V. A. Kozlov [12] have presented a method of mixing data inside segments with sequential mixing of rows and sensitive attributes, as the algorithm uses lookup tables generated by the cryptographic gamma method. On the one hand, the use of cryptography guarantees the sustaining power of the algorithm even during a processing session, but, on the other hand, it complicates the process of adding, deleting, searching data and increases the cost of protection. These shortcomings are obstacles for the implementation of the method.
Other research areas involve the use of mainly cryptographic methods, which can be attributed to depersonalization with a sufficient degree of conditionality, since they solve the problem of the impossibility of identifying an individual according to the processed data, but they are not formally included in the set of methods established by Roskomnadzor or merely partially use such methods. For example, the work of Y. V. Trifonova and R. F. Zharinov [13] suggests using the built-in cryptographic tools of the CryptDB database management system. As an example of the partial use of the identifier method, one can cite the work of I. Azhmukhamedov, R. Y. Demina and I. V. Safarov [14], wherein the cross-reference table encryption is applied with subsequent blocking.
To generate a sequence hash, the following method is used based on the concept of a cryptographic sponge, which calls for two primary stages [15][16].
1) Absorbing. The initial message P is subject to multi-round reshuffles f, accumulation and processing of all blocks of the message from which the hash will be developed is conducted [17].
2) Squeezing. The output of the received value of Z as the shuffle result, the development of the hash value and the output of the results until the necessary length of the hash is reache [18].
In the absorbing phase first is set the initial state from the zero vector with the size up to 1600 bits . Next is conducted the operation xor of a fragment of the initial message 0 p with the fragment of the initial state with the size of r , the remaining part of the state with capacity of c remains the same.
The result is processed by the f function which is a multiround non-key pseudo-random reshuffle and repeats till the initial message blocks exhaust [19]. Next comes the squeezing phase at which it is possible to extract a hash of a random length. The flow chart of the hashing algorithm is shown at the Fig. 1.
The function () F in this algorithm executes 24 rounds, one round includes the work of five functions a Pi,Rho,Lot Theta,Chi, , consistently processing the inner state at each round.
The function Theta is represented by the next expressions (1): The function Lota is represented by the next expression (4): Step 3: after data processed with subfunctions goes the check for the rounds count. If the condition 24  i is true then the output of the A array is conducted. If not then we increment by 1 and make the operations until this condition is true.

MATERIALS AND METHODS
In order to eliminate the drawbacks mentioned above a personal data depersonalization method, based on the method of identifiers implementation using hashing of critical data and a private key, was developed [20]. As raw data a personal data table   ) ,..., , ( is reviewed, where M is the total amount of attributes and N is table rows count, M d is an attribute referring to key and non-key. In this, at the first step by expert way critical data and data clearly identifying the personal data subject is defined. Corresponding attributes are defined as ley ones.
At the second step the initial array D according to chosen key attributes is split into two non-intersecting sub-arrays 1 A and 2 A . It is worth noting that into each of sub-arrays an additional attribute 0 d is added, by which value later the comparison of depersonalized data with the personal data subject is conducted. As the result the number of key values is equal to K patients ). < < (0 M K . In this, in 2 A is stored depersonalized data that is not interesting for the intruder, so it does not require protection and is stored in the clear.
At the third step for the set of key values of each row is calculated, where F is a unique function unknown for the user, PK is the unique private key. As F in this case the hash function is chosen [21]. ( , , …, )

EXPERIMENTS
For the experiments a computer program and a database, implementing the proposed method, with the initial data of 100 subjects of personal data of a medical institution, were developed. The developed software has been studied at solving the problems of depersonalization.
On the basis of the initial sample, key critical attributes were identified that uniquely identify the subject of personal data that is stored in a protected information sys-tem. Using this data and a private key, for each record a hash identifier is generated, which is the primary key of the subject of the personal data in the depersonalized information system.
To search for the necessary record in an impersonal information system, a developed subprogram for calculating the identifier hash is used, which based on the data from the primary documents of the personal data subject formed the primary key of the specific record.
After the formation of data for a depersonalized information system, an analysis was performed for the presence of collisions [22][23].

RESULTS
As an example let's review a database of patients of some treatment institution (see table 1). In the depersonalized database the hash identifier and the depersonalized personal data are stored (see table 2). In the secure database the hash identifier and the critical personal data are stored (see table 3). In this, the ability to restore the original data from the hash identifier is impossible. To obtain an identifier it is required to fill in the necessary fields of the subject of personal data from primary documents using the private key in the developed software.

DISCUSSION
Let's consider the application of this method sing the famous characters Alice and Bob [24].
Alice came to see Doctor Bob. To identify Alice she shows Bob the critical PD from her initial documents (passport and medical insurance). Bob using the calculator for hash identifier inserts this data and the key of the hospital and forms the hash identifier that allows getting the access to Alice's patient file. After diagnosing and prescribing treatment Bob inserts data into the information system and sings it with his electronic signature.
A curious staff member Eva wanted to know Alice's diagnosis but can't find her card in the information system because she does not know the hash identifier as well as Alice's critical data.
Mallorie found out Alice's critical PD and got the access to the calculator for hash identifier, but she does not know the hospital's key for calculating Alice's identifier.
This method has the next advantages: 1) Data becomes depersonalized which reduces costs of ISPD protection.
2) It is impossible to define the presence of a certain subject in ISPD by known unique attributes.
3) Operator during subject's application by his PD gets access only to one record of ISPD.
4) The context analysis is impossible.

CONCLUSIONS
The actual problem of data depersonalization in the information system was solved by introducing identifiers using hashing of critical data and a private key.
The scientific novelty of the obtained results is that a method was proposed for introducing identifiers using hashing of critical data and a private key for the first time. This allows to increase the level of data confidentiality, reduce the requirements for the level of information system security, increase the speed of data processing by convolving critical data into a hash identifier.
The practical significance of the obtained results is that software that implements the proposed method has been developed and experiments have been carried out to confirm the adequacy of the proposed mathematical model. The results of the experiment allow us to recommend the proposed method for introducing into automated information systems the processing of personal data at the design stage or optimizing of the existing systems, which will reduce the cost of protecting the information system.