SYSTEM FOR WEB RESOURCES CONTENT STRUCTURING AND RECOGNIZING WITH THE MACHINE LEARNING ELEMENTS

Context. A large number of web resources of different organizations requires checking of relevance and correctness of the content, in particular, concerning characteristics of the organization, staff, etc. For this, it is necessary to develop a system of the automated content analysis. This task causes the need to develop a method and software for structuring and recognizing of web resources content. Existing parsing systems do not provide solving of the specified task, since they do not contain elements of machine learning. The object of the research is the process of automated analysis of the web resources content. Objective. The goal of the work is the creation of the system for web resources content structuring and recognizing. Method. The system of structuring and recognizing of text content of web resources with elements of machine learning is considered. Models of the system functioning are proposed. The architecture for realizing of software system for structuring and recognizing of text content of web resources is developed. Example of implementation of the model of developed system for structuring, recognizing and revealing of outdated and incorrect information about personnel on the web resource of educational institution is given. Results. The developed software may be used by support services in order to update and correct the information content. Conclusions. The system of structuring and recognizing of content of web resources with the machine learning elements has been considered. The proposed system compared with the known ones, ensures automated content structuring, recognizing of outdated, non-relevant or wrong information. Represented example of the structuring and recognizing of outdated and incorrect information on the website of educational institution confirms the effectiveness of the proposed system.


NOMENCLATURE
Pr O is a parent element and the attribute; Id_Type is an identifier of type which allows to improve the belonging of an object to a group that already has more common characteristics than Dsc characteristic; _ O Phr is a characteristic of the object analyzed, allows different linguistic representation in the form of phrases; _ O Frm is a word and that define specific characteristics presented in word forms; _ O Bs is the words and their bases; IdLg is an identifier of language implementation; IdBs is an identifier of words base; WBase is a base of the word; IdFm is an identifier of words form; WForm is a form of words; O Id is an identifier of object; IdPh is an identifier of features; Pr Id Bs . is an identifier of the parental basics of the object; KDrO is a set of key characteristics; IdPg is an identifier of the analyzed web page; IdO is an identifier of the analyzed object; Id_ E Type is a type of sample characteristics.

INTRODUCTION
Structuring and recognizing of text information is one of the areas of intelligent information systems.The main component of such systems is machine learning [1].
Machine learning is a process in which a system processes a large number of examples, detects patterns and uses them to predict the characteristics of new data.
Machine learning deprives the programmer of the need to "explain in details" to the computer how to solve the problem.Instead, the computer learns to find a solution on its own [2].
In this paper, the system for structuring and recognizing of web resource content, which includes the elements of machine learning, is proposed.This system can be used for identification of incorrect and outdated information on websites.
The object of study is the process of automated analysis of web resources content.
The subject of study is the system for web resources content structuring and recognizing.
The purpose of the work is the development of method and software for web resources content structuring and recognizing using machine learning elements.
1 PROBLEM STATEMENT Let us consider a system that performs searching, structuring and recognizing of text information on web resources, recording to a database and comparing it.
The processed object of this system is the input data in the form of text information.The structure of such information is represented in the form of tree-like relations.
Therefore, the objects that are used to analyze the relevance of their content are modeled using following structure for machine learning: Thus, the input data for system for web resources content structuring and recognizing functionality is represented by the set of tuples (1)-( 4).As a result of transformation of given structures, we obtain a tree-like structure in such form: where _ Dsc O is resulting set of text structures, the relevance of content for which is established.
The relevance of obtained results of tree-like structures transformation depends on completeness of available content representation in existing databases that are the basis for comparison.Thus, in the content analysis process, the main database must be renewed using machine learning tools.The problem of defining of indicators of content representation completeness in existing databases is beyond the scope of research represented in the paper.

REVIEW OF THE LITERATURE
Machine learning can appear in many guises.We now discuss a number of applications, the types of data they deal with, and finally, we formalize the problems in a somewhat more stylized fashion.The latter is key if we want to avoid reinventing the wheel for every new application.Instead, much of the art of machine learning is to reduce a range of fairly disparate problems to a set of fairly narrow prototypes.Much of the science of machine learning is then to solve those problems and provide good guarantees for the solutions [4,5].
Machine learning algorithms are organized into taxonomy, based on the desired outcome of the algorithm.Common algorithm types include [6]: -Supervised learning -where the algorithm generates a function that maps inputs to desired outputs.One standard formulation of the supervised learning task is the classification problem: the learner is required to learn (to approximate the behavior of) a function which maps a vector into one of several classes by looking at several input-output examples of the function.
-Unsupervised learning -which models a set of inputs: labeled examples are not available.
-Semi-supervised learning -which combines both labeled and unlabeled examples to generate an appropriate function or classifier.
-Reinforcement learning -where the algorithm learns a policy of how to act given an observation of the world.Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
-Transduction -similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs.
-Learning to learn -where the algorithm learns its own inductive bias based on previous experience.
The performance and computational analysis of machine learning algorithms is a branch of statistics known as computational learning theory.Machine learning is about designing algorithms that allow a computer to learn.Learning is not necessarily involves consciousness but learning is a matter of finding statistical regularities or other patterns in the data.Thus, many machine learning algorithms will barely resemble how human might approach a learning task.However, learning algorithms can give insight into the relative difficulty of learning in different environments [6].
The components of machine learning tools provide the ability of software systems to analyze automatically the text information in order to structure it, determine incorrect and wrong information [5].
Especially important is to develop the machineadapted procedures for structuring the text information published on the web resources.Automated analysis and structuring of the web resources content makes it possible to solve complex applied problems in economy, ecology, medicine and more [7,8].
Today, there are tools that can solve these problems to some extent.
In the paper [9,10], the systems that allow parsing content of the entire website as well as its particular pages, obtaining data from dynamic pages and uploading information in the appropriate format for its comparison are considered.
All these systems have the main drawback -they do not deal with the machine learning.

MATERIALS AND METHODS
The characteristics of specific object are understood as tree-like system of meaningful notions (concepts) that describe its certain meaningful properties.Some separated typical characteristics for determination of its belonging to a particular group and generalized characteristics for the whole object are included for this.
Let input content of web resource is represented in the form of tree-like structures (1)-( 4).Description of the main information structures is represented in Fig. 1.
For comparison of analyzed characteristics values and their corresponding reference data, it is necessary to analyze the information represented on corresponding web pages [11].
The structure of description of information about analyzed objects is unknown a priori and may vary depending on the subject, content and technical realization of these websites.It is expedient that information on websites must be structured, not just divided into blocks or paragraphs [12].This requirement can significantly simplify the content relevance analyzing process.In this case, structuring means the representation of information in the form of defined structures.
To analyze information about a particular object, the set of key characteristics KDrO is formed.To support the analysis of the web pages content, the following auxiliary structure AS was created [13,14] During analysis of HTML code of next web page of specialized website, its identifier is set: Then, we distinguish the elements LSTIt of corresponding characteristics that fill corresponding tags.
Figure 1 -Structure of the information processing and storage subsystem for the program of detecting irrelevant and inaccurate information on web resources and comparison with the existing database If this characteristic coincides with relevant information, then its O_Dsc coincides with the corresponding relevant value.If values of these corresponding characteristics do not coincide, then the information needs to be updated [15,16].
Based on the analysis of existing systems for websites content analysis, we can formulate the basic requirements for automated algorithms for detection of outdated and incorrect information Fig. 2: -setting of URL filters, in order to not to parse some extra pages; -setting of parsing depth; -high quality content downloading; -multi stream processing and saving of content in various formats.
After the parsing, generated results must be processed in order to provide the representation of information in such form that is suitable for further use.The exact format depends on how, in future, the collected data will be processed.Quite often, from parsed content, RSS-stream is formed using XML Fig. 3.It is convenient for using the data without rewriting procedure.
Sometimes, the result of parsing is saved into CSV-file, because this text format is very simple for further processing, easily converted to SQL queries and able to work with in Excel.In special cases, it is needed to represent the final data in form of XLS spreadsheets Fig. 4.

EXPERIMENTS
In the process of analysis of text information obtained from web pages, attention must be paid to the process of text normalizing.This will allow to increase the percentage of establishment of equivalence between analyzed and reference.
At the normalization stage, we decrypt some of the abbreviations using corresponding subject dictionaries.In Table 2, for example, the abbreviation "Can.Tech.Sciences" is decrypted as "Candidate of Technical Sciences".Also, we convert a part of the text into a new form.For example, the human initials that are identified in the parsing process, are transformed as it is shown in Table 1.Some phrases are also transformed into more familiar corresponding phrases that are used in the subject area.For example, "The head of the department" is transformed into a "Head of department", as it is shown in Table 2.

RESULTS
Let us consider, in more details, developed system for structuring and recognizing of outdated and incorrect information on the example of the Faculty of Computer Information Technologies of Ternopil National Economic University website (staff information page) Fig. 5.The lifecycle of the system includes the following stages: -Choosing a web page or the entire website.
-Checking the validity and relevance of information.
-Parsing the content of a web page or a website.
-Saving parsing results into database.
-Viewing of selected profile.
-Validation of data with the possibility of updating in a database.
-Comparison of data obtained from the website and data from the organization's database.
-In case of obtaining of insufficient results, we supplement the rules (parsing, connecting dictionaries).

DISCUSSION
Thus, created system for web resources content structuring and recognizing unlike existing ones, contains elements of machine learning.At the same time, the system was built based on parsing algorithms.Similar of them are used in known systems: DataCol, Sjs-parser program system, Content Downloader software system.
The advantage of proposed system is also its crossplatformism.It means that description language of content structure is universal.Realization of the system is adopted for using with different types of relational database management system.Multiple examples of system applying one of which is represented above, showed high performance in recognition, structuring and analysis of content on different types of web resources.
At the same time, developed system has some disadvantages.In particular, the presence of machine learning elements allows to expand the capabilities of the main database which is used for the content analysis, which increases the completeness of content coverage.But, on the other hand, the presence of this element at the same time reduces the reliability of the recognition results and establishment of relevance between analyzed content and basic one.
The development of this work is advisable to be directed into the research of indicators of completeness of content representation in existing databases.And in cases of application of machine learning elements -into determining the relevance of representation of the content structure elements characteristics.

CONCLUSION
The problem of structuring and recognizing of content of web resources with the machine learning elements has been considered.
Scientific novelty of obtained result consists in creation of mathematical tools of system for structuring and recognizing of content of web resources with the machine learning elements.Its main difference is the presence of machine learning components.The proposed system compared with the known ones, due to the above mentioned fact, ensures automated content structuring, recognition of outdated, non-reliable or wrong information.Represented example confirms the effectiveness of the proposed system.
Practical significance of obtained result consists in developing of software that may be used by support services in order to update and correct the content information.The software was develop based on the algorithms that execute the following functions: setting of URL filters, in order to not to parse some extra pages; setting of parsing depth; high quality content downloading; multi stream processing and saving of content in various formats.
Prospects for further research are extension of functional possibilities of the systems in the way of use of neural networks elements for machine learning and increasing the quality of web resources content structuring.

Figure 2 -Figure 3 -Figure 4 -
Figure 2 -Architecture of software system for structuring and recognizing of text content of web resources with the machine learning

Figure 5 -
Figure 5 -Example of the structuring and recognizing of outdated and incorrect information on the website of educational institution

Table 1 -
Normalization of text information

Table 2 -
Normalization of text information