Труды сотрудников ИЛ им. В.Н. Сукачева СО РАН

w10=
Найдено документов в текущей БД: 1

    A nonparametric algorithm for automatic classification of large multivariate statistical data sets and its application
/ I. V. Zenkov, A. V. Lapko, V. A. Lapko [и др.] // Comput. Opt. - 2021. - Vol. 45, Is. 2. - С. 253-+, DOI 10.18287/2412-6179-CO-801. - Cited References:13. - The research was funded by RFBR, Krasnoyarsk Territory and Krasnoyarsk Regional Fund of Science, project number 20-41-240001. . - ISSN 0134-2452. - ISSN 2412-6179
РУБ Optics

Аннотация: A nonparametric algorithm for automatic classification of large statistical data sets is proposed. The algorithm is based on a procedure for optimal discretization of the range of values of a random variable. A class is a compact group of observations of a random variable corresponding to a unimodal fragment of the probability density. The considered algorithm of automatic classification is based on the "compression" of the initial information based on the decomposition of a multidimensional space of attributes. As a result, a large statistical sample is transformed into a data array composed of the centers of multidimensional sampling intervals and the corresponding frequencies of random variables. To substantiate the optimal discretization procedure, we use the results of a study of the asymptotic properties of a kernel-type regression estimate of the probability density. An optimal number of sampling intervals for the range of values of one- and two-dimensional random variables is determined from the condition of the minimum root-mean square deviation of the regression probability density estimate. The results obtained are generalized to the discretization of the range of values of a multidimensional random variable. The optimal discretization formula contains a component that is characterized by a nonlinear functional of the probability density. An analytical dependence of the detected component on the antikurtosis coefficient of a one-dimensional random variable is established. For independent components of a multidimensional random variable, a methodology is developed for calculating estimates of the optimal number of sampling intervals for random variables and their lengths. On this basis, a nonparametric algorithm for the automatic classification is developed. It is based on a sequential procedure for checking the proximity of the centers of multidimensional sampling intervals and relationships between frequencies of the membership of the random variables from the original sample of these intervals. To further increase the computational efficiency of the proposed automatic classification algorithm, a multithreaded method of its software implementation is used. The practical significance of the developed algorithms is confirmed by the results of their application in processing remote sensing data.

WOS

Держатели документа:
Siberian Fed Univ, Svobodny Av 79, Krasnoyarsk 660041, Russia.
Inst Computat Modelling SB RAS, Akademgorodok 50, Krasnoyarsk 660036, Russia.
Sukachev Inst Forest SB RAS, Akademgorodok 50, Krasnoyarsk 660036, Russia.
Reshetnev Siberian State Univ Sci & Technol, Krasnoyarsky Rabochy Av 31, Krasnoyarsk 660037, Russia.
Fed Res Ctr Informat & Computat Technol, Krasnoyarsk Branch, Mira Av 53, Krasnoyarsk 660049, Russia.

Доп.точки доступа:
Zenkov, I., V; Lapko, A., V; Lapko, V. A.; Im, S. T.; Tuboltsev, V. P.; Avdeenok, V. L.; RFBRRussian Foundation for Basic Research (RFBR); Krasnoyarsk Regional Fund of Science [20-41-240001]; Krasnoyarsk Territory