Authors: Dmitry Romanov Lyudmila Gadasina
Title of host publication: Proceedings - 2024 International Russian Smart Industry Conference, SmartIndustry Conference, 2024
Abstract: Clustering of datasets with mixed type data: quantitative and categorical is difficult by classical methods of cluster analysis. The study proposes to solve the problem of clustering multivariate data using a decision tree. This method requires setting the number of clusters, limiting the maximum proportion of observations that fall into each cluster, and also requires setting a target variable. The latter requirement can be fulfilled by an expert method or by experimenting with different target variables. The method was tested on the data of advertisements about residential real estate for sale in St. Petersburg. At first, tree clustering method was tested on the dataset with only quantitative data, then on the dataset with mixed types of data: quantitative and categorical. The results were compared with the results of the hierarchical method with different distance metrics. The proposed method does not require data standardization, has a higher speed of operation than hierarchical clustering and shows a clearer interpretation of the clustering results.
Title of host publication: Proceedings - 2024 International Russian Smart Industry Conference, SmartIndustry Conference, 2024
Abstract: Clustering of datasets with mixed type data: quantitative and categorical is difficult by classical methods of cluster analysis. The study proposes to solve the problem of clustering multivariate data using a decision tree. This method requires setting the number of clusters, limiting the maximum proportion of observations that fall into each cluster, and also requires setting a target variable. The latter requirement can be fulfilled by an expert method or by experimenting with different target variables. The method was tested on the data of advertisements about residential real estate for sale in St. Petersburg. At first, tree clustering method was tested on the dataset with only quantitative data, then on the dataset with mixed types of data: quantitative and categorical. The results were compared with the results of the hierarchical method with different distance metrics. The proposed method does not require data standardization, has a higher speed of operation than hierarchical clustering and shows a clearer interpretation of the clustering results.