Education, Science, Technology, Innovation and Life
Open Access
Sign In

Density-Based Clustering and Latent Dirichlet Allocation Framework for Consumer Preference Mining in E-Commerce Reviews

Download as PDF

DOI: 10.23977/autml.2026.070120 | Downloads: 2 | Views: 99

Author(s)

Weiyi Zhu 1, Ming Yin 1, Yan Zhang 1, Yanan Peng 1

Affiliation(s)

1 School of Big Data and Statistics, Sichuan Tourism University, Chengdu, Sichuan, China

Corresponding Author

Yanan Peng

ABSTRACT

Consumer preference inference from large-scale electronic commerce reviews remains a fundamentally challenging task due to the high dimensionality, sparsity, and noise characteristics inherent in user-generated textual content. This paper presents an integrated text mining framework that combines density-based spatial clustering with probabilistic topic modeling to extract structured preference signals from unstructured online review corpora. The proposed architecture employs the DBSCAN algorithm to partition product entries into coherent price segments without requiring prior specification of cluster count, applies a Jieba-based tokenization pipeline with custom stopword filtering for Chinese text normalization, and trains a Latent Dirichlet Allocation model whose optimal topic count is selected via inter-topic cosine similarity minimization. A web crawler built on Requests and BeautifulSoup collected 9,234 consumer reviews together with associated product metadata, which were partitioned into twelve density-coherent price clusters revealing two dominant preference intervals near 56 and 70 currency units. The LDA model identified three latent topics in positive reviews and two in negative reviews, achieving a perplexity of 287.4 and a topic coherence of 0.524, representing an 18.7% improvement over comparable LSA and NMF baselines. Sentiment-aware classification reached 92.6% accuracy with an F1 score of 91.0%, providing actionable insights for product design optimization and personalized recommendation in electronic commerce platforms.

KEYWORDS

Latent Dirichlet Allocation, Density-Based Spatial Clustering, Chinese Text Tokenization, Consumer Preference Mining, Probabilistic Topic Modeling, E-Commerce Review Analysis

CITE THIS PAPER

Weiyi Zhu, Ming Yin, Yan Zhang, Yanan Peng. Density-Based Clustering and Latent Dirichlet Allocation Framework for Consumer Preference Mining in E-Commerce Reviews. Automation and Machine Learning (2026). Vol. 7, No. 1, 162-172. DOI: http://dx.doi.org/10.23977/autml.2026.070120.

REFERENCES

[1] Syamsuri, A.R., Arohman, R., Saputra, M.R., Ikhlash, M. and Damanik, S.K. (2025) Integration of machine learning in e-commerce: A systematic literature review on consumer behavior prediction and product recommendation. Social Sciences Insights Journal, 3, 153-162.
[2] Wu, B., Ding, Z. and Huang, J. (2026) A review of continual learning in edge AI. IEEE Transactions on Network Science and Engineering.
[3] Izumi, C., Ghaffar, S.A. and Setiawan, W.C. (2025) Enhancing customer satisfaction and product quality in e-commerce through post-purchase analysis using text mining and sentiment analysis techniques in digital marketing. Journal of Digital Market and Digital Currency, 2, 1-25.
[4] Wu, B., Ding, Z., Ostigaard, L. and Huang, J. (2025) Reinforcement learning-based energy-aware coverage path planning for precision agriculture. Proceedings of the 2025 ACM Research on Adaptive and Convergent Systems (RACS), 1-8.
[5] Deepika, R. and Kandavel, R. (2025) Mining consumer behavior patterns in e-commerce using Apriori algorithm and sequential pattern analysis. Proceedings of the 2025 International Conference on Automation and Computation (AUTOCOM), 268-273.
[6] Wu, B., Cai, Z., Wu, W. and Yin, X. (2023) AoI-aware resource management for smart health via deep reinforcement learning. IEEE Access, 11, 81180-81195.
[7] Maia, S., Teixeira Domingues, J.P., Rocha Varela, M.L.R. and Fonseca, L.M. (2025) Exploring the user-generated content data to improve quality management. The TQM Journal, 37, 877-901.
[8] Wu, B. and Wu, W. (2023) Model-free cooperative optimal output regulation for linear discrete-time multi-agent systems using reinforcement learning. Mathematical Problems in Engineering, 6350647.
[9] De La Hoz-M, J., Montes-Escobar, K., Salas-Macias, C.A., Fors, M. and Ballaz, S.J. (2026) Using latent Dirichlet allocation topic modeling to uncover latent research topics and trends in renal cell carcinoma: Bibliometric review. JMIR Cancer, 12, e78797.
[10] Kirilenko, A.P. (2025) Topic modeling: Latent Dirichlet allocation. Practical Data Mining with AI for Social Scientists, Springer Nature Switzerland, 359-387.
[11] Noor Mathivanan, N.M., Janor, R.M., Razak, S.A. and Md Ghani, N.A. (2025) Feature substitution using latent Dirichlet allocation for text classification. International Journal of Advanced Computer Science & Applications, 16.
[12] Ningrum, A.F., Talirongan, F.J.B. and Tangaro, D.M.G.G. (2025) Identification of dominant topics in public discussions on IKN using latent Dirichlet allocation (LDA) and BERTopic. Scientific Journal of Computer Science, 1, 16-22..
[13] Nahidmobarakeh, L., Nemetiandoost, M., Yilmaz, B.S., Gazzarri, J., Zhang, X., Arias, S. and Ahmed, R. (2025) Two-stage genetic algorithm offline parameter optimization of adaptive extended Kalman filter for robust battery state-of-charge estimation. IEEE Access.
[14] Huang, J., Wu, B., Duan, Q., Dong, L. and Yu, S. (2025) A fast UAV trajectory planning framework in RIS-assisted communication systems with accelerated learning via multithreading and federating. IEEE Transactions on Mobile Computing.
[15] Kumar, R., Singhal, N. and Chhabra, A. (2025) Hybrid optimization algorithm with the combination of PSO and genetic algorithm for task scheduling in cloud computing. E-Learning and Digital Media, 20427530251331082.
[16] Nathiya, N., Rajan, C. and Geetha, K. (2025) A hybrid optimization and machine learning based energy-efficient clustering algorithm with self-diagnosis data fault detection and prediction for WSN-IoT application. Peer-to-Peer Networking and Applications, 18, 13.
[17] Wu, B., Huang, J. and Yu, S. (2026) 'X of Information' continuum: A survey on AI-driven multi-dimensional metrics for next-generation networked systems. IEEE Communications Surveys & Tutorials.
[18] Wu, B., Huang, J., Duan, Q., Dong, L. and Cai, Z. (2025) Enhancing vehicular platooning with wireless federated learning: A resource-aware control framework. IEEE/ACM Transactions on Networking, 33, 1-16.
[19] Monko, G. and Kimura, M. (2025) Enhanced stratified sampling-density-based spatial clustering of applications with noise (SS-DBSCAN) for high-dimensional data. Data Science, 8, 24518492251349080.
[20] Wu, B., Huang, J. and Duan, Q. (2025) FedTD3: An accelerated learning approach for UAV trajectory planning. Proceedings of the International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA), 13-24.
[21] Roh, H., Etzenbach, L., Oltramare, A., Norheim, J. and De Weck, O.L. (2025) Size constrained K-means clustering for controlled design structure matrix partitioning. Proceedings of the 2025 IEEE International Systems Conference (SysCon), 1-8.
[22] Yfantis, V., Wagner, A. and Ruskowski, M. (2025) Federated K-means clustering via dual decomposition-based distributed optimization. Franklin Open, 10, 100204.
[23] Wu, B., Huang, J. and Duan, Q. (2025) Real-time intelligent healthcare enabled by federated digital twins with AoI optimization. IEEE Network, 1.
[24] Okkels, C.B., Aumüller, M., Thomsen, V.B. and Zimek, A. (2025) High-dimensional density-based clustering using locality-sensitive hashing. Proceedings of the EDBT, 694-706.
[25] Pan, D., Wu, B.-N., Sun, Y.-L. and Xu, Y.-P. (2023) A fault-tolerant and energy-efficient design of a network switch based on a quantum-based nano-communication technique. Sustainable Computing: Informatics and Systems, 37, 100827.
[26] Agrawal, S.K. (2026) Adaptive density-aware clustering of high-dimensional patient data in electronic health records. International Journal of Engineering Development and Research, 14, 361-367.

Downloads: 5343
Visits: 271365

Sponsors, Associates, and Links


All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.