Aplicación de los algoritmos K-means y Random Forest para la segmentación de potenciales estudiantes del programa de maestría en estadística con mención en ciencia de datos e inteligencia artificial de la ESPOCH.
This research aims to segment potential students interested in the Master's program in Statistics specializing in Data Science and Artificial Intelligence at ESPOCH, utilizing supervised and unsupervised machine learning techniques: K-means and Random Forest. Initially, 700 surveys containing 1...
Gespeichert in:
| 1. Verfasser: | |
|---|---|
| Format: | masterThesis |
| Sprache: | spa |
| Veröffentlicht: |
2025
|
| Schlagworte: | |
| Online Zugang: | http://dspace.unach.edu.ec/handle/51000/14997 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | This research aims to segment potential students interested in the Master's program in Statistics specializing in Data Science and Artificial Intelligence at ESPOCH, utilizing supervised and unsupervised machine learning techniques: K-means and Random Forest. Initially, 700 surveys containing 19 questions each were collected and subjected to a cleaning and mapping process to ensure accuracy and consistency. Standardization was applied to balance the scales of variables, employing One-Hot Encoding for nominal variables and numerical assignment for ordinal variables. The elbow method determined the optimal number of clusters to be three. After removing less relevant variables and outliers, it was necessary to apply the K-means algorithm, achieving a Silhouette Score of 0.5886, indicating strong cohesion and clear separation between clusters. The researcher used the Principal Component Analysis (PCA) to visualize the resulting clusters. The Davies-Bouldin Index was 0.6491, and the Calinski-Harabasz Index recorded 798.4427, further corroborating the quality of the segmentation. The within-cluster sum of squares (WCSS) was 4034.7774, confirming appropriate cluster compactness. Following validation, it was possible to identify three well-defined clusters. Consequently, the researcher developed detailed profiles for each segment. It was also necessary to propose specific digital marketing strategies for "Young Technologists in Projection," "Professional Leaders in Academic Transition," and "Educators in Professional Evolution." A Random Forest model was trained and validated through cross-validation and Grid Search, successfully identifying the most influential variables in segmentation. Finally, the researcher created an automated pipeline to efficiently process new surveys and assign them to their corresponding clusters. |
|---|