Métodos para el análisis de la información en corpus de artículos científicos con algoritmos de clasificación y librerías NLTK en la Plataforma Científica ECUCIENCIA
The web platform called ECUCIENCIA belonging to the Technical University of Cotopaxi stores the scientific production of the research teachers, this system shows some metrics for the articles considering only the title, summary and keywords, being insufficient if we analyze the richness of all the c...
Guardado en:
| Autor principal: | |
|---|---|
| Formato: | masterThesis |
| Lenguaje: | spa |
| Publicado: |
2020
|
| Materias: | |
| Acceso en línea: | http://repositorio.utc.edu.ec/handle/27000/7234 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| Sumario: | The web platform called ECUCIENCIA belonging to the Technical University of Cotopaxi stores the scientific production of the research teachers, this system shows some metrics for the articles considering only the title, summary and keywords, being insufficient if we analyze the richness of all the content of the document in PDF format; relevant information related to research lines and other scientific documents could be extracted from the frequency of the words in each document, to solve this problem, a method of analysis of information was established in corpus of scientific articles, using data processing algorithms found in the NLTK, NUMPY, MATPLOTLIB, PYPDF2, SKLEARN and SCIPY libraries of Python. The Scrum methodology was used for module development and the results were validated through statistical methods. Data was obtained from a simple random sampling and the analysis of the information contained in the corpus of scientific articles of the selected sample, being able to obtain relevant information and visualization of significant data of Euclidean distances, Correlation, Chebychev, Cosine, Jaccard Coefficient and Dice Index were obtained. The validation of the results through the analysis of the variance of a factor yielded the value of F = 17.621 being higher than the critical value for F which was 2.412 and the probability less than 0.05 demonstrating that the frequency variables of the articles behave significantly in the process of representing metrics according to the articles' corpus. |
|---|