Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 127
Filtrar
1.
BMC Bioinformatics ; 25(1): 217, 2024 Jun 18.
Artigo em Inglês | MEDLINE | ID: mdl-38890569

RESUMO

BACKGROUND: Tandem repeats are specific sequences in genomic DNA repeated in tandem that are present in all organisms. Among the subcategories of TRs we have Satellite repeats, that is divided into macrosatellites, minisatellites, and microsatellites, being the last two of specific interest because they can identify polymorphisms between organisms due to their instability. Currently, most mining tools focus on Simple Sequence Repeats (SSR) mining, and only a few can identify SSRs in the coding regions. RESULTS: We developed a microsatellite mining software called SATIN (Micro and Mini SATellite IdentificatioN tool) based on a new sliding window algorithm written in C and Python. It represents a new approach to SSR mining by addressing the limitations of existing tools, particularly in coding region SSR mining. SATIN is available at https://github.com/labgm/SATIN.git . It was shown to be the second fastest for perfect and compound SSR mining. It can identify SSRs from coding regions plus SSRs with motif sizes bigger than 6. Besides the SSR mining, SATIN can also analyze SSRs polymorphism on coding-regions from pre-determined groups, and identify SSRs differentially abundant among them on a per-gene basis. To validate, we analyzed SSRs from two groups of Escherichia coli (K12 and O157) and compared the results with 5 known SSRs from coding regions. SATIN identified all 5 SSRs from 237 genes with at least one SSR on it. CONCLUSIONS: The SATIN is a novel microsatellite search software that utilizes an innovative sliding window technique based on a numerical list for repeat region search to identify perfect, and composite SSRs while generating comprehensible and analyzable outputs. It is a tool capable of using files in fasta or GenBank format as input for microsatellite mining, also being able to identify SSRs present in coding regions for GenBank files. In conclusion, we expect SATIN to help identify potential SSRs to be used as genetic markers.


Assuntos
Mineração de Dados , Repetições de Microssatélites , Polimorfismo Genético , Software , Repetições de Microssatélites/genética , Mineração de Dados/métodos , Algoritmos , Fases de Leitura Aberta/genética , DNA Satélite/genética
2.
J Pediatr (Rio J) ; 100(5): 512-518, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38670169

RESUMO

OBJECTIVE: To determine reference intervals (RI) for fasting blood insulin (FBI) in Brazilian adolescents, 12 to 17 years old, by direct and indirect approaches, and to validate indirectly determined RI. METHODS: Two databases were used for RI determination. Database 1 (DB1), used to obtain RI through a posteriori direct method, consisted of prospectively selected healthy individuals. Database 2 (DB2) was retrospectively mined from an outpatient laboratory information system (LIS) used for the indirect method (Bhattacharya method). RESULTS: From DB1, 29345 individuals were enrolled (57.65 % female) and seven age ranges and sex partitions were statistically determined according to mean FBI values: females: 12 and 13 years-old, 14 years-old, 15 years-old, 16 and 17 years-old; and males: 12, 13 and 14 years-old, 15 years-old, 16 and 17 years-old. From DB2, 5465 adolescents (67.5 % female) were selected and grouped according to DB1 partitions. The mean FBI level was significantly higher in DB2, on all groups. The RI upper limit (URL) determined by Bhattacharya method was slightly lower than the 90 % CI URL directly obtained on DB1, except for group female 12 and 13 years old. High agreement rates for diagnosing elevated FBI in all groups on DB1 validated indirect RI presented. CONCLUSION: The present study demonstrates that Bhattacharya indirect method to determine FBI RI in adolescents can overcome some of the difficulties and challenges of the direct approach.


Assuntos
Mineração de Dados , Jejum , Insulina , Humanos , Adolescente , Feminino , Masculino , Valores de Referência , Brasil , Criança , Insulina/sangue , Jejum/sangue , Mineração de Dados/métodos , Estudos Retrospectivos , Bases de Dados Factuais
3.
Prim Care Diabetes ; 18(3): 327-332, 2024 06.
Artigo em Inglês | MEDLINE | ID: mdl-38616442

RESUMO

AIMS: Machine learning models can use image and text data to predict the number of years since diabetes diagnosis; such model can be applied to new patients to predict, approximately, how long the new patient may have lived with diabetes unknowingly. We aimed to develop a model to predict self-reported diabetes duration. METHODS: We used the Brazilian Multilabel Ophthalmological Dataset. Unit of analysis was the fundus image and its meta-data, regardless of the patient. We included people 40 + years and fundus images without diabetic retinopathy. Fundus images and meta-data (sex, age, comorbidities and taking insulin) were passed to the MedCLIP model to extract the embedding representation. The embedding representation was passed to an Extra Tree Classifier to predict: 0-4, 5-9, 10-14 and 15 + years with self-reported diabetes. RESULTS: There were 988 images from 563 people (mean age = 67 years; 64 % were women). Overall, the F1 score was 57 %. The group 15 + years of self-reported diabetes had the highest precision (64 %) and F1 score (63 %), while the highest recall (69 %) was observed in the group 0-4 years. The proportion of correctly classified observations was 55 % for the group 0-4 years, 51 % for 5-9 years, 58 % for 10-14 years, and 64 % for 15 + years with self-reported diabetes. CONCLUSIONS: The machine learning model had acceptable accuracy and F1 score, and correctly classified more than half of the patients according to diabetes duration. Using large foundational models to extract image and text embeddings seems a feasible and efficient approach to predict years living with self-reported diabetes.


Assuntos
Diabetes Mellitus , Fundo de Olho , Aprendizado de Máquina , Valor Preditivo dos Testes , Autorrelato , Humanos , Feminino , Masculino , Idoso , Pessoa de Meia-Idade , Fatores de Tempo , Diabetes Mellitus/diagnóstico , Diabetes Mellitus/epidemiologia , Brasil/epidemiologia , Adulto , Bases de Dados Factuais , Retinopatia Diabética/diagnóstico , Retinopatia Diabética/epidemiologia , Mineração de Dados/métodos , Reprodutibilidade dos Testes , Interpretação de Imagem Assistida por Computador
4.
Work ; 78(2): 399-410, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38277324

RESUMO

BACKGROUND: Occupational accidents in the plumbing activity in the construction sector in developing countries have high rates of work absenteeism. The productivity of enterprises is heavily influenced by it. OBJECTIVE: To propose a model based on the Plan, Do, Check, and Act cycle and data mining for the prevention of occupational accidents in the plumbing activity in the construction sector. METHODS: This cross-sectional study was administered on a total of 200 male technical workers in plumbing. It considers biological, biomechanical, chemical, and, physical risk factors. Three data mining algorithms were compared: Logistic Regression, Naive Bayes, and Decision Trees, classifying the occurrences occupational accident. The model was validated considering 20% of the data collected, maintaining the same proportion between accidents and non-accidents. The model was applied to data collected from the last 17 years of occupational accidents in the plumbing activity in a Colombian construction company. RESULTS: The results showed that, in 90.5% of the cases, the decision tree classifier (J48) correctly identified the possible cases of occupational accidents with the biological, chemical, and, biomechanical, risk factors training variables applied in the model. CONCLUSION: The results of this study are promising in that the model is efficient in predicting the occurrence of an occupational accident in the plumbing activity in the construction sector. For the accidents identified and the associated causes, a plan of measures to mitigate the risk of occupational accidents is proposed.


Assuntos
Acidentes de Trabalho , Indústria da Construção , Mineração de Dados , Humanos , Mineração de Dados/métodos , Estudos Transversais , Acidentes de Trabalho/prevenção & controle , Acidentes de Trabalho/estatística & dados numéricos , Masculino , Adulto , Colômbia/epidemiologia , Fatores de Risco , Teorema de Bayes , Árvores de Decisões , Modelos Logísticos , Algoritmos
5.
Artigo em Espanhol | LILACS, CUMED | ID: biblio-1536340

RESUMO

Introducción: En Cuba y en el resto del mundo, las enfermedades cardiovasculares son reconocidas como un problema de salud pública mayúsculo y creciente, que provoca una alta mortalidad. Objetivo: Diseñar un modelo predictivo para estimar el riesgo de enfermedad cardiovascular basado en técnicas de inteligencia artificial. Métodos: La fuente de datos fue una cohorte prospectiva que incluyó 1633 pacientes, seguidos durante 10 años, fue utilizada la herramienta de minería de datos Weka, se emplearon técnicas de selección de atributos para obtener un subconjunto más reducido de variables significativas, para generar los modelos fueron aplicados: el algoritmo de reglas JRip y el meta algoritmo Attribute Selected Classifier, usando como clasificadores el J48 y el Multilayer Perceptron. Se compararon los modelos obtenidos y se aplicaron las métricas más usadas para clases desbalanceadas. Resultados: El atributo más significativo fue el antecedente de hipertensión arterial, seguido por el colesterol de lipoproteínas de alta densidad y de baja densidad, la proteína c reactiva de alta sensibilidad y la tensión arterial sistólica, de estos atributos se derivaron todas las reglas de predicción, los algoritmos fueron efectivos para generar el modelo, el mejor desempeño fue con el Multilayer Perceptron, con una tasa de verdaderos positivos del 95,2 por ciento un área bajo la curva ROC de 0,987 en la validación cruzada. Conclusiones: Fue diseñado un modelo predictivo mediante técnicas de inteligencia artificial, lo que constituye un valioso recurso orientado a la prevención de las enfermedades cardiovasculares en la atención primaria de salud(AU)


Introduction: In Cuba and in the rest of the world, cardiovascular diseases are recognized as a major and growing public health problem, which causes high mortality. Objective: To design a predictive model to estimate the risk of cardiovascular disease based on artificial intelligence techniques. Methods: The data source was a prospective cohort including 1633 patients, followed for 10 years. The data mining tool Weka was used and attribute selection techniques were employed to obtain a smaller subset of significant variables. To generate the models, the rule algorithm JRip and the meta-algorithm Attribute Selected Classifier were applied, using J48 and Multilayer Perceptron as classifiers. The obtained models were compared and the most used metrics for unbalanced classes were applied. Results: The most significant attribute was history of arterial hypertension, followed by high and low density lipoprotein cholesterol, high sensitivity c-reactive protein and systolic blood pressure; all the prediction rules were derived from these attributes. The algorithms were effective to generate the model. The best performance was obtained using the Multilayer Perceptron, with a true positive rate of 95.2percent and an area under the ROC curve of 0.987 in the cross validation. Conclusions: A predictive model was designed using artificial intelligence techniques; it is a valuable resource oriented to the prevention of cardiovascular diseases in primary health care(AU)


Assuntos
Humanos , Masculino , Feminino , Atenção Primária à Saúde , Inteligência Artificial , Estudos Prospectivos , Mineração de Dados/métodos , Previsões/métodos , Fatores de Risco de Doenças Cardíacas , Cuba
6.
Int J Occup Saf Ergon ; 29(3): 1088-1100, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35980110

RESUMO

Accident investigation reports provide useful knowledge to support companies to propose preventive and mitigative measures. However, the information presented in accident report databases is normally large, complex, filled with errors and has missing and/or redundant data. In this article, we propose text mining and natural language processing techniques to investigate low-quality accident reports. We adopted machine learning (ML) to detect and investigate inconsistencies on accident reports. The methodology was applied to 626 documents collected from an actual hydroelectric power company. The initial ML performances indicated data divergences and concerns related to the report structure. Then, the accident database was restructured to a more proper form confirming the supposition about the quality of the reports investigated. The proposed approach can be used as a diagnostic tool to improve the design of accident investigation reports to provide a more useful source of knowledge to support decisions in the safety context.


Assuntos
Acidentes , Mineração de Dados , Humanos , Mineração de Dados/métodos , Aprendizado de Máquina , Processamento de Linguagem Natural , Bases de Dados Factuais
7.
BMC Bioinformatics ; 23(1): 558, 2022 Dec 23.
Artigo em Inglês | MEDLINE | ID: mdl-36564712

RESUMO

BACKGROUND: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. RESULTS: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. CONCLUSIONS: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.


Assuntos
Surtos de Doenças , Saúde Pública , Humanos , América Latina/epidemiologia , Processamento de Linguagem Natural , Mineração de Dados/métodos
8.
PeerJ ; 10: e13351, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35539017

RESUMO

Antimicrobial resistance is a significant public health problem worldwide. In recent years, the scientific community has been intensifying efforts to combat this problem; many experiments have been developed, and many articles are published in this area. However, the growing volume of biological literature increases the difficulty of the biocuration process due to the cost and time required. Modern text mining tools with the adoption of artificial intelligence technology are helpful to assist in the evolution of research. In this article, we propose a text mining model capable of identifying and ranking prioritizing scientific articles in the context of antimicrobial resistance. We retrieved scientific articles from the PubMed database, adopted machine learning techniques to generate the vector representation of the retrieved scientific articles, and identified their similarity with the context. As a result of this process, we obtained a dataset labeled "Relevant" and "Irrelevant" and used this dataset to implement one supervised learning algorithm to classify new records. The model's overall performance reached 90% accuracy and the f-measure (harmonic mean between the metrics) reached 82% accuracy for positive class and 93% for negative class, showing quality in the identification of scientific articles relevant to the context. The dataset, scripts and models are available at https://github.com/engbiopct/TextMiningAMR.


Assuntos
Anti-Infecciosos , Inteligência Artificial , Mineração de Dados/métodos , Algoritmos , Aprendizado de Máquina
9.
Big Data ; 10(4): 279-297, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35394342

RESUMO

The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.


Assuntos
Big Data , Mineração de Dados , Computação em Nuvem , Mineração de Dados/métodos , Aprendizado de Máquina , Redes Neurais de Computação
10.
Sci. agric ; 79(01): 1-15, 2022. map, tab, ilus, graf
Artigo em Inglês | VETINDEX | ID: biblio-1498016

RESUMO

Lettuce (Lactuca sativa) is the main leafy vegetable produced in Brazil. Since its production is widespread all over the country, lettuce traceability and quality assurance is hampered. In this study, we propose a new method to identify the geographical origin of Brazilian lettuce. The method uses a powerful data mining technique called support vector machines (SVM) applied to elemental composition and soil properties of samples analyzed. We investigated lettuce produced in São Paulo and Pernambuco, two states in the southeastern and northeastern regions in Brazil, respectively. We investigated efficiency of the SVM model by comparing its results with those achieved by traditional linear discriminant analysis (LDA). The SVM models outperformed the LDA models in the two scenarios investigated, achieving an average of 98 % prediction accuracy to discriminate lettuce from both states. A feature evaluation formula, called F–score, was used to measure the discriminative power of the variables analyzed. The soil exchangeable cation capacity, soil contents of low crystalized Al and Zn content in lettuce samples were the most relevant components for differentiation. Our results reinforce the potential of data mining and machine learning techniques to support traceability strategies and authentication of leafy vegetables.


Assuntos
Lactuca/crescimento & desenvolvimento , Análise do Solo , Mineração de Dados/métodos , Química do Solo/análise , Abastecimento de Alimentos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA