Elia Giuseppe Ceroni (University of Siena)
When: Nov 1oth, 2021 – 11:00 – 11:45 AM
Where: Google meet link
In recent years, a number of Copy Number Variants detection tools using Whole Exome Sequencing data has become available, whilst there’s still no guidance on the usage of these tools in clinical settings since their low performances in terms of sensitivity and specificity. Currently most of them are based on the Read Count method, which assumes a random probability distribution for mapping depth and investigates the divergence of the data from that to highlight variations. The aim of this work is to improve upon these methods by employing a Machine Learning-based approach. Two major problems arise when trying to develop a ML-based CNV detection tool: the lack of training data and the absence of a benchmark. We addressed the former by training on labelled exons coming from the X chromosome, which allowed us to use a supervised learning framework. For what concerns the latter, we exploited one of the standards coming from 1000G projects. We estimated our features importance,and tested the minimal requirements needed to obtain well-formed classifiers. We found that while the models performed remarkably well on the X chromosome, even on some of its more difficult regions, they called a large number of false positives on autosomes. To reduce them, we added white noise, and also trained an autoencoder for outliers detection. At last, a two-state Hidden Markov Model using the model’s predictions as observations of the hidden states on NA12878 was employed to account for sequential observations. Then, using the Viterbi algorithm with the estimated HMM we found the most probable sequence of states that explains the ML model’s observations.