Presenter: Ren Yi
Title: Learning from Data-Rich Problems: case studies on genetic variant calling and transcription factor binding site prediction.
Highly-skewed data are commonly seen in genomics, which introduces technical challenges when building machine learning models for data-poor problems. For instance, next generation sequencing can sample the whole genome (WGS) or the 1-2% of the genome that codes for proteins called the whole exome (WES). Machine learning approaches to variant calling achieve high accuracy in WGS data, but the reduced number of training examples causes training with WES data alone to achieve lower accuracy. In terms of transcription factor binding site (TFBS) prediction, compendia databases such as ENCODE have accumulated a large collection of TF ChIP-seq data in a small number of well studied cell types and organisms. However, TFBS profiles remain largely unknown in rare cell types and it is infeasible to perform ChIP-seq experiments on all TFs across all cell types and organisms. In this talk, I will present two case studies–genetic variant calling and TFBS prediction–to explore how data supplementation methods can be effectively used to improve learning of data-poor problems.