Short Course Description
Data-driven science is a ubiquitous paradigm in modern biology: heaps of data are collected on a biological system of interest, from which one must discover qualitative scientific insights or build accurate quantitative models.
This course is an overview of advanced machine learning algorithms commonly used in modern computational biology research. The goals are three-fold: i) Learn the underlying mathematical principles behind these algorithms ii) Learn how and when to use them for scientific purposes iii) Understand their limitations. The algorithms will be illustrated on various biological systems (brain recordings, single cell data, protein sequences, molecules, etc.). Tentative syllabus below, is subject to changes.
Topic 1: Linear models and extensions (GLM, GAM, LASSO) for Tabular Data.
Topic 2: Decision trees and extensions (Decision rules; boosted trees) for Tabular Data.
Topic 3: Interpretable Machine Learning with Model agnostic-explanations.
a) Partial Dependency and Accumulated Local Effects plots.
b) Permutation Feature Importance.
c) LIME and SHapley Additive exPlanations (SHAP).
Topic 4: Deep learning Architectures for Unstructured Data.
a) Convolutional Neural Networks.
b) Graph Neural Networks.
c) Transformers.
d) Saliency maps.
Topic 5: Data visualizations with low-dimensional embeddings (PCA, tSNE, UMAP).
Topic 6: Meaningful feature extraction with Matrix Factorization Algorithms (K-means; Non-negative Matrix Factorization; Sparse PCA; Sparse Dictionary Learning).
Topic 7: Deep Generative Models
a) Autoregressive Generative Models.
b) Variational Inference & Variational Autoencoders.
c) Denoising Diffusion Generative Models.
Topic 8: Developing and Troubleshooting Deep Learning models.
Language: The course will be given in English.
Prerequisites: Introduction to Machine Learning (0368-3235) or Introduction to Statistical Learning (0365.3130) or another equivalent course. No prior knowledge of biology is required.
Evaluation: Evaluation will be based on home assignments (theoretical and case studies on biological data, 40%), on oral presentation of a research article (50%) and a written report of another article (10%).
Optional Reading Material:
a) An Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani & Taylor. https://www.statlearning.com
b) Interpretable Machine Learning by Molnar. https://christophm.github.io/interpretable-ml-book/
c) Understanding Deep Learning by Prince.
https://udlbook.github.io/udlbook/
Full Syllabus