From Biological Data to
Bioinformatics Mastery
Master Bioinformatics, Transcriptomics, Protein Designing & AI
for Modern Biological Research
About the Program
The Omics Nexus Summer School 2026 is an intensive training program designed for students, researchers, and professionals who want to build expertise in computational biology, transcriptomics, machine learning, and AI-driven biological discovery.
Participants will work with real-world biological datasets and industry-standard tools while developing practical skills that are directly applicable to research, graduate studies, and careers in bioinformatics.
Foundations for Bioinformatics
From Raw Data to Biological Insight
Build a strong computational foundation by learning how to navigate Linux environments, write scripts in Python and R, and handle real biological datasets. Develop the core skills researchers rely on to process, analyze, and visualize biological data efficiently and reproducibly.
Topics Covered
- •Introduction to the Linux operating system and command-line navigation
- •Writing and executing Bash scripts for automation
- •Python programming fundamentals for biological data
- •Reading, parsing, and managing biological file formats (FASTA, FASTQ, BED, GFF/GTF, VCF, SAM/BAM)
- •Combining Bash and Python for data processing pipelines
- •R programming fundamentals and data structures
- •Data wrangling and exploratory data analysis using the tidyverse
- •Data visualization and interpretation using ggplot2
- •Best practices for reproducible computational workflows
Hands-On Projects
- Independent Linux and Python-based processing of real biological data files
- End-to-end data handling and visualization workflow spanning Bash, Python, and R
- Research-style presentation of computational workflows and findings
Outcomes
- ›Work confidently in Linux-based environments
- ›Write basic scripts in Python and R
- ›Handle and parse common biological file formats
- ›Create publication-quality visualizations to identify and interpret meaningful patterns in biological data
Advanced Transcriptomics & AI
Bulk & Single-Cell RNA-Seq • Machine Learning
Master the analysis of modern transcriptomic datasets by exploring both single-cell RNA sequencing and machine learning approaches. Learn how researchers uncover cellular heterogeneity, identify biomarkers, and build predictive models from gene expression data.
- •Introduction to Bulk RNA-seq and Single-Cell RNA-seq
- •Understanding Cell Ranger outputs and transcriptomic data structures
- •End-to-end scRNA-seq analysis using Scanpy
- •Quality control, normalization, and feature selection
- •Cell clustering, dimensionality reduction, and cell-type annotation
- •Marker gene discovery and biological interpretation
- •Fundamentals of machine learning for biological data
- •Gene expression preprocessing and feature selection
- •Building disease-classification models from transcriptomic data
- •Model evaluation, biomarker identification, and result interpretation
Hands-On Projects
- Independent analysis of a real-world scRNA-seq dataset
- Development of a disease-classification model using bulk RNA-seq data
- Research-style presentation of findings and biological insights
Outcomes
- ›Analyze single-cell transcriptomic datasets using industry-standard tools
- ›Identify biologically relevant cellular populations
- ›Build machine learning models for disease prediction
- ›Translate complex gene expression data into actionable biological insights
Teaching Machines to Speak Protein
Protein Language Models • AI-Driven Design
A deep dive into protein language, AI-driven design, and hands-on PLM training — from foundational theory to a fully trained model, all built around a real cancer drug target (EGFR Kinase).
- •Amino acid alphabet tokens, sequence-to-structure-to-function mapping
- •Protein Language Models (ESM-2) learning via masked residue prediction
- •Embeddings encoding secondary structure, evolutionary signals, and functional sites
- •Forward problem (structure prediction) vs. inverse problem (sequence design)
- •Design strategies: Fixed-backbone (ProteinMPNN), De novo generation (RFdiffusion), Hallucination
- •ML architectures: Transformer self-attention, GNN message passing, Diffusion denoising
- •Data retrieval from open-source protein databases and sequence preprocessing
- •Model selection, configuration, training, and fine-tuning in PyTorch
- •Evaluation metrics (pLDDT, iPAE, Perplexity, Log-Likelihood, RMSD)
- •Self-consistency validation (designed vs. natural sequences)
Hands-On Projects
- Fetch EGFR kinase, run ColabFold for 3D structure prediction, visualize pLDDT
- Run ProteinMPNN on backbone, score sequences with ESM-2, validate with RMSD
- Engineer sequence dataset from open databases, train & fine-tune a transformer-based PLM, evaluate with perplexity & log-likelihood
Outcomes
Students leave with a complete end-to-end skillset:
- ›A clear mental model of protein language and AI design
- ›A notebook producing their first AI-designed protein candidate against EGFR kinase
- ›A trained PLM generating novel sequences validated by fold self-consistency