Knowledge discovery & machine learning

High throughput biology platforms produce huge amount of data, which requires adapted methods and algorithms to be investigated and given a semantic to. Most of the omics fields currently address this big data issue in the context of biological complexity: non-linearity, structured data (temporal, network, etc.), extremely high dimensionality over population size ratio, etc. However, beyond these major trends in machine learning arisen by general omics questions, proteomics carries its own particular challenges, such as the amount of imprecise/missing measurements, the level and the particular nature of the noise, the strongly violated assumption of independent and identically distributed variables, etc.

 

To face these methodological trends, EDyP researchers with competences in machine learning, statistics, data mining and signal processing are investigating the following methods:

  • Kernel machines and Riemannian geometry
  • Imprecise probabilities, data fusion under strong uncertainty
  • Large matrices factorization, regression with structured penalties and blind source separation
  • Multiple hypothesis testing on non-independent variables (along with the bio-analysis experts)
  • Bayesian methods and graphical models

Moreover, we collaborate with research groups from the neighboring Laboratoire d’Informatique de Grenoble (http://www.liglab.fr/), on the following questions:

  • Co-training / -clustering on large multi-partite graphs (LIG / AMA group)
  • Visual analytics and interactive data mining (LIG / IIHM group)
  • Symbolic artificial intelligence and optimization (LIG / STEAMER group)

Finally, the ultimate goal of our researches bears on the three following challenges of proteomics:

  • Quantitative label-free proteomics: Extracting the relative quantitation signal and defining the corresponding statistical analysis pipe-line for biomarker discovery
  • Protein inference problem: Finding the subset of proteins which most likely explains the observed peptic spectra
  • Peptide identification: Combining De Novo sequencing and Database Search for optimal peptide identification.

Involvement in EDyP projects

This research direction was promoted by the bioinformatics work-package of ProFI: The CNRS supported ProFI with an additional permanent researcher position dedicated to knowledge discovery.
The main part of our R package development is conducted through a Prime-XS (http://www.primexs.eu/) collaboration with the Cambridge Center for Proteomics (http://www.bio.cam.ac.uk/proteomics/).
Beyond these two infrastructure projects, experts in knowledge discovery and machine learning are involved in numerous other biological or methodological projects (along with the bio-analysis experts).

Prospectom: Interactive visual analytics via statistical machine learning and KB/DB integration on mass spectrometry and omics data.

Coordinator: Gilles Bisson (LIG/AMA) and Thomas Burger (EDyP)
EDyP correspondent: Thomas Burger
Funded by: Mission pour l’Interdisciplinarité du CNRS

Prospectom is an initiative renewed on an annual basis, which aims at bridging together French scientists from proteomics and system biology in one hand, and computer sciences (machine learning, artificial intelligence and visual analytics) in the other hand. It promotes pairwise collaborations by funding interdisciplinary internships, as well as an annual workshop (during fall semester) in Grenoble, co-organized by the EDyP and LIG (http://prospectom.liglab.fr). Practically, Prospectom promotes interdisciplinary researches focused on one of the following three axes:
•    Knowledge discovery from spectra to protein
•    Multi-omics integration and regulation network inference
•    Visual analytics for multi-scale, dynamic and big data