Julie Bourbeillon | L'Institut Agro Rennes-Angers

Computer Sciences Associate professor

Department of Applied Mathematics and Computer Science

Co-head of the Departement of Applied Mathematics and Computer Sciences

Research Unit: UMR IRHS (Institut de recherche en horticulture et semences)

Bio

Career

Since 2009: Associate Professor in Computer Science, Institut Agro Rennes-Angers – Angers Campus
2008-2009: Teaching and Research Associate, Bordeaux 2 University
2007-2008: Post-doctoral fellow, Bordeaux-Sud-Ouest Inria Research Centre
2004-2007: Research Fellow (doctoral student) - LIG - Grenoble Informatics Laboratory and TIMC-IMAG - Techniques of Medical Engineering and Complexity - Computer Science, Mathematics and Applications of Grenoble
May - August 2002: IT support engineer - LOGICA
2000 - 2002: Apprentice IT engineer - IBM Global Services, Finance Department

Training

2007: Doctoral thesis - Université Grenoble 1 (Joseph Fourier) - "Towards a task-oriented information synthesis - Application to the design and evaluation of Tissue MicroArrays"
2004: Master of Science and Engineering for Health and Medicines - Université Grenoble 1 (Joseph Fourier)
2002: Engineer from the Institut National Agronomique Paris-Grignon - IT project manager specialization

Teaching

I am responsible for all computer science teaching at the Angers site. These are part of:

The engineering programs in horticulture and landscape design
The Master’s program in Plant Biology (BV), jointly accredited by Institut Agro and the Universities of Angers and Nantes
The VAAME doctoral school

In addition, I regularly mentor students in various contexts (specific UC projects, internships, apprenticeships, etc.), and I frequently participate in various juries, from undergraduate (L1) to master’s level (M2).

Undergraduate-Level Teaching

General principles

I was the main driver behind integrating digital skills into the new competency framework for engineers in horticulture and landscape, with competencies inspired by the PiX framework. This led me to identify two major subject areas for my teaching:

General Computer Literacy:
- Understanding the underlying concepts of computing (hardware, operating systems, networks, ethical and legal contexts),
- Managing one's digital work environment (installation, configuration, securing, and maintaining personal workstations),
- Learning or reinforcing the use of common tools (office software, email, publishing on the internet).
Scientific Computing for Data Science:
- Data collection (acquiring data through various measurement and observation methods, both quantitative and qualitative; conducting information search),
- Data management (organizing data in office environments, setting up databases, developing data management plans),
- Data analysis (representing processes as algorithms, writing programs to solve problems or automate tasks, using specialized software to analyze specific data, such as images).

Beyond the needs of all scientists, scientific computing can be tailored to specific requirements in horticulture and landscape fields: spatial data analysis, sensor data analysis (time series, images), for example, for phenotyping and envirotyping, bioinformatics (particularly -omics data), and more.

These two themes are addressed with increasing complexity throughout the years of training. For instance, data management is introduced through spreadsheets with 1st-year students and relational databases with 3rd-year students.

1st-year "General Computer literacy"

In the first year of the post-secondary program, the teaching focuses on the use of Information and Communication Technologies, both within the school and from a professional perspective. The module includes:

Digital work environment: Installation and configuration of common software, setting up a "work" computing environment, and familiarization with common tools, particularly institutional ones (email, Moodle learning platform).
Work on PiX digital competencies.
Collaborative writing of a wiki on the theme of general computer literacy.

I also contribute to the "Communication" course by working on the creation of presentation materials and supporting the use of tools like word processors in connection with the writing of internship reports.

1st-year: " Introduction to scientific computing"

In the first year, contributions to other modules introduce initial concepts of scientific computing:

Introduction to programming through activities in electricity and electronics using Arduino, in collaboration with the physics instructor.
Initial concepts of data management in the context of Campus Biodiversity Analysis, in collaboration with ecology instructors.

2nd-year: "Introduction to programming in Python"

In the second year, the course on Introduction to Programming aims to provide students with tools for automating data processing tasks or numerically solving problems for which no analytical solution necessarily exists. The module is thus centered on the concept of problem-solving from a computational perspective. Various steps are addressed:

Introduction to the principles of problem analysis,
Fundamentals of algorithmics as a method for problem-solving,
Basics of a programming language to express algorithms in a form understandable by the computer,
Iterative error analysis: hypothesizing the cause of a malfunction, proposing solutions, testing...

Application problems are contributed by other disciplines such as ecology, economics, physics, chemistry, and more.

3rd-year Horticulture: "Digital and Statistical Tools for Phenotyping"

For students specializing in Horticulture, the "Mineral Nutrition and Adaptation of Plant Material for Agroecology" course includes an experiment aimed at evaluating the effects of mineral deficiencies on plant development. This experiment serves as the application field for the module "Digital and Statistical Tools for Phenotyping."

Sensors (cameras, environmental monitors) installed in the greenhouse module allow continuous crop monitoring, providing an introduction to high-throughput phenotyping through image analysis and key concepts of current agronomic scientific methodology. These include:

Designing experimental plans,
Drafting Data Management Plans,
Creating scientific posters.

3rd-year: "Databases"

In the third year, the optional "Databases" module covers information systems and the role of databases within these systems when computerized. The content includes:

The process of computerization for businesses, including possible infrastructures and architectures,
Essential principles of designing relational databases,
Fundamentals for transitioning from a theoretical model to implementation, with practical application in SQL using the PhpMyAdmin interface on a MariaDB server.

Master's Level Teaching

General principles

Master's level courses aim to professionalize students or provide more specialized skills related to their field of expertise.

Horticulture 1st-year Master: "Digital Horticulture"

I lead an elective module on "Digital Horticulture" for 1st-year Master students specializing in Horticulture. The objective is to introduce the concept of digital agriculture as applied to the horticultural field, particularly in controlled-environment cultivation (greenhouses).

This introduction includes an overview of methods (automation, robotics, etc.) used to optimize production, as well as the technical applications deployed in the industry.

A small group project provides hands-on experience, allowing students to explore digital horticulture through one of its methods (data analysis, image analysis, programming of automation or robotics, modeling, etc.) and applications (irrigation and input management, Integrated Pest Management (IPM), crop monitoring, environmental control, production control, etc.).

Plant Biology 1st-year Master: "Programming in R"

This module aims to introduce basic programming concepts (variables, conditionals, loops, etc.) and some statistical notions for biologists. A sequence of exercises in R programming gradually introduces the necessary concepts using a real dataset as a common thread throughout the sessions, from data design to interpretation.

Key topics include:

File import,
Dataframe manipulation,
Data visualization,
Application of basic statistical methods,
Structuring R code,
Using specific functions (e.g., apply, merge, etc.).

Doctoral Level Teaching

General Principles

At the doctoral level, the courses offered focus on computer science, particularly programming, designed for biology students. Initially intended for doctoral students of the ED VAAME program, they are practically open to non-computer science students from across the Pays de la Loire region.

Python for Biology Level 1 (Beginners)

In biology, the generation of data (genetic sequences, proteins, etc.) is becoming increasingly rapid and extensive. Analyzing this data requires the use of computational resources and environments that can be challenging for biologists to master. Python is the most widely used programming language in the scientific world and is perfectly suited for biology and bioinformatics.

This training gradually introduces programming, algorithms, and biological applications. It is based on numerous examples. The exercises, of increasing difficulty, are applied to the processing of biological data.

Research

Since joining Institut Agro in Angers, using knowledge engineering and ontologies as tools to complement other IT approaches in agricultural research and plant biology has been the guiding principle of my research. This perspective has led me to use a wide range of IT methods, from information retrieval to natural language processing, machine learning and data mining. I have also built a variety of collaborations across a wide spectrum of applications, from the molecular scale to the field or even the territory. In the course of my work, I have focused on applications that could be positioned throughout the research data lifecycle.

I carry out this work within the ImHorPhen team at IRHS. These activities have been or are being conducted as part of various research projects.

Research Topics

Data Integration and visualisation

Recent years have been marked by two major changes in how biological research is conducted. On one hand, high-throughput techniques have expanded the scale at which experiments are performed, gradually moving from the molecular level to the phenotype and even population levels. On the other hand, the resulting datasets are increasingly shared via public repositories, which now host vast amounts of data (e.g., GenBank exceeded 250 million sequences in December 2024).

In this new context, biologists face the challenges of the "5Vs" of big data: Volume, Velocity, Variety, Value, and Veracity. They must work with large, disparate datasets that are not interconnected, even though some may be similar and analyzing them jointly could be beneficial.

This objective raises several questions that are addressed by my research project.

Data Integration

The integration of biological data, particularly biomedical data, is a major research topic in the emerging field of data science. Initially, data integration was addressed as the integration of data sources. Advanced approaches provide users with a unified view based on mapping mechanisms across multiple data sources, using federated databases to extract information from various sources and consolidate it in a data warehouse or build a network of semantically linked data using ontologies.

This consolidation of sources typically requires mapping and transforming data between sources to address differences in vocabularies, units, and more. These approaches work well for data of similar types (e.g., a collection of transcriptomic datasets) or for "knowledge" derived from experimental data. For example, experimental data might include colorimetric measurements of 10 apples from two trees, leading to the "knowledge" that "apples from trees A and B have different colors."

However, integration at this "knowledge" level is insufficient; it must also extend to experimental data, considering different experimental modalities and data types (e.g., transcriptomic data and physiological traits). This requires focusing on the "Variety" aspect of the "5Vs" of big data challenges, which is a key focus of my work.

The challenge of using diverse datasets arises even on a small scale and can be categorized based on the type of data: similar or heterogeneous datasets. Similar datasets are often combined via meta-analysis, but integrating heterogeneous data is an active area of research. Furthermore, most available tools focus on biomedical data, with relatively few targeting plants.

This underscores the importance of addressing a wide range of data types in the plant domain, including genotyping data, transcriptomic data, biochemical composition, physical attributes, sensory data, and phenotypic data—an important objective of my research.

Dimensionality Reduction

Integrated datasets can quickly become impossible to process manually due to their sheer size or complexity. Simple tools like spreadsheets are often insufficient for handling such data, even though they are commonly used by biologists to get an initial overview. Therefore, it is essential to provide user-friendly ways for biologists to explore their data. Reducing the size of the matrices to be handled and providing summaries of the datasets are key solutions to this issue.

Reducing the number of variables is a common statistical practice. However, reducing the number of individuals (data points) is rare because traditional statistical methods are designed for situations where there are more individuals than variables. The reduction approaches I am developing involve grouping similar individuals based on knowledge stored in ontologies.

There are numerous similarity measures between vectors of values, as well as many semantic similarity measures in ontologies. Most existing approaches rely solely on the topology of the graph. Unlike these methods, the approach I am developing incorporates additional information in the ontology to describe the similarity between concepts and calculate similarities between individuals.

Additionally, I represent each group with an archetypal individual that "summarizes" its group.

Data visualisation

Heterogeneous datasets resulting from the integration process are often too complex to be easily interpreted using tabular representations. Graphical representations provide a valuable solution to this issue. Over the past 25 years, this field has seen tremendous progress, evolving from static 2D representations to interactive 3D displays and even early attempts at virtual reality.

However, current popular visualization software has several limitations when it comes to visualizing complex, heterogeneous experimental datasets:

Information overload: Most tools are prone to this issue. I address it through my data summarization approach, which reduces the number of elements to display.
Limited data categories: Existing tools often cater to specific data types, which may not be suitable for the contexts we consider. The tools I develop aim to be more generic, capable of handling a broader range of data types.
Compatibility issues: While existing software supports standard exchange formats and connects with established databases to retrieve various datasets, introducing personal datasets from biologists requires them to conform to the tool's expected format. This often involves cumbersome preprocessing, especially for integrating datasets of different types.

The approaches I develop rely on simple tabular data formats that biologists are already familiar with, simplifying integration and usability.

Digital twins of horticultural production systems

Providing quality and nutritious food for a growing population amid climate change, resource scarcity, and sustainability constraints presents significant challenges. Farming under cover offers a potential solution by enabling a controlled environment that helps reducing water and biocide use, while on the other hand often heavily depending upon the availability of fossil fuels at acceptable prices. To optimize greenhouse operations and align them with agroecological principles, the development of digital twins—virtual models updated with real-time data from sensors—is emerging as a key innovation. Unlike traditional automation tools, digital twins can simulate and predict plant responses under varying conditions, thereby allowing for smarter, more adaptive management.

Digital twins in agriculture are still scarce, and only few of them focus on greenhouse systems. A structured approach is proposed to develop Predictive, Prescriptive, and Autonomous Digital Twins, which can respectively optimize yields, suggest actionable crop plans, and autonomously detect and correct anomalies. These systems must integrate both plant and infrastructure models, requiring new methods to track plant stresses and adapt existing physiological models beyond ideal growth conditions.

Ultimately, the goal is to build an intelligent, ergonomic, and economically viable system that simultaneously manages crop health and optimizes resource use by dynamically linking plant models with greenhouse control systems.

Projects

PAYTAL (2011-2015)

The "PAYTAL" project was a multidisciplinary initiative (economics, remote sensing, data mining, knowledge engineering) aimed at shedding light on the role of landscapes in urban sprawl mechanisms. I worked on extracting knowledge from texts within a corpus I compiled from the Landscape Atlases of French departments and regions.

This work led to the development of:

An ontology of landscape perception, and
An annotation of the different landscape units (geographically homogeneous areas from a landscape perspective) covered by the atlases, using terms from the ontology.

These data were used in the urban sprawl models developed by the project's economists and to analyze the subjectivity of the document authors.

Verger de Demain (2011-2015)

Led by the IFPC (French Institute of Cider Production), this project brought together fruit growers, agricultural chambers, agricultural and agronomic training organizations, and research institutions around experiments conducted on fruit growers' plots. These experiments focused on new orchard management practices for cider apple production that reduce inputs while remaining technically and economically viable.

I contributed by setting up a database to monitor the experiments on the plots.

AI-Fruit (2012-2016)

The "AI-Fruit" project aimed to deepen knowledge about the determinants of apple quality and develop non-destructive methods for assessing this quality. A computational component included the development of tools for analyzing and integrating experimental data collected during the project.

The discussions conducted helped refine the concept of semantic queries for describing data processing needs, which I developed in my thesis, and adapt it to experiments conducted on apple trees. However, no concrete implementation could be achieved.

GRIOTE (2014-2018)

The "GRIOTE" project aimed to unite bioinformatics stakeholders in the Pays de la Loire region around collaborative projects. I contributed by supervising interns and a doctoral student working on incorporating antisense transcripts into the construction of co-expression networks.

CRB FraPeR et Apiacées (2014-2016) - ANANdb (2015-2016)

A significant need for the IRHS teams was the development of data management tools to improve traceability, sharing, and reuse. This became a unifying project for the bioinformatics team. Funding from various projects ("ANANdb," "AI-Fruit," "CRB FraPeR and Apiaceae," "GRIOTE") enabled the recruitment of interns, contractual engineers, and apprentices.

I developed a terminology management module and constructed domain ontologies in collaboration with biologists. The goal is to establish a controlled vocabulary, inspired by relevant reference ontologies (Plant Ontology, Gene Ontology, Crop Ontology), to annotate metadata associated with samples.

EUCLEG (2017-2021)

The European project EUCLEG aimed to improve the diversification, productivity, yield stability, and protein quality of legumes. The SMS and ImHorPhen teams at IRHS collaborated on the characterization of seedlings. The SMS team conducted the experimental phase (germination and image acquisition of the seedlings), while I, on behalf of ImHorPhen, measured the seedlings through image analysis.

This work allowed me to gain expertise in image processing and led to the co-supervision of three interns with D. Rousseau, a professor of physics at the University of Angers. Currently, the tool I developed consists of Python scripts utilizing image analysis approaches based on mathematical morphologies and machine learning with random forests.

DIVIS (2018-2021)

The DIVIS project (Biological Data Integration and Visualization) aimed to explore innovative and user-friendly approaches for integrating and visualizing large volumes of heterogeneous data. The developed tool processes large matrices containing biological datasets and:

Normalizes the datasets,
Groups similar samples using knowledge stored in ontologies designed for this purpose,
Represents each group with an archetypal "average" individual to create data "summaries,"
Builds a graphical representation of these summaries, enabling biologists to navigate and gain a better understanding of the underlying datasets.

In 2018, an M2 Bioinformatics internship explored the first two steps. In 2019, the graphical visualization dimension was studied in connection with transcriptomic data through CorGI, a web application for bi-clustering developed by the bioinformatics team.

Subsequently, the entire process was applied to rose data. In 2021, additional work focused on developing tools for cluster analysis inspired by the catdes function of the FactoMineR package, resulting in the QuaDS software.

DIGITOM (2024-2027)

In the context of climate change and reduced reliance on phytosanitary products, it is essential to develop resilient agricultural production systems. To achieve this, we can leverage models at different scales (molecule, cell, plant, etc.). However, to better represent a system, it is necessary to interconnect all existing models across their various spatiotemporal scales, giving rise to the digital twin paradigm. Through a digital representation of phenomena, the goal is to create tools to aid in the design of systems or even their management by incorporating real-world data collected from the field as simulation parameters.

The collective imagination often associates digital twins with a perfect representation of reality, accounting for all parameters at every scale, from the molecule to the field. However, striving for such an in silico clone leads to an exponential increase in the number of variables and the complexity of algorithms. Simplifications are therefore necessary to make the system usable in terms of cost and computation time.

The question that arises is how to best interconnect a large number of models and rationalize the simplifications at the scale of the digital twin. To address this, the project's objective is to build a formal representation of knowledge at different scales, such as an ontology, which can be explored automatically.

In a second phase, this approach should help identify statistically significant concepts and determine their relevance by comparing simulated data with real-world data. These real-world data will pertain to tomato production in a semi-closed greenhouse, a production of economic interest for which numerous models are already available.

ERICA (2025-2026)

As a preliminary to the DIGITOM project, we sought to characterize the semi-closed greenhouse where the experiments would take place, particularly in terms of climate. The characterization of the greenhouse using sensors moved weekly revealed physical and human limitations.

The project aims to develop a robot to perform spatiotemporal characterizations by moving environmental sensors throughout the greenhouse's volume (3D acquisition) at an optimized frequency. Cameras mounted on the robot will enable testing for image acquisition for phenotyping across space and time.

Research Outputs

The publications resulting from my work are presented below, either as a list of selected notable publications or a complete list.

I am also involved in the development of several software tools, listed below, with their source code made available.

Publications

Notable Publications

[Per+23]Alix Pernet et al., « Construction of a semantic distance for inferring structure of the variability between 19th century Rosa cultivars », in : Acta Horticulturae 1384 (déc. 2023), p. 477-484, issn : 2406-6168, doi : 10.17660/actahortic.2023.1384.60.
[Bar+22]Thibault Barrit et al., « A new in vitro monitoring system reveals a specific influence of Arabidopsis nitrogen nutrition on its susceptibility to Alternaria brassicicola at the seedling stage », in : Plant Methods 18.1 (déc. 2022), issn : 1746-4811, doi : 10.1186/s13007-022-00962-3.
[Eid+22]Rayan Eid et al., « DIVIS : a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets », in : BioData Mining 15.1 (avr.2022), issn : 1756-0381, doi : 10.1186/s13040-022-00293-y.
[Bou+21]Julie Bourbeillon et al., « Characterising the Landscape in the Analysis of Urbanisation Factors : Methodology and Illustration for the Urban Area of Angers », in : Economie et Statistique / Economics and Statistics 528–529 (déc. 2021), p. 109-128, issn : 0336-1454, doi : 10.24187/ecostat.2021.528d.2062.
[Rou+15]Céline Rousseau et al., « Phenoplant : a web resource for the exploration of large chlorophyll fluorescence image datasets », in : Plant Methods 11.1 (avr. 2015), issn : 1746-4811, doi : 10.1186/s13007-015-0068-4.
[San+14]Pierre Santagostini et al., « Assessment of the visual quality of ornamental plants : Comparison of three methodologies in the case of the rosebush »,in : Scientia Horticulturae 168 (mars 2014), p. 17-26, issn : 0304-4238, doi :10.1016/j.scienta.2014.01.011.
[Bou+10]Julie Bourbeillon et al., « Minimum information about a protein affinity reagent (MIAPAR) », in : Nature Biotechnology 28.7 (juill. 2010), p. 650-653, issn : 1546-1696, doi : 10.1038/nbt0710-650.
[Glo+10]David E. Gloriam et al., « A Community Standard Format for the Representation of Protein Affinity Reagents », in : Molecular & Cellular Proteomics 9.1 (jan. 2010), p. 1-10, issn : 1535-9476, doi : 10.1074/mcp.m900185-mcp200.
[BGG09]Julie Bourbeillon, Catherine Garbay et Françoise Giroud, « Mass data exploration in oncology : An information synthesis approach », in : Journal of Biomedical Informatics 42.4 (août 2009), p. 612-623, issn : 1532-0464,doi : 10.1016/j.jbi.2009.02.007.

Complete list

My publications on HAL

Software tools

ELVIS

A significant need for the IRHS teams was the implementation of data management tools to improve traceability, sharing, and reuse. The bioinformatics team within the unit is developing tools to address this need, and I contribute to these developments. For instance, ELVIS (Experiment and Laboratory on Vegetal Information System) integrates the common databases and server layer for various data management/processing tools developed by the team. ELVIS consists of a PostgreSQL database and a web service layer for data access, developed in Python. ELVIS is organized into a set of thematic modules, and several business applications developed by the team are built upon ELVIS.

The ELVIS project page on ForgeMIA

PREMS

PREMS is the business application focused on laboratory management that is built on ELVIS. PREMS consists of a set of components, including the management of projects, samples, and experimental results.

The PREMS project page on ForgeMIA

ELTerm

Elterm is the terminology management application based on ELVIS.

In ELVIS, the content of many fields is controlled by lists of possible values, which are generally derived from terminologies:

recognized domain terminologies, possibly derived from publicly available taxonomies or ontologies (Plant Ontology, Crop Ontology, etc.)
specific terminologies that we can consider disseminating

We therefore store a set of terminologies each covering a theme: morphology of organisms, development stages, growth conditions, etc. The general principle of what we want to store is similar to what is found in standard representations of terminologies in XML format like TermBase Exchange, but in the form of a database. Elterm provides a set of graphical interfaces allowing users to manipulate terminologies stored in ELVIS.

The ELTerm project page on ForgeMIA

DIVIS

Thanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insight into these datasets and reduce the data matrices size scientists often rely on multivariate analysis techniques. However such approaches are not always easily practicable in particular when faced with mixed datasets. Moreover displaying large numbers of individuals leads to cluttered visualisations which are difficult to interpret.

We developed a new methodology to overcome these limits. Its main feature is a new semantic distance tailored for both quantitative and qualitative variables which allows for a realistic representation of the relationships between individuals (phenotypic descriptions in our case). This semantic distance is based on ontologies which are engineered to represent real life knowledge regarding the underlying variables. For easier handling by biologists, we incorporated its use into a complete tool, from raw data file to visualisation. Following the distance calculation, the next steps performed by the tool consist in (i) grouping similar individuals, (ii) representing each group by emblematic individuals we call archetypes and (iii) building sparse visualisations based on these archetypes.Our approach is implemented as a Python pipeline and applied to a rosebush dataset including passport and phenotypic data.

The DIVIS project page on ForgeMIA

QuaDS

As part of the DIVIS project, we were faced with the need to characterized groups of individuals according to the values of variables in the dataset. Such a method has been developed by F. Husson et al with the catdes() function as part of the FactoMiner R package. However were not completely satisfied with the output of this function regarding both the result data table and the visualisation. Therefore we developped our own Python implementation, with extras...

The QuaDS project page on ForgeMIA

QUADS on HAL