Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. Plant models like Arabidopsis thaliana have extensive, experimentally derived annotations, whereas many non-model species do not. Here we employed a machine learning strategy, transfer learning, where knowledge from A. thaliana is transferred to predict gene functions in cultivated tomato with fewer experimentally annotated genes. The first tomato SM/GM prediction model using only tomato data performs well (F-measure = 0.74, compared with 0.5 for random and 1.0 for perfect predictions), but from manually curating 88 SM/GM genes, we found many mis-predicted entries were likely mis-annotated. When the SM/GM prediction models built with A. thaliana data were used to filter out genes where the A. thaliana-based model predictions disagreed with tomato annotations, the new tomato model trained with filtered data improved significantly (F-measure = 0.92). Our study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information. Additionally, our findings provide an example for transfer learning in genomics where knowledge can be transferred from an information-rich species to an information-poor one.
Recent years witnessed a stagnation in yield enhancement in major staple crops, which leads plant biologists and breeders to focus on an urgent challenge to dramatically increase crop yield to meet the growing food demand. Systems models have started to show their capacity in guiding crops improvement for greater biomass and grain yield production. Here we argue that systems models, phenomics and genomics combined are three pillars for the future breeding for high-yielding photosynthetically efficient crops (HYPEC). Briefly, systems models can be used to guide identification of breeding targets for a particular cultivar and define optimal physiological and architectural parameters for a particular crop to achieve high yield under defined environments. Phenomics can support collection of architectural, physiological, biochemical and molecular parameters in a high-throughput manner, which can be used to support both model validation and model parameterization. Genomic techniques can be used to accelerate crop breeding by enabling more efficient mapping between genotypic and phenotypic variation, and guide genome engineering or editing for model-designed traits. In this paper, we elaborate on these roles and how they can work synergistically to support future HYPEC breeding.