Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography-high resolution mass spectrometry (LC-HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the logIE. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 logIE for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 logIE for the test set and 0.62 logIE for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA. Scientific contribution: This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.
Background: Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.
Results: NOCTIS is an open-source Python package for building Networks of Organic Chemistry (NOCs) from reaction strings. It supports graph-based analysis, parallel processing of large datasets, and export to common Python formats (e.g., NetworkX, pandas). Built on Neo4j technology, it features a modular, extensible architecture with open-source dependencies. We also provide a companion plugin for exhaustive route enumeration. It traverses graph-encoded reactions to assemble all valid synthetic routes, helping prevent redundant exploration and supporting knowledge reuse in synthesis planning. The underlying algorithm is documented in detail along with its current limitations. Using the MIT USPTO-480k dataset (Adv Neural Inf Process Syst 30, 2017), we demonstrate the plugin's route mining capabilities, analyze network connectivity, and assess synthetic trees.
Conclusion: Built on LinChemIn (J Chem Inf Model 64(6):1765-1771, 2024), NOCTIS serves as an open and extensible toolkit for network-based reaction analysis and route mining, laying the groundwork for data-driven route design at scale. Future work will extend query capabilities and improve the efficiency of route extraction.
Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process-uniform (random replacement of tokens) and absorbing (progressive masking)-on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.
Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.
Nipah Virus (NiV) came into limelight due to an outbreak in Kerala, India. NiV infection can cause severe respiratory and neurological problems with fatality rate of 40–70%. It is a public health concern and has the potential to become a global pandemic. Lack of treatment has forced the containment methods to be restricted to isolation and surveillance. WHO’s ‘R&D Blueprint list of priority diseases’ (2018) indicates that there is an urgent need for accelerated research & development for addressing NiV. In the quest for druglike NiV inhibitors (NVIs) a thorough literature search followed by systematic data curation was conducted. Rigorous data analysis was done with curated NVIs for prioritising curated compounds. Our efforts led to the creation of Nipah Virus Inhibitor Knowledgebase (NVIK), a well-curated structured knowledgebase of 220 NVIs with 142 unique small molecule inhibitors. The reported IC50/EC50 values for some of these inhibitors are in the nanomolar range—as low as 0.47 nM. Of 142 unique small-molecule inhibitors, 124 (87.32%) compounds cleared the PAINS filter. The clustering analysis identified more than 90% of the NVIs as singletons signifying their diverse structural features. This diverse chemical space can be utilized in numerous ways to develop druglike anti-nipah molecules. Further, we prioritised top 10 NVIs, based on robustness of assays, physicochemical properties and their toxicity profiles. All the NVIs related information including their structures, physicochemical properties, similarity analysis with FDA approved drugs and other chemical libraries along with predicted ADMET profiles are freely accessible at https://datascience.imtech.res.in/anshu/nipah/. The NVIK has the provision to submit new inhibitors as and when reported by the community for further enhancement of the NVIs landscape.
Scientific contribution
The NVIK is a dedicated resource for NiV drug discovery containing manually curated NVIs. The NVIs are structurally mapped with known chemical space to identify their structural diversity and recommend strategies for chemical library expansion. Also, in NVIK a combined evidence-based strategy is used to prioritise these inhibitors.