Tokenization plays a critical role in preparing SMILES strings for molecular foundation models. Poor token units can fragment chemically meaningful substructures, inflate sequence length, and hinder model learning and interpretability. Existing approaches such as SMILES Pair Encoding (SPE) and Atom Pair Encoding (APE) compress token sequences but often ignore domain-specific chemistry or fail to generalize to larger or more diverse molecules. We propose a domain-aware method for SMILES compression that combines frequency-guided substring mining using a prefix trie with an optional entropy-based refinement step using a token transition graph (TTG). On a corpus of 100,000 PubChem molecules, the Trie+TTG method reduces token sequences by more than 50% compared to APE while preserving chemically coherent substructures. The method generalizes effectively to large, out-of-distribution molecules, achieving compression rates of up to 90% with minimal sensitivity to molecule size. To assess downstream utility, we evaluate latent-space structure using unsupervised clustering and perform QSAR regression on ESOL. Trie+TTG produces more separable molecular representations and stronger predictive performance than Trie-only and APE. In addition, on peptide corpora, our method substantially outperforms SPE and the PeptideCLM tokenizer in compression and entropy metrics. These results show that combining trie-based mining with TTG refinement yields compact, stable, and chemically meaningful tokenizations suitable for modern molecular representation learning.Scientific contributions: We present a trie-based framework that compresses SMILES sequences into shorter, chemically coherent units while guaranteeing lossless reconstruction. By incorporating a token transition graph for entropy-guided refinement, our method selects contextually stable merges that improve both compression efficiency and generalization. Unlike prior approaches such as APE and SPE, our tokenizer combines frequency and context awareness, yielding more compact, interpretable, and transferable molecular representations.
{"title":"Optimizing SMILES token sequences via trie-based refinement and transition graph filtering.","authors":"Sridhar Radhakrishnan,Krish Mody,Arvind Venkatesh,Ananth Venkatesh","doi":"10.1186/s13321-025-01143-9","DOIUrl":"https://doi.org/10.1186/s13321-025-01143-9","url":null,"abstract":"Tokenization plays a critical role in preparing SMILES strings for molecular foundation models. Poor token units can fragment chemically meaningful substructures, inflate sequence length, and hinder model learning and interpretability. Existing approaches such as SMILES Pair Encoding (SPE) and Atom Pair Encoding (APE) compress token sequences but often ignore domain-specific chemistry or fail to generalize to larger or more diverse molecules. We propose a domain-aware method for SMILES compression that combines frequency-guided substring mining using a prefix trie with an optional entropy-based refinement step using a token transition graph (TTG). On a corpus of 100,000 PubChem molecules, the Trie+TTG method reduces token sequences by more than 50% compared to APE while preserving chemically coherent substructures. The method generalizes effectively to large, out-of-distribution molecules, achieving compression rates of up to 90% with minimal sensitivity to molecule size. To assess downstream utility, we evaluate latent-space structure using unsupervised clustering and perform QSAR regression on ESOL. Trie+TTG produces more separable molecular representations and stronger predictive performance than Trie-only and APE. In addition, on peptide corpora, our method substantially outperforms SPE and the PeptideCLM tokenizer in compression and entropy metrics. These results show that combining trie-based mining with TTG refinement yields compact, stable, and chemically meaningful tokenizations suitable for modern molecular representation learning.Scientific contributions: We present a trie-based framework that compresses SMILES sequences into shorter, chemically coherent units while guaranteeing lossless reconstruction. By incorporating a token transition graph for entropy-guided refinement, our method selects contextually stable merges that improve both compression efficiency and generalization. Unlike prior approaches such as APE and SPE, our tokenizer combines frequency and context awareness, yielding more compact, interpretable, and transferable molecular representations.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"33 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s13321-025-01120-2
Yuzhu Li, Daiju Yang, Qingyi Shi, Weidong Zhang, Qingyan Sun
The molecular volume, surface area, and polar molecular surface area are important descriptors for characterizing and predicting the molecular properties of lead compounds. Existing computational tools for calculating the above parameters often have complex workflows and are not well-suited for high-throughput conditions. CalVSP is an open-source software for computing molecular volume, molecular surface area, and polar surface area. The software implements a grid-based algorithm that dynamically optimizes grid spacing via quantum chemical reference data to ensure precise parameter calculations. CalVSP was tested on 9489 3D molecular structures, and the results revealed a mean absolute percentage error of 1.25% (95% CI: 1.23–1.27%) for the molecular volume and 1.33% (95% CI: 1.31–1.35%) for the molecular surface area compared with the quantum chemical data. For the molecular polar surface area calculations, the mean absolute percentage error was 4.59% (95% CI: 4.16–5.04%) across the 388 tested molecular structures. The CalVSP written in the C programming language offers a lightweight and easy tool. It can be integrated with other molecular property prediction tools to increase computational accuracy and for large-scale molecular calculations.