The physical and chemical properties of coffee beans are drastically changed during the roasting process. The quality of coffee beans is directly determined by their color, texture, and brightness. The roasting level is essential for food safety, commercial value, and product quality. Traditional classification techniques are time-consuming, expensive, and prone to human error. These qualities have generated interest in deep learning. The features extracted from the Vision Transformer (ViT), which can understand advanced global contexts, enable detailed and hierarchical analysis of images. The aim of this study was to classify features extracted from images of the roasting levels of coffee beans using ViT and machine learning (ML) models. Two publicly available datasets were used for the study, including images of coffee beans in grayscale (Dataset1) and RGB (Dataset2). Random Forest (RF), Decision Tree, and K-Nearest Neighbors models were used to classify the features extracted by the BEiT, FlexiViT, MobileViT, and DeiT models. An Adaptive ViT feature fusion, combining features from ViT models, was also evaluated with these ML models. The experimental results revealed that the Adaptive ViT-RF achieved high accuracy rates of 0.825 on Dataset1 and 0.992 on Dataset2. The SHapley Additive exPlanations (SHAP) analysis was performed to evaluate the decision mechanism of the model. It was found that high feature values in the Green and Light classes and low feature values in the Dark and Medium classes were effective in classification. These results indicate that the Adaptive ViT-RF approach successfully and accurately classifies the roast level of coffee beans.