This paper presents Rosetta-XAI, a comprehensive software framework for evaluating and explaining Large Language Model (LLM) behavior in cross-language code conversion tasks. The system implements a four-stage automated pipeline: (1) code generation by LLMs accessed through the Ollama API inference service, (2) regex-based extraction of code blocks from markdown responses, (3) language-specific syntax and compilation validation with temporary artifact management, and (4) execution with timeout protections and CSV-based checkpoint recovery. The framework supports evaluation of 15 specialized code LLMs (1.3B–34B parameters), including DeepSeek Coder, Code Llama, CodeGemma, and Granite Code across 17 Rosetta Code programming tasks, generating 42 bidirectional conversion pairs among seven languages (C, C++, Go, Java, JavaScript, Python, Rust). Beyond traditional pass@1 accuracy metrics, the system incorporates explainability analysis through Shapley Value Sampling and Feature Ablation techniques implemented via Captum and PyTorch, enabling researchers to quantify token-level feature importance during translation. All pipeline components include XAI-enhanced variants supporting follow-up question analysis for interpretability studies. Built using Python with pandas for metrics aggregation and subprocess management for multi-language execution, the modular architecture separates extraction, validation, and execution concerns. Results are systematically organized into structured directories tracking accepted code, compilation failures, syntax errors, and execution outputs, with comprehensive metrics exported to CSVs for reproducible research and comparative model analysis.
扫码关注我们
求助内容:
应助结果提醒方式:
