{"title":"超压缩:通过超函数压缩模型","authors":"Fenglei Fan, Juntong Fan, Dayang Wang, Jingbo Zhang, Zelin Dong, Shijun Zhang, Ge Wang, Tieyong Zeng","doi":"arxiv-2409.00592","DOIUrl":null,"url":null,"abstract":"The rapid growth of large models' size has far outpaced that of GPU memory.\nTo bridge this gap, inspired by the succinct relationship between genotype and\nphenotype, we turn the model compression problem into the issue of parameter\nrepresentation to propose the so-called hyper-compression. The\nhyper-compression uses a hyperfunction to represent the parameters of the\ntarget network, and notably, here the hyperfunction is designed per ergodic\ntheory that relates to a problem: if a low-dimensional dynamic system can fill\nthe high-dimensional space eventually. Empirically, the proposed\nhyper-compression enjoys the following merits: 1) \\textbf{P}referable\ncompression ratio; 2) \\textbf{N}o post-hoc retraining; 3) \\textbf{A}ffordable\ninference time; and 4) \\textbf{S}hort compression time. It compresses LLaMA2-7B\nin an hour and achieves close-to-int4-quantization performance, without\nretraining and with a performance drop of less than 1\\%. Our work has the\npotential to invigorate the field of model compression, towards a harmony\nbetween the scaling law and the stagnation of hardware upgradation.","PeriodicalId":501168,"journal":{"name":"arXiv - CS - Emerging Technologies","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hyper-Compression: Model Compression via Hyperfunction\",\"authors\":\"Fenglei Fan, Juntong Fan, Dayang Wang, Jingbo Zhang, Zelin Dong, Shijun Zhang, Ge Wang, Tieyong Zeng\",\"doi\":\"arxiv-2409.00592\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid growth of large models' size has far outpaced that of GPU memory.\\nTo bridge this gap, inspired by the succinct relationship between genotype and\\nphenotype, we turn the model compression problem into the issue of parameter\\nrepresentation to propose the so-called hyper-compression. The\\nhyper-compression uses a hyperfunction to represent the parameters of the\\ntarget network, and notably, here the hyperfunction is designed per ergodic\\ntheory that relates to a problem: if a low-dimensional dynamic system can fill\\nthe high-dimensional space eventually. Empirically, the proposed\\nhyper-compression enjoys the following merits: 1) \\\\textbf{P}referable\\ncompression ratio; 2) \\\\textbf{N}o post-hoc retraining; 3) \\\\textbf{A}ffordable\\ninference time; and 4) \\\\textbf{S}hort compression time. It compresses LLaMA2-7B\\nin an hour and achieves close-to-int4-quantization performance, without\\nretraining and with a performance drop of less than 1\\\\%. Our work has the\\npotential to invigorate the field of model compression, towards a harmony\\nbetween the scaling law and the stagnation of hardware upgradation.\",\"PeriodicalId\":501168,\"journal\":{\"name\":\"arXiv - CS - Emerging Technologies\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Emerging Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00592\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00592","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hyper-Compression: Model Compression via Hyperfunction
The rapid growth of large models' size has far outpaced that of GPU memory.
To bridge this gap, inspired by the succinct relationship between genotype and
phenotype, we turn the model compression problem into the issue of parameter
representation to propose the so-called hyper-compression. The
hyper-compression uses a hyperfunction to represent the parameters of the
target network, and notably, here the hyperfunction is designed per ergodic
theory that relates to a problem: if a low-dimensional dynamic system can fill
the high-dimensional space eventually. Empirically, the proposed
hyper-compression enjoys the following merits: 1) \textbf{P}referable
compression ratio; 2) \textbf{N}o post-hoc retraining; 3) \textbf{A}ffordable
inference time; and 4) \textbf{S}hort compression time. It compresses LLaMA2-7B
in an hour and achieves close-to-int4-quantization performance, without
retraining and with a performance drop of less than 1\%. Our work has the
potential to invigorate the field of model compression, towards a harmony
between the scaling law and the stagnation of hardware upgradation.