ByteNet: Rethinking Multimedia File Fragment Classification Through Visual Perspectives

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI:10.1109/TMM.2024.3521830

Wenyang Liu;Kejun Wu;Tianyi Liu;Yi Wang;Kim-Hui Yap;Lap-Pui Chau

{"title":"ByteNet: Rethinking Multimedia File Fragment Classification Through Visual Perspectives","authors":"Wenyang Liu;Kejun Wu;Tianyi Liu;Yi Wang;Kim-Hui Yap;Lap-Pui Chau","doi":"10.1109/TMM.2024.3521830","DOIUrl":null,"url":null,"abstract":"Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose <bold>Byte2Image</b>, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte n-grams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network <bold>ByteNet</b> to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1305-1319"},"PeriodicalIF":9.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812851/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose Byte2Image, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte n-grams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network ByteNet to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

宗旨：从视觉角度重新思考多媒体文件片段分类

多媒体文件片段分类（MFFC）的目的是在没有系统元数据的情况下识别文件片段类型，如图像/视频、音频和文本。它在多媒体存储和通信中起着至关重要的作用。现有的MFFC方法通常将片段视为1D字节序列，并强调单独字节（interbytes）之间的关系进行分类。然而，字节（intrabytes）内部更多的信息关系被忽视了，很少被研究。通过查看字节内部，可以访问文件片段的位级详细信息，从而实现更准确的分类。受此启发，我们首先提出了Byte2Image，这是一种新颖的视觉表示模型，它将以前被忽视的字节内信息合并到文件片段中，并将这些片段重新解释为2D灰度图像。该模型包括一个滑动的字节窗口来显示字节内的信息，以及一个字节内n-gram的行堆叠，用于将片段嵌入到二维空间中。因此，可以使用强大的视觉网络同时挖掘复杂的字节间和字节内的相关性。此外，我们提出了一个端到端的双分支网络ByteNet，以增强鲁棒的关联挖掘和特征表示。ByteNet通过浅字节分支特征提取（BBFE）和深图像分支特征提取（IBFE）网络，充分利用原始1D字节序列和转换后的2D图像。特别是，由单个全连接层组成的BBFE自适应识别原始字节序列中几个特定字节的共出现，而建立在视觉转换器上的IBFE则有效地挖掘转换图像中复杂的字节间和字节内相关性。在两个代表性基准上的实验，包括14个案例，验证了我们提出的方法在不同情况下优于最先进的方法高达12.2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.

期刊最新文献

Screen Detection from Egocentric Image Streams Leveraging Multi-View Vision Language Model. HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming 2025 Reviewers List Long-Tailed Continual Learning for Visual Food Recognition SSPD: Spatial-Spectral Prior Decoupling Model for Spectral Snapshot Compressive Imaging