arXiv (Cornell University)最新文献

英文中文

Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? 作者归因模型能否区分语音记录中的说话人?

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07564

Aggazzotti, Cristina, Andrews, Nicholas, Smith, Elizabeth Allyn

Authorship verification is the problem of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not available or reliable. Therefore, we expect a priori that transcribed speech is a more challenging domain for attribution. On the other hand, other stylistic features, such as speech disfluencies, may enable more successful attribution but, being specific to speech, require special purpose models. To better understand the challenges of this setting, we contribute the first systematic study of speaker attribution based solely on transcribed speech. Specifically, we propose a new benchmark for speaker attribution focused on conversational speech transcripts. To control for spurious associations of speakers with topic, we employ both conversation prompts and speakers' participating in the same conversation to construct challenging verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they struggle in the hardest settings we consider.

作者身份验证是确定两个不同的写作样本是否属于同一作者的问题，通常与书面文本的归属有关。在本文中，我们探讨了转录语音的归因，这提出了新的挑战。主要的挑战是许多文体特征，如标点和大写，是不可用的或不可靠的。因此，我们先验地认为转录语音是一个更具挑战性的归因领域。另一方面，其他文体特征，如言语不流畅，可能会使归因更成功，但由于具体到言语，需要特殊的目的模型。为了更好地理解这种设置的挑战，我们贡献了第一个基于转录语音的说话人归因的系统研究。具体来说，我们提出了一种新的基于会话语音文本的说话人归因基准。为了控制说话者与主题的虚假关联，我们采用对话提示和说话者参与同一对话来构建不同难度的挑战性验证试验。我们通过比较一组神经和非神经基线，在这个新基准上建立了最新的技术水平，发现尽管书面文本归因模型在某些设置中取得了令人惊讶的良好表现，但它们在我们考虑的最困难的设置中却表现不佳。

{"title":"Can Authorship Attribution Models Distinguish Speakers in Speech\u0000 Transcripts?","authors":"Aggazzotti, Cristina, Andrews, Nicholas, Smith, Elizabeth Allyn","doi":"10.48550/arxiv.2311.07564","DOIUrl":"https://doi.org/10.48550/arxiv.2311.07564","url":null,"abstract":"Authorship verification is the problem of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not available or reliable. Therefore, we expect a priori that transcribed speech is a more challenging domain for attribution. On the other hand, other stylistic features, such as speech disfluencies, may enable more successful attribution but, being specific to speech, require special purpose models. To better understand the challenges of this setting, we contribute the first systematic study of speaker attribution based solely on transcribed speech. Specifically, we propose a new benchmark for speaker attribution focused on conversational speech transcripts. To control for spurious associations of speakers with topic, we employ both conversation prompts and speakers' participating in the same conversation to construct challenging verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they struggle in the hardest settings we consider.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"106 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Testing importance sampling on a quantum annealer for strong coupling SU(3) gauge theory 强耦合SU(3)规范理论在量子退火机上的重要抽样测试

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07209

Kim, Jangho, Luu, Thomas, Unger, Wolfgang

$SU(N_c)$ gauge theories in the strong coupling limit can be described by integer variables representing monomers, dimers and baryon loops. We demonstrate how the D-wave quantum annealer can perform importance sampling on $U(N_c)$ gauge theory in the strong coupling formulation of this theory. In addition to causing a sign problem in importance sampling, baryon loops induce a complex QUBO matrix which cannot be optimized by the D-Wave annealer. Instead we show that simulating the sign-problem free quenched action on the D-Wave is sufficient when combined with a sign reweighting method. As the first test on $SU(3)$ gauge theory, we simulate on $2 times 2$ lattice and compare the results with its analytic solutions.

强耦合极限下的规范理论可以用表示单体、二聚体和重子环的整数变量来描述。我们演示了d波量子退火器如何在U(N_c)规范理论的强耦合公式中对U(N_c)规范理论进行重要抽样。重子环除了在重要采样中引起符号问题外，还会产生一个复杂的QUBO矩阵，无法通过D-Wave退火器进行优化。相反，我们表明，当与符号重加权方法相结合时，在D-Wave上模拟无符号问题的淬火作用是足够的。作为对$SU(3)$规范理论的第一次检验，我们在$2 × 2$格上进行了模拟，并将结果与它的解析解进行了比较。

引用次数: 0

MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model 使用扩散模型的自监督单目深度估计

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07198

Shao, Shuwei, Pei, Zhongcai, Chen, Weihai, Sun, Dingchi, Chen, Peter C. Y., Li, Zhengguo

Over the past few years, self-supervised monocular depth estimation that does not depend on ground-truth during the training phase has received widespread attention. Most efforts focus on designing different types of network architectures and loss functions or handling edge cases, e.g., occlusion and dynamic objects. In this work, we introduce a novel self-supervised depth estimation framework, dubbed MonoDiffusion, by formulating it as an iterative denoising process. Because the depth ground-truth is unavailable in the training phase, we develop a pseudo ground-truth diffusion process to assist the diffusion in MonoDiffusion. The pseudo ground-truth diffusion gradually adds noise to the depth map generated by a pre-trained teacher model. Moreover,the teacher model allows applying a distillation loss to guide the denoised depth. Further, we develop a masked visual condition mechanism to enhance the denoising ability of model. Extensive experiments are conducted on the KITTI and Make3D datasets and the proposed MonoDiffusion outperforms prior state-of-the-art competitors. The source code will be available at https://github.com/ShuweiShao/MonoDiffusion.

近年来，在训练阶段不依赖于真实情况的自监督单目深度估计受到了广泛关注。大多数努力集中在设计不同类型的网络架构和损失函数或处理边缘情况，例如遮挡和动态对象。在这项工作中，我们引入了一种新的自监督深度估计框架，称为MonoDiffusion，通过将其表述为迭代去噪过程。由于深度真值在训练阶段是不可用的，我们开发了一个伪真值扩散过程来辅助MonoDiffusion中的扩散。伪真扩散逐渐将噪声添加到由预训练的教师模型生成的深度图中。此外，教师模型允许应用蒸馏损失来指导去噪深度。此外，我们还开发了一种掩蔽视觉条件机制来增强模型的去噪能力。在KITTI和Make3D数据集上进行了广泛的实验，提出的MonoDiffusion优于先前最先进的竞争对手。源代码可从https://github.com/ShuweiShao/MonoDiffusion获得。

{"title":"MonoDiffusion: Self-Supervised Monocular Depth Estimation Using\u0000 Diffusion Model","authors":"Shao, Shuwei, Pei, Zhongcai, Chen, Weihai, Sun, Dingchi, Chen, Peter C. Y., Li, Zhengguo","doi":"10.48550/arxiv.2311.07198","DOIUrl":"https://doi.org/10.48550/arxiv.2311.07198","url":null,"abstract":"Over the past few years, self-supervised monocular depth estimation that does not depend on ground-truth during the training phase has received widespread attention. Most efforts focus on designing different types of network architectures and loss functions or handling edge cases, e.g., occlusion and dynamic objects. In this work, we introduce a novel self-supervised depth estimation framework, dubbed MonoDiffusion, by formulating it as an iterative denoising process. Because the depth ground-truth is unavailable in the training phase, we develop a pseudo ground-truth diffusion process to assist the diffusion in MonoDiffusion. The pseudo ground-truth diffusion gradually adds noise to the depth map generated by a pre-trained teacher model. Moreover,the teacher model allows applying a distillation loss to guide the denoised depth. Further, we develop a masked visual condition mechanism to enhance the denoising ability of model. Extensive experiments are conducted on the KITTI and Make3D datasets and the proposed MonoDiffusion outperforms prior state-of-the-art competitors. The source code will be available at https://github.com/ShuweiShao/MonoDiffusion.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"117 42","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Time-Frequency Localization Characteristics of the Delay-Doppler Plane Orthogonal Pulse 延迟-多普勒平面正交脉冲的时频定位特性

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07238

Shafie, Akram, Yuan, Jinhong, Yang, Nan, Lin, Hai

The orthogonal delay-Doppler (DD) division multiplexing (ODDM) modulation has recently been proposed as a promising solution for ensuring reliable communications in high mobility scenarios. In this work, we investigate the time-frequency (TF) localization characteristics of the DD plane orthogonal pulse (DDOP), which is the prototype pulse of ODDM modulation. The TF localization characteristics examine how concentrated or spread out the energy of a pulse is in the joint TF domain. We first derive the TF localization metric, TF area (TFA), for the DDOP. Based on this result, we provide insights into the energy spread of the DDOP in the joint TF domain. Then, we delve into the potential advantages of the DDOP due to its energy spread, particularly in terms of leveraging both time and frequency diversities, and enabling high-resolution sensing. Furthermore, we determine the TFA for the recently proposed generalized design of the DDOP. Finally, we validate our analysis based on numerical results and show that the energy spread for the generalized design of the DDOP in the joint TF domain exhibits a step-wise increase as the duration of sub-pulses increases.

正交延迟-多普勒(DD)分复用(ODDM)调制最近被提出作为一种有前途的解决方案，以确保在高移动场景下的可靠通信。本文研究了ODDM调制的原型脉冲DD平面正交脉冲(DDOP)的时频局域化特性。TF局域化特性考察了脉冲能量在关节TF域中的集中或分散程度。我们首先推导了DDOP的TF定位度量，TF面积(TFA)。基于这一结果，我们提供了对DDOP在联合TF域中的能量传播的见解。然后，我们深入研究了DDOP由于其能量扩散而具有的潜在优势，特别是在利用时间和频率分集以及实现高分辨率传感方面。此外，我们确定了最近提出的DDOP广义设计的TFA。最后，我们基于数值结果验证了我们的分析，并表明广义设计的DDOP在联合TF域中的能量扩散随着子脉冲持续时间的增加而逐步增加。

{"title":"Time-Frequency Localization Characteristics of the Delay-Doppler Plane\u0000 Orthogonal Pulse","authors":"Shafie, Akram, Yuan, Jinhong, Yang, Nan, Lin, Hai","doi":"10.48550/arxiv.2311.07238","DOIUrl":"https://doi.org/10.48550/arxiv.2311.07238","url":null,"abstract":"The orthogonal delay-Doppler (DD) division multiplexing (ODDM) modulation has recently been proposed as a promising solution for ensuring reliable communications in high mobility scenarios. In this work, we investigate the time-frequency (TF) localization characteristics of the DD plane orthogonal pulse (DDOP), which is the prototype pulse of ODDM modulation. The TF localization characteristics examine how concentrated or spread out the energy of a pulse is in the joint TF domain. We first derive the TF localization metric, TF area (TFA), for the DDOP. Based on this result, we provide insights into the energy spread of the DDOP in the joint TF domain. Then, we delve into the potential advantages of the DDOP due to its energy spread, particularly in terms of leveraging both time and frequency diversities, and enabling high-resolution sensing. Furthermore, we determine the TFA for the recently proposed generalized design of the DDOP. Finally, we validate our analysis based on numerical results and show that the energy spread for the generalized design of the DDOP in the joint TF domain exhibits a step-wise increase as the duration of sub-pulses increases.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"117 41","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High Rectification Ratio at Room Temperature in Rhenium(I) Compound 室温下铼(I)化合物的高整流比

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07258

Rajbangshi, Subas, Pal, Nila, Rahman, Robinur, Nesterov, Vladimir N., Roy, Lisa, Ghosh, Shishir, Mondal, Prakash Chandra

Electrical current rectification is an interesting electronic feature, popularly known as a diode. Achieving a high rectification ratio in a molecular junction has been a long-standing goal in molecular electronics. The present work describes mimicking electrical current rectification with pi-stacked rhenium(I) compound sandwiched between two electrical contacts. Among the two mononuclear rhenium compounds studied here, [Re(CO)4(PPh3){(N)-saccharinate}] (1) and [Re(CO)3(phen){(N)-saccharinate}] (2), the latter show strong pi-pi interactions-induced high rectification ratio of ~ 4000 at 2.0 V at room temperature. Alternating current (AC)-based electrical measurements ensuring AC to DC electrical signal conversion at a frequency f of 1 KHz showing 2 can act as an excellent half-wave rectifier. Asymmetric charge injection barrier height at the electrode/Re(I) interfaces of the devices with a stacking configuration of p++-Si/Re compound31nm(2)/ITO originates the flow of electrical current unidirectionally. The charge transport mechanism governed by thermally activated hopping phenomena, and charge carrier propagation is explained through an energy profile considering the Fermi levels of two electrodes, and the energy of frontier molecular orbitals, HOMO, and LUMO, confirming rectification is of a molecular origin. The present work paves the way to combine different organometallic compounds as circuit elements in nanoelectronic devices to achieve numerous exciting electronic features.

电流整流是一种有趣的电子特性，通常被称为二极管。在分子结中实现高整流比一直是分子电子学的长期目标。本工作描述了在两个电触点之间夹入pi堆叠铼(I)化合物来模拟电流整流。本文研究的两种单核铼化合物[Re(CO)4(PPh3){(N)-糖化}](1)和[Re(CO)3(phen){(N)-糖化}](2)中，后者在室温2.0 V下表现出强的pi-pi相互作用，诱导了~ 4000的高整流比。基于交流(AC)的电气测量确保交流到直流电信号转换的频率f为1khz，显示2可以作为一个优秀的半波整流器。在p++-Si/Re化合物31nm(2)/ITO堆叠结构器件的电极/Re(I)界面处，电荷注入势垒高度的不对称导致了电流的单向流动。通过考虑两个电极的费米能级和前沿分子轨道HOMO和LUMO的能量分布，解释了由热激活跳跃现象和载流子传播控制的电荷输运机制，证实了整流是分子起源。本研究为在纳米电子器件中结合不同的有机金属化合物作为电路元件来实现许多令人兴奋的电子特性铺平了道路。

{"title":"High Rectification Ratio at Room Temperature in Rhenium(I) Compound","authors":"Rajbangshi, Subas, Pal, Nila, Rahman, Robinur, Nesterov, Vladimir N., Roy, Lisa, Ghosh, Shishir, Mondal, Prakash Chandra","doi":"10.48550/arxiv.2311.07258","DOIUrl":"https://doi.org/10.48550/arxiv.2311.07258","url":null,"abstract":"Electrical current rectification is an interesting electronic feature, popularly known as a diode. Achieving a high rectification ratio in a molecular junction has been a long-standing goal in molecular electronics. The present work describes mimicking electrical current rectification with pi-stacked rhenium(I) compound sandwiched between two electrical contacts. Among the two mononuclear rhenium compounds studied here, [Re(CO)4(PPh3){(N)-saccharinate}] (1) and [Re(CO)3(phen){(N)-saccharinate}] (2), the latter show strong pi-pi interactions-induced high rectification ratio of ~ 4000 at 2.0 V at room temperature. Alternating current (AC)-based electrical measurements ensuring AC to DC electrical signal conversion at a frequency f of 1 KHz showing 2 can act as an excellent half-wave rectifier. Asymmetric charge injection barrier height at the electrode/Re(I) interfaces of the devices with a stacking configuration of p++-Si/Re compound31nm(2)/ITO originates the flow of electrical current unidirectionally. The charge transport mechanism governed by thermally activated hopping phenomena, and charge carrier propagation is explained through an energy profile considering the Fermi levels of two electrodes, and the energy of frontier molecular orbitals, HOMO, and LUMO, confirming rectification is of a molecular origin. The present work paves the way to combine different organometallic compounds as circuit elements in nanoelectronic devices to achieve numerous exciting electronic features.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"117 39","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FIRST: A Million-Entry Dataset for Text-Driven Fashion Synthesis and Design 第一:文本驱动时装合成与设计的百万条目数据集

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07414

Huang, Zhen, Li, Yihao, Pei, Dong, Zhou, Jiapeng, Ning, Xuliang, Han, Jianlin, Han, Xiaoguang, Chen, Xuejun

Text-driven fashion synthesis and design is an extremely valuable part of artificial intelligence generative content(AIGC), which has the potential to propel a tremendous revolution in the traditional fashion industry. To advance the research on text-driven fashion synthesis and design, we introduce a new dataset comprising a million high-resolution fashion images with rich structured textual(FIRST) descriptions. In the FIRST, there is a wide range of attire categories and each image-paired textual description is organized at multiple hierarchical levels. Experiments on prevalent generative models trained over FISRT show the necessity of FIRST. We invite the community to further develop more intelligent fashion synthesis and design systems that make fashion design more creative and imaginative based on our dataset. The dataset will be released soon.

文本驱动的时尚合成和设计是人工智能生成内容(AIGC)的一个非常有价值的部分，它有可能推动传统时尚产业的巨大革命。为了推进文本驱动的时尚合成和设计研究，我们引入了一个新的数据集，该数据集由一百万张高分辨率时尚图像组成，具有丰富的结构化文本(FIRST)描述。在FIRST中，有广泛的服装类别，每个图像配对的文本描述都是在多个层次上组织的。在常用的生成模型上进行的实验表明了FIRST的必要性。我们邀请社区进一步开发更智能的时装合成和设计系统，使时装设计基于我们的数据集更具创造性和想象力。数据集将于近期发布。

引用次数: 1

Machine Learning For Beamline Steering 光束转向的机器学习

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07519

Kante, Isaac

Beam steering is the process involving the calibration of the angle and position at which a particle accelerator's electron beam is incident upon the x-ray target with respect to the rotation axis of the collimator. Beam Steering is an essential task for light sources. In the case under study, the LINAC To Undulator (LTU) section of the beamline is difficult to aim. Each use of the accelerator requires re-calibration of the magnets in this section. This involves a substantial amount of time and effort from human operators, while reducing scientific throughput of the light source. We investigate the use of deep neural networks to assist in this task. The deep learning models are trained on archival data and then validated on simulation data. The performance of the deep learning model is contrasted against that of trained human operators.

光束转向是指粒子加速器的电子束入射到x射线目标上的角度和位置相对于准直器旋转轴的校准过程。光束控制是光源的一项重要任务。在研究的情况下，波束线的直线到波动(LTU)部分很难瞄准。每次使用加速器都需要重新校准这部分的磁铁。这涉及到大量的时间和人力操作员的努力，同时降低了光源的科学吞吐量。我们研究了使用深度神经网络来协助完成这项任务。深度学习模型在档案数据上进行训练，然后在仿真数据上进行验证。深度学习模型的性能与训练有素的人类操作员的性能进行了对比。

引用次数: 0

Lattice relaxation, electronic structure and continuum model for twisted bilayer MoTe$_2$ 扭曲双分子层MoTe$_2$的晶格弛豫、电子结构和连续介质模型

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07533

Mao, Ning, Xu, Cheng, Li, Jiangxu, Bao, Ting, Liu, Peitao, Xu, Yong, Felser, Claudia, Fu, Liang, Zhang, Yang

We investigate the lattice relaxation effect on moir'e band structures in twisted bilayer MoTe$_2$ with two approaches: (a) large-scale plane-wave basis first principle calculation down to $2.88^{circ}$, (b) transfer learning structure relaxation + local-basis first principles calculation down to $1.1^{circ}$. Two types of van der Waals corrections have been examined: the D2 method of Grimme and the density-dependent energy correction. We note the density-dependent energy correction yields a continuous evolution of bandwidth with twist angles. Including second harmonic of intralayer potential/interlayer tunneling and the strain induced gauge field, we develop a more complete continuum model with a single set of parameters for a wide range of twist angles, providing a useful starting point for many body simulation.

我们采用两种方法研究了晶格弛豫对扭曲双层MoTe$_2$中moir'e带结构的影响:(a)大规模平面波基第一原理计算降至$2.88^{circ}$， (b)迁移学习结构弛豫+局部基第一原理计算降至$1.1^{circ}$。研究了两种范德华校正:格里姆的D2方法和密度相关能量校正。我们注意到，密度相关的能量校正产生了带宽随扭角的连续演化。考虑层内电位/层间隧道的二次谐波和应变诱导的规范场，我们建立了一个更完整的连续体模型，该模型具有单一的一组参数，适用于大范围的扭转角，为许多体的模拟提供了一个有用的起点。

引用次数: 0

CASTER: A Computer-Vision-Assisted Wireless Channel Simulator for Gesture Recognition CASTER:用于手势识别的计算机视觉辅助无线通道模拟器

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07169

Ren, Zhenyu, Li, Guoliang, Ji, Chenqing, Yu, Chao, Wang, Shuai, Wang, Rui

In this paper, a computer-vision-assisted simulation method is proposed to address the issue of training dataset acquisition for wireless hand gesture recognition. In the existing literature, in order to classify gestures via the wireless channel estimation, massive training samples should be measured in a consistent environment, consuming significant efforts. In the proposed CASTER simulator, however, the training dataset can be simulated via existing videos. Particularly, a gesture is represented by a sequence of snapshots, and the channel impulse response of each snapshot is calculated via tracing the rays scattered off a primitive-based hand model. Moreover, CASTER simulator relies on the existing videos to extract the motion data of gestures. Thus, the massive measurements of wireless channel can be eliminated. The experiments demonstrate a 90.8% average classification accuracy of simulation-to-reality inference.

针对无线手势识别中的训练数据采集问题，提出了一种计算机视觉辅助仿真方法。在现有文献中，为了通过无线信道估计对手势进行分类，需要在一致的环境中测量大量的训练样本，耗费大量的精力。然而，在提出的CASTER模拟器中，训练数据集可以通过现有的视频进行模拟。特别是，一个手势是由一系列快照表示的，每个快照的通道脉冲响应是通过跟踪从基于原始的手模型散射的光线来计算的。此外，CASTER模拟器依赖于现有的视频来提取手势的运动数据。因此，可以消除无线信道的大量测量。实验表明，模拟到现实推理的平均分类准确率为90.8%。

引用次数: 0

SponTTS: modeling and transferring spontaneous style for TTS SponTTS:为TTS塑造和传递自发性风格

arXiv (Cornell University)

Pub Date : 2023-11-13 DOI: 10.48550/arxiv.2311.07179

Li, Hanzhao, Zhu, Xinfa, Xue, Liumeng, Song, Yang, Chen, Yunlin, Xie, Lei

Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.

由于各种自发现象(如充满停顿、延长)和大量韵律变化(如不同的音高和持续时间变化，偶尔的非言语言语(如微笑))，自发说话风格与其他说话风格存在显著差异，这给自发说话风格的建模和预测带来了挑战。此外，高质量的自发数据的局限性限制了没有自发数据的说话者的自发语音生成。为了解决这些问题，我们提出了SponTTS，这是一种基于瓶颈(BN)特征的两阶段方法，用于TTS的自发风格建模和迁移。在第一阶段，我们采用条件变分自编码器(CVAE)从BN特征中捕获自发韵律，并通过自发现象嵌入预测损失的约束来涉及自发现象。此外，我们还引入了一个基于流的预测器来预测文本中潜在的自发风格表征，从而丰富了推理过程中韵律和上下文特定的自发现象。在第二阶段，我们采用类似vits的模块，将第一阶段学到的自发性风格传递给目标说话者。实验表明，SponTTS能够有效地对自发语体进行建模并将其传递给目标说话人，生成具有较高自然度、表现力和说话人相似度的自发语音。零镜头自发风格TTS测试进一步验证了SponTTS在未知说话者自发语音生成中的泛化性和鲁棒性。

{"title":"SponTTS: modeling and transferring spontaneous style for TTS","authors":"Li, Hanzhao, Zhu, Xinfa, Xue, Liumeng, Song, Yang, Chen, Yunlin, Xie, Lei","doi":"10.48550/arxiv.2311.07179","DOIUrl":"https://doi.org/10.48550/arxiv.2311.07179","url":null,"abstract":"Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"118 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv (Cornell University)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀