Speech and audio in window systems: when will they happen?

B. Arons, C. Schmandt, Michael Hawley, Lester Ludwig, P. Zellweger
{"title":"Speech and audio in window systems: when will they happen?","authors":"B. Arons, C. Schmandt, Michael Hawley, Lester Ludwig, P. Zellweger","doi":"10.1145/77276.77285","DOIUrl":null,"url":null,"abstract":"Good afternoon. Boy, I can't see anything out there. I assume you all can see me -- thats why these lights are here. My name is Chris Schmandt from the Media Lab at MIT. I'm co-chairing this panel with Barry Arons, who is sitting over here. It's actually quite a pleasure to co-chair this panel with Barry. We've been working together off and on for more years than I care to remember. This panel has a long ridiculous name. Basically it's about audio and window systems and workstations. I'm wearing two hats here. I'm going to spend a minute or two introducing the panel and then I'm going to spend some time talking about my own segment of the panel. We're going to try to be a panel as opposed to a series of five mini-papers that never get published. In other words, we're going to try to keep our presentations relatively short, then segue into a series of prepared questions that the panelists are going to answer amongst themselves. Then we'll open the floor up for questions. In some ways this is a very incestuous crew. We've all known each other for quite a while. We have different slants and we're actually going to try to focus on those slants a little bit. So if we disagree with each other, that doesn't necessarily mean we really hate each other. We're all friends. Where this panel is coming from is a surge of interest in audio, and multimedia, in general, in computer workstations. The Macintosh has had audio for quite a while -- you may or may not choose to call that a workstation. The NeXT computer sort of surprised people by having fairly powerful DSP and audio in and out. You'll get a demo of that later if you haven't seen it. The Sun SPARCStation has come out with some primitive digital record and playback capabilities. On the other hand, there's been interest in voice in computer workstations for years and years, and what we've seen so far is that voice really hasn't had very much success. There have been a number of products that have come and gone. What has become popular has been centralized service -- specifically voice mail. Voice mail is tied in more to a PBX -- and the interface is more like a telephone than it is a mouse and window system, in the computer workstation interface. Obviously, window systems are here to stay. We're not suggesting that audio is going to replace the graphical paradigm, but rather have to interact with it. On the other hand, everybody has a telephone. People had telephones on their desks before they had workstations, and we talk all the time at work. Voice really is a fundamental component of the way we talk, the way we interact with each other. What we're seeing in terms of the technologies showing up in these workstations is higher bit rate coding. Gone are the days of unintelligible low bit rate linear predictive coding or something like that -- except for specialized applications. Speech recognition is here, but it's in its infancy. Text-to-speech -- it's around, it's difficult to understand. You can learn to understand it. Telephony is obviously part of this set-up if we're dealing with audio. We don't know whether it's going to be analogue or digital. Is it going to be plain old telephone or is it going to be ISDN? Those are some of the issues that we're going to be talking about in this session. As I say, we're going to try to keep each of the speakers to a relatively short period -- and now I can put on my other hat. (puts toy plastic headset on -- laughter) Some people ask me whether speech recognition is a toy or not. Yes, it is. It's sort of a fun toy. Speech technologies are in general fun. I was originally hoping to be able to play this out to the audience. But I don't think it's going to work well enough. This is actually a kid's toy -- $50 at Toys R Us. Speaker Independent Isolated Word Speech Recognizer -- \"yes\", \"no\", \"true\", and \"false\". It will take you on tours about dinosaurs and things like that. From my point of view, the key for what we can do with voice has to do with understanding its advantages and disadvantages and the comcomitant user interface requirements leading us to design reasonable applications for it. Voice has some advantages. It's very useful when your hands and eyes are busy; you're looking at a screen, you have your fingers on the mouse. Sometimes it's intuitive; we learn to talk at a very early age. People talk to their computers even if the computers don't have speech recognition. (laughter) Usually it's expletives -- especially with UNIX. (laughter) Voice really dominates human-to-human communication. No matter what we're doing with E-Mail and FAX, the bottom line is we just still have to spend a certain amount of time physically speaking to each other. Telephones are everywhere. If I can turn an ordinary pay phone into a computer terminal, suddenly I have access from all over the place. From my own work, this suggests a heavy focus on telecommunications. The kinds of systems that I'm building are really designed to use voice in a communications kind of environment. On the other hand, there's many, many disadvantages of voice. It's very slow. 200 words per minute, 150-250 words per minute. That's less than a 300 baud modem and who uses those any more. Speech is serial. You have to listen to things in sequence. It's a time varying signal by definition. And it requires attention. You have to listen to what's going on, as opposed to simply scrolling it by and stopping it occasionally. My way of characterizing this is to say that speech is \"bulky\". Yes, it takes up space on the file system, but most importantly you can't \"grep\" it, you can't do keyword searches on it. It's hard to file, it's just hard to get any kind of handle on it. It takes time. Finally, speech broadcasts. If my workstation is talking to me and you're sitting in my office, you're going to hear what it says, which is very different from if it appears as text. In fact, if it appears as text, and I'm sitting in front of the screen with these kinds of tiny bit map fonts that we tend to use, I'm probably not even going to be able to read it -- much less you. This has some user interface implications. One is that it suggests that we would like, where possible, to have graphical access to sounds. I'm going to show a video in just a second, showing you an interface to audio built under the X Window System, designed to give you some kind of a graphical context, so you can mouse around and perhaps use some visual cues to keep track of where you are in the sound. If you could roll the first piece of one-inch, please. This is a sound widget.","PeriodicalId":405574,"journal":{"name":"ACM SIGGRAPH 89 Panel Proceedings","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1989-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGGRAPH 89 Panel Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/77276.77285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Good afternoon. Boy, I can't see anything out there. I assume you all can see me -- thats why these lights are here. My name is Chris Schmandt from the Media Lab at MIT. I'm co-chairing this panel with Barry Arons, who is sitting over here. It's actually quite a pleasure to co-chair this panel with Barry. We've been working together off and on for more years than I care to remember. This panel has a long ridiculous name. Basically it's about audio and window systems and workstations. I'm wearing two hats here. I'm going to spend a minute or two introducing the panel and then I'm going to spend some time talking about my own segment of the panel. We're going to try to be a panel as opposed to a series of five mini-papers that never get published. In other words, we're going to try to keep our presentations relatively short, then segue into a series of prepared questions that the panelists are going to answer amongst themselves. Then we'll open the floor up for questions. In some ways this is a very incestuous crew. We've all known each other for quite a while. We have different slants and we're actually going to try to focus on those slants a little bit. So if we disagree with each other, that doesn't necessarily mean we really hate each other. We're all friends. Where this panel is coming from is a surge of interest in audio, and multimedia, in general, in computer workstations. The Macintosh has had audio for quite a while -- you may or may not choose to call that a workstation. The NeXT computer sort of surprised people by having fairly powerful DSP and audio in and out. You'll get a demo of that later if you haven't seen it. The Sun SPARCStation has come out with some primitive digital record and playback capabilities. On the other hand, there's been interest in voice in computer workstations for years and years, and what we've seen so far is that voice really hasn't had very much success. There have been a number of products that have come and gone. What has become popular has been centralized service -- specifically voice mail. Voice mail is tied in more to a PBX -- and the interface is more like a telephone than it is a mouse and window system, in the computer workstation interface. Obviously, window systems are here to stay. We're not suggesting that audio is going to replace the graphical paradigm, but rather have to interact with it. On the other hand, everybody has a telephone. People had telephones on their desks before they had workstations, and we talk all the time at work. Voice really is a fundamental component of the way we talk, the way we interact with each other. What we're seeing in terms of the technologies showing up in these workstations is higher bit rate coding. Gone are the days of unintelligible low bit rate linear predictive coding or something like that -- except for specialized applications. Speech recognition is here, but it's in its infancy. Text-to-speech -- it's around, it's difficult to understand. You can learn to understand it. Telephony is obviously part of this set-up if we're dealing with audio. We don't know whether it's going to be analogue or digital. Is it going to be plain old telephone or is it going to be ISDN? Those are some of the issues that we're going to be talking about in this session. As I say, we're going to try to keep each of the speakers to a relatively short period -- and now I can put on my other hat. (puts toy plastic headset on -- laughter) Some people ask me whether speech recognition is a toy or not. Yes, it is. It's sort of a fun toy. Speech technologies are in general fun. I was originally hoping to be able to play this out to the audience. But I don't think it's going to work well enough. This is actually a kid's toy -- $50 at Toys R Us. Speaker Independent Isolated Word Speech Recognizer -- "yes", "no", "true", and "false". It will take you on tours about dinosaurs and things like that. From my point of view, the key for what we can do with voice has to do with understanding its advantages and disadvantages and the comcomitant user interface requirements leading us to design reasonable applications for it. Voice has some advantages. It's very useful when your hands and eyes are busy; you're looking at a screen, you have your fingers on the mouse. Sometimes it's intuitive; we learn to talk at a very early age. People talk to their computers even if the computers don't have speech recognition. (laughter) Usually it's expletives -- especially with UNIX. (laughter) Voice really dominates human-to-human communication. No matter what we're doing with E-Mail and FAX, the bottom line is we just still have to spend a certain amount of time physically speaking to each other. Telephones are everywhere. If I can turn an ordinary pay phone into a computer terminal, suddenly I have access from all over the place. From my own work, this suggests a heavy focus on telecommunications. The kinds of systems that I'm building are really designed to use voice in a communications kind of environment. On the other hand, there's many, many disadvantages of voice. It's very slow. 200 words per minute, 150-250 words per minute. That's less than a 300 baud modem and who uses those any more. Speech is serial. You have to listen to things in sequence. It's a time varying signal by definition. And it requires attention. You have to listen to what's going on, as opposed to simply scrolling it by and stopping it occasionally. My way of characterizing this is to say that speech is "bulky". Yes, it takes up space on the file system, but most importantly you can't "grep" it, you can't do keyword searches on it. It's hard to file, it's just hard to get any kind of handle on it. It takes time. Finally, speech broadcasts. If my workstation is talking to me and you're sitting in my office, you're going to hear what it says, which is very different from if it appears as text. In fact, if it appears as text, and I'm sitting in front of the screen with these kinds of tiny bit map fonts that we tend to use, I'm probably not even going to be able to read it -- much less you. This has some user interface implications. One is that it suggests that we would like, where possible, to have graphical access to sounds. I'm going to show a video in just a second, showing you an interface to audio built under the X Window System, designed to give you some kind of a graphical context, so you can mouse around and perhaps use some visual cues to keep track of where you are in the sound. If you could roll the first piece of one-inch, please. This is a sound widget.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
窗口系统中的语音和音频:何时实现?
下午好。天啊,我什么都看不见。我想你们都能看到我——这就是这些灯在这里的原因。我是来自麻省理工学院媒体实验室的Chris Schmandt。我和Barry Arons共同主持这个小组,他就坐在这里。我很荣幸能和巴里共同主持这个小组。我们断断续续合作的时间我都记不清了。这个面板有一个很长很可笑的名字。基本上它是关于音频、窗口系统和工作站的。我身兼两职。我将花一两分钟的时间来介绍这个小组,然后我将花一些时间来谈谈我自己的小组。我们将努力成为一个小组,而不是一系列五篇从未发表过的迷你论文。换句话说,我们将尽量保持我们的演讲相对简短,然后进入一系列准备好的问题,由小组成员自己回答。然后我们开始提问。在某些方面,这是一群乱伦的船员。我们都认识很久了。我们有不同的倾向性,实际上我们要试着把重点放在这些倾向性上。所以,如果我们意见不一致,并不一定意味着我们真的讨厌对方。我们都是朋友。这个小组来自于对音频和多媒体的兴趣激增,一般来说,在计算机工作站。麦金塔有音频已经有一段时间了——你可能会也可能不会把它称为工作站。NeXT电脑拥有相当强大的DSP和音频输入输出功能,这让人们有些惊讶。如果你还没看过,稍后会有一个演示。Sun SPARCStation已经推出了一些原始的数字记录和播放功能。另一方面,人们对计算机工作站语音的兴趣已经有很多年了,但我们目前所看到的是语音并没有取得很大的成功。有很多产品来了又走了。现在流行的是集中式服务——尤其是语音邮件。语音邮件更多地与PBX捆绑在一起,其界面更像电话,而不是计算机工作站界面中的鼠标和窗口系统。显然,窗口系统将继续存在。我们并不是说音频将取代图像范例,而是必须与之互动。另一方面,每个人都有一部电话。在有工作站之前,人们的办公桌上就有了电话,我们在工作时一直在交谈。声音确实是我们说话方式的一个基本组成部分,是我们彼此互动的方式。我们在这些工作站中看到的技术是更高的比特率编码。难以理解的低比特率线性预测编码或类似的东西的日子已经一去不复返了——除了专门的应用程序。语音识别在这里,但它还处于起步阶段。文本转语音——它随处可见,很难理解。你可以学着去理解它。如果我们处理音频,电话显然是这个设置的一部分。我们不知道它是模拟的还是数字的。它会是普通的老式电话还是综合服务数字网?这些都是我们这节课要讨论的问题。就像我说的,我们会尽量让每个演讲者的发言时间相对较短——现在我可以戴上另一顶帽子了。(戴上玩具塑料耳机——笑声)有人问我语音识别是不是玩具。是的,它是。这是一种有趣的玩具。语音技术总的来说很有趣。我本来希望能把这段话讲给观众听。但我认为这还不够好。这其实是一个儿童玩具,在玩具反斗城卖50美元。说话者独立孤立词语音识别器-“是”,“否”,“真”和“假”。它会带你参观恐龙之类的东西。从我的角度来看,我们能用语音做什么的关键在于理解它的优点和缺点,以及伴随的用户界面需求,从而引导我们为它设计合理的应用程序。语音有一些优势。当你的手和眼睛都很忙的时候,它非常有用;你看着屏幕,手指放在鼠标上。有时这是直觉;我们在很小的时候就学会说话。即使电脑没有语音识别功能,人们也会和电脑说话。(笑声)通常都是脏话——尤其是UNIX。(笑声)声音确实主宰了人与人之间的交流。不管我们用电子邮件和传真做什么,底线是我们仍然需要花一定的时间来面对面交谈。电话无处不在。如果我能把一个普通的付费电话变成一个电脑终端,我就可以从任何地方访问。从我自己的工作来看,这表明了对电信的高度关注。 我正在构建的这种系统实际上是为了在通信环境中使用语音而设计的。另一方面,语音也有很多缺点。它非常慢。每分钟200字,每分钟150-250字。这比300波特的调制解调器还低,现在谁还用这些呢。言语是连续的。你必须按顺序听。根据定义,这是一个时变信号。它需要关注。你必须倾听正在发生的事情,而不是简单地滚动它,偶尔停止它。我对这种现象的描述是,语言是“庞大的”。是的,它占用了文件系统上的空间,但最重要的是,您不能“grep”它,您不能对它进行关键字搜索。很难归档,很难处理好它。这需要时间。最后,演讲广播。如果我的工作站和我说话,而你坐在我的办公室里,你会听到它说什么,这与以文本形式出现的情况非常不同。事实上,如果它以文本的形式出现,而我坐在屏幕前,用的是我们常用的这些小字体,我可能根本看不懂——更不用说你了。这有一些用户界面的含义。一个是,它表明我们希望,在可能的情况下,有图像访问声音。我马上会放一段视频,展示一个在X窗口系统下建立的音频界面,它的设计是为了给你一些图形化的环境,这样你就可以用鼠标四处移动,或者使用一些视觉线索来跟踪你在声音中的位置。请把第一块卷一英寸。这是一个声音小部件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Operating systems and graphic user interfaces Physically-based modeling: past, present, and future Distributed graphics: where to draw the lines? HDTV (Hi-Vision) computer graphics Hardware/software solutions for scientfic visualization at large reserach laboratories
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1