Audio Computer! More promising than Apple Vision Pro? (UIs from AUI to ZUI)

Gary Bartos
29 min readMay 16, 2024

--

I want AI that’s demonstrably useful. AI is hot (again), and this time it looks like we’re stuck with it. So I want “AI” to live up to the promise for once.

Just last week a TED talk introduced an “audio computer.” Consider me enthused about AI (again)!

An exploded view of three components of the iyo One audio computer.
The iyo one and its bits

My enthusiasm is high enough that I’ve decided to write about the variety of user interfaces — at some length — to help explain why I think the audio computer and its audio user interface (AUI), as Rugolo describes them, could be better than Siri / Google Assistant .

The TED Talk

Jason Rugolo of iyo has demonstrated AI that appears to offer more than just ChapGPT / Claude / Gemini, and running on more convenient hardware. In particular, the team at iyo appears to have solved at least one tricky problem using far smaller hardware than I’ve seen before. More on that problem below.

Whatever the everyday experience of this AI product-to-be may be when it’s released, the AUI will definitely be worth trying.

Rugolo smiles a lot in the video, including when the audio interface says something silly.

So yeah, I pre-ordered an iyo One using my company bank account. Work expenses can be fun, right?

More Revolutionary Than Apple Vision Pro?

Apple Vision Pro intrigued me as well. It has features I want and the sensors I need to implement a number of applications. In 2023 and before, many of us kept up with the trickle of rumors about Apple’s smart glasses project. We wondered what the Apple smart glasses would do, what they would look like, and how much they would cost.

The answers so far, as I understand: VisionOS devices could do a lot some day, but the Vision Pro is way too big and too heavy, and it costs way too much.

Snap Spectacles: Closer to What People Actually Want, and Would Wear

When my team conducted tests and interviews with blind and visually impaired folks who tried apps running on Snap Spectacles augmented reality smart glasses, the conclusions were clear:

  • None of our target customers want bulky glasses. Preferably, smart glasses would have the same form factor as good quality sunglasses.
  • Battery power has to last the better part of a day.
  • Smart glasses need to come pre-loaded with a number of utility apps, all of which are properly accessible.
  • The price for smart glasses should be around $1000, less than one-third the base price of the Vision Pro.

Bummer.

Apple may have something cool by version 3, assuming the VisionPro product line isn’t killed off before then.

Why You Should Be Enthused about an Audio Computer

If you’ve watched the video above and didn’t think something like “That’s cool!” then maybe you should stop reading here. But I hope some of you manage to wade through my long explanation why the tech is cool.

First, a negative.

Holding up a phone is (weirdly) fatiguing

Siri or Google Assistant might work well when you utter commands using your headphones or earbuds. But if you’re like me, then you struggle sometimes to get a useful response to simple questions. So then you take your iPhone or Android phone out of your pocket, face-unlock the phone, type your query, scroll through search engine results, etc., and I’m getting tired just typing that out.

Even if Apple uses ChatGPT, that doesn’t solve the problem of ChatGPT yielding garbage answers unless I use a very specific prompt, which is really not much better than having to enunciate unusually clearly to get speech-to-text to recognize my words. For now I’d like to accept that Rugalo & company have implemented a better conversational AI somehow, and that they haven’t simply created a prompt wrapper around an existing AI service. (WiFi is required, which suggests at least some processing takes place off device.)

Many assistive tech apps for blind and visually impaired people require the user to hold the phone and move it around so that objects of interest are in view of the rear-facing camera. Holding up one’s arm up for extended periods of time can be fatiguing. Just try holding up your hand in the air for five minutes.

A wearable camera such as a pair of smart glasses eliminates the problem of cognitive load and body fatigue. AirPods and the like provide hands-free operation, but for smart phone assistants that to date have been useful only for certain tasks.

Smart Glasses Haven’t Hit the Right Price/Utility Point

Smart glasses are the sort of hands-free cameras we can expect to wear in the future. Currently they’re overpriced for most budgets and underpowered for many needs. Like a lot of fresh tech, smart glasses can be fun for a week or two, after which many pairs just collect dust at the bottom of a drawer, or end up on eBay.

If smart glasses don’t have open APIs to allow 3rd parties to develop apps, then the user is stuck with whatever features and apps the smart glasses maker deems useful.

Good Conversational AI could cut short the Chatbot Acrapolypse

Multi-turn conversational AI is tough. ChatGPT, Claude, and Gemini can pull off multi-turn conversation, sorta, but the resources required to support large language models (LLMs) are mind scrambling. “Mind scrambling” is the many-millions-of-dollars-per-month step beyond “mind boggling.”

Cathy Pearl’s book Designing Voice User Interfaces explains conversational AI in simple and clear language. The book was written before ChatGPT was released, but design lessons from Pearl are timeless.

Beyond the limited turn-taking with Siri and Google Assistant, our everyday experience with conversational AI are with chatbots.

We’re stuck with the ever more numerous and invariably disappointing chatbots of websites and customer support phone lines.

Go ahead and sigh with me. Take a deep breath. Think of Corgi puppies.

Can confirm: totally accurate.

The idea of a chatbot seems cool: an interactive agent that answers your questions within the well-defined context such as billing support for a utility company, or finding out what legal forms need to be signed to be added to an elder parent’s account. But in practice the main beneficiary of chatbots would appear to be the companies that want to make customer support even cheaper. In my experience so far, chatbots are not at all appropriate for users who want to TALK TO A REPRESENTATIVE, get questions answered, and not get stuck in a query-response loop designed to steer away from a human in customer support.

When someone demonstrates a wearable, hands-free “speech-to-speech” interface with minimal latency — a short time span from your question to the device’s response — then that’s something to pay attention to. Very close attention. And iyo’s audio computer might provide that.

And if a consumer can buy such a device for less than U.S. $1000, then surely some development team lead by a conversational AI specialist could make a better chatbot. Right? Please?

A Reference Librarian Sitting on Your Shoulder

Librarians are cool. They’re good at answering questions. It’d be great to have a personal librarian available at all times. A librarian sitting on your shoulder, whispering to you much of what human culture has learned over millenia, would be most excellent.

However, an adult librarian is too heavy to carry on a shoulder. They’re not parrots. (If there’s a librarian parrot somewhere in the world, contact me immediately.) Also, the cost of feeding and caring for a librarian is relatively high. So we settle for parrots on our shoulders and librarians in libraries and coffee shops.

But how about a reference librarian, amanuensis, and polymath who speaks quietly to you, and who weighs no more than a pair of headphones?

That’s an intriguing notion, and worth the $69 downpayment / bet to “reserve” the iyo device, especially if it’s a bit more capable of combining librarian-like knowledge with multi-turn conversational AI.

A Problem Otherwise Hard to Solve: The Cocktail Party Effect

Have you heard of the cocktail party effect? People with sufficient hearing can typically pick out a single voice in a crowd — in a noisy cocktail party, for example — and then listen to what the person with that voice is saying. Conversation surfing, listening to one person then the next, is a fun sport.

Think for a moment how a computer would need to be programmed to pick a single voice out of a crowd, and to then to keep that voice isolated as voice and noise levels change over time. Sure, that involves microphones, maybe even two or more microphones, but . . . then what?

This is a notoriously hard problem to solve. It wasn’t that long ago that a research project was lauded for its ability to distinguish between a singer singing and a piano playing. The claim in the MIT Technology Review article title that the problem was “solved” in 2015 was an overreach. The UK researchers were separating voices in songs. They were not allowing for the arbitrary, on-the-fly choice for a casual user to pick any single non-singing voice in the hubbub of many non-singing voices in a room with other sources of noise, such as silverware clanking, a jazz trio playing on the stage, and an eager water waiter asking you if you want more water than the three liters you’ve already drunk out of politeness.

Omnidirectional microphone arrays can provide a hardware workaround for voice isolation: choose a direction to listen, use input from the shotgun-like microphone pointing in that direction, and process input from the other microphones. Then it’s possible to isolate the desired sound, and to suppress sound sources from other directions. But good microphone arrays aren’t that cheap, and few people will wear such an array.

(I’d wear such a microphone array in public, but not if I wanted my family to join me for dinner ever again.)

Anyway: the iyo device is the first one I’ve seen to specifically highlight the ability to isolate a single voice from ambient noise, thus providing a solution to the cocktail party effect in at least some situations. The video shows a camera feed, and it’s not clear to me how a camera feed would be used in conjuction with earables (i.e. wearables for your ears), or how recognition of a specific person would make it not just feasible, but apparently straightforward, to ask that the speech of a named person be identified, and then translated.

I can imagine carrying around a backpack of multiple laptops running in a sort of data processing cluster, and burning my back as the GPUs puke out heat while processing streaming data. But maybe the iyo is using a combination of ASICs and Snapdragon chips to preprocess data before sending the data into the cloud?

Binaural Recording Headphones and Pass-Through Sound can provide Augmented Reality or Virtual Reality

Sennheiser Ambeo binaural recording headsets make it possible to record spatial sound. I have the Ambeo headset, and it’s allowed me to record all kinds of cool stuff in spatial audio: birds chirping, people passing by on a walking path, the sound of scissors and clippers as a barber cuts my hair.

Listening to a binaural recording you made while you’re standing in the same place where the recording was made — at the center of a living room, for example — is a freaky experience. Your brain convinces you that the sound of your family member walking by is a sound that is being generated now, in real time, and not a recording you made ten minutes ago. The spatial audio playback mixes with the very subtle soundscape of the room to recreate the sound as you heard it while you were recording.

Sennheisers, and later devices such as AirPods, can mix ambient sounds with audio playback. You can vary the mix between “pass-through” sound and playback sound. Thus you can continue to hear your surroundings, but also listen to music or to a podcast.

Bone conduction headphones such as Aftershokz allow a similar experience: the sound from the headphones passes through your bones and jiggles the mechanical parts of your ear — those weird little bones. Your ears remain unblocked, and you can hear what’s around you.

Whether you’re using Sennheisers, AirPods, Aftershokz, or some other headset that blends ambient sound and recorded sound, there’s still a potential problem: you can have only one locus of attention. If your locus of attention is the podcast, or your current jam, and not on the sound of the e-bike rapidly approaching.

Pay attention to the music and the dancing, but also think about calculus, the smell of your wet dog, and the baseball your child just threw at you. Good luck!

If you keep the sound level of audio playback low enough, and maintain a mix that favors more pass-through sound, you can remain alert to ambient sounds. You can keep your locus of attention on music — maybe not on the words of a podcast — and allow the security guards in your brain to alert you to sudden sounds, unexpect motion, and weird vibrations.

As I’ll describe later: spatial audio, if generated on the fly in response to prompts, could make some wonderful applications possible.

Why I’m Enthused

Interface design been a locus of my design work since the mid 1990s. Simplicity, elegance, coolness, effectiveness, power — that’s what an interface / interaction designer wants to achieve.

To quote a phrase I thought up for a previous company: an interface should make clear your product is “embarrassingly better” than that of the competition. That’s a callback to the phrase “insanely great” from Steve Jobs.

Jaws should drop when you demonstrate an interface for the first time. Preferably, someone will cry with joy the first time you present an interface they didn’t know they wanted.

An interface should be elegant magic. Then the interface should fade into the background and become transparent, ever-present magic, leaving the user with new capabilities.

To weave that magic, we need different interfaces for different use cases.

We Need More Interface Options

It’s refreshing to break out of the moldy mold of WIMP (Windows, Icon, Menu, Pointer) interfaces and the desktop metaphor. Apple and Microsoft were inspired by (that is, copied) the desktop interface created by Xerox in the 1970s.

The desktop metaphor — files, folders, the trash basket, and so on — may have helped adoption in office environments of the 1970s through 1990s. Yee-haw. Witness my excitement and joy.

Now we’re stuck with a rather clunky metaphor. WIMP/desktop interfaces have become familiar — arguably “intuitive” — but such interfaces are hardly optimal for many tasks. Smart phone apps, game interfaces, and a very few web interfaces make clear that other graphical user interfaces are not only possible, but preferable.

What other user interface types are there? I’m spoiling answers to one of my favorite questions to ask interviewees, but only six of you out there have read this far — hello, Katherine! — so let’s proceed. The goal of my broad and shallow overview of user interface types is to provide a context in which to understand and appreciate the Audio User Interface (AUI) of the iyo earable.

Ooh, the Many __UIs: Types of User Interfaces

Some years ago I wrote a series of presentations to justify the design of a novel user interface, and I like you well enough not to subject you to that. Aside from finding links for those of you who like to follow links, I’m pulling everything below from memory, and from glances at my bookshelf. Google any subject further, as desired.

NUI: Natural User Interface

Touch & gesture interfaces are common now that we all own smart phones. These are sometimes called natural user interfaces, although few gestures that take place on smooth shiny glass screens could be considered natural. Pushing a virtual button is natural enough, swiping is easy to learn and remember, but multi-finger taps and weird pattern tracing (e.g. a “Z” swipe) have no analogue in our everyday experience.

Unusual gestures are hard to remember. Magicians have to study hard to make the right gestures. I don’t want to become an expert magician just to find that nutrition app on my phone.

The book Brave NUI World by Wigdor and Wixon is not quite the book I’d hoped it would be. I appreciate what the authors have done, but there’s a Microsoft-y feel to the book: a lack of distinctive personality, with another quirk edges sanded off, leaving an HR-approved blandness.

It would be grand and science fiction-y to walk around the world and point, poke, pull, and rotate objects to make computers and robots do fun things. But such interaction belongs to tangible user interfaces (TUIs).

TUI: Tangible User Interface

What if you could perform magic with a wooden wand? Or a pencil? Or just with your hand? Or what if placing your breakfast banana in a certain position on the kitchen counter indicates (for some reason) that your recipe app should add bananas to your grocery shopping list?

A tangible user interface allows you to configure or program a digital device like a computer using physical objects.

Many museums and schools in the U.S. saw a burst of TUI creativity when the original Microsoft Kinect 3D sensor (based on PrimeSense tech) became available for about $140. The Kinect could be used to detect the movements of users’ hands (or some other moving object). Dragging one’s hand through sand would cause changes to a multi-color topological map projected onto the sand.

A pair of hands reaching into a box of lumped sand. A multi-color topological map is projected onto the sand, making it look like a 3D landscape with mountains and water.
From https://www.wikiwand.com/en/Tangible_user_interface

A delightful example of a tangible / tactile user interface for programming is Code Jumper from the American Printing House for the Blind. Code Jumper is a bit like the Scratch programming language made into blocks suitable for young hang. It was cool to see Code Jumper at the 2020 CSUN convention for assistive tech.

https://www.aph.org/product/code-jumper/

Components of an APH Code Jumper kits. Lots of little plastic shapes that can be connected by cables to create a program that runs.
https://codejumper.com/about.php

The book Designing Interactions by the late Bill Moggridge of IDEO gives a number of examples of tangible user interfaces. And if I’ve misremembered, and TUIs are covered in one of the other books on my bookshelf, go get Designing Interactions anyway. Moggridge’s book is a platinum mine of design ideas, interviews, and inspirations. There’s a follow-up book, too, called Designing Media.

ZUI: Zooming User Interface

Our smart phones implement a limited type of zooming user interface, known to us largely through pinch & zoom gestures.

But the ZUI has a long history, and potentially much broader set of features. A narrow slice of ZUI capabilitiesare provided in the Wikipedia article.

The most consciousness-altering book on the subject of ZUIs is The Humane Interface by Jef Raskin. Raskin was the Apple engineer responsible for the Macintosh. From the Wikipedia entry:

Raskin started the Macintosh project in 1979 to implement some of these ideas. He later hired his former student Bill Atkinson from UCSD to Apple, along with Andy Hertzfeld and Burrell Smith from the Apple Service Department, which was located in the same building as the Publications Department.

Raskin is a deity in the pantheon of user interfaces gods.

If The Humane Interface by Raskin doesn’t break your design brain a little bit, then maybe you weren’t properly broken by The Design of Everyday Things by Donald Norman, namesake for the Norman Door.

I’ve designed and implemented a ZUI with all sorts of gee whiz features. That might be a subject for a future post.

VUI: Voice User Interface

Phone chatbots are voice user interfaces. And it’d be great if more phone-based chatbots responded quickly to our yelling “Representative! Operator! Representative! Human!” so that we can talk to a human, and stop talking to the chatbot.

I’ve already mentioned the book Designing User Interfaces: Principles of Conversational Experiences by Cathy Pearl. And there’s a video! Shouldn’t more chatbot designers worship at the altar of Pearl’s book daily?

There may be newer books than Pearl’s 2016 book, but as a reference book on VUIs and audio/speech interactions Pearl’s book remains great. Find a copy!

CLI: Command Line Interface (or just “command line”)

If you’ve used a command line recently, by which I mean in the past half century, then the command line prompt probably looks like this:

C:>

For many of us, the command line looks like this:

firstnamesurname@overly_long_device_name currentFolderName %

A command line prompt waits for your command. You enter a command that you hope you’ve recalled correctly. If you don’t quite remember a command, maybe you swear a bit and think about recognition vs recall. Then you google to figure out what switches must be passed to the “help” command to make it actually helpful.

Or maybe you’re typing out Git commands at the command line. So brave!

Command line interfaces are fast, provided you have the commands memorized, and that you’re an accurate typist. And that you value your time in a certain way.

REPL: Read, Evaluate, Print, Loop

LISP, Python, Julia, and other languages support a REPL. It’s not just a simple command line, but an interactive way to run code.

Do you want to know the value of 2 raised to the 127th power? Then a REPL is a nice choice. Type something like “2 ^ 127” at the prompt and you see the answer printed out immediately (as far as humans can tell). You don’t have to write a code file and then compile it.

As Jef Raskin suggested, a REPL can be a great complement to a GUI. Again: go read The Humane Interface by Raskin.

API: Application Programming Interface

An API is an interface?

Yes. The word “interface” is in the name for good reason.

If you’re writing an API of functions that other programmers call from their code, then you’ve created an interface. An API is an interface between your mental model and another developer’s brain. You hope their mental model matches yours. How you architect your code, how much documentation & help is required to understand your API, and (especially) the naming conventions you use determine how easy or hard it is to use your code.

When a developer uses your API by (figuratively) pushing a button labeled “Go!” on the black box encompassing your million lines of code, does the result match the developer’s expectations? Does the software go?

If the software doesn’t go, is the confused developer chided to “Read The Free Manual” (which stinks), take a video crash course in telesurgery, and also deduce your mental state back in 2017, all to understand when and how to use the function

execute_current_context(&ctx_unpointer, numThings, &&*^%?criticalRawData)?

(Over time, many languages evolve toward a version of LISP, but with an extended character set that looks like censored profanity. And for good reason: the languages become less usable, and profanity is appropriate)

This is where Strunk & White and Universal Principles of Design are your guides. Name your functions meaningfully and elegantly.

And most definitely read The Philosophy of Software Design by John Ousterhout. Design your interfaces to have deep classes of relatively few functions. Avoid designing interfaces of broad, shallow classes. (For “class”, substitute the word “library” or “function”.)

Check out Ousterhout’s talk at Google:

You should still read Ousterhout’s book after watching this video. Read all the books.

Multi-Modal Interfaces (Multi-Sensory Interfaces)

There’s a long and not always successful history of interfaces meant to provide feedback using some combination of visual display, sound, haptics, and sometimes even heat, static resistance, and so on.

Some years ago I had a haptic mouse that would vibrate when it passed over a web page element. It was cool for a few days.

Let’s stick to graphics + sound for now.

Elizabeth Wenzel is quoted in Dick Lyon’s book Human and Machine Hearing:

The function of the ears is to point the eyes.

Elegant!

Wenzel has worked on the integration of vision and spatial audio for virtual reality and augmented reality. Despite her minimal web presence, her work has been highly influential — at least it’s been highly influential on me.

When virtual reality or augmented reality combine visual feedback and spatial audio, the experience can be properly immersive.

And that leads us at last to the iyo audio computer’s AUI, and to the applications for which an AUI may be particularly well suited.

AUI: Audio User Interface (iyo’s terminology)

I accept Rugolo’s claim that an audio interface is natural. I accept this not because the idea goes down easy, but specifically because a speech-to-speech interface is something I’ve wanted for a while. Lots of us have.

“Computer, when will we arrive at Ceti Tau? . . . Can we go faster? . . . Are we there yet? Are we there yet?”

After many years of sci fi wishful thinking, maybe we’ll have a good audio interface at last. We’ll be able to talk with our computers in multi-turn conversations, rather than talk at our computers with precise enunciation and constrained diction.

A general-purpose AUI coupled to one or more large language models, and an ability to isolate individual voices or other sound sources, and (one hopes) simultaneous access to streaming video, would be cool even for a system closed to outside developers.

As with our smart phones, though, the utility of an audio computer with an AUI would be greatly enhanced if the world of 3rd party developers can write application to run on the device. We could focus on applications, and treat the AUI as a black box with an open or semi-open API. My guess is that we’ll have to wait a bit for such an API. Providing an API to developers is expensive, and would take resources away from early development.

Audio Developer Programs and (Semi-)Open APIs

Bose had a virtual reality developer program that I joined, but they nixed the program during the pandemic, just before I planned to start development. So now I have a pair of Bose Frames collecting dust. (They’re cool, but kinda bulky, and don’t have prescription lenses, so I’ll wear my Aftershokz or cheaper wired earbuds instead.)

Although 3rd party apps created by independent developers can encourage wider adoption of a piece of tech, creating a developer program requires money, support staff, time, and a commitment to keep support going indefinitely. To summarize: money, money, money.

Apple did not offer developers a means to create native apps for the first generation iPhone. Even now there are some rough edges to app development, approval, and distribution for smart phones, but Apple and Google have both demonstrated how this can be done.

Creating the App Store or Google Play: wow. That’s an undertaking.

Given that Rugolo previously worked at Google, it’s not hard to imagine that Google would be first in line to acquire iyo if/when the technology sells well enough. Maybe the iyo earables will be a first generation proof of concept, after which the core technology will be licensed out, or built into smart phone peripherals.

As an Axios article points out, Humane’s AI pin hasn’t done well. Even Picard is reluctant to put it on.

The iyo staff number a scant two dozen (as of late April 2024), and they have a lot of work ahead of them. Adding a developer program to supoprt a public API would mean paying for the equivalent of another whole team. Such a team could number as many as 8 people, or 4 people working the work of 8 people. Even then, it would likely make sense to support such a team only after the software and hardware have stabilized in version 2 or 3.

But I’m so hopeful. Someone please give them an additional $10 million — why not a cool $100 million? — to hire a team that can develop an open API in parallel. Sure, that new team might just get pulled into the development rush, but still, it’d be cool to have that API.

And now, here’s another TMI diversion into tech to help explain what the iyo create may have done to solve the cocktail party problem with minimal hardware.

The Head-Related Transfer Function (HRTF)

The HRTF concerns how sound in the world is perceived by you, a listener, whose ears, shoulders, and head all affect the timing and frequency characteristics of the sound you hear. A different listener’s head, shoulders, ears, and brain are different, and hence their HRTF is different.

Measuring an individual person’s HRTF requires a crazy complicated speaker array as shown in the iyo video: lots of speakers in a large cage sort of structure, a real-world Thunderdome of sound: two ears listen, one HRTF leaves.

If you can measure an individual’s HRTF, then you can create a stereo signal that mimics real-world spatial audio. You could generate sound and locate that sound in space to improve immersion.

Binaural recording headsets such asSennheiser Ambeo record the sound in your ears — the microphones tuck right inside — so the sound is recorded as you hear it. Then, on playback, the audio experience is eerily like real-world, real-time sound, with all the subtle background noises that convince your brain you’re hearing sound in the space where you’re standing or sitting.

Generating a personal HRTF is tricky. Not everyone has a Microphone Thunderdome at home. Thus, an average HRTF that works for most people was the simple solution: generate the HRTF once, and it’ll do a good job of generating spatial audio that works for many if not most people.

Historically, a general-purpose “average” HRTF has been generated using a model of a human torso called a Kemar simulator. Or, if I recall correctly, the simulator is used to generate a head-related impulse response (HRIR), and then the HRIR gets mathed into an HRTF.

The simulator looks a bit agitated.

The simulation dummy, which consists of a head, shoulders, and a facial expressing suggests the dummy is in the midst of saying something.
“You talkin’ to me? You talkin’ to me?!?”

Digital Audio Workstation (DAW) software such as Reaper supports plugins that can apply the Kemar HRTF to a recording. The result is spatial audio, “3D sound,” for playback on stereo headphones. (Sennheiser’s plugin is good.) Just don’t ask me to remember all the incantations and dances required to make that happen in Reaper. Maybe it’s easier now.

I haven’t tried Apple’s spatial audio since Apple has quite enough of my money and mindshare already. But from Apple’s online help page, spatial audio seems straightforward to set up. Maybe it’s good and it generates a good HRTF.

I’d be surprised if the 3D sensor in one of the honkin’ big iPhones generates a 3D cloud of points that serves as a good enough model of your head to generate a high-quality HRTF, or the equivalent. But I’ve been surprised before.

Spatial Audio from Recording/Playback Integration?

If software can process the audio signal to isolate individual speakers in a crowded restaurant, as demonstrated in the iyo video, then maybe the mass of hardware and software to record and play binaural / spatial sound is no longer necessary. Perhaps a generalized HRTF model, two microphones, and sufficient processing power can all be integrated into the oyo One headset to provide a seamless, immersive audio experience of spatial sound

Seamless: that’s one of the power words in interface design.

As an engineer I want to know how it works. As a user I just want it to work.

Uses of Spatial Audio

In the video, timbre is mentioned briefly. Human voices and musical instruments have distinctive timbres: highly individual mixes of frequencies, amplitudes, and other sonic complexities that make the voice or musical tones distinctive. An oboe, a bullfrog, and your cousin Yuko may all sing the same note, but they’ll sound very different.

Try this sometime: if you hear a colleague or family member cough, ask yourself whether you can identify the person just by the cough. As long as you’re sufficiently familiar with someone’s voice, you’re likely to be able to identify them by their cough alone. That’s the case even if you can’t see the person, and if you and the cougher are in a room with many other people. I think that counts of timbre identification, but I’m not a timbrist by trade.

If spatial audio is generated on the fly for multiple sources, and if each source has a distinctive timbre, then it’s possible to create an expressive, totally virtual sound environment. The user can mentally track virtual objects that have distinctive timbres. It’d be a sort of virtual cocktail party.

In other contexts, a cocktail party effect-supporting UI could represent real-world sounds, such as a herd of Corgis puppies running about.

Spatial Audio Playback vs. Real-Time Spatial Audio Generation

It’s already possible to create recordings of spatial audio, and to play back the spatial audio on any old pair of stereo headphones or earbuds. In just the past few years, there are even improved head-tracking headphones (for your head).

Head tracking makes it possible to present a virtual sound source so that the sound source seems to remain in place as you turn your head about. Immersion improves as the stability improves for the 3D location of a virtual sound: that T. Rex’s guttural “time for lunch” mutterings should remain over there, near that tree just five meters away, even as you move about and turn your head. Carefully.

If you combine spatial audio with head tracking, real-time generation of spatial sound with virtual sound sources, and a speech-to-speech interface, well, you’ve got not only your shoulder librarian, but also an audio augmented reality experience. This AR audio would remain centered on you as you walk around the world. So…

You’re a swimmer deep in the ocean when a humpback whale swims by on your left. But you’re actually sitting on your couch, immersed in the experience.

A dogfight between World War II fighter planes takes place overhead as you eat breakfast. (Note: creating spatial audio so that it appears to come from overhead is hard. Spatial audio is generally much better in the nominally horizontal plane that passes through our ears.)

You walk through an orchestra pit and hear the individual instruments near and far. After a few steps, you identify that annoying sound: the oboe player has been chewing gum loudly while counting off the bars until it’s time to play again.

Think up your own immersive applications.

And I probably should have filed patents years ago for some of the applications of spatial audio I worked on, but maybe there’s a chance yet to implement those applications if iyo hasn’t already done so.

The Shoulder Librarian Again

The shoulder librarian really is the key feature, isn’t it? If I can ask my shoulder librarian just about any question and get a reasonable answer most of the time, I’d accept the occasional AI hallucination. I’ve live without augment reality audio that lets me swim with whales. I’d wait patiently for API access.

Large language models can be terribly disappointing if you expect them to deliver results that are accurate for 95% or more of your queries. You don’t want to study to become a prompt engineer just to use the LLM. Maybe I’m stretching myself by assuming prompt engineering will exist as a profession by the the time you finish reading this article.

But if you allow for your shoulder librarian to be a bit quirky, with a hint of the crazy cat person who gives a weird answer sometimes, then that’s just fine. Who doesn’t have warm, cozy feelings for their local librarians and their quirks?

Applications for an AUI

So here’s what I want from my future shoulder librarian, the audio computer and its various features and services.

B2B applications for hands-free augmentation in factories

Smart glasses and augmented reality headsets such as Hololens, Google Glass, and others switched business models from B2C (business to customer) to B2B (business to business). iyo might try the same.

For the typical customer, a $3000 outlay for a head-worn device is a significant outlay of cash. For a business such as a factory mass producing expensive products, a lost minute of production can cost $40,000. If twelve units of a $1000 device purchased for a engineers and production line workers saves just a few minutes a day per person, then Pat in accounting might like that.

A number of companies working in augmented reality and virtual reality have sought contracts with major manufacturers. The disadvantage of something like smart glasses, though, is that they partially obstruct your vision. And augmented reality may be useful only during training, such as presenting the internals of an engine in an exploded 3D wireframe view so that the trainee understands how the engine parts fit together.

On a production line, with workplace safety concerns and legal liability for accidents, it can be hard to imagine someone wearing smart glasses for an 8-hour shift. Partially obstructed vision, a locus of attention on unreal 3D images, noise levels that already hamper one’s ability to think, and the need to adapt to people who require corrective lenses — these are all problems that can impact the everyday utility of smart glasses for somone working on a production line.

A speech-to-speech AUI that can work in the presence of noise (!), that can filter out some ambient noise, and that doesn’t obstruct one’s vision at all, addresses some concerns about safety and utility.

Imagine a line worker who could ask questions of a factory or warehouse management system via speech-to-speech interface that can isolate the worker’s voice from a noisy background.

  • “How many units of model A are coming down the line?” (How should I prepare myself for the next batch of products?)
  • “The product barcode indicates that this is product B, but it has trim [outer components] that suggest product C. Is this really product B?”
  • “Did the shipment from supplier S arrive at the dock on time?”
  • “For the robot installation in front of me now, which way do the X, Y, and Z axes point?” (It can be tedious to track down such simple info. Are the coordinate axes spray painted on the ground? Did someone tape a note on the side of the controller cabinet?)
  • “What’s for lunch in the cafeteria today?” (In a large building, the cafeteria can be a long walk away.)

There’s a potential issue with someone speaking loud enough to hear their own voice — something we prefer to do, when we can, but that leads to hoarseness after working in a loud factory. For audio-only conversations, we also need feedback to confirm what we’ve asked, and an easy way to correct the intelligent agent’s (mis)understanding of what we’ve said.

B2B applications in offices

Cubicle farms. Hot swap desks. Factory offices. For office workers it’s easy to access information on a Windows or Mac computer. Easy, that is, if you ignore the many unpleasantries and inefficiencies of the WIMP/desktop model of windowing operating systems.

Given that Chris Stentorian in the neighboring cubicle talks much of the day on the phone at a volume loud enough to keep all of the hungover college interns awake, and to simultaneously cause you to lose your place reading documentation, it wouldn’t be a big deal if you interacted with an intelligent agent via speech-to-speech interface. And perhaps Chris’s voice would be filtered out.

Some queries in the office:

  • “Open the local user bin folder.” (Please don’t make me google the location again.)
  • “Find that document about non-linear optics Jean and I thought was cool when we read it last year.” (In-document search works better on some OSes than others.)
  • “Tell me who from accounting could best answer a question about the cover for this TPS report.”
  • “How does our P&L [profit and loss statement] look today?”
  • “Call the phone company and keep yelling ‘Representative!’ or whatever it takes to get a human on the phone. Then let me know when you’ve confirmed you have a real person on the line.”
  • “How are our bookings for the quarter relative to the projections we made last quarter? Should I be polishing up my resume?”
  • “At what time could the entire Tiger Tornado Tactical Team meet in person this week?” (Don’t make me use Outlook or write a scheduling poll. I beg you.)
  • “Please summarize in a polite, professional way the following criticism: Pat, the new feature is not only weeks late, which has held everyone else up, but the customer wanted a blue Execute button for their app running on Windows 11, not . . . [sigh] . . . 11 buttons labeled Windows 1 through Windows 11, each of which — and I quote your closing comment on the Jira ticket here — ‘executes the color blue.’ Let’s meet after lunch to talk.”
  • “Remind me which meetings I have tomorrow.” [Asked while walking down the hall.]

B2B applications in the field

Salespeople, field service engineers, and others travel to customer sites, sometimes having to take a combination of planes and cars to do so. Travel, means, and hotels can be a significant expense. Cutting costs without inconveniencing the traveler would be great. More importantly, a traveler should be able to get work done without taking on undue stress, and without having to devote mental energy to squeezing efficiencies from every costed expense.

Queries to the AUI:

  • [While driving] “Call the customer and let them know that there’s an accident on the highway, and that I’d like to know if we can reschedule our 1 o’clock to 3 o’clock. Alternately, we can reschedule for tomorrow morning. Tell me when you’ve got this sorted out.”
  • “Where is the closest restaurant with halal dishes that’s open after 8pm?”
  • “Translate for me what the receptionist just said. . . . Okay, now tell me how to say ‘I don’t mind waiting here until Ms. Demir is free from her meeting.’ ”
  • “Given the announcement this morning of competitor C’s new product, how should I change my pitch to emphasize my company’s superior tech support?” [Asked while walking through the parking lot to the customer’s lobby, fifteen minutes before the pitch.]
  • “In local culture, is it a faux pas to talk about one’s family during a business dinner?”

B2C Applications — Use Cases for Everyday Life

Keeping the limitations of LLMs in mind, here are a few things I might ask while doing everyday stuff.

  • [While walking] “Is the local library still open? Will I make it in time at my current pace?”
  • “Is it legal to turn right on a red light in this city?”
  • “How fast can a Corgi run?”
  • “Is there an analytic expression for the Euler spiral?”
  • “Did the author of the book ‘Reading in the Brain’ write any books after that?” (N.B. Check out the video “How the Brain Learns to Read” with Dehaene.)
  • “What are some jokes about apples that my niece in kindergarten might find funny?”

What It’s All About

Out of everything we can imagine the iyo audio computer could do, what will it be able to do upon release? When might other desirable features and services be released?

Is this the personal intelligent agent we’ve been waiting for?

Play That Funky Music

And the iyo One will play my music, right?

--

--

Gary Bartos

Founder of Echobatix, engineer, inventor of assistive technology for people with disabilities. Keen on accessible gaming. echobatix@gmail.com