Francis Goodling writes:
When you cut [remove a frame from a movie reel], you also lose a frame’s worth of the magnetic strip that holds the soundtrack. And while a missing 1/24 of a second is undetectable to the eye, it turns out that 1/24 of a second in lost sound is impossible to miss: there is a tic in the music, a skip in the background noise, or a word that has a bite taken out of it. You can’t see the lost frame, but you can hear it. At 24 frames per second, the eye loses track and registers seamless animation, but the ear is counting time.
Interesting. In Delivering data differently, Gwynn, Jonathan, and I talked about the comparative advantages and disadvantages of looking vs. listening as modes of understanding data. With visualization you can consider a lot more data at once: this is a result of visualization occurring over space, whereas sounds develop over time, also our brains can just process a lot more information through visual than sonic channels. Two advantages of sonic perception that we did consider were:
1. To gather visual information you have to look; you can’t do much visual processing in the background. In contrast, you can notice sound without paying attention. This suggests a role for sonic information transfer for diagnostic methods where the goal is not to draw inferences, perceive complex patterns, or synthesize large amounts of data but rather to be alerted to sudden changes.
2. Sound and music are emotionally engaging in a way that visual images typically are not. Sounds can be soothing, annoying, or all sorts of other things. Perhaps there is some way to harnessing this emotional connection for the purpose of learning from data.
Goodling’s above-quoted remark suggests a third distinction:
3. With vision, our brain fills in gaps so it can be difficult to notice small changes; consider those “find 7 differences between these two pictures” puzzles they used to have in the kids’ pages in the newspaper. In contrast, sonic gaps or discordances jump out at you; recall that famous Bugs Bunny cartoon with the wrong note on the piano. It’s similar to how a block of wood can look smooth, but if you run a finger along it you’ll feel the imperfections (and maybe even get a splinter)!.
When I was first thinking about general data perception, I was thinking of other senses as replacements for sight, thus trying to come up with sonic or haptic alternatives to scatterplots, time series plots, and so on. My current thinking is to set aside the things that vision can do best, and instead focus on the applications where the other senses are effective in ways that vision is not.
Perhaps there is something to be gained by merging the two senses. I haven’t explored these tools, but there are some that turn data into sound (for example https://twotone.io/). Maybe the sound of data will include some features that visualization misses?
I see that you address this (to an extent) in your linked paper. More for me to learn about.
An aside, but there is some really fascinating (to me) neuroscience on this topic. For instance you may have noticed you can perceive information more accurately if you combine senses (e.g., watching someone’s lips while listening to them speak). When your senses disagree, like when watching a ventriloquism act, you tend to put more weight on the more reliable sense. In this case, vision is better for sensing location than hearing is, so you mistake the puppet for the speaker. There are opposite examples though, like hearing is much more accurate for counting rapid clicks than vision is for counting rapid flashes.
The coolest part is this is all basically Bayesian. The posterior estimate is a function of the two pieces of evidence, weighted by their reliability. If you degrade the visual quality, you lean more heavily on the auditory signal, and it all follows very closely to what Bayes theorem would predict. Here’s one of the many papers on this: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000943
Data perception/correction might be more general than sight vs. sound. When a tourist guide is not a native English speaker and proudly says in English to the tourists, “This morning we will bypass all the famous sites of this lovely city,” we instantly (?) recognize that the guide means to say, “This morning we will pass by all the famous sites of this lovely city.”
Of course, it works the other way when, for example, a native English speaker attempts to speak a tonal language such as Chinese.
You can think of the auditory system as being sensitive to ~4 bytes of instantaneous information (16 bits of pressure on 2 eardrums), whereas the visual system is sensitive to much much more (lots of pixels for each eye, each with its own color, I don’t know a rule-of-thumb bit estimate, but I imagine VR folks do). So it makes sense we can sample sonic space more frequently than visual space (~48K times per second as opposed to 24).
A big real-world advantage of sound involves occlusion. It is easy to hide yourself visually (stand behind a tree) but hard to do sonically (because of your breathing, the twig you step on). I can’t think of a data analysis analogue of that, but it would cool if it could be done.
I’ve often wondered if there was a way to use sound to perceive very large mathematical sequences — if we could somehow normalize them we might be able to hear hundreds of thousands of sequential elements as audio.
As an estimate of the visual bitrate you could consider that video streams become “high quality” at several tens of megabits per second however you don’t perceive all of it as equally detailed etc. The fovea is much more detailed than the periphery. Still, clearly megabits/second is the right order of magnitude
Agreed — what’s interesting is that 16*2*48000 = 1.5Mbps/sec so they’re in the same ballpark. The visual system has a definite edge but they’re closer than you might guess.
Whoops: mbps/sec is redundant, like ATM machine.
I don’t think so. A LOT of good quality music recordings can be well described by Ogg vorbis or Ogg opus files in the vicinity of say 60-80kbps. Although raw bitstreams might be 1.5Mbps our perception of them is probably closer to ~100kbps whereas visual bistreams are probably closer to 10Mbps order of magnitude.
Interesting back of the envelope calculation!
I think your overall point is correct, but the details are drawn from engineering compromises rather than physiological capabilities.
People can reliably distinguish sound phenomena at time differences of around 9μs (https://asa.scitation.org/doi/10.1121/1.1908493), and visual images can perceived at around 1/200 of a second (don’t have a citation handy for this one).
One area where auditory perception may strictly dominate visual perception is in perceiving ratios. My ability to discriminate intervals reliably is good to within a quarter-tone or so over three octaves or so, which equates to an implied maximum relative error of half of a quarter-tone or 2^(1/48) = ~1.5%.
Without the aid of axis tick marks or grid lines, I doubt I’d be able to reliably estimate the ratio of two printed bars in a plot as finely as 6:1 vs 5.65:1. If, however, the ratio are presented as two simultaneous pitches (e.g., C3 and G5 vs C3 and F#5), I can distinguish between ratios that similar effortlessly and instantly.
I am a trained musician, which means that I am not a representative sample, but it’d be interesting to see how visual vs intervallic ratio perceptions compare in the general population.
I suspect that a big roll is also just anatomy. Our eyes sample a 2D image repeatedly over time, but our ears actually have dedicated portions for specific frequencies; removing a frame (or perhaps adding a ‘wrong’ frame) causes a change in the frequency of the signal.
There are also a bunch of people talking about combining information, if any of them are reading this, check out the “cocktail party problem”.
“‘but our ears actually have dedicated portions for specific frequencies; removing a frame (or perhaps adding a ‘wrong’ frame) causes a change in the frequency of the signal.”
This is, I think, exactly correct. If a “smarter” removal of a frame of sound were performed, e.g. where (at a minimum) any discontinuities in the signal were smoothed and (even better) frequencies were matched for the 1/24 second before and 1/24 second after the removed frame, it’d be much harder to notice.
That is, cutting a frame introduces information (in the form of unexpected frequencies and frequency changes) that we’re good at detecting.
The visual inverse of this is a favorite of mine. If you track a subject’s eye’s and light up (with a computer controlled spotlight) just that section of a room the subject is actually attending to, the subject “sees” a really really bright room lighted evenly everywhere, whereas the experimenter “sees” a dark room with a random spot of light bouncing around. (My understanding is that this experiment has been done with these results. But I don’t have a reference.)
Thanks for sharing this article.
I found the section on sonifying the progress of data-fitting algorithms especially interesting. But I would encourage you to expand your horizon beyond traditional “musical sounds.” There is a sonifcation of a stochastic process that you are already familiar with: the Geiger counter. I think it holds lessons for what might work well in this kind of situation.
> consider those “find 7 differences between these two pictures” puzzles they used to have in the kids’ pages in the newspaper.
They still have them:
https://comicskingdom.com//slylock-fox-and-comics-for-kids/2023-01-17
Andrew –
I don’t understand this:
> 1. To gather visual information you have to look; you can’t do much visual processing in the background. In contrast, you can notice sound without paying attention.
Seems to me that the difference between looking and seeing is much like the difference between listening and hearing. The brain processes much visually that’s in the background without bringing it to conscious attention in the same way that it processes much that we hear without bringing it to conscious attention. Although I would guess that the gap between the amount of data it processes and what it brings to attention might be smaller with hearing, is that something that you know to be quantified?
Anyway, this all reminds me of this:
https://www.youtube.com/watch?v=G-lN8vWm3m0
@Joshua:
1. Many moons ago there was a study published (apologies for not being able to supply a reference) which appeared to show that people watching TV move less than people who are asleep.
2. When it comes to survival value, audition is infinitely more important than vision.
3. Strangely enough, whether you listen to a tune or merely recall it in your mind, the same brain circuitry is involved (fans of Timmy Leary should take note.)
4. Other phenomena (e.g. the cocktail party effect &c.) have been well and replicably demonstrated.
5. The reason why the Portsmouth Sinfonia’s offerings were nonetheless recognizable [yes indeed!] was because they always adhered to the rythm of the original composition.
6. I can make it longer if you like the style …
“2. When it comes to survival value, audition is infinitely more important than vision.”
That’s an interesting suggestion because Joshua’s video shows that in the McGurk effect the brain prioritizes vision. Auditory perception has the obvious survival benefit of 360/360 detection. However, it doesn’t help you avoid a leaping lion. When push comes to shove, visual provides more information and more reliable information. There are plenty of threats in the wild that are silent.
So I don’t know if sound is *more* important than vision for survival. Seems to me like the two, as well as other senses, are tuned to work well together, each taking precedence in situations where it’s more valuable. If I was forced to choose, I’d rather be deaf than blind, although neither appeals to me much.
Vision is the most expensive of all our senses (hence the Stroop effect) ever since we crawled out of the oceans, as well as the most dispensible. Allowing for the fact that Our Man in the Savannah was already in possession of language (which, next to cooking and music, makes us human after all – we survive because we are social animals), it is highly unlikely that he would have been strolling around in the hope of a Gestalt psychologist yelling at him “Here be Lions!”. Cf. Ivo Kohler, Stephen Kuffler, Hubel & Wiesel, R.L. Gregory, J.J. Gibson, David Marr, and many others who did the footwork but never wrote books.
BTW, I’m looking forward to the day when the last PowerPoint presenter has been strangled with the intestines of the last webdesigner – never mind pop videos. (For the sake of balance, I could through in ‘ambient music’ which makes Gregorian chants seem bristeling with vivacity.)
bleedin’ autocomplete -> should read “throw in”; now there’s ‘visual’ for you …
I’m not convinced but thanks.