Is This Thing On? Privacy and Your Smartphone Sensors

Smartphone Snooping Without Microphone Access

Can your smartphone sensors still enable apps to eavesdrop on your conversations even after the app has been denied microphone access? It does seem possible. We dug into this question based on two research papers, “AccEar: Accelerometer Acoustic Eavesdropping with Unconstrained Vocabulary”, and “Side Eye: Characterizing the Limits of POV Acoustic Eavesdropping from Smartphone Cameras with Rolling Shutters and Movable Lenses”, which show how other smartphone sensors can be used by applications to recognize speech without accessing the microphone at all.

In the first paper, “AccEar: Accelerometer Acoustic Eavesdropping with Unconstrained Vocabulary”, a team of researchers from Shandong University in China and George Mason University in Virginia found a way to capture speech played on Android device speakers by sampling the accelerometer sensor. These are important caveats. They are not capturing sound from the environment here; they are capturing speakerphone output or other speech played on the device’s speaker. But given this restriction, they make a convincing case that they can recover speech on many different Android devices.

How Accelerometers Measure Sound on Mobile Phones

How is the accelerometer on a mobile phone turned into a microphone? Well, accelerometers measure the forces on the mobile device. For example, they can recognize gestures like shaking your phone back and forth. It turns out that sound waves shake things back and forth very quickly, like thousands of times per second. That movement is essentially how a speaker works – we shake a membrane back and forth very quickly, and it makes pressure waves in the air, which we hear as sound. So, playing a sound on your phone’s speakers can be detected as tiny, fast phone shakes. And the accelerometer can detect this shaking – sort of.

Many Android phones allow applications to check the accelerometer up to 500 times per second. This frequency is way faster than would be necessary to detect gestures – most apps only need to sample it around 30 times per second, max. But the phones allow 500 samples per second anyway. Ordinarily, this would not be nearly enough samples to recognize words – you would want several thousand samples per second on a good microphone to do that. But the research team created a sophisticated combination of signal processing and machine learning to get pretty good results anyway.

The AccEar tool samples the accelerometer data and builds spectrograms of the signal. Because of the low sampling frequency, only very low frequencies are present in the spectrogram. In order to rebuild the higher frequencies and get something like full frequency speech out of the accelerometer signal, they train a machine learning model called a Conditional Generative Adversarial Network (cGAN) on a bunch of speech sampled at low frequencies, to be able to infer the high frequency components from the observed low frequencies. The output of the cGAN is compared to the full frequency captured signal, and the training continues until the cGAN is as good as it can get at inferring high frequency components of speech from the lower frequency (20-250hz) signal. Basically, the cGAN is trained to make moderately good educated guesses based on low frequency fuzzy audio about what the full frequency is, assuming that the audio signal is speech. It turns out to be pretty good at these guesses.

Setting the Stage for a Realistic Attack Scenario Using the cGAN Model

The speech reconstruction method uses a combination of spectral analysis on the low-frequency signal, and a trained cGAN to reconstruct the full signal by inferring the high frequency components. The paper presents several technical and opinion-based metrics on the effectiveness of the recovered speech signal, and their algorithm performs well. It excels on the primary technical metric, the Mel-Cepstral Distortion (MCD). According to the paper, the MCD measures the difference between the original signal, and the reconstructed one, lower is better, and anything below 8 means the speech is probably recognizable. All the different voices they tried had MCD values between 3 and 8, which seems to indicate a technically feasible attack. The researchers also had some judges try to recognize words in the reconstructed audio stream. Judges were typically able to accurately recognize 80% to 90% of the words in the cGAN reconstructed audio. These results show that this is a realistic attack scenario.

Given the success of this eavesdropping attack that just samples the accelerometer, which on Android 11 and below requires no special permissions at all, what are some countermeasures? First of all, Android 12 and later restricts the ability of applications to sample the accelerometer at such high frequencies. However, Google’s proposed restriction is to limit sampling without explicit permission to 200 samples per second, which the authors of this paper believe is too high, and will still allow partial reconstruction of speech. The preferred solution is to require explicit permission from the user for any application that samples at greater than 50hz. Another solution is to avoid using your phone’s speakers for anything, since this attack was only demonstrated on speech audio that is played via the speakers on the phone. It is not known whether it would be effective on speech that occurred near the device while it rested on a sounding board type surface.

Extracting Audio from a Smartphone Camera

The second eavesdropping paper is called “Side Eye: Characterizing the Limits of POV Acoustic Eavesdropping from Smartphone Cameras with Rolling Shutters and Movable Lenses” by researchers from the University of Michigan, University of Florida and Northeastern University. This research is focused on extracting audio signals from a smartphone’s camera, specifically by looking at changes in the smartphone camera’s video stream caused by the impact of audio waves on the various components of the camera.

In other words, sound makes tiny mechanical camera components wiggle, which shows up as tiny changes in the video produced by the camera. In particular, sound waves wiggling the camera’s CMOS sensor in tiny amounts while it is going row-by-row capturing a frame of video, and wiggling the springs and wires of the camera’s movable lens (which supports image stabilization and autofocus capabilities), can introduce artifacts into captured video that can be used to infer some of the frequency components of the sound that caused the wiggling. It is interesting to note that the subject of the video can be almost anything – pointing the camera at a nearly featureless ceiling or floor is all it takes for this attack to work.

While the AccEar experiment used the built-in speaker on the smartphone to generate audio signals, Side Eye has an external audio source, typically a speaker placed on the same surface as the phone. The other limitation is that the mobile app running the Side Eye type of attack has to be capturing video while the audio is on – typically this requires permissions to be granted and may also enable lights on the device to indicate that the camera is active.

Training a Neural Network to Perform Speech Recognition from Camera Movement

Unlike the AccEar research, which basically trained a GAN to try to recreate the full signal from the low frequency signal, and then tested whether people could understand the reconstructed signal, the Side Eye team trained a transformer neural network called HuBERT (Hidden-unit Bidirectional Encoder Representation from Transformers) to try to perform limited-vocabulary speech recognition from the sampled signal directly. This had limited success – Side Eye can recognize spoken numbers correctly between 30% and 70% of the time, depending on how loud the speaker was. With just numbers, it is a limited vocabulary with less accuracy than the AccEar team had.

Given that the Side Eye team estimates their sampling rate is roughly equivalent to 600 samples per second, and AccEar maxed out around 500 samples per second, you might have expected Side Eye to be more successful. If I had to speculate about why, I would guess that the AccEar team had a smarter post-processing setup: they attempted to regenerate the audio from the signal rather than try to train a machine learning system to recognize the speech itself. It seems possible that the input process from Side Eye, combined with the signal regeneration techniques from AccEar, could result in unlimited vocabulary speech recognition at higher rates that even AccEar achieved because of the higher sampling frequency. This approach is suggested as a future research direction by the Side Eye team as well.

The Importance of Smartphone Microphone Attack Studies

Disrupting the Side Eye attack is challenging, and will require changes to both the video sensors and the lenses. Instead of video sensors that capture a row of light at a time, manufacturers will need to find sensors that capture the entire image simultaneously, known as a global CMOS sensor. Alternately, they might randomize the order in which they capture rows.

Fixing the information leakage from lens assemblies is more difficult, but the best option is probably to put the lens assemblies in a mechanically isolated compartment that does not vibrate with the rest of the phone.

One area I would like to see explored further in the video-to-speech eavesdropping space is whether you can improve the sample rate by using more cameras. Recent Apple smartphones have three cameras on the front and one facing the user, and it is possible to sample all of them at the same time. If they each have distinct responses to structurally-conducted audio signals, it might be possible to improve the effective sample rate by combining the information received from each video stream.

The two papers reviewed here show how smartphone apps can be used to eavesdrop on speech even when they do not have access to the microphone on the device. Both methods have significant limitations for how they can be used – one only works when the audio is played on the phone’s speaker, the other only works by capturing video while the phone is on the same surface as a speaker playing the voice. However, both end up being quite unexpected ways for mobile devices to be used for eavesdropping.

The creativity and technical sophistication coming out of the academic community in this space makes it clear that highly sensitive speech does not belong anywhere near a smartphone. Until the vendors get their sensors under control (if they ever do), there will be multiple modes that an attacker with access to a phone can use to spy on nearby conversations, and users having high consequence conversations should isolate their smartphones accordingly.