Engaging, comforting and charismatic: benefits of using voices of friends and family in interfaces
Imagine hearing a reminder from your smart speaker or phone, in the voice of your close friend, asking you to go for an exercise.
We set out to explore how people feel about interfaces that utilise the voices of their friends and relatives. And how to design such interfaces in a beneficial and secure manner.
Voices have visceral and subconscious effects on us. In 2015, I visited a dementia day-centre at a hospital in Singapore. There were audio devices on the tables that played out voice recordings from patients’ families or friends and sometimes played reminders to them. This somehow helped the patients to reconnect and feel comforted.
Beyond voice recordings, today’s artificial intelligence (AI) speech synthesis technologies enable us to clone peoples’ voices and dynamically generate realistic speech with ease. Voices can be cloned from just 5 seconds of audio of our own voices or those we have access to. Voices of friends and family are exciting alternatives to common voice interface and virtual assistant voices.
However, we also need to be wary of the misuse of these technologies, especially for voice impersonation attacks. For example, in 2020, scammers tricked a bank manager into transferring $35 million to them using a cloned company director’s voice.
Unlike the familiar voices of acquaintances, famous figures and mentors, friends and relatives likely have closer relationships with us and are more familiar to us. In this work, we asked two key questions:
- What do people feel about the voices of friends and family from voice interfaces?
- What are the key design considerations for such interfaces? We have to consider security and prevent misuse.
We conducted surveys and interviews. Then, we built a prototype, KinVoice, that enables users to set reminders and receive them in kin voices. Lastly, we let users test KinVoice in their homes for two weeks and held co-design activities. We summarise this in an overview video:
Our prototype issued reminders in AI-generated voices that were synthesised based on sample voices of family members and friends using the Real-Time Voice Cloning tool (Corentin, 2019).
User Perceptions (What do people feel about the voices?)
- Voices of friends and family promoted the feeling of connection (co-presence), social presence and telepresence.
- The voices were persuasive, credible, and charismatic.
- The voices were likeable, safe, and eerie (drew attention to the interface).
Our work brings a new understanding of how users perceive kin voices (voices of friends and family). A similar user study approach could be used for voice researchers and designers.
- Should we limit realism? Our findings suggest that synthesised kin voices should not be overly realistic to prevent misuse but should still be recognisable and familiar.
- VUIs with kin voices are more beneficial for personalised tasks which match the kin’s role and content of interactions with the kin than for general tasks.
- Mixing the use of real and AI-generated voices.
- We recommend leveraging the user perceptions resulting from the close familiarity of kin voices for use in various applications: virtual therapists and companions, notifications and motivation, and shared and social settings (e.g. to form closer relationships, maintain social connection).
Our studies and discussions contribute insights and guide future research on voice interactions and voice design.
I am fascinated by voice design and the implicit influence that it could have on us. If you have read Dune (1965 novel) or watched the most recent 2021 film adaptation, we are introduced to the sci-fi idea of “The Voice” that could subconsciously influence us to move. Perhaps there is something to this that is worth further investigation. I believe that we can use AI-generated voices for much good. We already see applications of AI-generated voices and characters in learning and well-being. Imagine how different kinds of voices could be useful in other domains such as Robotics, Virtual Avatars, The Metaverse, and more. With the maturing of AI-generated media, it is even more relevant for us today to think about. Let us thread carefully yet optimistically.
This blog post summarises our paper presented at the ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2021). It is published in the journal Proceedings of the ACM on Human-Computer Interaction (PACM-HCI):
Sam W. T. Chan, Tamil Selvan Gunasekaran, Yun Suen Pai, Haimo Zhang, and Suranga Nanayakkara. 2021. KinVoices: Using Voices of Friends and Family in Voice Interfaces. Proc. ACM Hum.-Comput. Interact. 5, CSCW2, Article 446 (October 2021), 25 pages. https://doi.org/10.1145/3479590
Special thanks to the instructors, organisers and peers of the “Experiments in AI-Generated Media” (MIT Media Lab) online course for the discourse on “deepfakes for good”. I highly recommend the course if you are interested in this area.