We leveraged the interactive demonstrations that our probe afforded to engage and involve eleven community members (7M, 4F; 18–68) in the co-creation of media retrieval use-cases that are more appropriate in oral contexts. These occurred during two workshops across two days, one focused on current information seeking practices and the other on uncovering use-cases that support and extend these practices with more useful content than the photos of the current probe. Before, between, and after workshops we experimented with different content generation approaches as well as refining the ways of collecting audio annotation data to drive the IR system with the same participants.
5.5.2 Current Information Practices & Languages.
We split this workshop across two groups to fit more comfortably inside the house: the first was with four younger participants aged 18–22 and the second was with seven older aged participants (aged 30–68). Younger and older generations had different responsibilities in the fields and in the homes, and so were available to participate at different times. The generational divide across groups also expressed itself in terms of smart-/feature-phone usage and non-usage as well as their fluency and literacy in Hindi and Marathi. We asked polylingual participants to translate for one participant in the second group who did not speak Hindi.
We structured what ended up being lively discussions around five scenarios/topics, designed to cover a broad range of everyday experiences. To arrive at these, we utilised interpretative research strategies [68] and drew on our observations and lived experiences of previous research phases – documented through field notes and research diaries – to come up with 18 potential scenarios/topics, captured these on post-it notes, and following discussion between research team members distilled these down to five.In the selling cotton scenario we asked participants to walk us through their line of reasoning for deciding when to sell their cotton harvest. In phase 2 we had observed that one family stored many bags of cotton in the back room of their house, hoping for higher prices later in the year. Some participants were not involved in this process and deferred to and trusted other family members in their decisions. Those involved in the process said that the internet was not a helpful resource: internet prices were characterised as ‘fake’ – more than what is actually offered from buyers in the area. Instead they call buyers and middlemen in the area, if possible pooling together, so the buyer will come collect the cotton harvest from many families in a single vehicle. But this arrangement often falls through on either side, and then the buyer will not come to collect the crop. Other families drop the cotton off themselves at depot in the nearby town to command a higher price, but need to cover the cost of transport.
On the topic of seeking health information, community members mentioned that they visit physicians in a nearby town and generally do not consult online information. They also reported on their experiences of Covid-19 and how healthcare workers came to administer vaccinations. They received a digital copy of an English vaccination certificate and those with smart-phones would facilitate receiving certificates for those without access.
On the topic of seeking help and information to manage crop diseases and pests participants consulted agricultural shops in the nearby town and preferred to go in-person rather than call. Sometimes they will show a photo they have taken of the issue, and usually the people working in the shop will make appropriate recommendations.
Across these three scenarios and especially when information was sought outside the Banjara thanda, participants reported using Marathi, which led us to the topic of language preferences and perceptions. Participants preferred, and indeed cared deeply about, their language, but were also pragmatic about speaking different languages, and saw it as necessary to live in a society with lots of different people. However, within the community they always speak Gormati with each other. Elders mentioned that Gormati is a strong and stable language, but also acknowledged that their children learn more and more Marathi in order to speak with outsiders. In their view, they should stick to Gormati. Younger participants prefer Ahirani and Marathi songs, and claimed that songs in those languages are more melodic than Gormati. However, older participants did not share this view.
This led us to the topic of
multimedia. Here older participants generally relied on those with smartphones to facilitate access. For instance, this was achieved by asking children to play a song from YouTube. Young people demonstrated how they use voice recognition, keyword search, and code-switching on YouTube to search for:
“Gor Banjara Song”11. They explained that you need to use the English alphabet to find content on the internet, and also adjusted their querying style, from fluid Gormati queries that we observed while evaluating the IR system, to using keywords. Top results
12 are of high production value, with well-designed title cards that help identify and differentiate songs.
Participants mentioned that they would like to see more Gormati videos on topics in the following areas—farming, recipes, songs, comedy, and religion—and to record and share their own songs as well as videos with recipes or showing effective farming practices. Younger participants already create video content, but often delete it from their phones to conserve space and choose not to upload it as it does not match the production value of the videos they like to look at online.
5.5.3 Community-Generated Media Content.
Between the workshops we experimented with generating the type of media content participants mentioned earlier: filming community-members making chapati, cooking lentils, weeding, and ploughing. Participants in the videos narrated what they were doing and, at our encouragement, repeated their demonstrations and narrations. For instance, when demonstrating how to make chapati, the person in the video made multiple chapatis and demonstrated and narrated each step multiple times: dosing and shaping the dough, cooking and flipping the chapati on the stove, and manipulating the cooked chapati to make it more pliable. We also encouraged participants—mothers and older farmers as well as their younger adult children—to make their own videos on project phones. We also found that women in particular wanted to record and share songs. However, unlike phase 2 where we precluded sung content, because it would be technically too difficult for our recogniser to cope with, we encouraged these and later explored ways for community members to contribute spoken, rather than sung, annotation data for the IR system.
5.5.4 Design workshop.
We met with the same participants the next day to think about use-cases for spoken language technologies, as embodied and exemplified by the current probe, that support and extend their current practices. This time we had arranged to meet with the older group first, so that we could feed back their insights to the more digitally savvy, younger group.
With the
first group, we started by showing some of the videos they recorded earlier and discussed that they would be of interest both within and outside of the community. We asked them to imagine how they would find those videos if they were living in a different Banjara community hundreds of kilometres away. They mentioned they would ask their children to help and that songs would be of particular interest to them. In the media gallery of one of the two phones we had lent, we tried to locate one of the songs that participants had recorded, but initially could not locate it across similar-looking thumbnails. We also checked on the second phone, until we finally found the video after a more systematic check on the first device. We used this difficulty as an opportunity [
28] to show our IVR probe again, demonstrate how it can make it easier to find content from spoken descriptions, and showcase how it had been extended with new photos since the previous day. We also explained that the IVR probe can be changed in future to include videos and song content, but that it can only understand spoken Gormati. We then asked participants to imagine that many songs were on the IVR probe and tell us how they would find a particular song. After some discussion in the group they said that they could either explain the song in words or say the first line of its lyrics.
With the second group we began by discussing how the elder group had shown an interest in making videos—of their farming, their cooking, and their songs—and considering whether this would be of any value to them. The group said that if they knew the people in the videos they would look at them, and suggested they might laugh initially, but if the content is useful they could see others looking at them. We asked about another nearby community creating such videos and to imagine what these videos would be of. They expressed interest in seeing how different communities create fertiliser from cow manure, or stock ponds with fish. They also mentioned how their fathers are very skilled at particular aspects of farming, such as inter-cropping and keeping an ox-drawn plough straight; videos of these could be shared within the community and with other communities. They mentioned that videos could also be shared via WhatsApp, which led us to enquire how the messaging app is used in the community. Within their group participants tended to use WhatsApp to forward images and videos and to send very short messages—e.g., ‘hi’, ’what’s up’, etc.—transliterated using an English keyboard. This was the only evidence we saw of transliteration practices.
5.5.5 Revised Data Collection Methodology.
On our last day in the community we worked with data-collectors from phase 2 as well as community members who had shown interest in creating video content. We loaned phones to an older farmer, and to two sisters-in-law who wanted to perform and share their songs. Other people either had their own phones, or could borrow one from a family member or use one of the phones we had lent. The community-generated video content from earlier in our visits already contained some spoken narrations, but not enough for the ranker model of our IR system, and in the case of sung content, would need to rely entirely on spoken annotation. Trialling this with participants, we learned that young people appreciated being able to listen to the original narrations of the content videos, as they found it harder to record annotations if they lacked the knowledge and/or confidence to describe what was being demonstrated in the video. They found it easier to start by ‘repeating’ what was already said, but also found ways of integrating their own knowledge and experience of the topic once they started speaking. These recordings were therefore
not verbatim repetitions used by systems such as ‘ReSpeak’ [
78] to develop transcriptions for written languages. However for our use-case, annotations featuring repetitions with variations are a more useful training resource for the IR system ranker than verbatim ones.
We asked phase 2 data-collectors about their experiences using the digital storytelling software, and they mentioned that it was frustrating to only be able to export the entire digital story slideshow, even if they only wanted to share one new audio annotation. We also wanted to explore a different data collection methodology, given that phase 2 audio annotations were unevenly distributed across photos (see fig.
4). We suggested that they could also try using WhatsApp voice messaging for this purpose, and set up a group between devices to demonstrate this. We shared a farming video to the group, and participants found it easier to respond to that video with a voice message containing their spoken annotation. This refined method also leverages participants’ familiarity with the platform (see [
36]). Following this, we settled upon and further demonstrated and agreed on the following (ongoing) data collection process:
•
A video (e.g., on farming, cooking, songs, etc.) is shared to the WhatsApp group;
•
Participants record and send audio annotations for that video to the same WhatsApp group;
•
All audio received is assumed to be related to that video;
•
After enough (10–20) annotations have been received, a new video is shared, and the process repeats; and
•
Researchers would be included in the WhatsApp group to collect video and audio annotation data and to encourage use.
Finally, we established formal participant consent to being included in the WhatsApp group, to participate, for us to use the video content for a community repository and the spoken annotations to improve the IR system. To date, community members have contributed ten further videos with 48 minutes of audio annotations.