Topics Digital archiving of text and sound data of morphosyntactic microvariation of southern Bantu languages


This 2020 IRC project aims to build an online archive of local languages spoken in South Africa, in order to contribute to typology of Bantu languages. As an outcome, we launched “Bantu Microvariation Digital Archive” in March 2021. This archive is a collection of text, audio, and plots of data from six southern Bantu languages: Northern Sotho, Siswati, Southern Ndebele, Sesotho, Tshivenda, and Xitsonga.

Northern Sotho is spoken in South Africa by about 4.7 million speakers. Sesotho is spoken in South Africa and Lesotho by about 5.6 million speakers. Siswati is spoken in South Africa, Eswatini, Lesotho and Mozambique by about 2.3 million speakers. Southern Ndebele is spoken in South Africa by about 1.1 million speakers. Tshivenḓa is spoken in South Africa and Zimbabwe by about 1.3 million speakers. Xitsonga is spoken in South Africa, Zimbabwe and Mozambique by about 12 million speakers.

The dataset is based on linguistic descriptions collected through joint research with local research institution, MER Mathivha Centre for African Languages, Arts and Culture, conducted in University of Venda, South Africa in March 2020. All files are annotated at the word level. Original recordings that are available upon request are in 16bit and 44.1KHz. If you are interested in it, please fill in the access request form on the resource page.

*The research collaboration with MER Mathivha Centre for African Languages, Arts and Culture is financially supported by the JSPS’s Core-to-Core Program: B. Asia-Africa Science Platforms “Establishment of a Research Network for Exploring the Linguistic Diversity and Linguistic Dynamism in Africa (ReNeLDA)” grant.

(Written by Daisuke SHINAGAWA and Mayumi ADACHI)

Pick up



Building a digital archive of interviews based on a health care questionnaire



Bantu Language Digital Archive (BantuDArc)

Back to top