When one hears the term machine learning, what immediately comes to mind is something complicated and hard to understand. We sat down with Kathleen Siminyu, a machine learning fellow at Mozilla Foundation to better understand what her journey and what the job entails this is what she had to say.

Tell us about yourself?

My name is Kathleen Siminyu and I am a machine learning fellow at Mozilla Foundation. Professionally, I am a Natural Language Processing researcher. At Mozilla, I am working on the common voice dataset which is essentially a platform that enables language communities to build datasets and one of the languages therein is Kiswahili which is what I work on.

How did get started in your career path?

For my undergraduate degree I studied math and computer science at J.K.U.A.T and while I was in 4th year, I decided to venture into a field which utilized both. I did my research and came about data science which encompassed both math and computer science. My 4th year project was on data science which helped me after I finished my degree as I was able to include it in my portfolio.

Other than the degree, I started to do online courses on platforms like Edex and Coursera which were related to data science. I did this to increase my knowledge as well as boost my C.V  because at the end of the day I was not a trained data scientist.

My first job was at Africa’s Talking and I did a lot of learning on the job. At first, my role more about providing metrics like how much airtime was sold and things like that rather than data science. However, I managed to automate most of these processes which means that I had more time to focus on my passions that is data engineering.

During this time, I came to realize that there was a need for African language tooling or resources and that I.T was not the place I where  would be able to follow that interest. This meant that  I have to venture back to academia, at this time I found research communities who were building NLP for African languages. This is what has really contributed to my learning journey due to that fact that in Africa there are very few academic institutions that offer degrees on data science and artificial intelligence. However, there are grassroot communities who are nurturing this talent and I am a product of such.

What is Kiswahili Common Voice dataset and what are its benefits?

A Kiswahili common voice dataset is essentially a dataset for speech recognition also known as speech to text. It is basically a task which involves turning audio into text. One of its uses is in captioning for Tv/videos and some conference platforms like Zoom.

The dataset in itself begins with us collecting text and which are then broken down at sentence level and then sent to people. At Mozilla, we crowdsource the audio aspect as well and when you go onto our platform and sign yourself up as a contributor, you will start receiving sentences and you can record yourself saying those sentences out loud.

So a dataset for speech recognition or transcription is essentially a text accompanied by audio of what is in the text. That is the data that you would fill into your machine learning algorithm or model for it to start learning how to transcribe Kiswahili text. This is because it is then able to do a mapping a word to the respective sound.

It is important because the datasets can be used to develop end user products. The transcriptions can be used on platforms like Zoom or Google Meet.

Other than Luhya and Kiswahili which other language have you worked on?

Earlier on in my research days, I worked on a task known as machine translation which is similar to Google translate whereby you can type in English and it gives you a French translation. During this time, I worked on several Kenyan languages such as Kamba, Kikuyu and Luo. When it comes to speech recognition,  I started with the Luhya language and and now I am currently working on Kiswahili.

What are your future plans?

When I am done with Kiswahili, I would like to continue with my work on Kenyan languages. During my work I have to realize that the translations usually pivot on English, say a Kamba-English or Luhya-English translation. I am thinking of changing this pivot to Kiswahili such that we can have a Kiswahili-Kamba or Kiswahili-Luhya translation model. This is more so because we have over 200 Million Kiswahili speakers worldwide.