Pakistani researchers lay the foundation for building Urdu speech recognition system
The possibility of some software application developer coming up with an Urdu speech recognition program just got more likely as the most fundamental tool needed for it has just been developed at Lahore’s Information Technology University.
Linguistic technology expert Dr Agha Ali Raza and his team at ITU’s Center for Speech and Language Technologies (CSaLT) laboratory has released for public use a corpus of Urdu sentences that covers all possible distinct sounds, called phoneme by linguists, used in everyday speech.
This corpus comprising 708 sentences that covers all 63 phonemes will soon be available for download at the C-SALT website.
Those interested in developing an Urdu speech recognition software will now have access to the most basic ingredient needed for the purpose.
They will just need a repository of words used in everyday speech to proceed with developing the application, says Dr. Raza.
“Speech recognition is a two-step process. The corpus will give the computer application access to all possible phonemes used in formation of meaningful Urdu words from everyday speech,” he says.
Though there are 63 distinct phonemes in Urdu, in everyday speech these don’t correspond to 63 distinct sounds. Dr. Raza explains that sound made for a phoneme may vary from one utterance to another depending on the phoneme used before and after it in a word.
Thus, he says, for every phoneme x, there will be 63x63 possible (tri-phoneme) sounds. The corpus of sentences covers for all these possible sounds.
In the first step, words from the corpus will allow the application to train itself in the sounds of various Urdu words.
The separate repository of words will come into play in the second stage allowing the application to choose the most appropriate words for the output sentences.
“This will enhance accuracy of the software,” Dr. Raza says.
Read more: Data Science Lab in Pakistan Makes Urdu-Hindi Dictionary
Thus, the accuracy of the speech recognition softwares depends on written or oral sources from where words and sentences are generated for the corpus and the repository maintained separately for ruling out meaningless words.
Dr. Raza and his team has relied upon written material obtained from newspaper and magazine articles in Urdu to generate this corpus.
In another corpus under development at the CSaLT lab, he is using oral material recorded with permission from several thousand telephonic conversations.
“This will provide us with a more comprehensive corpus of words used in everyday speech,” he says. He expects that work on this bigger project will be completed this year.
One way speech recognition softwares available for international languages like English constantly keep enhancing their accuracy is by adding to the repository (for future use) words spoken to them by users.
Dr. Raza says he had started work on the corpus under supervision of Dr. Sarmad Hussain as part of his master’s’ thesis at the Lahore-based National University of Computer and Emerging Sciences FAST.
Then, he proceeded to Carnegie Mellon University where he completed his doctorate of philosophy (PhD) in Language Technologies.
He and Dr. Hussain were assisted by a colleague, Huda Sarfaraz, and two linguists, Inamullah and Zahid Sarfaraz, in compiling the list of 708 sentences for the corpus.
“We hope that release of this corpus will also prove beneficial for regional languages in the country and languages lacking ample linguistic resources all over the world.
Those interested in working on those languages can follow our technique to develop similar corpora of sentences in those languages,” he says.
“The technique used in development of this corpus will work for any language for which written material is available,” he says.
The other corpus under development at the lab will pave the way for working on regional languages with little or no written material.
“Recorded speech in those languages will be sufficient to provide material for speech recognition,” he says.
Linguist Tariq Rahman welcomes the development and says that such initiatives for promotion and preservation of the country’s languages have been due for a long time now.
“These resources have the potential of raising the profile of a language. The closer a language is to domains of power the higher the probability of its survival,” he says.
He adds people tend to prioritise learning of languages that offer material rewards like better employment opportunities.
He hopes similar work will soon be started for ‘other national languages of the country’.
Rahman highlights that there may be a need for maintaining multiple databases of words and sentences to account for different varieties of Urdu spoken in the country.
There are subtle differences in Urdu in different regions of the country. Urdu of Punjab is different from that of Sindh and both have features that distinguish them from its variety in other regions like Khyber Pakhtunkhwa, he adds.
Dr. Raza says this variety has been taken into consideration in compilation of the database of oral communications.
He says the software being used to collect data has been made available all over the country.
Initially, the users were concentrated in the Punjab but now we have reached out to all regions of the country, he adds.
Dr. Raza says speech recognition programmes facilitate access to smartphone- and internet-based services.
Those unable to use these services due to a disability like visual impairment and amputation, paralysis or illiteracy can do so with help of speech recognition feature, he says.
It also allows use of smartphone and internet in situations where it may be inconvenient otherwise, like when you’re driving a car.
“Now, all of this can be done in our local languages as well,” he says.
The corpus is available for download at: http://csalt.itu.edu.pk/
This article, originally published in MIT Technology Review, has been reproduced with permission.
Comments (25) Closed
A dire need of the time. Good luck to team.
Feeling excited
I am certified application developer I will definitely touch this as a new project.
Thats brilliant work.
Great, love to see more research in this area.
Great effort
Urdu a fascinating king language spoken in 60% of the world. Here in US I often speak Urdu mixed with English,and amazingly the guys fully understand. Try it.
Nice work keep it up
Great initiative, proud of you guys. Please keep up the good work.
@layman ... I love this Hilarious comment. 60%... WOW
@layman You are a "layman". Which World are we talking about? Even 6% would be way too high.
world had these technologies since 40-50 years. we are too late and to little.
@Aftab given 40-50 years ago there were no PCs, so I find it difficult to believe this technology was available. Try being a bit positive and supportive instead of being negative. By the way what is your contribution to advancement?
Although Urdu is not my mother language, but I am proud of the team, who have achieved this success. Please keep it up for our future generations.
Congratulations!
What a landmark achievement and a great news for Pakistan.
FAST, Dr Agha Ali Raza and his team have made my day.
Keep it up the good work!
One hope other Software Developers will take the task to expand this speech recognition program to other centuries old domestic languages or dialect, Pakistan's assets, to preserve and enhance them such as Punjabi, Sindhi, Pashto, Balochi and other languages.
Great initiative!
@layman Where this "60%" figure comes from?
Khuda us musafir ki himat bdhai----Jo manzil ko thukra dai manzil samajh ker.Pakistanis should take full responsibility for developing Urdu.Preparing an Urdu-Hindi dictionary is a feather in the cap.They should also try to develop the dictionary of technical terms.The same technical words should be used in all the major provincial languages of Pakistan.There could also be found a way in this regard to cooperate with Arabic,Persian,Turkish,Indonesian and Hindi.Computer departments and foreign language universities of Pakistan should cooperate in this matter.Pakistanis can do it.
My appreciation for the researchers. Brilliant work. Wish i can give you state recognition, so that younger scientists get motivated to do authentic work.
The should be done for Sindhi and other national languages of Pakistan. Sindhi Language Authority and Sindh University and Sindh Govt Should pay some heed to language recognization tools and take necessary steps as well.
@layman spoken in 60% of the world? where did you take this number from? Even in Pakistan Urdu is merely a mother tongue of less than 5% people. even in rural Pakistan people can't speak urdu.
Awesome! Thumbs up
@Azam A Siddiqui, As a machine learning and pattern recognition researcher, I want to update you all that since few years a work on Urdu data processing has been started. OCRs have been developed for Synthetic Urdu text. Work on handwritten Urdu dataset and Urdu scene text recognition has been initiated and experimented but the research is still going on. You can google about these research.
Awesome guys. Just wonderful and superb