Pakistani researchers lay the foundation for building Urdu speech recognition system

The possibility of some software application developer coming up with an Urdu speech recognition program just got more likely as the most fundamental tool needed for it has just been developed at Lahore’s Information Technology University.

Linguistic technology expert Dr Agha Ali Raza and his team at ITU’s Center for Speech and Language Technologies (CSaLT) laboratory has released for public use a corpus of Urdu sentences that covers all possible distinct sounds, called phoneme by linguists, used in everyday speech.

This corpus comprising 708 sentences that covers all 63 phonemes will soon be available for download at the C-SALT website.

Those interested in developing an Urdu speech recognition software will now have access to the most basic ingredient needed for the purpose.

They will just need a repository of words used in everyday speech to proceed with developing the application, says Dr. Raza.

“Speech recognition is a two-step process. The corpus will give the computer application access to all possible phonemes used in formation of meaningful Urdu words from everyday speech,” he says.

Though there are 63 distinct phonemes in Urdu, in everyday speech these don’t correspond to 63 distinct sounds. Dr. Raza explains that sound made for a phoneme may vary from one utterance to another depending on the phoneme used before and after it in a word.

Thus, he says, for every phoneme x, there will be 63x63 possible (tri-phoneme) sounds. The corpus of sentences covers for all these possible sounds.

In the first step, words from the corpus will allow the application to train itself in the sounds of various Urdu words.

The separate repository of words will come into play in the second stage allowing the application to choose the most appropriate words for the output sentences.

“This will enhance accuracy of the software,” Dr. Raza says.

Thus, the accuracy of the speech recognition softwares depends on written or oral sources from where words and sentences are generated for the corpus and the repository maintained separately for ruling out meaningless words.

Dr. Raza and his team has relied upon written material obtained from newspaper and magazine articles in Urdu to generate this corpus.

In another corpus under development at the CSaLT lab, he is using oral material recorded with permission from several thousand telephonic conversations.

“This will provide us with a more comprehensive corpus of words used in everyday speech,” he says. He expects that work on this bigger project will be completed this year.

One way speech recognition softwares available for international languages like English constantly keep enhancing their accuracy is by adding to the repository (for future use) words spoken to them by users.

Dr. Raza says he had started work on the corpus under supervision of Dr. Sarmad Hussain as part of his master’s’ thesis at the Lahore-based National University of Computer and Emerging Sciences FAST.

Then, he proceeded to Carnegie Mellon University where he completed his doctorate of philosophy (PhD) in Language Technologies.

He and Dr. Hussain were assisted by a colleague, Huda Sarfaraz, and two linguists, Inamullah and Zahid Sarfaraz, in compiling the list of 708 sentences for the corpus.

“We hope that release of this corpus will also prove beneficial for regional languages in the country and languages lacking ample linguistic resources all over the world.

Those interested in working on those languages can follow our technique to develop similar corpora of sentences in those languages,” he says.

“The technique used in development of this corpus will work for any language for which written material is available,” he says.

The other corpus under development at the lab will pave the way for working on regional languages with little or no written material.

“Recorded speech in those languages will be sufficient to provide material for speech recognition,” he says.

Linguist Tariq Rahman welcomes the development and says that such initiatives for promotion and preservation of the country’s languages have been due for a long time now.

“These resources have the potential of raising the profile of a language. The closer a language is to domains of power the higher the probability of its survival,” he says.

He adds people tend to prioritise learning of languages that offer material rewards like better employment opportunities.

He hopes similar work will soon be started for ‘other national languages of the country’.

Rahman highlights that there may be a need for maintaining multiple databases of words and sentences to account for different varieties of Urdu spoken in the country.

There are subtle differences in Urdu in different regions of the country. Urdu of Punjab is different from that of Sindh and both have features that distinguish them from its variety in other regions like Khyber Pakhtunkhwa, he adds.

Dr. Raza says this variety has been taken into consideration in compilation of the database of oral communications.

He says the software being used to collect data has been made available all over the country.

Initially, the users were concentrated in the Punjab but now we have reached out to all regions of the country, he adds.

Dr. Raza says speech recognition programmes facilitate access to smartphone- and internet-based services.

Those unable to use these services due to a disability like visual impairment and amputation, paralysis or illiteracy can do so with help of speech recognition feature, he says.

It also allows use of smartphone and internet in situations where it may be inconvenient otherwise, like when you’re driving a car.

“Now, all of this can be done in our local languages as well,” he says.

The corpus is available for download at: http://csalt.itu.edu.pk/

This article, originally published in MIT Technology Review, has been reproduced with permission.

Data Science Lab in Pakistan makes Urdu-Hindi Dictionary

Bayern edge closer to Bundesliga crown

This start-up helps travelers explore beautiful parts of Pakistan. Should you try it?

SOES, Wall Street’s status quo and role of Sheldon Maschler in the 80s

On DawnNews

سندھی لوک داستان میں متارے سانپ کے زہر سے مرے تھے یا موکھی کے طعنے سے؟

بنگلہ دیش: طلبہ کا پُرامن احتجاج، حکومت مخالف تحریک میں کیسے تبدیل ہوا؟

خوش ذائقہ سینسیشن آم کی دنیا بھر میں مانگ اور پاکستان میں اس کی کاشت

Dawn News English

Are We Our Own Worst Enemies?

The Case for Agriculture Exports

Explained: Pakistan’s $7 Billion IMF Deal

What’s A Digital Terrorist?

Karachi Ranked ‘Second Riskiest’ City For Tourists: Report

🔴LIVE: UNSC Debates Genocide In Gaza

Top News Stories: Netanyahu Is A War Criminal And A Liar: Bernie Sanders

Comments (25) Closed

Yaser Arafat

Jan 23, 2017 03:41pm

A dire need of the time. Good luck to team.

Recommend 0

Sana

Jan 23, 2017 03:43pm

Feeling excited

Recommend 0

Philosopher(from japan)

Jan 23, 2017 04:04pm

I am certified application developer I will definitely touch this as a new project.

Recommend 0

Asad Kazmi

Jan 23, 2017 04:16pm

Thats brilliant work.

Recommend 0

hassan

Jan 23, 2017 04:50pm

Great, love to see more research in this area.

Recommend 0

Zubair

Jan 23, 2017 05:22pm

Great effort

Recommend 0

layman

Jan 23, 2017 05:41pm

Urdu a fascinating king language spoken in 60% of the world. Here in US I often speak Urdu mixed with English,and amazingly the guys fully understand. Try it.

Recommend 0

Mohaisin Sharif

Jan 23, 2017 06:25pm

Nice work keep it up

Recommend 0

Jami Khan

Jan 23, 2017 06:38pm

Great initiative, proud of you guys. Please keep up the good work.

Recommend 0

Hello Man

Jan 23, 2017 06:50pm

@layman ... I love this Hilarious comment. 60%... WOW

Recommend 0

Javed

Jan 23, 2017 06:59pm

@layman You are a "layman". Which World are we talking about? Even 6% would be way too high.

Recommend 0

Aftab

Jan 23, 2017 07:41pm

world had these technologies since 40-50 years. we are too late and to little.

Recommend 0

Ghaznavi

Jan 23, 2017 08:03pm

@Aftab given 40-50 years ago there were no PCs, so I find it difficult to believe this technology was available. Try being a bit positive and supportive instead of being negative. By the way what is your contribution to advancement?

Recommend 0

KHALID YOUSAFZAI (UK)

Jan 23, 2017 08:24pm

Although Urdu is not my mother language, but I am proud of the team, who have achieved this success. Please keep it up for our future generations.

Recommend 0

Salman

Jan 23, 2017 09:19pm

Congratulations! What a landmark achievement and a great news for Pakistan. FAST, Dr Agha Ali Raza and his team have made my day. Keep it up the good work!

Recommend 0

Salman

Jan 23, 2017 09:32pm

One hope other Software Developers will take the task to expand this speech recognition program to other centuries old domestic languages or dialect, Pakistan's assets, to preserve and enhance them such as Punjabi, Sindhi, Pashto, Balochi and other languages.

Recommend 0

Syyab Rahi

Jan 23, 2017 11:50pm

Great initiative!

Recommend 0

Guest from USA

Jan 24, 2017 04:28am

@layman Where this "60%" figure comes from?

Recommend 0

Azam A Siddiqui

Jan 24, 2017 05:46am

Khuda us musafir ki himat bdhai----Jo manzil ko thukra dai manzil samajh ker.Pakistanis should take full responsibility for developing Urdu.Preparing an Urdu-Hindi dictionary is a feather in the cap.They should also try to develop the dictionary of technical terms.The same technical words should be used in all the major provincial languages of Pakistan.There could also be found a way in this regard to cooperate with Arabic,Persian,Turkish,Indonesian and Hindi.Computer departments and foreign language universities of Pakistan should cooperate in this matter.Pakistanis can do it.

Recommend 0

AJK

Jan 24, 2017 09:26am

My appreciation for the researchers. Brilliant work. Wish i can give you state recognition, so that younger scientists get motivated to do authentic work.

Recommend 0

Idealist

Jan 24, 2017 11:58am

The should be done for Sindhi and other national languages of Pakistan. Sindhi Language Authority and Sindh University and Sindh Govt Should pay some heed to language recognization tools and take necessary steps as well.

Recommend 0

Idealist

Jan 24, 2017 11:56am

@layman spoken in 60% of the world? where did you take this number from? Even in Pakistan Urdu is merely a mother tongue of less than 5% people. even in rural Pakistan people can't speak urdu.

Recommend 0

Muhammad Jamal Maqsood

Jan 24, 2017 12:21pm

Awesome! Thumbs up

Recommend 0

Saad.Bin.Ahmed

Jan 24, 2017 12:54pm

@Azam A Siddiqui, As a machine learning and pattern recognition researcher, I want to update you all that since few years a work on Urdu data processing has been started. OCRs have been developed for Synthetic Urdu text. Work on handwritten Urdu dataset and Urdu scene text recognition has been initiated and experimented but the research is still going on. You can google about these research.

Recommend 0

Pak soul

Jan 24, 2017 07:59pm

Awesome guys. Just wonderful and superb

Recommend 0

Read more