Tafseer Ahmed, a 28-year-old software engineer and lecturer of Computer Science at University of Karachi, has developed an Urdu Text Miner. The software comprises three modules: summarizer, keyword detector and categorizer.
The software can also be implemented as a website
.urduopensource.org/miner> while its desktop version is available at . Readers can test this software taking Unicode-based Urdu text from BBC Urdu website.
With the dramatic increase in use of computers in our daily life, an enormous amount of digital text data is created every day. It becomes almost impossible to filter through the massive chunks of data in order to get some useful information. To overcome this problem, Text Mining techniques are used. IBM has already launched a Text Miner Package which caters to English readers, and include eight tools: summarization, categorization and clustering tools .
The objective of developing the Urdu Text Miner is to make a similar kind of tool to be used for Urdu language. Since the development of the content in Urdu is on the rise, specially after the launch of BBC Urdu website , this software is bound to become a useful tool for extracting required information from Urdu websites.
The software application uses algorithms for summarization, categorization and key-phrase detection. A list of stop/noise words are required for the program’s algorithms to function. These are buzz words that are most commonly used in our everyday speech, such as commonly used verbs, articles, nouns, pronouns, conjunctions, etc. For instance, “the” and “is” in English; and ka, ki, kay, mein, aur as found in Urdu.
A huge amount of Urdu text has been collected and processed to develop a list of such stop/noise words. It is important to mention that a number of text-based Urdu files were taken from BBC Urdu website to accomplish this task. These algorithms also need stemming, for instance, go and goes, runs and running, and effect and effectiveness have the same roots.
Prior to this project, no algorithm for stemming existed for text in Urdu language. So for the first time the stemming algorithm engine has been developed to work as an integral part of Urdu Text Mining program.
The summarizer takes any Urdu document, written in Unicode encoding scheme, as input and summarizes it to the desired percentage of size, that is, from 1 to 99 per cent. It selects important sentences from the document to do its job.
It uses statistical methods to detect keywords of the document. After which, it gives “weight” to each sentence according to these keywords. Sentences with larger weights are selected as summary.
Keyword detection tool selects (single or multi-word) keywords from given document. It applies the statistical analysis to detect keywords consisting of up to three words. It resolves that whether a multi-word keyword, its single word component or both should be selected as keyword.
Categorization Tool, not available on website, takes documents and their categories or classes as input and generates a categorizer. This categorizer is based on keywords of the training documents. The categorizer will classify or categorize a new document into an appropriate category by matching the document with all existing categories and selecting the best match.
In 2001, Tafseer Ahmed had developed the first English-to-Urdu translation software, “Urdu Mashini Mutarjim.” Despite being the first of its kind, the demo version remained unable to get attention of Pakistani entrepreneurs and authorities, and still remains at the initial demo stage, with a vocabulary of about 1,000 words. The second phase that pertains to development is both cost- and workforce-intensive, and requires sound investment.
But for apathy and indifference that is the hallmark of the research bodies and industrial organizations in this country towards our local innovators, this useful software would have set new trends in Pakistan.
The writer is a science journalist and editor, Global Science, Karachi