Funded Projects

Urdu Search Engine

Principal Investigator’s Organization (PIO):
Al-Khwarizmi Institute of Computer Science, University of Engineering and Technology, Lahore
Principal Investigator (PI):
Dr. Ghalib A Shah
Project Details:
Start Date 01-Aug-2016
Duration 29 months
Budget 33.22 PKR million
Status On Going Project
Progress Report N/A
Publications N/A
Thematic Area Education
Project Website
Executive Summary

Research indicates that indigenously developed search engines , are more successful in the communities accessing localized content, primarily because they offer language and culture specific services. For example, Google only has 8%, 22% and 31% share in search market in South Korea, China and Japan respectively, till 2012 , which is considerably smaller share than the search engines developed locally. As the availability and usage of online content in local language increases, it is becoming essential to provide efficient access to relevant content for Pakistani users through electronic devices including low end mobile phones, and to learn these content access patterns by our communities to drive national content development policies and commercial content development market. This project will develop ‘Urdu Search Engine’ to address the nationaland linguistic needs, and to incubate the much needed expertise in this area of research and development. The project will work in three aspects, focusing on high performance distributed computing, content search optimization and local content management. Online content will be crawled andcould be optionallyfiltered as per the user opt-in requests. This will require developing both language identification and filtering algorithms. Once the information is sifted, the indexing scheme will be tuned for efficient retrieval of the information. The content will also be summarized for quicker access through computers and mobile phones, and will be stored initially using Amazon Web Services and later on a local compute infrastructure. Presentation of the Urdu content will be tuned for access through the various devices. Though based on open source search technology, the work still presents multi-faceted challenges. Automatic language identification is needed to ensure that Urdu content is appropriately tagged aftercrawling, and not mixed with content of Persian, Arabic, Pushto and other languages with common vocabulary. Further, search should be linguistically intelligent, ignoring Urdu stop words, providing proper tokenization and searching through different morphologically relevant forms of Urdu keywords. In addition, ranking and ordering the resulting pages as per the optional user initiated filtering is needed. Finally, both summarizing the results for mobile phone access and determining user’s choices and linguistically acceptable presentation forms require detailed analysis for implementation. To get the initial user base for the search engine, a marketing campaign will be organized. As the user base strengthens, online contextual advertising and other services will also be initiated, to enable revenue generation for sustainability and growth of the project. There will also be opportunity to make the user search trend data available for commercial use and policy development. In addition, the language technology for language identification, text summarization, content filtering, etc. can be independently commercialized. The possibility to get relevant Urdu information from online sources, with access through mobile phones, provides a great opportunity to general public across Pakistan. This also opens further research opportunities to provide similar services in other Pakistani languages. The data will be crucial to spark better online marketing at more affordable rates and to drive policy around online content development and its presentation. Thus, the project presents both social and economic promise at a national scale.