Although natural language acquisition programs have been developed for a myriad of languages, they have only tenuously been applied to Arabic, due to ambiguities and complexities in the way it is written and the large number of regional dialects commonly used in informal communication. This technology describes two software packages, MADAMIRA (Morphological Analysis and Disambiguation for Dialectal Arabic) and AIDA (Automatic Identification of Arabic Dialectals) that, working in tandem, can perform linguistic analyses on raw Arabic text required for natural language processing and additionally assign dialects on both sentence and token levels while generating an English translation. Utilized together, these technologies may be used to identify and translate a variety of written or spoken Arabic dialects into English for commercial and educational purposes.
A flexible software package that extracts and clarifies the linguistic information needed for accurate Arabic language processing
Developed in conjunction, the MADAMIRA and AIDA software suites allow for the generation of linguistic information such as tokenization, diacritization, and part-of-speech tagging from written Arabic words in a variety of dialects. Leveraging machine-learning algorithms and established morphological analyzers, ambiguity is reduced while limiting bias. The accuracy of these programs has been demonstrated at the Center for Computational Learning Systems at Columbia University. Furthermore, because AIDA is Java-based, it can be configured as either a web-based interface or packaged for offline processing, including in mobile devices. Together, these software packages allow for the rapid translation of various Arabic dialects into English with high accuracy and a low level of ambiguity.
Lead Inventor:
Owen Rambow, Ph.D.
Applications:
- Text-to-speech
- Named entity recognition
- Automatic speech recognition
- Identification and translation of Arabic dialects for educational and commercial applications
- Development of a dialectal Arabic dictionary
- Development of voice recognition translation device of spoken Arabic
Advantages:
- Single software package capable of performing several natural language processing tasks
- Unbiased natural language processing for Arabic, providing flexibility for developers building applications requiring NLP
- High accuracy in predicting linguistic features such as part-of-speech, lemmatization, diacritization, and tokenization
- Output can be configured to the user's preference
- Includes Egyptian, Iraqi, Levantine, and Moroccan dialects
- Can be used on any computing platform capable of running Java, including mobile devices
- Can be used as a stand-alone program or in a client-server mode
Tech Ventures Reference: IR CU14012, CU14014
Related Publications:
- H. Elfardy, M. Diab. AIDA: Automatic Identification & Glossing of Dialectal Arabic. Poster Session, Proceedings of European Association for Machine Translation (EAMT 2012), Trento, Italy, 2012.
- H. Elfardy, M. El Badrashiny, M. Diab. Sentence-Level Dialect Identification in Arabic. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Sofia, Bulgaria, 2013.
- H. Elfardy, M. Diab. Token Level Identification of Linguistic Code Switching. Proceedings of COLING, Mumbai, India, 2012.
- M. Diab, P. Dasigi. CODACT: Towards Identifying Orthographic Variants in Dialectal Arabic. Proceedings of IJCNLP 2011, November, Chiang Mai, Thailand, 2011.