The Arabic morphological transducer and finite state tools, such as tokenizer and guesser, are developed using finite state technology.
Mohammed Attia
admin@attiaspace.com

Research Scientist, Oct 2013
George Washington University.

Lexicographer, Nov 2012 - Oct 2013
Oxford University Press.

Research Fellow, Feb - Oct 2012
The British University in Dubai,
United Arab Emirates.

Post-doctoral Researcher, Sept 2009 - Dec 2011,
School of Computing, Dublin City University,
Dublin, Ireland.

Lecturer in Linguistics, Sept 2008 - Aug 2009,
Al-Azhar University, Cairo, Egypt.

Ph.D. in Computational Linguistics, Apr 2004 - May 2008,
School of Languages, Linguistics and Cultures,
The University of Manchester, UK.

Translator and Web Developer, Jan 1996 - Apr 2004
Harf Information Technology,
Egypt.

Mohammed A. Attia
Research outcomes:
 
1. AraComLex Lexical Database for Modern Standard Arabic
 
 
I developed a large-scale, corpus-driven lexical database for Modern Standard Arabic following the modern lexicographic practices containing 30,000 lemmas. I developed a web application (dictionary writing system) for curating the database: [View here] +Show Reference:

 
 
2. AraComLex Open-Source Morphological Analyser for Modern Standard Arabic
 
 
I developed an open-source large-scale finite state morphological transducer for processing Arabic texts, AraComLex, or Arabic Computer Lexicon, containing more than 30,000 lemmas. The competitive edge this morphology has over Buckwalter's is that it tried be specialized purely in MSA by avoiding the noise coming from Classical Arabic and the wrong word-clitic formation which are rampant in Buckwalter's morphology. My morphology is compatible with the open-source finite state compiler Foma. All you need to do is download Foma, download AraComLex from Sourceforge.net and read the README file to learn how to compile. You can compile the transducer under Windows, Linux or Mac OS X. +Show Reference:

 
 
3. Arabic LFG Rule-basic Parser for Modern Standard Arabic
 
 
I the developed first Arabic rule-based parser to be freely available on the internet for Modern Standard Arabic, using XLE. The output this parser gives is a phrase structure tree (c-structure) and a dependency structure (f-structure). The parser is hosted by Bergen University in Norway, along with English, German, Malagasy, Norwegian and Welsh. Test the parser here +Show Reference:

 
 
4. Arabic Morphology Patterns
 
 
I developed a database of 490 templatic patterns for Arabic (الأوزان الصرفية في اللغة العربية) that has been successfully used in detecting unknown words in a statistical parser and in lexical profiling tasks. [Download from Sourceforge.net] +Show Reference:

 
 
5. Arabic Subcategorization Frames in the LFG Parser
 
 
I manually developed a list of subcategorization frames to be used in the Arabic LFG parser, containing 2901 lemma-frame types. [Download from Sourceforge.net] +Show Reference:

 
 
6. Arabic Subcategorization Frames in the Arabic Treebank
 
 
I automatically extracted the list of subcategorization frames (following the LFG syntactic theory) from the Arabic Treebank, containing 7746 lemma-frame types for verbs, nouns and adjectives. [Download from Sourceforge.net] +Show Reference:

 
 
7. Arabic Wordlist for Spellchecking
 
 
I developed the Arabic word list for spell checking containing 9 million Arabic words. The words are automatically generated from the AraComLex open-source finite state transducer and from a one billion word corpus. The entire list is validated against Microsoft Word spell checker. [Download from Sourceforge.net] +Show Reference:

 
 
8. Named Entities and Multiword Expressions
 
 
I developed the largest lexical database for named entities and multiword expressions to date using automatic methods to process a large corpus of over one billion words. Multiword expression resources for Arabic, totalling 34,658 MWEs (Download from Sourceforge.net). Arabic Named Entities, 45,202 entries (Download from Sourceforge.net) +Show Reference:

 
 
9. Word Count of Modern Standard Arabic
 
 
I developed A word count of Modern Standard Arabic from a 1 billion word corpus, sorted according to frequency counts. [Download from Sourceforge.net] +Show Reference:

 
 
10. Arabic Broken Plurals
 
 
A list of Arabic Broken Plurals automatically extracted from a large contemporary corpus, provided with morphological patterns for both the singular forms and the plural forms. It contains 2562 broken plural forms. [Download from Sourceforge.net] +Show Reference:

 
 
11. Arabic Unknown Words - Weighted
 
 
This is a list of unknown words, or words that are not included in the Buckwalter Morphological Analyser version 2.0. It includes about 18,000 new lemmatized words, and they are weighted and ordered so that there is a good likelihood that words which are most relevant (lexicographically) will surface to the top and the least relevant words will be pushed down the list. [Download from Sourceforge.net] +Show Reference:

 
 
12. Obsolete Arabic Words
 
 
This is a list of obsolete words, or words that are outdated or not in contemporary use, in the Buckwalter Morphological Analyser database. This list is developed according to a threshold of frequency on the web and the Arabic gigaword corpus. The list contain about 8,400 words that fell out of current use with a margin error of 1%. [Download from Sourceforge.net] +Show Reference:

 
Ph.D. thesis:
Title: Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation.
Description: This research investigates different methodologies to manage the problem of morphological and syntactic ambiguities in Arabic. I built an Arabic parser using XLE (Xerox Linguistics Environment) which allows writing grammar rules and notations that follow the LFG formalisms. I also formulate a description of main syntactic structures in Arabic within the LFG framework.
Mohammed Attia. (2008) 'Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation'. PhD Thesis. School of Languages, Linguistics and Cultures, the University of Manchester. [pdf version]
Mohammed A. Attia

Publications

Dictionaries:

Oxford Arabic Dictionary
  • Tressy Arts, Radia Benzehra, Mohammed Attia, et al. 2014. Oxford Arabic Dictionary. Oxford Arabic Dictionary, ISBN 978-0-19-958033-0. August 2014 (estimated)

Books:

Ambiguity In Arabic Computational Morphology And Syntax

Book Chapters:

Systems and Frameworks for Computational Morphology Advances in Natural Language Processing
  1. Mohammed Attia, Pavel Pecina, Lamia Tounsi, Antonio Toral, Josef van Genabith. 2011. A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer. In Mahlow, Cerstin; Piotrowski, Michael (Eds.) Systems and Frameworks for Computational Morphology. Second International Workshop, SFCM 2011, Zurich, Switzerland, August 26, 2011, Proceedings. Series: Communications in Computer and Information Science, Vol. 100. 1st Edition. [pdf version]
  2. Mohammed Attia. (2006) 'Accommodating Multiword Expressions in an Arabic LFG Grammar'. In T. Salakoski et al. (Eds.): Advances in Natural Language Processing. FinTAL 2006, Lecture Notes in Computer Science. Vol. 4139, pp. 87 - 98, 2006. Springer-Verlag Berlin Heidelberg 2006. [pdf version]

Journal Papers:

  • Mohammed Attia, Pavel Pecina, Antonio Toral, Josef van Genabith. 2013. A Corpus-Based Finite-State Morphological Toolkit for Contemporary Arabic. Journal of Logic and Computation 2013; doi: 10.1093/logcom/exs070. Oxford University Press. [pdf version]

Theses:

  • Mohammed Attia. 2008. Handling Arabic morphological and syntactic ambiguity within the LFG framework with a view to machine translation. Ph.D. Thesis. School of Languages, Linguistics and Cultures, the University of Manchester, UK. [pdf version]
  • Mohammed Attia. 2002. Implications of the agreement features in machine translation. Master's Thesis. Faculty of Languages and Translation, Al-Azhar University, Cairo, Egypt. [pdf version]

Conference Papers:

  1. Mona Diab, Mohamed AlBadrashiny, Maryam Aminian, Mohammed Attia, Heba Elfardy, Nizar Habash and Abdelati Hawwari. (2014) Towards Compiling a large scale three-way Egyptian Arabic Dictionary. The 9th edition of the Language Resources and Evaluation (LREC) Conference, 26-31 May, Reykjavik, Iceland. [pdf version]
  2. Attia, Mohammed and Josef van Genabith. 2013. A Jellyfish Dictionary for Arabic. eLex2013 Conference (Electronic Lexicography in the 21st Century), Tallinn, Estonia. [pdf version]
  3. Attia, Mohammed, Pavel Pecina, Younes Samih, Khaled Shaalan, Josef van Genabith. 2012. Improved Spelling Error Detection and Correction for Arabic. COLING 2012, Bumbai, India. [pdf version]
  4. Attia, Mohammed, Younes Samih, Khaled Shaalan, Josef van Genabith. 2012. The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database. COLING 2012, Bumbai, India. [pdf version]
  5. Khaled Shaalan,Younes Samih, Mohammed Attia, Pavel Pecina, and Josef van Genabith. 2012. Arabic Word Generation and Modelling for Spell Checking. Language Resources and Evaluation (LREC). Istanbul, Turkey. Pages: 719-725. [pdf version]
  6. Mohammed Attia, Khaled Shaalan, Lamia Tounsi, and Josef van Genabith. 2012. Automatic Extraction and Evaluation of Arabic LFG Resources. Language Resources and Evaluation (LREC). Istanbul, Turkey. Pages 1947-1954. [pdf version]
  7. Mohammed Attia, Pavel Pecina, Lamia Tounsi, Antonio Toral, Josef van Genabith. 2011. Lexical Profiling for Arabic. Electronic Lexicography in the 21st Century. Bled, Slovenia. [pdf version]
  8. Mohammed Attia, Pavel Pecina, Lamia Tounsi, Antonio Toral, Josef van Genabith. 2011. An Open-Source Finite State Morphological Transducer for Modern Standard Arabic. International Workshop on Finite State Methods and Natural Language Processing (FSMNLP). Blois, France. [pdf version]
  9. Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina, Josef van Genabith. 2010. Construction of Language Resources for Enhancing Future Information Technologies. Poster presented at the Globe Forum Dublin 2010. The Convention Centre Dublin. Ireland.
  10. [pdf version]
  11. Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina and Josef van Genabith. 2010. Automatic Extraction of Arabic Multiword Expressions. COLING 2010 Workshop on Multiword Expressions: from Theory to Applications. Beijing, China. [pdf version]
  12. Mohammed Attia, Jennifer Foster, Deirdre Hogan, Joseph Le Roux, Lamia Tounsi and Josef van Genabith. 2010. 'Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French'. First Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), NAACL HLT. Los Angeles, CA. [pdf version]
  13. Mohammed Attia, Antonio Toral, Lamia Tounsi, Monica Monachini and Josef van Genabith. 2010. 'An automatically built Named Entity lexicon for Arabic'. LREC 2010. Valletta, Malta. [pdf version]
  14. Lamia Tounsi, Mohammed Attia and Josef van Genabith. 2009. 'Parsing Arabic Using Treebank-Based LFG Resources'. LFG09: 14th International LFG Conference, Trinity College, Cambridge, UK. [pdf version]
  15. Lamia Tounsi, Mohammed Attia and Josef van Genabith. 2009 'Automatic Treebank-Based Acquisition of Arabic LFG Dependency Structures.' EACL-Workshop on Computational Approaches to Semitic Languages, Athens, Greece.[pdf version]
  16. Mohammed Attia. (2008) 'A Unified Analysis of Copula Constructions in LFG'. LFG08: 13th International LFG Conference, University of Sydney, Australia. [pdf version]
  17. Mohammed Attia. (2007) 'Arabic Tokenization System'. ACL-Workshop on Computational Approaches to Semitic Languages, Prague. [pdf version]
  18. Mohammed Attia. (2006) 'An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modelling Finite State Networks'. The Challenge of Arabic for NLP/MT Conference, October 2006. The British Computer Society, London. [pdf version]
  19. Mohammed Attia. (2005) 'Developing a Robust Arabic Morphological Transducer Using Finite State Technology'. 8th Annual CLUK Research Colloquium, Manchester. [pdf version]

Technical Reports:

  • Mohammed Attia. (2010) 'Automatic Lexical Resource Acquisition for Constructing an LMF-Compatible Lexicon of Modern Standard Arabic'. The NCLT Seminar Series, DCU, Dublin, Ireland. [pdf version]
  • Mohammed Attia. (2008) 'Alternate Agreement in Arabic'. The ParGram Spring Meeting, Istanbul, Turkey. [pdf version]
  • Mohammed Attia. 2005. Functional Control and Long Distance Dependencies in Arabic. Parallel Grammar (ParGram) Meeting, Gotemba, Japan 2005. [pdf version]
  • Mohammed Attia. 2004. Report on the Introduction of Arabic to ParGram. The ParGram Fall Meeting 2004, The National Centre for Language Technology, School of Computing, Dublin City University, Ireland. [pdf version]

Presentations:

  • Mohammed Attia. (2012) 'Arabic Language: Nature and Challenges'. A presentation at the the British University in Dubai, UAE, May 29, 2012. [Slides available]
  • Mohammed Attia. (2010) 'Automatic Lexical Resource Acquisition for Constructing an LMF-Compatible Lexicon of Modern Standard Arabic'. A presentation at the NCLT, Dublin City University, Ireland. [Slides available]
  • Mohammed Attia. (2008) 'From Arabic Handcrafted Grammar to Statistical Parsing'. A presentation at the NCLT, Dublin City University, Ireland. [Slides available]
  • Mohammed Attia. (2008) 'Alternate Agreement in Arabic'. Presented on my behalf in the ParGram Spring Meeting, Istanbul, Turkey. [Slides available]
  • Mohammed Attia. (2006) 'Issues in Arabic Grammar: from Tokenization to Transfer'. A presentation at the ParGram Meeting, Oxford, UK. [Slides available]
  • Mohammed Attia. (2005) 'Functional and Anaphoric Control in Arabic'. A presentation at ParGram Fall Meeting, Gotemba, Japan. [Slides available]
  • Mohammed Attia. (2005) 'Accommodating Multiword Expressions in an LFG Grammar'. A presentation at ParGram Fall Meeting, Gotemba, Japan. [Slides available]
  • Mohammed Attia. (2005) 'Developing a Robust Arabic Morphological Transducer/Tokenizer, and Integration with XLE'. Presented on my behalf in the ParGram Spring Meeting, Parc, Palo Alto, USA. [Slides available]
  • Mohammed Attia. (2004) 'Report on the Introduction of Arabic to ParGram'. Presented at ParGram Fall Meeting, Dublin, Ireland. [pdf version]

E-Books:

  • Mohammed Attia. (2003) 'Implications of the Agreement Features in Machine Translation'. M.A. Thesis.
  • Mohammed Attia. (2004) 'Common English Proverbs'. E-Books.
  • Mohammed Attia. (2007) 'Common English Expressions'. E-Books.
  • Mohammed Attia. (2008) 'Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation'. PhD Thesis. School of Languages, Linguistics and Cultures, the University of Manchester. [pdf version]
  • Mohammed Attia. (2009) 'The Translation Manual'. E-Books.
  • Mohammed Attia. (2009) 'The Translation Terminology Aid'. E-Books.
  • Mohammed Attia. (2009) Pigeon: A Collection of Poems'. E-Books.
  • Mohammed Attia. (2009) Basic English Words: A Vocabulary Bootstrap for Beginning Learners'. E-Books.
  • Mohammed Attia. (2009) 'Arabic Grammar Summary: A Digest of Badawi et. al. 2004 "Modern Written Arabic, A Comprehensive Grammar"'. E-Books.