Title: Abstractive Text Summarization for Morphologically Rich Languages
Advisor. Tunga Güngör
Abstract: The exponential growth in the number of documents available on the Web has turned finding the relevant piece of information into a challenging, tedious, and time-consuming activity. Accordingly, automatic text summarization has become an important field of study by gaining significant attention from the researchers. Recent progress in deep learning shifted the research in text summarization from extractive methods towards more abstractive approaches. The research and the available resources are mostly limited to the English language, which prevents progress in other languages which especially differ in terms structure and characteristics such as the morphologically rich languages (MRLs). In this thesis, we mainly focus on abstractive text summarization on two MRLs, Turkish and Hungarian, and address their important challenges. Firstly, we tackle the resource scarcity problem by curating two large-scale datasets for Turkish (TR-News) and Hungarian (HU-News) aimed for text summarization, but are also suitable for other tasks such as topic classification, title generation, and key phrase extraction. Then, we utilize the morphological properties of these languages and adapt them to summarization where we show improvements upon the existing models. Later, we make use of pretrained multilingual sequence-to-sequence models and provide state-of-the-art models for abstractive text summarization and title generation tasks. Evaluation of text summarization for MRLs is very limited. Thus, we show how preprocessing can drastically influence the evaluation results through a case study in Turkish. Finally, morphosyntactic methods are proposed for text summarization evaluation and a human judgement dataset is curated. It is shown that morphosyntactic tokenization processes during evaluation increase correlation with human judgements. All the work and the curated datasets are made publicly available.