Turkish Preprocessing Operations Using Deep Learning Approaches
The first step in nearly all natural language processing (NLP) applications is applying preprocessing operations to the text. Preprocessing operations include tokenization (segmenting the text into tokens), sentence splitting (dividing the text into sentences), normalization (converting the text into a canonical form), and the like. In this project, you will develop and implement algorithms for preprocessing of Turkish text using deep learning approaches. First, a literature review will be conducted and similar systems for English will be analyzed (e.g. UDPipe, Stanza). Then, deep learning models will be built for each of the preprocessing operations. The models will be adapted to Turkish based on the characteristics of the language (e.g. using embeddings for the suffixes). Finally, the system will be tested on Turkish corpora, probably on the Turkish treebanks in the UD (Universal Dependencies) framework. A conference or journal paper will be written towards the end of the project.
This is a 2-semesters project.