Newer
Older
This project investigates machine translation using LLMs, more specifically Llama3 for creating synthetic data and for implementing quality estimations for the machine translations by prompt-based MQM. Ctranslate2 is used for creating the machine translations.
The 'Load_corpus' file contains all data used to load the corpus used for the translation task from the file extracted from INT, to extract a sample from this and to transform it into a CSV file.
The 'prompting for synthetic data' file consists of the prompts used for eliciting synthetic parallel sentences.
The 'Dutch_dataset' file contains all the sentences, corpus and synthetic, from the source language Dutch.
The 'English_dataset' file contains all the translations, corpus and synthetic, from the target language English.
The 'complete_dataset.csv' file consists of a sample from the INT (Instituut voor de Nederlandse Taal) which contains parallel sentences with the source language Dutch and the target language English.
The 'augmented_dataset.csv' file consists of the corpus parallel sentences combined with the ai-generated parallel sentences, which is used for the machine translation task.
The 'analysis' file contains the code used for doing the linguistic analysis of the linguistic phenomena.
The 'Translation' file contains all data used to implement machine translation by utilizing Ctranslate2 to translate the source sentences into English.
The 'EvalBLEUChrF' file contains all code used to evaluate the translation sentences using BLEU scoring and ChrF scoring.
The 'two_translation_scores.csv' file contains the scorings for the translations of the source sentences by Ctranslate2 using BLEU and ChrF.
The prompting for QE file contains the prompts used for Quality Estimations of the machine translations.
The author of this project is Hannah Goossens, and special acknowledgement for Lucia Donatelli who taught the course for which this project is done.