Skip to content
Snippets Groups Projects
README.md 2.03 KiB
Newer Older
hgs570's avatar
hgs570 committed
# Translation project LaD




## Name
hgs570's avatar
hgs570 committed
Machine Translation using LLMs
hgs570's avatar
hgs570 committed

## Description
hgs570's avatar
hgs570 committed
This project investigates machine translation using LLMs, more specifically Llama3 for creating synthetic data and for implementing quality estimations for the machine translations by prompt-based MQM. Ctranslate2 is used for creating the machine translations.
hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed
The 'Load_corpus' file contains all data used to load the corpus used for the translation task from the file extracted from INT, to extract a sample from this and to transform it into a CSV file.
hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed
The 'prompting for synthetic data' file consists of the prompts used for eliciting synthetic parallel sentences. 

hgs570's avatar
hgs570 committed
The 'Dutch_dataset' file contains all the sentences, corpus and synthetic, from the source language Dutch.
hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed
The 'English_dataset' file contains all the translations, corpus and synthetic, from the target language English.
hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed
The 'complete_dataset.csv' file consists of a sample from the INT (Instituut voor de Nederlandse Taal) which contains parallel sentences with the source language Dutch and the target language English. 
hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed
The 'augmented_dataset.csv' file consists of the corpus parallel sentences combined with the ai-generated parallel sentences, which is used for the machine translation task.
hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed
The 'analysis' file contains the code used for doing the linguistic analysis of the linguistic phenomena. 

hgs570's avatar
hgs570 committed
The 'Translation' file contains all data used to implement machine translation by utilizing Ctranslate2 to translate the source sentences into English.
hgs570's avatar
hgs570 committed

The 'EvalBLEUChrF' file contains all code used to evaluate the translation sentences using BLEU scoring and ChrF scoring.

hgs570's avatar
hgs570 committed
The 'two_translation_scores.csv' file contains the scorings for the translations of the source sentences by Ctranslate2 using BLEU and ChrF.
hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed
The prompting for QE file contains the prompts used for Quality Estimations of the machine translations.

hgs570's avatar
hgs570 committed

hgs570's avatar
hgs570 committed

## Authors and acknowledgment
hgs570's avatar
hgs570 committed
The author of this project is Hannah Goossens, and special acknowledgement for Lucia Donatelli who taught the course for which this project is done.
hgs570's avatar
hgs570 committed