Behind the excitement about the sensational potential of new AI to improve lives, there is a little talked about secret killer that can drown AI. It is poor quality datasets and it transpires there's a flood of them out there.
New research has revealed that lots of translation on the internet is poor quality and not suitable for training GPT and models. Artificial intelligence, especially in the context of language translation, is utilised to speed up processes, save money, and facilitate international growth efficiently for businesses and organisations.
That’s a big problem for companies who want to scrape the Web for data to train their machine learning engines because they must be trained on clean, high-quality data otherwise they generate results that are wrong.
What is the big problem with poor quality AI training data, and how do you solve it?
Researchers have discovered a surprising prevalence of machine-generated translations across the web, particularly in languages with fewer resources. This "multi-way parallel" content, translated into multiple languages simultaneously, constitutes a significant portion of total web content.
Historically, there was a heavy reliance on human translators, but there has been a significant shift towards machine translation engines.
This is especially prominent for lower-resource languages and dialects that have relatively less data available for training conversational AI systems. By contrast, English, Spanish, and Chinese languages have higher resource.
Discover how Human Translation compares with Machine Translation?
To analyse this phenomenon, a research team created a massive dataset called Multi-Way ccMatrix, containing 6.4 billion unique sentences in 90 languages. By examining patterns of "multi-way parallelism" (sets of sentences translated into three or more languages), they found that the quality of these translations tends to be low, especially those involving many languages.
This highlights the importance of high-quality translations in training datasets to improve overall translation quality.
The study also revealed a bias towards shorter, more predictable sentences in multi-way parallel data. These sentences often originated from low-quality articles, suggesting a trend of low-quality English content being translated en masse into many lower resource languages via machine translation.
These findings raise significant concerns for training LLMs, particularly when using web-scraped data containing low-quality machine translations. The researchers emphasise the crucial role of data quality in LLM training, highlighting the potential risks of less fluent models with more hallucinations if trained on such data.
Translation is a big investment for a business, especially when you commission expert human linguists with credentials specific to your industry or vertical. These verified translations are perfect for training machine learning models due to their high quality. Using these to train an AI engine specific to your company is a prudent step since it improves the results.
Clean data in, clean data out. Moreover, it helps you to future-proof your translations, so they keep giving you a return on investment. That saves you time and money, after all, why keep translating the same content, when an AI translation engine can do this for you.
Future-proofing translated content has been Guildhawk's focus since we were established in 2001. Our partners love receiving amazingly accurate translations, but they do not want to keep translating them each year. That’s why we started training machine learning models with high-quality translated data that has undergone a rigorous vetting process by professional linguists.
This is what powers our AI-translation GAI platform. Now, our partners ask us to build translation engines specific to their business in order to guarantee quality and accuracy of results. Our training strategy for GAI, relies exclusively on using data that has undergone a rigorous vetting process by professional linguists.
It is imperative to implement a diligent multilingual data labelling system to confirm that only approved data is added to a training repository. At Guildhawk, we have been organising a labelling data for almost a decade. We maintain the strict rule that unverified data has no place in our system, as the use of such could lead to poor machine translation results.
Our labelling system doesn't act as an added chore when we need it, but it is part of an ongoing process. This system allows us to label data on-the-go, enabling us to sort data by domain and accuracy level, among other factors. The result? We have a sensational data lake that is high-quality and generates translation results that are accurate and authentic, particularly for specific verticals.
In the key to mitigating the negative impacts of low-quality data on LLM training lies in filtering and detecting machine-generated content. The strategy applied by Guildhawk is clear: Never compromise on quality, and always prioritise accurate, verified data for machine translation training.