• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

LLMs Running Out of Data? MetaSynth's AI Agents Generate Diverse Training Data

Lomanu4 Оффлайн

Lomanu4

Команда форума
Администратор
Регистрация
1 Мар 2015
Сообщения
16,073
Баллы
155
This is a Plain English Papers summary of a research paper called

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

. If you like these kinds of analysis, you should join

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

or follow us on

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

.


The Looming Data Crisis for LLM Training


Language models face an impending data shortage. By 2028, we'll likely exhaust the available stock of public human text data if current development trends continue. Recent models like Llama 3 already use ten times more data than compute-optimal estimates from just two years ago. This data scarcity threatens future model scaling and capabilities.

Synthetic data generated by language models offers a potential solution, but diversity remains a critical challenge. The researchers behind

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

identify two main factors limiting diversity in synthetically generated data: the choice of seed instances used to initialize generation, and the reliance on template-based prompts.

Template-based generation methods produce texts with repetitive sentence structures and recurring phrases. For example, financial texts often begin with predictable patterns like "In today's ever-changing financial landscape" or contain generic buzzwords, reducing their utility for model training.

To address these limitations, the researchers propose MetaSynth, a method that uses meta-prompting to orchestrate multiple "expert" LLM agents to collaboratively generate diverse synthetic data. This approach significantly improves domain adaptation capabilities while requiring surprisingly little synthetic data.

Previous Approaches to Synthetic Data Generation


Traditional synthetic data generation relies on template-based methods that offer limited variation. Approaches like Self-prompting, Attrprompt, CLINGEN, and Explore-Instruct use predefined templates with placeholders populated dynamically. While efficient, these methods produce data with structural similarities that limit their effectiveness for model training.

Meta-prompting approaches, where an LLM writes prompts to solve problems, have shown promise in generating more diverse outputs. As demonstrated in

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

, optimized meta-prompts can significantly improve the quality and downstream effectiveness of synthetic data.

Agent-based methods, where multiple LLM instances work collaboratively, represent another promising direction. Similar approaches have been successful in dialogue generation, as shown in

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

, which tackles related challenges in low-resource settings.

Domain adaptation represents a key use case for synthetic data. When abundant generic pre-training data exists, but domain-specific data is limited, synthetic data can help tailor an LLM to specialized domains efficiently without requiring expensive domain-specific data collection or compromising general capabilities.

How MetaSynth Works: Agentic Scaffolds for Diverse Data


MetaSynth revolutionizes synthetic data generation by decoupling the process into two distinct stages:

  1. Creating a diverse scaffolding of domain-relevant content
  2. Building instruction-response pairs based on this content

This approach uses a supervisor language model that orchestrates multiple expert LLM agents to collaboratively generate domain-specific content. The process begins with either seed keywords or seed documents that serve as a foundation for the generated content.

The supervisor agent creates specialized prompts for content generation, directing expert agents to produce diverse document scaffolds. These scaffolds contain rich domain-specific knowledge that forms the foundation for subsequent instruction-response pairs.

This multi-agent approach mirrors the collaborative framework described in

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

, where agent coordination improves task performance. With MetaSynth, this collaborative generation produces more diverse and contextually rich content than traditional template-based approaches.

The same scaffolds can be repurposed to build high-quality instruction-response pairs, creating a versatile dataset suitable for domain adaptation. This two-stage process ensures both content diversity and instruction quality, addressing key limitations of previous approaches.

Experimental Design: Testing Domain Adaptation


To evaluate MetaSynth's effectiveness, the researchers focused on adapting Mistral-7B-v0.3 to two specialized domains: Finance and Biomedicine. They generated 25 million tokens of synthetic data using various approaches and compared performance across domain-specific and general tasks.

The experiments included several comparative conditions:

  • Template-based synthetic data generation (baseline)
  • MetaSynth with seed keywords
  • MetaSynth with seed documents
  • Real documents from Common Crawl and Wikipedia
  • Various mixtures of real and synthetic data

They evaluated diversity using seven automated metrics:

  1. Compression Ratio (lower is better)
  2. Task2Voc Diversity Coefficient (higher is better)
  3. Remote Chapters (higher is better)
  4. Chamber Distance (higher is better)
  5. 1-Gram Diversity (higher is better)
  6. 4-Gram Diversity (higher is better)
  7. Mean Information Flow (higher is better)

For domain-specific performance, they used specialized benchmarks:

  • Finance: ConvFinQA, NER, FPB, Headline, FiQA_SA
  • Biomedicine: PubMedQA, USMLE, MQP, RCT, ChemProt

They also tested general capabilities using common NLP benchmarks like ARC, BoolQ, HellaSwag, and MMLU to ensure domain adaptation didn't compromise overall performance, a concern highlighted in

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

when working with synthetic data.

MetaSynth Outperforms Traditional Approaches


The results show that MetaSynth generates significantly more diverse content than template-based approaches. Across all diversity metrics, MetaSynth synthetic documents approach the diversity of real corpora like Wikipedia and Common Crawl.

SettingCompression Ratio $\downarrow$Task2Voc Div. Coeff $\uparrow$Remote Chaps $\uparrow$Chamber Distance $\uparrow$1-GD $\uparrow$4-GD $\uparrow$MIF $\uparrow$
Template Prompting3.66740.15760.19640.08970.01980.92248.5614
Common Crawl2.7380 (-25.34\%)0.212 (+34.52\%)0.3036 (+54.58\%)0.2359 (+162.99\%)0.0621 (+213.64\%)1.6080 (+74.33\%)8.1263 (-5.08\%)
Synth. Docs (Seed Keywords)3.4443 (-6.08\%)0.1757 (+11.49\%)0.2191 (+11.56\%)0.1351 (+50.61\%)0.0345 (+74.24\%)1.1749 (+27.37\%)9.0016 (+5.14\%)
Synth. Docs (Seed Documents)3.1495 (-14.12\%)0.1788 (+13.45\%)0.2047 (+4.23\%)0.1383 (+54.18\%)0.0390 (+96.97\%)1.3468 (+46.01\%)8.9150 (+4.13\%)
Wikipedia2.6088 (-24.82\%)0.1892 (+20.05\%)0.2868 (+46.03\%)0.2416 (+169.34\%)0.1046 (+428.28\%)1.6997 (+84.27\%)8.3149 (-2.88\%)

Table 1: Diversity metrics for Finance domain across different data generation approaches. Lower Compression Ratio and higher values for other metrics indicate better diversity.

When used for domain adaptation, MetaSynth shows substantial performance improvements:

Finance
CPT SettingToken MixConvFinQANERFPBHeadlineFiQA_SAAverage
Mistral-7B Base (No CPT)0 M38.958.1465.0979.2675.6263.40
Real Docs + Template-Prompting Docs12.5M:12.5M48.7952.6464.2476.0074.4763.23
Real Docs25 M46.5155.5965.0778.3076.0964.31
Real Docs + MetaSynth Docs12.5M:12.5M48.5953.6967.8280.1475.6665.18
Real Docs + MetaSynth Docs-Instructions-Responses12.5M:12.5M43.2953.7762.0679.7571.5762.09
Real Docs + MetaSynth Docs-Instructions-Responses8.33M:16.7M43.2252.2665.6679.7372.5062.67
Real Docs + MetaSynth Instructions-Responses12.5M:12.5M47.5152.0863.1679.5372.5662.97
Real Docs + MetaSynth Instructions-Responses8.33M:16.7M44.4349.3463.0579.6875.2762.35
MetaSynth Docs25 M42.2848.7267.3779.6773.6562.34
MetaSynth Docs-Instructions-Responses25 M49.3054.6466.4383.4676.1365.99
Biomedicine
CPT SettingToken MixPubMedQAUSMLEMQPRCTChemProtAverage
Mistral-7B (No CPT)0 M58.2035.2767.8662.5540.8052.94
Real Docs + Template-Prompting Docs12.5M:12.5M56.4038.4167.3859.8030.4050.48
Real Docs25 M59.7036.3762.2963.7028.9050.19
Real Docs + MetaSynth Docs12.5M:12.5M60.7037.3164.2667.5045.0054.95
Real Docs + MetaSynth Docs-Instructions-Responses12.5M:12.5M60.3037.1674.7571.8538.4056.49
Real Docs + MetaSynth Docs-Instructions-Responses8.33M:16.7M59.5036.6176.0671.0542.2057.08
Real Docs + MetaSynth Instructions-Responses12.5M:12.5M62.9035.9871.8071.4039.6056.34
Real Docs + MetaSynth Instructions-Responses8.33M:16.7M60.2036.4473.7771.7542.1056.85
MetaSynth Docs25 M60.2037.2370.1668.1540.4055.23
MetaSynth Docs-Instructions-Responses25 M61.8036.6077.8774.4550.4060.22

Table 2: Performance on domain-specific tasks across different continuous pre-training settings. Higher scores are better. CPT = Continuous Pre-Training.

Most impressively, using only MetaSynth-generated Documents-Instructions-Responses (without any real data) produced the best overall results, with improvements of up to 4.08% in Finance and a remarkable 13.75% in Biomedicine compared to the base model.

Crucially, these domain improvements came without sacrificing general capabilities:

ARC-chARC-easyBoolQHellaSwagMMLUOBQAPIQASIQAWinograndeAvg
Base Model
$\quad$ Mistral-7B52.178.482.080.459.144.282.345.973.466.4
Finance
Real Docs + Template Prompting Docs53.878.478.080.759.045.681.948.171.466.3
Real Docs + MetaSynth Docs55.977.384.380.758.544.481.149.671.767.1
MetaSynth Docs-Instr-Responses50.975.184.179.456.343.080.748.169.365.2
Biomedicine
Real Docs + Template Prompting Docs54.979.580.881.158.145.482.646.971.766.8
Real Docs + MetaSynth Docs53.476.183.580.658.044.681.046.970.266.0
MetaSynth Docs-Instr-Responses54.275.283.279.157.543.281.047.570.865.8

Table 3: General evaluation across domains and settings. Models adapted with MetaSynth maintain strong performance on general NLP tasks.

The most successful MetaSynth configuration maintained general capabilities while achieving substantial domain-specific improvements, demonstrating that high-quality synthetic data can enable effective domain adaptation without compromising broader capabilities.

Limitations and Future Work


Despite its promising results, MetaSynth has several limitations. The computational cost of using multiple LLM agents for data generation is substantial, potentially limiting accessibility for researchers with limited computing resources. Each document generation requires multiple API calls or model runs, making the process more expensive than template-based approaches.

The current implementation focuses specifically on Finance and Biomedicine domains. Extending MetaSynth to other specialized domains would require creating new expert agent configurations and potentially adapting the system architecture.

There's also room for improvement in the diversity metrics. While MetaSynth significantly outperforms template-based approaches, it still doesn't fully match the diversity of human-written corpora like Wikipedia and Common Crawl across all metrics.

Future research could:

  • Scale the approach to more domains
  • Reduce computational requirements through model distillation or more efficient agent architectures
  • Further improve diversity metrics through enhanced collaborative generation techniques
  • Explore multi-modal synthetic data generation by extending the framework to include images, audio, or other modalities
Conclusion: Key Takeaways for Synthetic Data Generation


MetaSynth represents a significant advance in synthetic data generation for language models. By using a meta-prompting approach where an LLM orchestrates multiple expert agents, it produces more diverse and effective synthetic data than traditional template-based methods.

The key findings demonstrate that:

  1. Just 25 million tokens of MetaSynth synthetic data enables effective domain adaptation to specialized fields like Finance and Biomedicine
  2. MetaSynth-generated data approaches the diversity of real corpora across multiple automated metrics
  3. Domain adaptation with MetaSynth doesn't compromise general capabilities
  4. The agent-based scaffolding approach produces more contextually rich and diverse content than template prompting

These results suggest that high-quality synthetic data could help address the looming data scarcity problem in LLM development. As available human-written text becomes exhausted, methods like MetaSynth that generate diverse, high-quality synthetic data will become increasingly valuable for maintaining progress in language model capabilities.

The success of MetaSynth also highlights the importance of diversity in synthetic data. Models trained on more diverse data show better performance across specialized domains, suggesting that diversity—not just quantity—is crucial for effective model training and adaptation.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.




Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу