Skip to content

20-Nov-24

Prompt

What are some examples of proprietary / closed-source training datasets that are commonly used to train LLMs?

Output

Some examples of proprietary or closed-source datasets commonly used to train large language models (LLMs) are not typically disclosed in full detail due to their closed nature. However, certain companies and organizations have been known to use proprietary datasets that include:

  1. Microsoft's Proprietary Codebase: Microsoft has a vast collection of proprietary software and code, which it uses internally for various research purposes. For example, a study comparing the performance of OpenAI's Codex model on open-source vs. proprietary code used Microsoft’s non-public C# and C++ codebase as part of the evaluation.

  2. OpenAI's Proprietary Datasets: While OpenAI’s GPT models (like GPT-3.5 and GPT-4) are trained on large-scale datasets that include publicly available data from the internet, OpenAI also uses proprietary resources that are not fully disclosed. These datasets may include curated collections of text from books, academic papers, and other sources that are not explicitly open-source.

  3. Google's Internal Data: Google has access to a vast array of proprietary data sources, including internal documentation, codebases, and other resources that are used to train its LLMs like Bard. These datasets are often not made public due to privacy and competitive reasons.

  4. Facebook/Meta's Proprietary Data: In addition to open-source models like LLaMA, Meta (formerly Facebook) also uses proprietary datasets for internal projects and models that are not released to the public. These datasets could include internal social media data or other restricted sources.

While the specific contents of proprietary datasets are generally kept confidential by organizations, they often consist of: - Internal codebases (e.g., C# or C++ code from Microsoft) - Proprietary APIs and usage patterns - Curated text corpora from licensed books, articles, or other non-public sources - Internal communications or documentation (e.g., from large tech companies like Google or Meta)

These proprietary datasets are closely guarded due to their commercial value and the competitive advantage they provide in training advanced AI models.