Skip to content

Are there collections of prompts and outputs out there?

Generation Details

Date LLM Access Method
Nov 2024 GPT-4 Web UI

Prompt

As an experiment, I'm gathering some of my prompts and outputs and saving them. As time goes on, I'm developing quite a collection. This made me wonder: are there other collections of prompts and outputs out there? Do any purport to reflect the state of LLM development at a given time?

Open source LLM output datasets

There are several existing open-source datasets of LLM outputs or LLM-generated dialogues that could serve as examples for open-sourcing your own collection. These datasets typically focus on conversations, prompts, and responses that are designed to help train or fine-tune language models. Here are a few examples:

  1. ShareGPT Datasets:
  2. The ShareGPT dataset consists of real conversations that users had with ChatGPT. There are different versions, such as the filtered version used by models like Vicuna, which contain user-GPT conversation pairs. These conversations have been used to improve the quality of the model's understanding of conversational contexts and to train assistant-like LLMs.
  3. LaMini-Instruction Dataset:
  4. This dataset contains 2.8 million entries of instructions and responses derived from models like GPT and various instruction datasets, such as FLAN. It aims to enhance the model's capability in responding to human-like instructions effectively.
  5. WizardLM and Evol-Instruct:
  6. WizardLM and related datasets like Evol-Instruct were created by fine-tuning on existing datasets, including ShareGPT, to evolve the instruction-following abilities of models. This dataset includes responses and instructions generated by models to provide a more effective instruction-following training base.
  7. UltraChat:
  8. UltraChat is a large-scale dataset created using two instances of GPT-like models—one acting as the user and the other as the responder. This dataset focuses on improving the naturalness of dialogue-based responses and has 1.57 million dialogues.
  9. Open-Assistant Conversations (OASST1):
  10. The OASST1 dataset contains a large number of assistant-style conversations that were generated by different LLMs, such as GPT-4, GPT-3.5-turbo, and others. It aims to make model-generated dialogues more natural and interactive, covering many topics and assisting in instruction-tuning.

These datasets highlight the kinds of data that are available and serve as examples of how a collection like yours could be useful for the broader community, especially in training or fine-tuning models. They also focus on maintaining data quality, filtering for harmful content, and labeling data to make it more suitable for training purposes.

If you wish to open-source your dataset, it might be helpful to take inspiration from these existing datasets:

  • Structure the Data: Organize the data with clear labels (such as the prompt, response, and metadata like timestamps or model version).
  • Annotate for Safety: Implement checks for bias, ethical concerns, and sensitive content before releasing. Existing datasets often go through rigorous cleaning to ensure they’re fit for public release.

Overall, there is significant interest in datasets generated from LLM interactions, as they help improve the models' conversational depth and utility. Your collection could contribute to this growing area if properly anonymized and documented. If you want to explore more on these datasets, platforms like Hugging Face host many of them, where you can also publish your dataset if it meets the standards for open-source contributions.

The above text was generated by a large language model (LLM) and its accuracy has not been validated. This page is part of 'LLMs-on-LLMs,' a Github repository by Daniel Rosehill which explores how curious humans can use LLMs to better their understanding of LLMs and AI. However, the information should not be regarded as authoritative and given the fast pace of evolution in LLM technology will eventually become deprecated. This footer was added at 16-Nov-2024.