Columbia Technology Ventures

Prompt generation training for large language models

This technology is a method to choose training examples for Large Language Models (LLMs) that can be used to build more accurate and relevant models.

Unmet Need: Method to offset the imbalance and limited availability of training examples

Large Language Models (LLMs) are powerful machine learning models that are highly adaptable and rely on training datasets. The amount of available training data can influence model parameters and the accuracy of output predictions. One major concern is that, due to the limited size of training data, there may be imbalances in data coverage. This becomes an issue when a LLM is tasked with making predictions in an area with scarce training examples to learn from, negatively affecting accuracy and performance of the LLM.

The Technology: Method for improving LLM accuracy by optimizing training examples

This technology describes a method to improve LLM accuracy by increasing the relevancy of the training examples. The method measures the similarity between the query and the training data by calculating the cosine similarity between the two. By optimizing the relevance of the training set, this method aims to improve the accuracy of the LLM predictions.

Applications:

  • Increase efficiency in training Large Language Models (LLMs) models similar to ChatGPT
  • Method for measuring similarity of inputs and training sets in other life science research applications
  • Mathematical tool for building new LLMs
  • Method to measure similarity between query and output

Advantages:

  • More relevant training examples
  • Increased accuracy of LLM output, especially for queries with previously imbalanced training examples

Lead Inventor:

Vishal Misra, Ph.D.

Patent Information:

Patent Pending

Related Publications:

Tech Ventures Reference: