This technology is a framework using code-generating large language models to compose vision-and-language models and a result for any image or video input query.
Current approaches to computer vision tasks such as visual queries include end-to-end models, which do not differentiate between visual processing and reasoning, unlike humans. End-to-end models are fundamentally uninterpretable and increasingly untenable as models become more data intensive. Attempts to create modular systems for complex tasks have been difficult to optimize and ultimately unsuccessful.
This technology creates a framework leveraging code-generating large language models to compose vision-and-language models based on any query. The technology creates custom Python programs for each query, taking images or video as input and outputting the result of the task. The platform is more interpretable, flexible, and adaptable than past methods. This technology delivers state of the art performance and has the potential to merge advancements in computer vision and language, increasing scientific capabilities beyond any single model.
This technology has been validated with videos and query inputs.
IR CU23274
Licensing Contact: Greg Maskel