Columbia Technology Ventures

A code-generating model to output accurate results for complex visual input tasks

This technology is a framework using code-generating large language models to compose vision-and-language models and a result for any image or video input query.

Unmet Need: Visual query model that uses both visual processing and reasoning

Current approaches to computer vision tasks such as visual queries include end-to-end models, which do not differentiate between visual processing and reasoning, unlike humans. End-to-end models are fundamentally uninterpretable and increasingly untenable as models become more data intensive. Attempts to create modular systems for complex tasks have been difficult to optimize and ultimately unsuccessful.

The Technology: An interpretable, high-performance vision-and-language model for reasoning

This technology creates a framework leveraging code-generating large language models to compose vision-and-language models based on any query. The technology creates custom Python programs for each query, taking images or video as input and outputting the result of the task. The platform is more interpretable, flexible, and adaptable than past methods. This technology delivers state of the art performance and has the potential to merge advancements in computer vision and language, increasing scientific capabilities beyond any single model.

This technology has been validated with videos and query inputs.

Applications:

  • Improved video retrieval on streaming sites
  • Accessibility for the blind and visually impaired
  • Expanded search engine capabilities and performance
  • More informative images on social media and e-commerce

Advantages:

  • Improves image recognition
  • Improves video recognition
  • Compatible with images unlike prior methods

Lead Inventor:

Carl Vondrick, Ph.D.

Related Publications:

Tech Ventures Reference: