Columbia Technology Ventures

Detecting AI-generated text via machine-based rewriting

This technology is a method for discerning artificial intelligence (AI)-generated text from human-written text using large language models.

Unmet Need: Distinguishing machine learning-assisted text generation and human-authored writing

With the growing abundance of artificial intelligence (AI)-generated text, there exists a need to distinguish between human and machine learning-assisted writing. Discerning machine-generated text from human-written text can be challenging because machine-learning algorithms inherently mimic the syntax and writing of a human being. Distinguishing between artificial and written text represents a technological challenge that impacts a plurality of fields, including journalism, creative writing, and academic writing.

The Technology: Detecting AI-generated text via machine-based rewriting

This technology describes a method for identifying artificial intelligence (AI)-generated text by prompting a large language model to rewrite portions of a piece of text. In doing so, this technology exploits a common occurrence in writing, in which human-generated text is prone to more significant rewriting than AI-generated text when passed through a machine-learning algorithm. This technology utilizes a symbolic word output from large language models to reduce reliance on a deep neural network, boosting its reliability, generalizability, and adaptability. In addition, this technology remains robust even when the text generation is aware of the detection mechanism.

This technology has demonstrated improved detection for several established paragraph-level detection benchmarks, with F1 detection score gains up to 29 points.

Applications:

  • AI-generated text detection
  • Essay plagiarism/Chat GPT checkers
  • Development of a human-vs-machine written metric
  • Detection and removal of bots on social media

Advantages:

  • Eliminates the need for deep neural network features, to boost robustness, generalizability, and adaptability
  • Integrates with the latest LLM models
  • Doesn’t require the original generating model (i.e. model A can detect the output of model B)
  • Generalized to different datasets and domains and robust to detecting text generated from different language models

Lead Inventor:

Carl Vondrick, Ph.D.

Patent Information:

Patent Pending

Related Publications:

Tech Ventures Reference: