To request an academic license to download and use this software, please click on "Express Licensing" above, create an account/log in if you do not have an account/are not logged in, then return to this page, and click Express Licensing again. For a commercial license, please contact techventures@columbia.edu.
Narrative structure (i.e. storytelling) is prevalent in textual information channels such as news, personal blogs, and literature. This technology applies machine learning to literary analysis by automatically identifying the speakers of quoted speech in natural language textual stories. The method was developed using a corpus of over 3,000 instances of quoted speech from six works of 19th and 20th century literature. The text is first preprocessed to find quotes and candidate characters. After classifying the quotes into syntactic categories, features from the text specific to its syntactic category are extracted for training. The result is an algorithm that attributes instances of quoted speech to their respective speakers in narrative discourse.
This method bridges two important aspects between machine learning and literary analysis: automating the process of reading a text and identifying the speaker of each quotation. In order to leverage dialogue chains and the frequent use of expression verbs, a pattern matching algorithm was implemented to assign each quote to five syntactic categories. One-third of the selected corpus was used to develop the algorithm, while the remainder was used for training and testing. Results showed that this method correctly assigned a quote to its characters 83% of the time without any a priori information, exceeding the "nearest character" baseline. Ongoing work is aimed at social network extraction and investigation of methods for extracting segments of indirect (unquoted) speech and their speakers.
Patent Pending
IR Proxy66