A Project Gutenberg Poetry Corpus

Quoi:
Talk
Quand:
mardi 14 août   03:45 PM à 05:00 PM (1 heure 15 minutes)
Discussion:
0

In this paper, I present the Gutenberg Poetry Corpus: a corpus of over three million lines of poetry (in annotated JSON format) automatically curated from Project Gutenberg. Project Gutenberg, a collection of machine-readable texts in the public domain, was originally instigated in the early 1970s with a hand-typed copy of the US Declaration of Independence. More recently driven by the volunteer efforts of a decentralized group of proofreaders, Project Gutenberg now consists of more than 54,000 texts, mostly English- language literature from the 18th and 19th centuries. Researchers in the humanities and in computational linguistics have made use of Project Gutenberg for decades, and more recently its use in data-driven computational creativity has grown. I relay the methodology used to automatically filter and identify lines of poetry from the larger Gutenberg corpus, discuss the potential of this corpus for research and creative work, and then present a series of my own experiments that use this corpus as their primary source material.

Présentateur
New York University
Teacher

Mon horaire

Ajouter à votre horaire