Vous êtes connectés en tant qu'administrateur. Cette page est en cache jusqu'à Thu, 19 May 2022 04:34:52 GMT. Prévisualisez la dernière m-à-j en cliquant Rafraîchir.

A Project Gutenberg Poetry Corpus

3:45 PM, mardi 14 août 2018 (1 heure 15 minutes)

In this paper, I present the Gutenberg Poetry Corpus: a corpus of over three million lines of poetry (in annotated JSON format) automatically curated from Project Gutenberg. Project Gutenberg, a collection of machine-readable texts in the public domain, was originally instigated in the early 1970s with a hand-typed copy of the US Declaration of Independence. More recently driven by the volunteer efforts of a decentralized group of proofreaders, Project Gutenberg now consists of more than 54,000 texts, mostly English- language literature from the 18th and 19th centuries. Researchers in the humanities and in computational linguistics have made use of Project Gutenberg for decades, and more recently its use in data-driven computational creativity has grown. I relay the methodology used to automatically filter and identify lines of poetry from the larger Gutenberg corpus, discuss the potential of this corpus for research and creative work, and then present a series of my own experiments that use this corpus as their primary source material.

New York University