Datasets of 30 English novels for pattern mining and text mining

Today, I want to announce that I have just made public datasets of 30 novels from English Novels from 10 authors of the XIX century. These datasets can be used for testing algorithms for sequential pattern miningsequential rule mining, as well as for some text mining applications such as authorship attribution (guessing the authors of an anonymous text) and sequence prediction.

All the datasets  were public domain texts that have been prepared and converted to a suitable format for text analysis by Jean-Marc Pokou et al. (2016) so that they can be used with the SPMF library. 

These books are written by 10 different English novelists from the XIX century. The total number of words/sentences in the corpus of each author is as follows:
Catharine Traill (276,829/ 6,588),
Emerson Hough (295,166/ 15,643),
Henry Addams (447,337/ 14,356),
Herman Melville (208,662/ 8,203),
Jacob Abbott (179,874/ 5,804),
Louisa May Alcott (220,775/ 7,769),
Lydia Maria Child (369,222/ 15,159),
Margaret Fuller (347,303/ 11,254),
Stephen Crane (214,368/ 12,177),
Thornton W. Burgess (55,916/ 2,950).

The list of books is:

AuthorDatasets (books) in SPMF format
Catharine Traill– A Tale of The Rice Lake Plains
-Lost in the Backwoods
– The Backwoods of Canada
Emerson Hough– The Girl at the Halfway House
– The Law of the Land
– The Man Next Door
Henry Addams– Democracy, an American novel
– Mont-Saint-Michel and Chartres
– The Education of Henry Adams
Herman Melville– I and My Chimney
-Israel Potter
-The Confidence-Man His Masquerade
Jacob Abbott– Alexander the Great
– History of Julius Caesar
– Queen Elizabeth
Louisa May Alcott– Eight Cousins
– Rose in Bloom
– The Mysterious Key and What Opened
Lydia Maria Child– A Romance of the Republic
-Isaac THoppe
Margaret Fuller– Life Without and Life Within
-Summer on the Lakes, in 1843
– Woman in the Nineteenth Century
Stephen Crane– Active Service
– Last Words
– The Third Violet
Thornton WBurgess– The Adventures of Buster Bear
– The Adventures of Chatterer the Red Squirrel
-The Adventures of Grandfather Frog

Each dataset has two versions: (1) sequences of words and (2) sequences of Part-of-Speeches (POS) tags (obtained using the Stanford NLP Tagger).

Here are the links to download the books:

If you use the above book datasets, you may want to cite this paper:

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

In that paper, we have discovered skip-grams (sequential patterns) and n-grams (consecutive sequential patterns) of part-of-speech tags to guess the authors of books.

More datasets can also be found on the dataset webpage of the SPMF software.

Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

This entry was posted in Data Mining, Data science and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *