Datasets of 30 English novels for pattern mining and text mining

Today, I want to announce that I have just made public datasets of 30 novels from English Novels from 10 authors of the XIX century. These datasets can be used for testing algorithms for sequential pattern miningsequential rule mining, as well as for some text mining applications such as authorship attribution (guessing the authors of an anonymous text) and sequence prediction.

All the datasets  were public domain texts that have been prepared and converted to a suitable format for text analysis by Jean-Marc Pokou et al. (2016) so that they can be used with the SPMF library. 

Each dataset has two versions: (1) sequences of words and (2) sequences of Part-of-Speeches (POS) tags.

The authors and total number of words/sentences in the corpus of each author is as follows: Catharine Traill (276,829/ 6,588), Emerson Hough (295,166/ 15,643), Henry Addams (447,337/ 14,356), Herman Melville (208,662/ 8,203), Jacob Abbott (179,874/ 5,804), Louisa May Alcott (220,775/ 7,769), Lydia Maria Child (369,222/ 15,159), Margaret Fuller (347,303/ 11,254), Stephen Crane (214,368/ 12,177), and Thornton W. Burgess (55,916/ 2,950).

AuthorDatasets (books) in SPMF formatDatasets in SPMF format (with item names)
– can be used with the GUI of SPMF
Original books as text
Catharine Traill– A Tale of The Rice Lake Plains
(words / POS)
-Lost in the Backwoods (words / POS)
– The Backwoods of Canada (words / POS)
– A Tale of The Rice Lake Plains
(words / POS)
-Lost in the Backwoods (words / POS)
– The Backwoods of Canada (words / POS)
– A Tale of The Rice Lake Plains
(words / POS)
-Lost in the Backwoods (words / POS)
– The Backwoods of Canada (words / POS)
Emerson Hough– The Girl at the Halfway House (words / POS)
– The Law of the Land (words / POS)
– The Man Next Door (words / POS)
– The Girl at the Halfway House (words / POS)
– The Law of the Land (words / POS)
– The Man Next Door (words / POS)
– The Girl at the Halfway House (words / POS)
– The Law of the Land (words / POS)
– The Man Next Door (words / POS)
Henry Addams– Democracy, an American novel (words / POS)
– Mont-Saint-Michel and Chartres (words / POS)
– The Education of Henry Adams (words / POS)
– Democracy, an American novel (words / POS)
– Mont-Saint-Michel and Chartres (words / POS)
– The Education of Henry Adams (words / POS)
– Democracy, an American novel (words / POS)
– Mont-Saint-Michel and Chartres (words / POS)
– The Education of Henry Adams (words / POS)
Herman Melville– I and My Chimney (words / POS)
-Israel Potter (words / POS)
-The Confidence-Man His Masquerade (words / POS)
– I and My Chimney (words / POS)
-Israel Potter (words / POS)
-The Confidence-Man His Masquerade (words / POS)
– I and My Chimney (words / POS)
-Israel Potter (words / POS)
-The Confidence-Man His Masquerade (words / POS)
Jacob Abbott– Alexander the Great (words / POS)
– History of Julius Caesar (words / POS)
– Queen Elizabeth (words / POS)
– Alexander the Great (words / POS)
– History of Julius Caesar (words / POS)
– Queen Elizabeth (words / POS)
– Alexander the Great (words / POS)
– History of Julius Caesar (words / POS)
– Queen Elizabeth (words / POS)
Louisa May Alcott– Eight Cousins (words / POS)
– Rose in Bloom (words / POS)
– The Mysterious Key and What Opened (words / POS)
– Eight Cousins (words / POS)
– Rose in Bloom (words / POS)
– The Mysterious Key and What Opened (words / POS)
– Eight Cousins (words / POS)
– Rose in Bloom (words / POS)
– The Mysterious Key and What Opened (words / POS)
Lydia Maria Child– A Romance of the Republic (words / POS)
-Isaac THoppe (words / POS)
-Philothea (words / POS)
– A Romance of the Republic (words / POS)
-Isaac THoppe (words / POS)
-Philothea (words / POS)
– A Romance of the Republic (words / POS)
-Isaac THoppe (words / POS)
-Philothea (words / POS)
Margaret Fuller– Life Without and Life Within (words / POS)
-Summer on the Lakes, in 1843 (words / POS)
– Woman in the Nineteenth Century (words / POS)
– Life Without and Life Within (words / POS)
-Summer on the Lakes, in 1843 (words / POS)
– Woman in the Nineteenth Century (words / POS)
– Life Without and Life Within (words / POS)
-Summer on the Lakes, in 1843 (words / POS)
– Woman in the Nineteenth Century (words / POS)
Stephen Crane– Active Service (words / POS)
– Last Words (words / POS)
– The Third Violet (words / POS)
– Active Service (words / POS)
– Last Words (words / POS)
– The Third Violet (words / POS)
– Active Service (words / POS)
– Last Words (words / POS)
– The Third Violet (words / POS)
Thornton WBurgess– The Adventures of Buster Bear (words / POS)
– The Adventures of Chatterer the Red Squirrel (words / POS)
-The Adventures of Grandfather Frog (words / POS)
– The Adventures of Buster Bear (words / POS)
– The Adventures of Chatterer the Red Squirrel (words / POS)
-The Adventures of Grandfather Frog (words / POS)
– The Adventures of Buster Bear (words / POS)
– The Adventures of Chatterer the Red Squirrel (words / POS)
-The Adventures of Grandfather Frog (words / POS)
ALL THE 30 ABOVE BOOKSwords / POSwords / POSwords POS

If you use the above book datasets, you may want to cite this paper:

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

In that paper, we have discovered skip-grams (sequential patterns) and n-grams (consecutive sequential patterns) of part-of-speech tags to guess the authors of books.

More datasets can also be found on the dataset webpage of the SPMF software.


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

This entry was posted in Data Mining, Data science and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *