Today, I want to announce that I have just made public datasets of 30 novels from English Novels from 10 authors of the XIX century. These datasets can be used for testing algorithms for sequential pattern mining, sequential rule mining, as well as for some text mining applications such as authorship attribution (guessing the authors of an anonymous text) and sequence prediction.
All the datasets were public domain texts that have been prepared and converted to a suitable format for text analysis by Jean-Marc Pokou et al. (2016) so that they can be used with the SPMF library.
These books are written by 10 different English novelists from the XIX century. The total number of words/sentences in the corpus of each author is as follows:
Catharine Traill (276,829/ 6,588),
Emerson Hough (295,166/ 15,643),
Henry Addams (447,337/ 14,356),
Herman Melville (208,662/ 8,203),
Jacob Abbott (179,874/ 5,804),
Louisa May Alcott (220,775/ 7,769),
Lydia Maria Child (369,222/ 15,159),
Margaret Fuller (347,303/ 11,254),
Stephen Crane (214,368/ 12,177),
Thornton W. Burgess (55,916/ 2,950).
The list of books is:
Author | Datasets (books) in SPMF format |
Catharine Traill | – A Tale of The Rice Lake Plains -Lost in the Backwoods – The Backwoods of Canada |
Emerson Hough | – The Girl at the Halfway House – The Law of the Land – The Man Next Door |
Henry Addams | – Democracy, an American novel – Mont-Saint-Michel and Chartres – The Education of Henry Adams |
Herman Melville | – I and My Chimney -Israel Potter -The Confidence-Man His Masquerade |
Jacob Abbott | – Alexander the Great – History of Julius Caesar – Queen Elizabeth |
Louisa May Alcott | – Eight Cousins – Rose in Bloom – The Mysterious Key and What Opened |
Lydia Maria Child | – A Romance of the Republic -Isaac THoppe -Philothea) |
Margaret Fuller | – Life Without and Life Within -Summer on the Lakes, in 1843 – Woman in the Nineteenth Century |
Stephen Crane | – Active Service – Last Words – The Third Violet |
Thornton WBurgess | – The Adventures of Buster Bear – The Adventures of Chatterer the Red Squirrel -The Adventures of Grandfather Frog |
Each dataset has two versions: (1) sequences of words and (2) sequences of Part-of-Speeches (POS) tags (obtained using the Stanford NLP Tagger).
Here are the links to download the books:
- Books in SPMF format (sequences of words)
- Books in SPMF format (sequence of part-of-speeches)
- Books in SPMF format with item names (sequence of words)
- Books in SPMF format with item names (sequence of part-of-speeches)
- Original books as text files
If you use the above book datasets, you may want to cite this paper:
Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91
In that paper, we have discovered skip-grams (sequential patterns) and n-grams (consecutive sequential patterns) of part-of-speech tags to guess the authors of books.
More datasets can also be found on the dataset webpage of the SPMF software.
—
Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.