Q&A, Information Retrieval

Name Description Language Author Link
Deep QA A library which has implementations of several Q&A and Reading comprehension models and can be used easily with a number of the Q&A datasets listed below Keras/TensorFlow Allen AI View
Basic RNN Q&A model to be used with the bAbi Dataset Keras (TF) François Chollet View
Name Description Language Company Link
Q&A Maker Build, train and publish a simple question and answer bot based on FAQ URLs, structured documents or editorial content - Microsoft View
Name Description Dataset Size Source Link
Wiki Q&A Bing query logs are the question source. Each question is linked to the sentences from the summary section of the respective Wikipedia page. The tsv file contains data as to which ones are correct. 3047 Qs, 19158 options, 1473 answer sentences ParlAI
(Facebook Research)
View
Web Questions Dataset contains Qs &list of Answers from Freebase 5809 Questions ParlAI
(Facebook Research)
View
Simple Questions Dataset contains Questions & links to valid answers on Freebase 108 442 Q & A Pairs ParlAI
(Facebook Research)
View
Stackoverflow - Python Full text of questions and answers from Stack Overflow that are tagged with the python tag 607 282 Qs, 987 122 As Kaggle View
Stackoverflow - R Full text of questions and answers from Stack Overflow that are tagged with the R tag 147076 Qs 198593 As Kaggle View
AI2 Science Questions v2 (May 2017) The AI2 Science Questions dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels 5394 Q & A pairs Allen AI View
AI2 Science Questions Mercury The AI2 Science Questions Mercury dataset consists of questions used in student assessments across elementary and middle school grade levels 5141 Q & A pairs Allen AI View
AI2 Biology How/Why Corpus This dataset consists of 185 "how" and 193 "why" biology questions with the corresponding answers 378 Q & A pairs Allen AI View
AI2 4th Grade Science Exams Training Set 108 real science exam questions 108 Q & A pairs Allen AI View
News Q & A Documents are CNN news articles. Questions are written by human users in natural language. Answers may be multiword passages of the source text 120K Q & A pairs Microsoft View
Graph Questions Factoid questions with logical forms and ground-truth answer (constructed from Freebase) 5166 logical form question pairs associated with answers from the knowledge base UC Santa Barbara, IBM Research, View
Cornell Natural Language Visual Reasoning (NLVR) corpus Image, sentence pairs for visual question answering forthe binary prediction task of judging if a statement is true for an image or not 92,244 sentence-image pairs with 3,962 unique sentences. Images are generated synthetically Facebook AI Research, Cornell University View
Name Description Language Author Link
Q&A NLP Oxford + Deep Mind NLP course 2017 lecture on Question Answer models TensorFlow Oxford University View
Retrieval Based models Generate answers based on a repository of pre-defined responses (tutorial is based on the Ubuntu Dialog Corpus) TensorFlow Denny Britz View
Name Description Authors Journal Link
End-To-End Memory Networks Neural network with a recurrent attention model for language modelling Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus NIPS 2015 View

Machine Comprehension

Name Description Language Author Link
Bi Directional Attention Flow Autogeneration of answers for questions based on an context paragraph TensorFlow Allen AI View
Name Description Dataset Size Source Link
Textbook Question Answering (TQA) 1,076 lessons from various science textbooks and has a set of questions that address concepts taught in that lesson. 6 674 Allen AI View
Explanations for Science Questions The data contains explanation sentences supporting science questions. 691 Allen AI View
MCTest This dataset consists of a set of stories and associated questions. 660 stories, 4 questions per story Microsoft View
Stanford - Wikipedia (SQUAD) Stanford Question Answering Dataset (SquAD) consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text from the passage. 98 169 questions on 536 articles Stanford University View
The Movie Dialog dataset The MDD contains data on question answering, recommendations, and discussions on movies between two entities. The ID for a given dialog starts at 1 and increases. Each ID consists of one turn for each speaker (an "exchange"), which are \t separated. When the IDs in a file reset back to 1 it means a new conversation has started QA for various tasks ranging in size from 10,000 to 100,000 ParlAI(Facebook Research) View
Identifying key phrases in a text The dataset contains Question-Answer pairs and an excerpt. Those rows are marked where the excerpt speaks directly to the question. 8262 Crowd Flower View
Children's Book Test (CBT) Questions’ are created from chapters in the book by enumerating 21 consecutive sentences where the first 20 questions form the context and the 21st sentence becomes the query. Train – 364 543, Test – 8 000, Valid – 10 000 ParlAI(Facebook Research) View
MS Marco Questions are sampled from real anonymized user queries. Context passages, from which answers in the dataset are derived, are extracted from real web documents using the Bing search engine. Answers to the queries are human generated." 100K Queries, 1M passages, 200K+ documents Microsoft View

Language Modelling

Name Description Language Author Link
Character Level RNN Character level language modelling - the model takes a text file as input and trains a RNN that learns to predict the next character in a sequence. The RNN can then be used to generate text character by character that will look like the original training data Torch Andrej Karpathy View
Character Level RNN Another character level model for text generation TensorFlow Sherjil Ozair View
Name Author Dataset Size (lines/words/char) Source Link
A Modern History, from the time of Luther to the Napoleon John Lord 23096, 213530, 1337687 Project Gutenberg View
An Elementary Study of Chemistry William McPherson 17285, 122713, 782694 Project Gutenberg View
The Eve of the French Revolution Edward J. Lowell 12981, 132608, 792764 Project Gutenberg View
Popular Scientific Lectures Ernest Mach 13775, 110488, 689861 Project Gutenberg View
The student’s element to Geology Sir Charles L Yell 27224,233400, 1461829 Project Gutenberg View
Name Description Dataset Size (lines/words/char) Source Link
Gutenberg Ebook List A plain text file which lists all the ebooks which have been posted to the Project Gutenberg collection with the Author name and additional metadata in some cases 54917 Project Gutenberg View
Hate speech identification Short text was reviewed by 3 contributors to be either a hate speech, an offensive statement or neither." 14,442 Crowd Flower View
Google Books ngrams Describes the usage patterns of some small sets of words/phrases. Dataset provides info like ‘In year x the word was found in y books and was used a total of z times 2.2 TB file Google View
Name Description Language Author Link
RNNs Character Level RNNs Torch Andrej Karpathy View

Language Learning and Machine Translation

Name Description Language Author Link
OpenMNT Neural Machine Translation with pretrained models available for several language pairs. Includes extensions to allow other sequence generation tasks such as summarization and image-to-text generation Torch/Lua Harvard NLP View
Seq2Seq Language models for translation from one language to another TensorFlow Google View
Name Description Dataset Size Author Link
The EMILLE Corpus: Monolingual, Parallel, Annotated Monolingual contain written and spoken data for 14 South Asian Languages. Parallel Corpus contains English text and its translations into various languages Monolingual - 92,799,000 words, Parallel – 200,000 words, English text and its translations Lancaster University View
Name Description Language Author Link
Seq2Seq Language models for translation from one language to another TensorFlow Google View

Language Scoring

Name Description Dataset Size Source Link
The Hewlett Foundation: Automated Essay Scoring 8 essay sets written by students from grade 7 to grade 10. The length of the essays varies from 150 to 550 words each. Train - 12975, Test – 4255, Valid - 4218 Kaggle View
The Hewlett Foundation: Short Answer Scoring Dataset comprises of essays written by students primarily in Grade 10. The average length is 50 words per response. All responses were hand-graded and were double-scored. 17043 Essays Kaggle View
English grammar-checking The data set is a collection of text extracts from 9,919 published journal articles (mainly from physics and mathematics) with data before and after language editing Train – 331,687, Dev - 41,069 Textmining View
English Spell-checking A spelling corrector python code + Lots of plain text dataset to test it 1144300 words for testing the code Kaggle View
Grammatical error correction (Shared Task) 1,414 essays which are written by students at the NUS and hand-corrected by the professors. For each grammatical error instance, the start, and end character offsets of the erroneous text span are marked, and the error type and the correction string are provided 1,414 essays NUS Computing View

Coming Soon

Challenge problems and datasets from partners including IBM and the Ed Tech community