The CL-SciSumm Corpus 2017

On the behalf of our team, I’m pleased to announce the release of SciSumm14, an annotated corpus for scientific summarization.

CL-SciSumm 2017 is an open repository with a corpus of ACL Computational Linguistics research papers and their annotations, contributed to the public by the Web IR / NLP Group at the National University of Singapore (WING-NUS).  This corpus is offered as a part of the SciSumm Shared Task.

The purpose behind the release of this corpus is to highlight the challenges and relevance of the scientific summarization problem, support research in automatic scientific document summarization and provide evaluation resources to push the current state of the art. This corpus offers a “community” summary of a reference paper based on its collection of citing sentences, called citances. Furthermore, each of the citances is mapped to referenced text in the reference paper and tagged with the information facet it represents.

This corpus is expected to be of interest to a broad community including those working in computational linguistics NLP, text summarization, discourse structure in scholarly discourse, paraphrase, textual entailment, and/or text simplification.



Dr. Kokil Jaidka (alumnus, Wee Kim Wee School of Communication and Information, Nanyang Technological University)

Dr. Min-Yen Kan (Dept. of Computer Science, School of Computing, National University of Singapore)

Muthu Kumar Chandrasekaran (Dept. of Computer Science, School of Computing, National University of Singapore)


​1. Created by randomly sampling ten documents from the ACL Anthology corpus and selecting their citing papers. It is available for download at

2. Organized into “topic” folders. Each “topic” is the Reference Paper, and the folder contains ten or more Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP.

3. Most text files were created from the pdf files obtained above by using Adobe Acrobat. The remaining were converted using the GATE 8.0 open source software. For more details, see the README at

4. Inter-annotator agreement was used to assess the homogeneity and quality of the coding of citances and references, and disagreements were resolved through discussion.

5. The ACL ids and the titles of selected reference papers (out of 50 total) are given below:


ACL-anthology-id     Tile of the paper


H89-2014         Augmenting a Hidden Markov Model for Phrase-Dependent Word Tagging


C94-2154       The correct and efficient implementation of appropriate specifications for typed feature structures

E03-1020         Discovering Corpus-Specific Word Senses

C90-2039         Strategic Lazy Incremental Copy Graph Unification

J00-3003   Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech

P98-1081         Improving Data Driven Wordclass Tagging by System Combination

N01-1011        A Decision Tree of Bigrams is an Accurate Predictor of Word Sense

H05-1115        Using Random Walks for Question-focused Sentence Retrieval

J98-2005          Estimation of Probabilistic Context-Free Grammars


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at

Up ↑

%d bloggers like this: