The 2nd BIRNDL workshop and the 3rd Microsoft Research Asia CL-SciSumm Shared Task at SIGIR 2017 will focus on the problems in Big Science – the explosion in the production of scientific literature and the growth of scientific enterprise1. This document comprises the full workshop proposal and call for papers.
This workshop is a follow-up to the successful NLPIR4DL workshops and Shared Tasks at JCDL 2016, TAC 2014 and ACL-IJCNLP 2009. The workshop and CL-SciSumm Shared Task generated a lot of interest and participation at JCDL 2016. All proponents heavily favoured an NLP venue for last year and this year, and are keen on participating again. Since 2012, some other workshop series around the same theme have been hosted at major ACL/IR conferences. This underscores the timeliness of our proposal as compared to other long-standing workshops, and the community’s growing interest in this topic.
The objective of our workshop is to provide a forum to enable the progression of research in the search and retrieval of scientific literature. Scientific literature is of interest to a wide variety of users, including academics, policymakers, practitioners, scholars and researchers with specific medical or industrial expertise. It is indexed in large digital repositories, such as the ACL Anthology, ArXiv, Web of Science, ACM Digital Library, IEEE database and Google Scholar, which allow access to digital papers and their metadata (including citations) and are visited daily by millions of users who search for and download papers, even as the size of the repositories grow by thousands every day. The large scale of scholarly publications poses a challenge for scholars in their search for relevant literature, as they are inundated with thousands of results. In the case of evidence-based medicinal research, the information overload problem could gravely impact their efficiency and critical decision-making abilities; as such, Big Science is a potent challenge with dire consequences. Some key aspects of this information overload problem are summarized below:
- Lack of automated assistance – Scholars conducting a literature survey are required to track individual papers over time, which could amass hundreds to thousands of related papers published per year. Digital libraries do not provide a ready reference of research trends, directions or key concepts to assist them and are required to implement greater automation to manage this signal-rich big data.
- Lack of research focus – Digital libraries require semantic search, question-answering and automated recommendation and reviewing systems to manage and retrieve answers from scholarly databases. The state-of-the-art methods in NLP and IR cannot easily adapt to the context of scientific literature, with its specialized scientific document format, argumentation patterns and technical terminology. Even within a single disciple, scientific papers differ in their format – single papers contain text, figures, tables, images, references to related papers and resources.
- Lack of a standardized corpus – There is a need for established, standardized baselines or evaluation metrics and test collections to evaluate tools and technologies developed for digital libraries. A key requirement is a standardized reference corpus, which can be used for comparative objective benchmarking, research reproducibility and assessment. Through this workshop and its Shared Task, we are providing a corpus of over 500 Computational Linguistics research papers, inter-linked through a citation network, which we plan to double in the coming year.
This workshop will address these challenges in a community forum, and provide resources to evaluate advanced tools for mining and accessing scientific publications. Full document text analysis can help to design semantic search, translation and summarization systems; citation and social network analyses can help digital libraries to visualize scientific trends, bibliometrics and relationships and influences of works and authors. All these approaches can be supplemented with the metadata supplied by digital libraries, inclusive of usage data, such as download counts. Further discussion and progression on this topic would be beneficial to the community.
This workshop will be relevant to scholars in Computational Linguistics, Natural Language Processing and Information Retrieval; it will also be important for all stakeholders in the publication pipeline: implementers, publishers and policymakers, in their efforts to disseminate the right published works to their audience. Finally, the approaches developed herein could aid individuals, universities and funding bodies to better assess the impact of scientific literature.
The 3rd CL-SciSumm Shared Task
The 3nd Computational Linguistics (CL) Scientific Summarization Shared Task is sponsored by Microsoft Research Asia and will be conducted as a part of the 3rd NLP4IRDL workshop. This is the first medium-scale shared task on scientific document summarization in the computational linguistics (CL) domain. It follows up on the successful CL-SciSumm Shared Task as a part of the BIRNDL workshop (JCDL 2016) and the CL Pilot Task conducted as a part of the BiomedSumm Track at the Text Analysis Conference 2014 (TAC 2014). In the CL-SciSumm 2016, fifteen teams from six countries signed up, and ten teams ultimately submitted and presented their results.
The Shared Task comprises three sub-tasks in automatic research paper summarization on a new corpus of research papers, as described below.
Given: A topic consisting of a Reference Paper (RP) and 10 or more Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP.
- Task 1a: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These are of the granularity of a sentence fragment, a full sentence, or several consecutive sentences (no more than 5).
- Task 1b: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets.
- Task 2 (optional bonus task): Finally, generate a structured summary of the RP from the cited text spans of the RP. The length of the summary should not exceed 250 words.
Evaluation: Task 1 will be scored by overlap of text spans measured by number of sentences in the system output vs gold standard. Task 2 will be scored using the ROUGE family of metrics between i) the system output and the gold standard summary fromt the reference spans ii) the system output and the asbtract of the reference paper.
The CL-SciSumm corpus, comprising a training corpus of twenty topics and a test corpus of ten topics, was provided to the participants. The topics comprised of ACL Computational Linguistics research papers, and their citing papers and three output summaries each. The three output summaries comprisde: the traditional self-summary of the paper (the abstract), the community summary (the collection of citation sentences ‘citances’) and a human summary written by a trained annotator. Within the corpus, each citance is also mapped to its referenced text in the reference paper and tagged with the information facet it represents. For the 2017 Shared Task, we plan to further enrich this dataset with the AAN metafeatures and other meta-descriptors developed by researchers at DERI, National University of Ireland. More details of the Shared Task are here.
Our goal is to encourage insights from bibliometrics, scientometrics and infometrics to applications in digital libraries. We invite stimulating submissions on topics including full-text analysis, multimedia and multilingual analysis and alignment as well as citation-based NLP, information retrieval and information seeking. Other interests include (but are not limited to):
- Navigation, searching and browsing and niche search in large scientific paper datasets and similar resources; new information access methods for scientific papers
- Network analysis and citation analysis of authors, experts and collaborators; citation function/motivation analysis; novel bibliometric metrics; topical modeling analysis; information retrieval for scholarly text; early influencer detection; modeling the referencing behavior across disciplines
- Summarization of scientific articles; automatic creation of reviews and automatic qualitative assessment of submissions; question-answering for the web, large digital libraries and repositories of scholarly publications
- Recommendation for scholarly papers, reviewers, citations and publication venues
- Knowledge discovery and analysis of the ancestry of ideas; large scale linking of various entities, e.g. articles with articles by similarity
- Translation, multilingual and multimedia analysis and alignment of scholarly works; analyses of writing style in scholarly publications
- Metadata and controlled vocabularies for resource description and discovery; automatic metadata discovery, such as language identification
- Disambiguation issues in scholarly DLs using NLP or IR techniques; data cleaning and data quality
Positive Psychology Centre
University of Pennsylvania, USA
|Kokil is a postdoctoral researcher in Computer Science in the World Wellbeing Project at the University of Pennsylvania. She has been the lead coordinator of all aspects of the CL-SciSumm Shared Task since 2014, and she also co-organized the BIRNDL workshop. Her expertise is in multi-document summarization, natural language processing and applied linguistics. Her PhD dissertation involved the development of a literature review framework for the summarization of research papers. Currently, she is conducting social media analyses and user language modeling for behavioral profiling and health outcomes.|
Ph.D. Candidate, School of Computing, National University of Singapore
|Muthu Kumar is broadly interested in natural language processing and its applications to information retrieval; specifically, in retrieving and organising information from asynchronous conversation media such as scholarly publications, discussion and debate forums. He was on the organizing committee of the CL-SciSumm 2016 Shared Task, the CL-SciSumm 2014 Pilot Task and the BIRNDL workshop. He believes communication of scholarly research needs to be summarised to avoid redundant or outdated research and ensure faster progress to pressing problems. He is currently doing his Ph.D. research on a similarly motivated problem on Massive Open Online Course (MOOC) discussion forums on recommending salient student discussions for instructors to intervene given their limited bandwidth.|
The list below comprises the confirmed committee members who have stated their support to review submissions to the workshop if accepted.
|Akiko Aizawa, National Institute of Informatics, Japan
Colin Batchelor, Royal Society of Chemistry, Cambridge, UK
Jöran Beel, University of Konstanz, Germany
Cornelia Caragea, University of North Texas, USA
Jason S Chang, National Tsing Hua University, Taiwan
John Conroy, IDA Center for Computing Sciences, USA
C Lee Giles, Penn State University, USA
Bela Gipp, University of Konstanz, Germany
Nazli Goharian, Georgetown University, USA
Sujatha Das Gollapalli, Institute for Infocomm Research, A*STAR, Singapore
Pawan Goyal, Indian Institute of Technology, Kharagpur, India
Rahul Jha, Microsoft, USA
Noriko Kando, National Institute of Informatics, Japan
| Dain Kaplan, Tokyo Institute of Technology, Japan
Roman Kern, Graz University of Technology, Austria
Anna Korhonen, University of Cambridge, UK
John Lawrence, University of Dundee, UK
Elizabeth Liddy, Syracuse University, USA
Chin-Yew Lin, Microsoft Research, USA
Xiaozhong Liu, Indiana University, Bloomington, USA
Kathy McKeown, Columbia University, USA
Prasenjit Mitra, Penn State University / Qatar Computing Research Institute, USA/Qatar
Marie-Francine Moens, KU Leuven, Germany
Preslav Nakov, Qatar Computing Research Institute, Qatar
Doug Oard, University of Maryland, College Park, USA
Manabu Okumura, Tokyo Institute of Technology, Japan
| Arzucan Ozgur, Bogazici University, Turkey
Cecile Paris, CSIRO, Australia
Kazunari Sugiyama, National University of Singapore, Singapore
Simone Teufel, University of Cambridge, UK
Mike Thelwall, University of Wolverhampton, UK
Lucy Vanderwende, Microsoft Research, USA
Vasudeva Varma, International Institute of Information Technology, Hyderabad, India
Andre Vellino, University of Toronto, Canada
Anita de Waard, Elsevier Labs, USA
Alex Wade, Microsoft Research, USA
Stephen Wan, CSIRO ICT Centre, Australia
Supporters are members of the community who support the goals of the workshop and would like to see it happen, but are unable to serve as reviewers in the programme committee.
- Bonnie Dorr
- Oren Etzioni
- Marti Hearst
- Min-Yen Kan
- Diane Litman
- Ani Nenkova
- Dragomir R. Radev
- Chris Reed
Tentative Schedule of Events
We plan to follow the same schedule as our workshop at JCDL. Full papers will be presented in the morning session. The poster session for the participants in the CL-SciSumm Shared Task will be held during lunch. System papers will be presented in the afternoon session, and the workshop will wind up with a fishbowl style interactive discussion between participants and the organizers to determine future directions and plans. Based on the precedent set in 2016, we expect an increase in interest both due to the topic ‘s recent popularity as well as the shared task. Recently, several major conferences have had similarly themed workshops, but the novelty and short learning curve in our Shared Task has helped us to quickly gain traction and interest in the community.
Participation and Selection Process
Papers submitted to the workshop will follow the standard, double-blind peer-review process by our programme committee. Selected papers will be presented at the workshop and included in the workshop proceedings.
Shared task participants will need to register earlier to obtain our training dataset. Those registered participants who submit their system for final evaluation and will be invited (at their cost) to attend the workshop. Although systems will be ranked by their performance on our evaluation metric, all participants will be given the opportunity to present their system ’s details through a white paper, accompanied with a demonstration and/or poster session during the workshop.
We anticipate at least 60-80 attendees at our workshop. This is based on the attendance at the BIRNDL workshop, we had about 50 attendees at the relatively smaller venue of JCDL. Of these, 30 attended the full workshop and over 15 had travelled to the conference, especially in order to attend our workshop. In the first NLPIR4DL, we had about 30 unique attendees, with 15 attendees that attended the full workshop.
Previous NLPIR4DL workshops
The 3rd NLPIR4DL workshop is a follow-up to the successful BIRNDL workshop and CL-SciSumm Shared Task, co-located with JCDL 2016, where 11 full papers and 10 system papers were presented (acceptance rate:30%), the CL-SciSumm Pilot Task at TAC 2014, where the results from 3 system papers were presented, and the NLPIR4DL workshop co-located with IJCNLP-ACL 2009 with 11 full papers (acceptance rate: 21%). Despite not getting selected as an ACL workshop last year, we had a successful 2016 – the Shared Task generated a lot of interest and participation. All proponents heavily favoured an NLP venue for last year and this year, and are keen on participating again.
- Workshop on Mining Scientific Publications (WOSP) at JCDL 2014, 2015 and 2016
- Workshop on Scholarly Big Data at IJCAI 2016, AAAI 2014 and 2016
- 3rd Workshop on Argumentation Mining at ACL 2016
- Bibliometric-enhanced Information Retrieval (BIR) in 2014, 2015 and 2016
Our efforts are synergistic and complementary to the above workshops. We have co-organized BIRNDL with the organizers of BIR, and a number of our PC members served on the Scholarly Big Data workshop committee as well. We believe that the increasing number of workshops on this topic, indicate the timeliness of our proposal as compared to other long-standing workshops, and the community’s growing interest in solving related problems.
Schatz, G. (2014). The faces of Big Science. Nature Reviews Molecular Cell Biology, 15(6), 423-426.