M4S: Mining Big Text Data for Semantics
The Mining Big Text Data for Semantics (M4S) workshop, to be held on October 17, 2016 at Kobe Japan during the 15th International Semantic Web Conference, aims to explore the potential combinations of statistical and formal semantic based approaches that will help to combine the analytic depth and precision of the latter with the scalability, recall and speed of the former.
M4S focuses on two application domains, namely healthcare and finance. For both, we see coexistence of large amount of textual documents, which are still the predominant means of communication, and extensive models in formal knowledge representation languages. Taking healthcare as an example, textual documents are still the means of communication when scholars, industrial practitioners, and authorities publish their research findings, clinical trial reports, recommendations, GxP protocols and guidelines. However, gigantic ontologies are also widely available as the outcomes of community-wide collaborations. In the finance domain, new pieces of data are being produced at second or even millisecond magnitude. Unambiguously defining the data nuances and bringing them under regulatory powers of authorities becomes essential.
The workshop intends to foster discussions and seek answers to the following research and development questions:
- Theoretical questions
- How can distributional semantics and formal semantic work seamlessly together?
- What is the optimal way of combining e.g. large-scale curated knowledge models with associations mined from large text corpora?
- Which characteristics of formal knowledge models are needed such that they can be used in combination with distributional semantics?
- Application questions
- How do certain NLP tasks benefit from a combination of distributional and formal semantics?
- Specifically, how can such combination be used fruitfully in the healthcare and finance domains?
Call for Participation
Objectives
There has been a growing interest in recent years in probabilistic and statistical methods for mining and analysing textual data, which is fuelled by the explosive increase of computer power and highly efficient algorithms. By quantifying statistical co-occurrence of words across extremely large corpora, such methods can identify various patterns in natural language and thus allows for accurate predictions in many NLP tasks.
At the same time, large-scale curated knowledge models/ontologies have been developed jointly by international collaborations and successfully applied to semantics-based content processing. Thus, there is increasing interests in formalising semantics with statistical methods.
The Mining Big Text Data for Semantics (M4S) workshop aims to explore the potential combinations of statistical and knowledge-based approaches that will help to combine the analytic depth and precision of the latter with the scalability, recall and speed of the former.
M4S workshop will be firmly grounded upon two application domains, namely healthcare and finance. For both, we see coexistence of large amount of textual documents which are still the predominant means of communication and extensive models in formal knowledge representation languages.
The workshop consists of four sessions: one keynote speech, paper presentations (3-5 papers), panel discussion, and a demo & networking session.
Motivation
Semantics of natural language, vaguely defined as it is, has been assigned a variety of canonical forms in the past two decades. The formal description logic based mathematical theory may have encountered challenges for applications eager to jump on the Big Data bandwagon wherein “precision” gives way to “speed” and “scale”. Such a trend is particularly evident when one tries to make sense out of large uncurated text corpora. The sheer size of such text data and their informality render approaches based on formal semantics inefficient. Two typical application domains are healthcare and finance.
In both domains, there is a strong call for gleaning the best from both worlds to tackle at the same time “speed” and “ambiguity”. Taking healthcare as an example, textual documents are still the means of communication when scholars, industrial practitioners, and authorities publish their research findings, clinical trial reports, recommendations, GxP protocols and guidelines. However, gigantic ontologies are also widely available as the outcomes of community-wide collaborations. Existing efforts of combining these two worlds still lack scalability because they usually involve rewriting free text search queries with ontology concepts or populating ontologies with extracted instance entities using predefined linguistic patterns.
In the finance domain, new pieces of data are being produced at second or even millisecond magnitude. The subtle fluctuation of data can impinge a much larger scale in the global financial market. Clearly defining the data nuances and bringing them under regulatory powers of authorities becomes essential. For instance, the Data Transparency Act mandates every listed company to publish their data in machine understandable format. The community, however, failed to provide satisfactory tools to translate between textual business data and formal reporting models, which leads to great resistance from the businesses and draw-back of the Act implementation.
Recently, among other statistical/probabilistic tools, bag-of-words based text mining has demonstrated its advantages in processing text corpora, ignoring syntax information and computationally expensive tasks such as part of speech, and sentence structure, instead focusing on simple proximity. This makes it possible to process very large text corpora. The resultant data-driven distributional “semantics” seems to outperform many conventional approaches in detecting “analogy” among words.
The M4S workshop sees a right opportunity to bring these two communities even closer to interrogate the interplay between formal and distributional semantics in the context of Big Text Data.
Topics
Topics of interest include but are not limited to:
- Ontology learning/mining from large text corpora
- Relation mining, extraction and validation
- Event extraction
- Entity disambiguation and resolution
- Latent topic modelling
- Incorporate imperfections from text mining in semantic web
- Tools and systems
- Ontology enhanced distributional language models
- Reasoning with both distributional and formal models
- General application areas
- Full-text search: increasing precision and recall of searches using semantics
- Question answering
- Translation aids and Multilingual systems
- Semantics-based information extraction, e.g. for
- structured / faceted search
- business analytics and visualisation
- Utilisation in finance and healthcare
- Requirements and use cases for semantic applications in health care and finance
- Deployed systems, experiences and lessons-learnt, specifically, ontology learning from PubMed, Edgar, OpenFDA, etc.
Important Dates
Submission | 15 Jul 2016 (extended) |
Notification | 15 Aug 2016 |
Camera ready | 20 Aug 2016 |
Submission Guideline
M4S invites three types of submissions:
- Technical papers: maximum 14 pages
- Short position papers: maximum 6 pages
- System demo: a 2-page summary of system features
Submitted papers will be peer-reviewed by at least two workshop Programme Committee members. Accepted papers will be presented at the workshop.
All papers should be written in English following the Springer conference proceedings guidelines (LNCS guidelines). Technical papers should not exceed 14 pages including bibliography and figures. Short position papers should be no more than 6 pages clearly state “position paper” in the title. All system demo submissions should be accompanied by a two-page description of key features and core technologies of the system. Preferably, a link to the real demo should be made available at the time of submission.
Papers will be submitted in PDF format through EasyChair. If you experience any problems during the submission, please contact the workshop co-Chairs at m4s@easychair.org.
Workshop Organizers
Co-Chairs
- Bo Hu: Fujitsu, United Kingdom
- Hans Friedrich Witschel: Fachhochschule Nordwestschweiz, Switzerland
- Daqing He: University of Pittsburgh, USA
Program Committee
- Panos Alexopoulus: TextKernel, Netherlands
- Ghislain Atemezling: Mondeca, France
- Christian Biemann: TU Darmstadt, Germany
- Victor de la Torre: Fujitsu Labs, Spain
- Ronald Denaux: Expert System, Spain
- Jana Diesner: UIUC, USA
- Sergio Fernandez: Redlink, Austria
- Alessio Ferrari: ISTI CNR, Italy
- Nuria Garcia-Santa: Expert System, Spain
- Andreas Holzinger: TU Graz, Austria
- Daqing He: University of Pittsburgh, USA
- Gerhard Heyer: University of Leipzig, Germany
- Bo Hu: Fujitsu, United Kingdom
- Terunobu Kume: Fujitsu, Japan
- Yu-ru Lin: University of Pittsburgh, USA
- Nuno Lopez: IBM, Ireland
- Fumihito Nishino: Fujitsu, Japan
- Vandenbussche Pierre-Yves: Fujitsu, Ireland
- Elena Montiel Ponsoda: Universidad Politécnica de Madrid, Spain
- Simone Paolo Ponzetto: University of Mannheim, Germany
- Angus Roberts: University of Sheffield, UK
- Barbara Thönssen: FHNW, Switzerland
- Boris Villazon Terrazas: Fujitsu, Spain
- Hans Friedrich Witschel: FHNW, Switzerland
Contact
If you have any questions, please contact one of the workshop organiser via email at
- Bo Hu: bo (dot) hu (at) uk (dot) fujitsu (dot) com
- Hans Friedrich Witschel: hansfriedrich (dot) witschel (at) fhnw (dot) ch
- Daqing He: dah44 (at) pitt (dot) edu