Υποδομές Γλωσσικών Πόρων

Contemporary methods for language technology research and development rely on the deployment of the appropriate resources and tools more than ever before. The paradigm shift of the late eighties spans today almost all areas of language technology: from speech recognition and synthesis, to technologies extracting information from unstructured textual or multimedia content, to contemporary methods for machine translation technologies development. Despite the strong dependence of research and technology progress on language and language-related data and tools, the respective landscape has been scattered, unorganised and highly fragmented.

Realising the need and importance of language resources for technology development and evaluation, LDC (Linguistic Data Consortium) was created in the US in 1992, ELRA (European Language Resources Association) in the EU in 1995, the GSK (Gengo-Shigen-Kyokai, literally “Language Resources Association”) in Japan in 2003 (and later on the NICT (National Institute of Information and Communications Technology) Language Infrastructure Group, all with similar objectives, but different governance models. At the national level, the Nederlandse Taalunie (Dutch Language Union) founded the TST-Centrale (HLT Agency) in 2004 as a resource centre for managing data collections, while similar structures have been set up in other EU member states.

Turning to language resources needed for development and evaluation of specific applications, corresponding data pooling and processing activities have taken place in projects and fora like EUROMATRIX (Statistical and Hybrid Machine Translation between all European Languages), TC-STAR (Speech-to-Speech Translation), CLEF (Cross Lingual Information Retrieval Evaluation Forum), SENSEVAL, CONLL Shared Tasks, Technolangue-Cesta, etc. The existing picture is finally largely completed by organisation-based and researcher-based data collections and tools, in the form of downloadable software or of services over the web.

However, availability, accessibility and visibility of resources and tools, as well as their re-use, re-purposing, interlinking and direct deployment in complex systems and service architectures are seriously hampered by a multilevel lack of interoperability. As a matter of fact, the field of language resources and technologies has been suffering from problems practically as to all three types of interoperability, according to the European Interoperability Framework: organisational, semantic and technical interoperability.

The EU and the member states have in the last few years increased the coordination of their research policies and activities by promoting and establishing the necessary infrastructures that address the observed fragmentation and treat the entailed problems by promoting availability and sharing of language resources and adoption of standards and exchange of best practices in their construction. The possibilities now brought by contemporary IT and the evolving web, from 2.0 soon to 3.0, and its associated technologies, for creating distributed infrastructures where researchers and developers can easily find language resources and run tools and services, introduce a new paradigm and help in cutting costs and improving efficiency. This new paradigm provides a solution for producing the resources and technologies which are necessary for the numerous applications with multilingual, multimedia and multimodal dimensions.

First instances of the new paradigm in Europe include the recently launched:

the META-SHARE infrastructure to facilitate sharing and exchange of LRs in the LT community. Data and tools included in the RI are both open and with restricted access rights, free and for-a-fee, while the user group includes both academic users and industry (cf. Accomplishments section below).

the PANACEA factory of language resources building, by offering a platform for creating language processing web-services and workflows for limited volumes of content (similarly cf. section below), as well as

the CLARIN infrastructure which aims at offering the researchers in Humanities and Social Sciences turn-key online existing resources and language processing services.

Outside Europe, OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of LRs; their major contribution is the set of metadata for LR description. The Advanced Language Information Forum (ALAGIN) in Japan brings together representatives of industry, academia, research institutions, and the government with the aim to research and develop technologies for crossing the language barrier. The distribution of tools and LRs facilitating these technologies is part of its mission.

More recently, as interdisciplinary research comes into focus, a number of initiatives are emerging that try to make research data available across barriers. The Research Data Alliance implements the technology, practice, and connections that make data work across barriers. It aims to facilitate the efforts of researchers and data scientists from different disciplines to define common issues, goals and activities through dedicated Interest Groups. Similarly, the World Data System strives to form a worldwide "community of excellence" for multidisciplinary scientific data, which ensures provision of quality-assessed data and data services to the international science community and other stakeholders. In all these initiatives, interoperability, consensual best practices and standardization constitute the main requisites.

Ινστιτούτο Επεξεργασίας του Λόγου

Ερευνητικοί άξονες