Big Data and Scalable Data Analytics

Big Data and Scalable Data Analytics

Over the last decade, there has been a huge and rapidly increasing quantity of data from various sectors, produced at extremely high volumes and rates. The digitization of production processes has transformed organizations, independently of their size, into producers or consumers of big data. Previously isolated structured data silos are widely being available on the Web for advanced data management activities, such as processing, searching and querying, interlinking and integration. At the same time, scientific procedures and experimentation have resulted in the production of huge volumes of data that are processed and used for extracting new knowledge. Finally, social networks have grown to record vast amounts of data reflecting the attitudes and interests of people and society at large. These changes have given birth to the Big Data challenge, which dominates the discussion in most sectors of computer science research (and industry). Big Data affects most aspects of everyday life and producing scalable data analytics for extracting knowledge from the huge volumes and speed of Big Data, poses new challenges and opportunities.

IMIS is the only Institute in Greece specializing in Information Systems and data management. Most of its personnel have a strong background in data management and databases and the majority of its projects (and research papers) deal with core data management issues. We develop novel algorithms and methods to model, index and query data produced in the contexts of Linked Open data and the Data Web, scientific databases, social media, the World Wide Web, and sensor networks. Each of the main areas of research and development is described in more detail in the following subsections. Our goal is to produce high quality research results and develop novel tools and technologies that can be exploited at large by the research community and the industry.

Scalable Analytics for Social Data

The increasing use of online social networks and microblogging platforms, such as Facebook and Twitter, and the content generated by millions of users, has led to the development of a multitude of innovative applications based on the analysis of big data from social networks. The selection of an appropriate sample for each different application data analysis is critical to the quality of the analysis results. In this direction, technologies supporting efficient collection and modeling of social network data are key components of an innovative information management platform for social networks. Such a platform fills the gap that exists between the restrictive and cumbersome interface offered by online social networks and the need of applications for easy collection and flexible sample selection.

At IMIS, we are developing a novel platform for managing information obtained from OSNs and we incorporating in it the results of relevant research activities. The main characteristics of the platform include:

  • The ability to interactively define multiple “campaigns” for the large-scale collection of thematically focused information from online social networks.
  • The modeling of the collected data over multiple axes, such as relevance conditions, user activity, time sequence, spatial replacement, etc.
  • The creation of “data views” based on complex criteria (thematic, time, space, etc.) for the definition of appropriate samples for each use of the OSN data analysis applications.

Scalable Analytics for Biomedical data

IMIS, in collaboration with the DNA Intelligent Analysis (DIANA) group of B.S.R.C. «Alexander Fleming», has designed and implemented an advanced IT infrastructure for genomic data management, oriented to processing, analysis and visualization of miRNA-related data and their interactions with genes.

miRNAs are small RNA molecules that are involved in many biologic processes, regulating the expression of more than a third of human genes. miRNAs can completely silence proteins. They do so, by binding themselves to complementary sequences on mRNA transcripts, called targets. The knowledge of miRNA targets (i.e., which mRNA transcripts are targeted by a miRNA) is important for therapeutic uses (e.g., for cancer and heart disease). For example, based on such knowledge, biologists can shut off genes by delivering artificial miRNA molecules into cells. Since there is a lack of high-throughput experimental methods for identifying microRNA targets, computational methods to predict targets have become increasingly important.

In 2012, we finalized the design and development of an integrated database to store and organize data from all DIANA applications in a single point. A key feature is that we are able to re-produce experimental datasets by taking into account versioning info and tracking the evolution of miRNA metadata information. Moreover, we designed and implemented several new features for all the existing tools, and we have also launch mirPub, a database and a Web application, on top, which provides a powerful and intuitive interface to researchers and miRNA database curators for searching publications related to miRNAs. mirPub provides the most complete set of miRNA related publications considering both miRNA name variations and miRNA data evolution, and combining text mining techniques with manual curation. Moreover, mirPub’s interface provides informative graphs presenting intuitively the history of miRNA data evolution, tag clouds summarizing the correlation of publications to particular diseases, cell types or tissues, and access to TarBase data in order to oversee genes related to the publications.

Also, in collaboration with B.S.R.C. “Alexander Fleming” and GRNET, we started the integration of miRNA data intensive processing modules into virtual/cloud facilities. The aim is to provide to the academic community a RI for experimental purposes, covering all aspects of the analysis of large volumes of miRNA data and their interactions with genes. A set of VMs has been set up providing scripts to perform various operations for miRNA analysis. Examples include MRE (miRNA recognition elements) discovery, determining miRNA-to-gene binding energy (RNAhybrid), conservation profiling, score aggregator, etc. Then, Master VMs can be allocated to orchestrate the Job Workers in order to perform a sequence of operations by executing the scripts. To orchestrate the operations, an API based on HTTP calls has been developed.

Big Data Research Infrastructures

Research Infrastructures are the necessary facilities, resources, and services that enable the research community to conduct research in any scientific and technological domain. For example, in the domain of earth sciences and astronomy, the objective of research infrastructures is to facilitate the collection of observations and relevant data at large-scales, and to make them available to researchers at a national, European or world-wide level. The publications of researchers are also made accessible through Open Access research infrastructures, benefiting the whole research community and the society at large.

There are several challenges to meet when designing and operating research infrastructures. Management of research data has to be addressed throughout the life cycle of data and research infrastructure should support researchers to register, discover, access and reuse data. Hence, metadata have to be available for all datasets, regardless of whether the dataset itself is made available or not. Data should be deposited in trusted repositories, providing long-term access. If possible, research data should also be made available to peers for reviewing purposes and made openly available when related articles are published.

Our activities at IMIS are focused on providing efficient systems for accessing large amounts of scientific data by means of query languages flexible enough to address the needs of a range of scientific and technological domains. We also conduct research and development on the modeling of information collected from scientific activities and the extraction of knowledge.

Scalable Data Analytics for Monitoring

One of the most profound challenges in the scalable analytics field is to create tools for real time monitoring of activities, like transport, water consumption, energy production etc. In all these cases we have numerous data streams producing huge volumes of data that must be analyzed in real time. IMIS has already started research towards this direction focusing on two case studies:

Movement monitoring. Tracing and analyzing the movement of people and vehicles is a very important problem with numerous applications like fleet management, location based recommendations, city monitoring etc. IMIS is already working on this area by creating data analytics tools for both vehicles and people. Issues like pattern and event detection in real time, visualization, efficient storage and querying are at the core of IMIS work. This field of work is already taking place in existing projects.

Monitoring the use of resources. Monitoring the production and consumption of resources, especially water and energy is a driving need for the area of scalable data analytics. The detailed monitoring of energy production from small installations of renewable energy sources and the consumption of water on a household level is an existing need that calls for advanced analytics tools.

IMIS will create tools that will allow analyzing the consumption of resources, which is an important issue for both energy and water sectors. We are researching and developing analysis and recommendation services for resource measurements established from real-time resource consumption data. The services will automatically handle all data collection, management and knowledge extraction duties. They will analyze resource consumption data for various residential and consumer dimensions, identify behavioral patterns, and provide personalized recommendations.

Emerging Research Directions

IMIS has an excellent record in basic research on issues related to data management, analytics and scalability. IMIS will continue to work on those subjects giving emphasis to parallelization and real time analysis which are the most basic core research challenges in Big Data. In addition to the areas described above, a (non-exhaustive) list of basic research problems IMIS is going to focus on follows:

  • New indexing methods for non-relational data, like set-values, sequences, trajectories, etc.
  • Novel query evaluation techniques for complex queries, like similarity joins, target prediction in genomic series, RDF querying etc.
  • Novel and scalable data mining techniques, like moving object cluster detection, creation of recommendation models based on social and spatial data, trend detection in twitter feeds, etc.
  • Event processing and detection on streaming data