Welcome to Vidyasirimedhi Institute of Science and Technology. Website from Thailand

Big Data Analytics

Scalable Data Systems Lab

Founding Members

Sarana Nutanong
Principal Investigator

Data Engineering
Scalable Machine Learning

Research Topics
High Dimensional Data Management
Vehicular Sensor Network Analytics
Scalable Machine Learning

Thanawin Rakthamanon
Collaborating Investigator
Kasetsart University

Data Mining
Time Series Data Analysis

Research Topics
Time Series Classification and Clustering
Scalable Algorithms for Search and Indexing
Concept Drift Detection

Ekapol Chuangsuwanich
Collaborating Investigator
Chulalongkorn University

Deep Learning
Signal Processing

Research Topics
Deep Learning
Automatic Speech Recognition

Team Members
Raheem Sarwar: Stylometric Analytical Query Processing
PhD Students:
Nattapol Trijakwanich: Scalable Data Mining
Krissanee Kamthawee: Extreme Multi-class Multi-label Classification
Sasikarn Khwanmuang: Machine Learning Systems
Bundit Boonyarit: Data-Intensive Scientific Discovery (Molecular Dynamics Simulations)
Benchakarn Leelakittisin: Healthcare Data Management

Research Projects

Multi-Author Authorship Attribution

Authorship Identification
Aims at identifying the true author of an anonymous document from a set of candidate authors
Single-label (author) classification problem
Applicable to single-author documents

Authorship Identification for Multi-Author Documents (AIMD)
Given a corpus of multi-author documents labeled with their authors, identify the authors of an anonymous multi-author document from a set of authors of a given corpus.
Applicable to single-author/multi-author documents
Multi-label classification problem

Ref: Raheem Sarwar, Chenyun Yu, Sarana Nutanong, Norawit Urailertprasert, Nattapol Vannaboot, Thanawin Rakthanmanon: A Scalable Framework for Stylometric Analysis of Multi-author Documents. DASFAA (1) 2018: 813-829

C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join

Approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution for reducing the processing cost with a predictable loss of accuracy.
The network cost is the bottleneck in a distributed processing environment.
Focusing on collision counting LSH-based similarity join on MapReduce, we propose a network-efficient solution called C2Net, which improves the utilization of MapReduce combiners.

Hangyu Li, Sarana Nutanong, Hong Xu, Chenyun Yu, Foryu Ha: C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join. IEEE Transactions on Knowledge and Data Engineering Year: 2018, ( Early Access )

A Hardware-Accelerated Solution for Join Operations

The join query is one of the most fundamental database query types for relational database management systems and has a high cost in comparison to other query types.
We propose a novel solution to accelerate processing of sort-merge join queries with a low match rate.

Zimeng Zhou, Chenyun Yu, Sarana Nutanong, Yufei Cui, Chenchen Fu, Chun Jason Xue: A Hardware-Accelerated Solution for Hierarchical Index-Based Merge-Join. IEEE Transactions on Knowledge and Data Engineering Year: 2018, ( Early Access )

A Quality-oriented Data Collection Scheme in Vehicular Sensor Networks

The communication overhead of collecting data from all vehicles at a high frequency could be prohibitively expensive.
We propose a Quality- oriented Data Collection (QDC) scheme which aims to effectively support the accuracy and real-time requirements stipulated by ITS applications, while reducing communication overhead due to the huge number of update packets.

Wendi Nie, Victor C. S. Lee, Dusit Niyato, Yaoxin Duan, Kai Liu, Sarana Nutanong: A Quality-oriented Data Collection Scheme in Vehicular Sensor Networks. IEEE Transactions on Vehicular Technology. Year: 2018, ( Early Access )

Multivariate Time Series Data Management

Identifying similar time series is a core subroutine for many data mining and data analysis problems
Existing efficient solutions fail to scale as the number of dimensions increases
We propose an efficient approximation method via locality sensitive hashing.