Scalable Data Systems Lab
Founding Members

Sarana Nutanong
Principal Investigator
VISTEC
Data Engineering
Scalable Machine Learning
Research Topics
High Dimensional Data Management
Vehicular Sensor Network Analytics
Scalable Machine Learning |

Thanawin Rakthamanon
Collaborating Investigator
Kasetsart University
Data Mining
Time Series Data Analysis
Research Topics
Time Series Classification and Clustering
Scalable Algorithms for Search and Indexing
Concept Drift Detection
|

Ekapol Chuangsuwanich
Collaborating Investigator
Chulalongkorn University
Deep Learning
Signal Processing
Research Topics
Deep Learning
Automatic Speech Recognition
|
Team Members
Postdoc:
• Raheem Sarwar: Stylometric Analytical Query Processing
PhD Students:
• Nattapol Trijakwanich: Scalable Data Mining
• Krissanee Kamthawee: Extreme Multi-class Multi-label Classification
• Sasikarn Khwanmuang: Machine Learning Systems
• Bundit Boonyarit: Data-Intensive Scientific Discovery (Molecular Dynamics Simulations)
• Benchakarn Leelakittisin: Healthcare Data Management

Research Projects
Multi-Author Authorship Attribution

Authorship Identification
Aims at identifying the true author of an anonymous document from a set of candidate authors
Single-label (author) classification problem
Applicable to single-author documents
Authorship Identification for Multi-Author Documents (AIMD)
Given a corpus of multi-author documents labeled with their authors, identify the authors of an anonymous multi-author document from a set of authors of a given corpus.
Applicable to single-author/multi-author documents
Multi-label classification problem
Ref: Raheem Sarwar, Chenyun Yu, Sarana Nutanong, Norawit Urailertprasert, Nattapol Vannaboot, Thanawin Rakthanmanon: A Scalable Framework for Stylometric Analysis of Multi-author Documents. DASFAA (1) 2018: 813-829
C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join

Approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution for reducing the processing cost with a predictable loss of accuracy.
The network cost is the bottleneck in a distributed processing environment.
Focusing on collision counting LSH-based similarity join on MapReduce, we propose a network-efficient solution called C2Net, which improves the utilization of MapReduce combiners.
Hangyu Li, Sarana Nutanong, Hong Xu, Chenyun Yu, Foryu Ha: C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join. IEEE Transactions on Knowledge and Data Engineering Year: 2018, ( Early Access )
A Hardware-Accelerated Solution for Join Operations
The join query is one of the most fundamental database query types for relational database management systems and has a high cost in comparison to other query types.
We propose a novel solution to accelerate processing of sort-merge join queries with a low match rate.
Zimeng Zhou, Chenyun Yu, Sarana Nutanong, Yufei Cui, Chenchen Fu, Chun Jason Xue: A Hardware-Accelerated Solution for Hierarchical Index-Based Merge-Join. IEEE Transactions on Knowledge and Data Engineering Year: 2018, ( Early Access )

A Quality-oriented Data Collection Scheme in Vehicular Sensor Networks
The communication overhead of collecting data from all vehicles at a high frequency could be prohibitively expensive.
We propose a Quality- oriented Data Collection (QDC) scheme which aims to effectively support the accuracy and real-time requirements stipulated by ITS applications, while reducing communication overhead due to the huge number of update packets.
Wendi Nie, Victor C. S. Lee, Dusit Niyato, Yaoxin Duan, Kai Liu, Sarana Nutanong: A Quality-oriented Data Collection Scheme in Vehicular Sensor Networks. IEEE Transactions on Vehicular Technology. Year: 2018, ( Early Access )

Multivariate Time Series Data Management
Identifying similar time series is a core subroutine for many data mining and data analysis problems
Existing efficient solutions
fail to scale as the
number of dimensions increases
We propose an efficient
approximation method via locality sensitive hashing.
