Feature Driven Clustering Big Data Covid-19 Analytics

Govindraju G.N, B.K.Raghavendra, Raghavendra S, Santosh Kumar J.

PDF

Published: 2021-09-01

Govindraju G.N, B.K.Raghavendra, Raghavendra S, Santosh Kumar J.

Abstract

The rise in internet, IoT and web-services specially with blogs has enhance the demand of BigData, which demands robust and highly-efficient system to analytics, which will serve real time and accurate distributed data. The framework which will distribute the data for storing and computation that is parallelized computing have been found key driving force behind the BigData analytics; however, the system always lacks in optimal data pre-processing, feature sensitive computation and more importantly feature learning makes major at-hand solutions inferior, especially in terms of time and accuracy. The proposed model hypothesizes that an analytics solution with BigData features or characteristics must have the ability to process humongous, heterogenous, structured, unstructured and semi-structured multi-dimensional features to yield time-efficient and accuracy analytical outputs. To process analytical task our proposed model at first employs tokenization, followed by Word2Vec based semantic feature extraction using CBOW (Bag of words) and N-Skip-Gram methods. Unlike other clustering models, we propose a improved multi-objective GA (Genetic algorithm) IMOGA to serve dual purposes, first to improve centroid and second optimize the clusters. Our proposed model applied Euclidian distance information to perform centroid optimization, while Silhouette coefficient was applied to perform cluster validation and its optimization. Eventually, the optimal amalgamation of tokenization, Word2Vec word-embedding or feature extraction, and IMOGA K-Means clustering in parallel to the Spark distributed data framework exhibited better performance in terms of execution time and clustering. Our proposed model was found more effective with Skip-Gram Word2Vec feature extraction. Simulation results with a publicly available COVID-19 data exhibited better performance than existing K-Means based MapReduce distributed data frameworks.

Issue

Vol. 12 No. 9 (2021)

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section