[1]徐旸,王佳斌,彭凯.结合PCA的t-SNE算法的并行化实现方法[J].华侨大学学报(自然科学版),2022,43(5):685-692.[doi:10.11830/ISSN.1000-5013.202110006]
 XU Yang,WANG Jiabin,PENG Kai.Parallel Implementation Method of t-SNE Algorithm Combined With PCA[J].Journal of Huaqiao University(Natural Science),2022,43(5):685-692.[doi:10.11830/ISSN.1000-5013.202110006]
点击复制

结合PCA的t-SNE算法的并行化实现方法()
分享到:

《华侨大学学报(自然科学版)》[ISSN:1000-5013/CN:35-1079/N]

卷:
第43卷
期数:
2022年第5期
页码:
685-692
栏目:
出版日期:
2022-09-13

文章信息/Info

Title:
Parallel Implementation Method of t-SNE Algorithm Combined With PCA
文章编号:
1000-5013(2022)05-0685-08
作者:
徐旸 王佳斌 彭凯
华侨大学 工学院, 福建 泉州 362021
Author(s):
XU Yang WANG Jiabin PENG Kai
College of Engineering, Huaqiao University, Quanzhou 362021, China
关键词:
高维数据 Spark平台 降维 可视化 t-SNE算法
Keywords:
high-dimensional data Spark platform dimensionality reduction visualization t-SNE algorithm
分类号:
TP391
DOI:
10.11830/ISSN.1000-5013.202110006
文献标志码:
A
摘要:
为了提高大数据环境下高维非线性数据的处理速度和精确度,提出一种结合主成分分析(PCA)的基于t分布的随机近邻嵌入(t-SNE)算法.首先,通过主成分分析法对原始数据进行预处理,去除噪声点;然后,结合t-SNE算法,构建K最邻近(K-NN)图,以表示高维空间中数据的相似关系;最后,在Spark平台上进行并行化运算,并在BREAST CANCER,MNIST和CIFAR-10数据集上进行实验.结果表明:文中算法完成了高维数据至低维空间的有效映射,提升了算法的效率和精确度,可应用于大规模高维数据的降维.
Abstract:
In order to improve the processing speed and accuracy of high-dimensional nonlinear data based on t distribution in the big data environment, a random neighbor embedding(t-SNE)algorithm combined with principal component analysis(PCA)is proposed. Firstlly, the original data is preprocessed by the principal component analysis method to remove noise points. Then, combined with the t-SNE algorithm, the K nearest neighbor(K-NN)graph is constructed to represent the similarity relationship of the data in the high-dimensional space. Finally, on the Spark platform carry out parallel operation and experiment on BREAST CANCER, MNIST and CIFAR-10 data sets. The results show that the proposed algorithm complete the effective mapping of high-dimensional data to low-dimensional space, improves the efficiency and accuracy of the algorithm, and can be applied to large-scale high-dimensional data dimensionality reduction.

参考文献/References:

[1] PEZZOTTI N,THIJSSEN J,MORDVINTSEV A,et al.GPGPU linear complexity t-SNE optimization[J].IEEE Transactions on Visualization and Computer Graphics,2020,26(1):1172-1181.DOI:10.1109/TVCG.2019.2934 307.
[2] 赵学武,吴宁,王军,等.航空大数据研究综述[J].计算机科学与探索,2021,15(6):999-1025.DOI:10.3778/j.issn.1673-9418.2012108.
[3] HEINRICH J,LUO Yuan,KIRKPATRICK A,et al.Evaluation of a bundling technique for parallel coordinates[J].Energy Conversion and Management,2011,88(5):259-266.DOI:10.1016/j.enconman.2014.08.006.
[4] 途乐,陈彬捷,周志光.OD数据可视分析综述[J].计算机辅助设计与图形学报,2021,33(8):1160-1171.DOI:10.3724/SP.J.1089.2021.18679.
[5] 梁京章,黄星舒,吴丽娟,等.基于KPCA和改进K-means的电力负荷曲线聚类方法[J].华南理工大学学报(自然科学版),2020,48(6):143-150.DOI:10.12141/j.issn.1000-565X.200009.
[6] ROWEIS S,SAUL L.Nonlinear dimensionality reduction by locally linear embedding[J].Science,2000,290(5500):2323-2329.DOI:10.1126/science.290.5500.2323.
[7] TENENBAUM J,SILVA V,LANGFORD J.A global geometric framework for nonlinear dimensionality reduction[J].Science,2000,290(5500):2319-2322.DOI:10.1126/science.290.5500.2319.
[8] MAATEN L,HINTON G.Visualizingnon-metric similarities in multiple maps[J].Machine Learing,2012,87(1):33-55.DOI:10.1007/s10994-011-5273-4.
[9] CHAN D M,RAO R,HUANG F,et al.GPU accelerated t-distributed stochastic neighbor embedding[J].Journal of Parallel and Distribute Computing,2019,131(1):1-13.DOI:10.1016/j.jpdc.2019.04.008.
[10] 崔文泉,黄禹侨.高维数据情形下的一种基于随机投影的集成分类方法[J].中国科学技术大学学报,2019,49(12):974-984.DOI:10.3969/j.issn.0253-2778.2019.12.004.
[11] 刘东江,黎建辉.基于Spark的并行图聚类算法研究[J].系统仿真学报,2020,32(6):1038-1050.DOI:10.16182/j.issn1004731x.joss.18-0722.
[12] 张文杰,蒋烈辉.基于MapReduce并行化计算的大数据聚类算法[J].计算机应用研究,2020,37(1):53-56.DOI:10.19734/j.issn.1001-3695.2018.05.0496.
[13] 任磊,杜一,马帅,等.大数据可视分析综述[J].软件学报,2014,25(9):1909-1936.DOI:10.13328/j.cnki.jos.004645.
[14] 程宇航,张健钦,李江川,等.交通行业事故文本数据的可视化挖掘分析方法[J].计算机工程与应用,2021,57(21):116-122.DOI:10.3778/j.issn.1002-8331.2010-0269.
[15] 魏占辰,刘晓宇,黄秋兰,等.Spark迭代密集型应用的优化方法研究[J].计算机工程与应用,2020,56(23):68-73.DOI:10.3778/j.issn.1002-8331.1912-0293.
[16] LIU Shusen,MALJOVEC D,WANG Bei,et al.Visualizing high-dimensional data: Advances in the past decade[J].IEEE Transactions on Visualization and Computer Graphics,2017,23(3):1249-1268.DOI:10.1109/tvcg.2016.2640960.
[17] BELKINA A C,CICCOLELLA C O,ANNO R,et al.Automated optimized parameters for T-distributed stochastic neighbor embedding improve visuallization and analysis of large datasets[J].Nature Communications,2019,10(1):1-12.DOI:10.1038/s41467-019-13055-y.
[18] 崔艺馨,陈晓东.Spark框架优化的大规模谱聚类并行算法[J].计算机应用,2020,40(1):168-172.DOI:10.11772/j.issn.1001-9081.2019061061.
[19] 章蓉,陈谊,张梦录,等.高维数据聚类可视分析方法综述[J].图学学报,2020,41(1):44-56.DOI:10.11996/JG.j.2095-302X.2020010044.
[20] 董安国,张倩,刘洪超,等.基于TSNE和多尺度稀疏自编码的高光谱图像分类[J].计算机工程与应用,2019,55(21):177-182.DOI:10.3778/j.issn.1002-8331.1903-0155.

相似文献/References:

[1]邹小波,王佳斌,詹敏.Spark平台下KNN-ALS模型推荐算法[J].华侨大学学报(自然科学版),2019,40(2):264.[doi:10.11830/ISSN.1000-5013.201703071]
 ZOU Xiaobo,WANG Jiabin,ZHAN Min.Recommendation Algorithm of KNN-ALS Model Based on Spark Platform[J].Journal of Huaqiao University(Natural Science),2019,40(5):264.[doi:10.11830/ISSN.1000-5013.201703071]

备注/Memo

备注/Memo:
收稿日期: 2021-10-06
通信作者: 王佳斌(1974-),男,副教授,主要从事物联网、云计算和大数据的研究.Email:fatwang@hqu.edu.cn.
基金项目: 国家自然科学基金青年科学基金资助项目(61505059)
更新日期/Last Update: 2022-09-20