-
贝叶斯分类[11]的核心思想是通过已知概率分布中存在的误判损失去完成数据分类的最优化。基于贝叶斯定理可知,在特征样本的条件下的类别概率P(K|X)可以表示为:
$$P(K\left| X \right.) = \frac{{P(K)P(X\left| K \right.)}}{{P(X)}}$$ (1) 式中:K表示类别;X表示样本特征;P(K)表示可以预先获取的先验概率;P(X|K)表示关于样本特征的类别概率;P(X)表示算法设置系数。由上式可知将对类别概率的计算转化成了对先验概率与特征因子的计算。
-
在光纤网络中,异常数据往往是具有一定特征的,并且产生的异常数据形式具有一定的相关性,从而采用先验概率去识别异常信息是有一定优势的。而在光纤网络中的异常信息往往是由于网络中错误代码、数据冲突等造成的,这些异常基本上是独立存在的,故在文中采用朴素贝叶斯策略[12]进行分区,具有稳定性强、准确度高等特性。在这里想要计算类别概率时,所对应的类别就是算法的控制变量k,由于控制变量并不唯一,故采用下标区分不同控制变量,m个控制变量k,即k1、k2、k3、···k m,其对应的特征矢量分别为k1、k2、k3、···k n。设样本为X={x1, x2, x3, ···, xn},包括了则对于Ki满足在此条件下,其贝叶斯分类概率可表示为:
$$P({K_i}\left| X \right.) = \frac{{P({K_i})P(X\left| {{K_i}} \right.)}}{{P(X)}}$$ (2) 式中:P(X)为设置系数。故当公式(2)中分子满足极大化时,则该式也能够满足。在光纤网络传输的通常情况下控制变量的概率是由于硬件设备决定的,换言之从概率分布的角度而言,这个也往往被看作是常量。故最终实际上是在计算P(X|Ki)的,则其可表达为:
$$P(X\left| {{K_i}} \right.) = \prod\limits_{k - 1}^n {P({X_k}\left| {{K_i}} \right.)} $$ (3) 在计算样本数据时可以获取公式(3)中不同X赋值时的P(X|Ki),故当其符合公式(3)时样本数据被分类到Ki中,从而样本符合极大化要求。
针对样本的先验概率[13],如果样本集合中所有的样本或训练集都没有出现某个分量值,则检测结果为0。并且采用拉氏平滑[14]修正先验概率,从而防止非特征数据占据特征数据类别的问题。如果训练样本D中类别量为N,则对应的第i个特征值对应数值为Ni,由此获得修正结果:
$$P(X\left| {{K_i}} \right.) = \left( {\left| {{D_c}} \right| + 1} \right){\left( {\left| {{D_c}} \right| + {N_i}} \right)^{ - 1}}$$ (4) 由上式可知,当样本总数增大时,修正过程中的先验效应造成的影响会越来小,其估计值与真实概率会无限逼近。
-
在通过贝叶斯定理完成异常信息分区后,对已完成分区的样本数据进行数据挖掘,挖掘过程主要分为:特征数据提取、数据预处理、分区分类、模型构建。首先,对已完成的分区进行信息类型趋势分析,从而对不同的异常信息的数据格式与类型进行分类;然后,对异常信息进行概率化处理,将异常信息的概率属性叠加概率化系数上;最后,利用贝叶斯拓扑结构[15],将概率化[16]的数据分布转化为数据特征向量,形成数据挖掘的边界条件。
设数据集合为A,挖掘特征参数为B,异常信息的分类系数为n,概率化系数为l,则数据挖掘的计算规律满足:
$$\int {\bar P} = \left\{ {P({a_n})\sum\limits_{i \in n}^{{B_i}} {l\left( {{A_i},{B^n}} \right)\forall } } \right\}$$ (5) 为了提高数据挖掘的精度与挖掘速度之间的制约关系,采用贝叶斯分区将初始海量光纤网络数据进行分区,这样在数据挖掘过程中不同分区的侧重是不同的,针对不同异常信息类型其概率化值不同(该概率化系数可以理解为每个数据点的权值),从而挖掘深度和速度可以达到最优化配置,避免无效挖掘,从而保证挖掘速度。
设任意贝叶斯分区中数据集合为X,而对应的X中可以展开成n×n的矩阵形式,与第1节中的样本数据集对应,则满足其分区数据挖掘的概率关系有:
$$X = \sum\limits_{n \in N} {\left( {\begin{array}{*{20}{c}} {{x_{11}}}&{{x_{12}}}&{...}&{{x_{1n}}} \\ {{x_{21}}}&{{x_{22}}}&{...}&{{x_{2n}}} \\ {...}&{...}&{...}&{...} \\ {{x_{n1}}}&{{x_{n2}}}&{...}&{{x_{nn}}} \end{array}} \right)} $$ (6) 根据以上步骤完成迭代每一个贝叶斯分区中的数据集合,就能快速地获得全部的异常数据集合。
-
为了提高光纤网络中异常信息识别精度与收敛速度,将贝叶斯分区应用于数据挖掘前的数据分区,从而使不同分区中异常信息类型的识别概率可以根据分区属性进行调节,这样就能提高异常信息的识别精度与收敛速度。挖掘算法的流程如图1所示,实现步骤如下:
(1) 对光纤网络中异常信息的种类与数据格式进行分类,并根据以往异常信息出现频次的差异设定不同的先验概率P(X);
(2) 设置分区内样本数据集X={x1, x2, x3, ···, xn},依据异常信息特征设置m个控制变量k,即k1、k2、k3、···、km;
(3) 循环判断符合控制变量条件下数据集的概率,当其满足极大化条件时,输出贝叶斯分类概率值P(Ki|X);
(4) 训练样本数据D,设置其需要处理的数据的类别量N和其对应数值Ni,从而对原有的贝叶斯分类概率值进行修正,随着数据量不断增大,修正效果将无限逼近真实概率,从而提高系统分区精度,最终确定所以数据的区域划分;
(5) 在具有明确分区的基础上,将数据挖掘的计算规律给出,并将贝叶斯分区作为其边界条件,对不同区域的异常信息进行概率化分类,分类依据为公式(5),对数据集合A中的n个类别进行挖掘;
(6) 通过分区数据挖掘的概率关系作为收敛条件对所有分区进行分段迭代,将光纤网络中数据遍历后输出异常信息结果。
Optical fiber network anomaly analysis algorithm based on Bayesian partition data mining
-
摘要: 光纤网络通信中异常信息的快速、准确识别是保证通信稳定的关键,随着光纤网络通信数据的激增,也成为了近年来的一个研究热点。文中结合异常信息识别算法的精度与收敛速度之间的制约机理,提出了基于贝叶斯分区数据挖掘的异常信息识别算法。首先,采用贝叶斯定量完成数据样本的特征分类,通过极大化分析修正先验概率;然后,依据异常信息的不同类型设置挖掘特征参数及概率化系数;最后,依据贝叶斯分区分别对样本数据进行具有针对性的数据挖掘。实验以光纤局域网的通信状态数据为样本,将该算法与人工神经网络算法和遗传算法的识别结果进行对比,计算了三种算法的识别正确率、收敛速度以及算法稳定性。该算法的识别正确率均值为93.83%,在数据量增大时未发生明显的降低。收敛速度与遗传算法相近,均值为3.25 s。漏检率和误检率均值分别为0.10%和0.54%。结果表明:该算法识别正确率与收敛速度均得到了提高,稳定性好,并能够在漏检率与误检率之间通过参数控制进行微调,具有较好的应用价值。Abstract: The rapid and accurate identification of abnormal information in optical fiber network communication was the key to ensuring the stability of communication. The surge in conversion of optical fiber network communication data has also become the only research hotspot. Firstly, Bayesian partition data mining was used to quantify the feature classification of data samples, and the prior probability was corrected through maximization analysis; Secondly, the mining characteristic parameter and probability coefficient were set according to different types abnormal information; Finally, according to the Bayesian partition, the sample data was collected with specific data. The experiment takes the communication state data of the optical fiber interconnection as a sample, compared the recognition results of this algorithm with the artificial neural network algorithm and the genetic algorithm, and calculated the recognition accuracy, convergence speed and algorithm stability of the three algorithms. The average value of the recognition accuracy of this algorithm was converted to 93.83%, and there was no significant decrease when the amount of data increased. The convergence speed was similar to that of genetic algorithm, with an average value of 3.25 s. The mean values of missed detection rate and false detection rate were 0.10% and 0.54%, respectively. The results show that the recognition accuracy and convergence speed of this algorithm are improved, the stability is good, and the parameter control can be fine-tuned between the missed detection rate and the false detection rate, which has better application value.
-
[1] Ramezani M, Yaghmaee F. A review on human action analysis in videos for retrieval applications [J]. Artificial Intelligence Review, 2016, 46(4): 485-514. doi: 10.1007/s10462-016-9473-y [2] Wang Hui, Zhang Cuiyu. Differences between network data mining algorithm based on improved genetic algorithm [J]. Computer Simulation, 2015, 32(5): 311-314. (in Chinese) [3] Kuang Y, Guo Y, Xiong L, et al. Packaging and temperature compensation of fiber Bragg grating for strain sensing: A survey [J]. Photonic Sensors, 2018, 8(4): 320-331. doi: 10.1007/s13320-018-0504-y [4] Jia Q. Location and monitoring of fiber optic line faults [J]. China New Telecommunications, 2017, 19(1): 74-74. [5] Yeung S, Russakovsky O, Jin N, et al. Every moment counts: dense detailed labeling of actions in complex videos [J]. International Journal of Computer Vision, 2018, 126(24): 375-389. [6] Chen Yang, Zhao Shanghong, Wang Xiang, et al. BER analysis of high-altitude OFDM-FSO modulation system under exponentiated weibull atmospheric turbulence model [J]. Laser & Infrared, 2018, 48(7): 832-837. [7] Chen Y, Li L J. Very fast decision tree classification algorithm based on Red-Black tree for data stream with continuous attributes [J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2017, 37(2): 86- 90. [8] Liu Y, Wang C R. An improved big data clustering method based on sampling fusion [J]. Microelectronics & Computer, 2017, 34(4): 17- 21. [9] Gu X Q, Jiang Y Z, Wang S T. Zero-order TSK-type fuzzy system for imbalanced data classification [J]. Acta Automatica Sinica, 2017, 43(10): 1773-1788. [10] Lee J, Lee S, Hwang I. Hybrid system modeling and estimation for arrival time prediction in terminal airspace [J]. Journal of Guidance Control & Dynamics, 2016, 39(4): 903-910. [11] Sun B C, Li J Z, Zhang W T. Fiber Bragg grating sensor [J]. Optical Fiber Sensing and Structural Health Monitoring Technology, 2019, 26(4): 77-148. [12] Huang X, Wang Z, Li Y, et al. Design of fuzzy state feedback controller for robust stabilization of uncertain fractional-order chaotic systems [J]. Journal of the Franklin Institute, 2015, 351(12): 5480-5493. [13] Shang F, Yi J, Xiong A, et al. A node localization algorithm based on multi-granularity regional division and the lagrange multiplier method in wireless sensor networks [J]. Sensors, 2016, 16(11): 1934. doi: 10.3390/s16111934 [14] Pan Q K, Sang H Y, Duan J H, et al. An improved fruit fly optimization algorithm for continuous function optimization problems [J]. Knowledge-Based Systems, 2014, 62(1): 69- 83. [15] Guan L, Hu G J, Wang Zh. Research on network security situational awareness technology based on big data [J]. Netinfo Security, 2016, 1(9): 45-50. [16] Guo H, Liu H, Wu C, et al. Logistic discrimination based on G-mean and F-measure for imbalanced problem [J]. Journal of Intelligent and Fuzzy Systems, 2016, 31 (3): 1155-1166. doi: 10.3233/IFS-162150