宋千里, 赖华. 采用深度学习的小语种舆情监控方法[J]. 红外与激光工程, 2021, 50(S2): 20210298. DOI: 10.3788/IRLA20210298
引用本文: 宋千里, 赖华. 采用深度学习的小语种舆情监控方法[J]. 红外与激光工程, 2021, 50(S2): 20210298. DOI: 10.3788/IRLA20210298
Song Qianli, Lai Hua. Monitoring method of public opinion in minor languages using deep learning[J]. Infrared and Laser Engineering, 2021, 50(S2): 20210298. DOI: 10.3788/IRLA20210298
Citation: Song Qianli, Lai Hua. Monitoring method of public opinion in minor languages using deep learning[J]. Infrared and Laser Engineering, 2021, 50(S2): 20210298. DOI: 10.3788/IRLA20210298

采用深度学习的小语种舆情监控方法

Monitoring method of public opinion in minor languages using deep learning

  • 摘要: 在小语种舆情监控领域,由于小语种的标注语料难以获取,导致深度学习的训练效果较差。对于民间及媒体发表的新闻内容很难准确抽取其核心观点句,从而影响了进一步的舆情分析效果。为了将研究问题具体化,以越南语为例,提出一种融入共享主题特征的汉越跨语言新闻观点句的抽取方法,可以借助充足的汉语标注语料解决小语种资源稀缺问题,并利用双语可比语料间可共享的主题信息来优化抽取效果,进而提升舆情监控效果。具体方法为,提取汉越可比新闻的隐含狄利克雷分布(Latent Dirichlet Allocation, LDA)主题来构建共享主题特征,借助共享主题词典和情感词典训练双语词嵌入模型来共享汉越语义空间表征,将特征融入词向量,通过将语义信息与主题、情感、位置信息相结合来提升抽取效果。在汉越可比新闻数据集里进行的实验结果表明,融入共享主题特征能够提升小语种新闻观点句的抽取效果,F1值达到0.721,对小语种舆情监控起到支撑作用。

     

    Abstract: In the field of public opinion monitoring in minor languages, it is difficult to obtain annotated corpus in minor languages, resulting in poor practice of deep learning. It is difficult to extract the core opinions of the information published by the private and the media for further analysis of public opinion. Taking Vietnamese as an example, a method for extracting Chinese-Vietnamese news opinion sentences that incorporated shared topic features was proposed. The problem of scarcity of small language resources was solved with the help of sufficient Chinese annotation corpus, and the topic information was used to shared between bilingual comparable corpora. It could optimize the extraction effect, and then enhance the public opinion monitoring effect. First, the topics of Chinese-Vietnamese comparable news were extracted separately to construct shared topic features through LDA topic modeling; then, the bilingual word embedding model was trained to achieve shared semantic spatial representation of Chinese and Vietnamese; Finally, the features were integrated with word vectors for combining semantic information with topics, emotion and location information to enhance the effect of extracting consequent. The experimental results in the Chinese Vietnamese comparable news dataset show that the integration of shared topic features can improve the extraction of Chinese Vietnamese news opinion sentences, the value of F1 is 0.721, which supports the monitoring of public opinion in minor languages.

     

/

返回文章
返回