实验室经过这几年的积累,发表了不少优秀的文章,现在在这做一个展示概要,包括文章的简介,pdf链接以及部分文章的开源代码github链接。

COLING 2018国际会议

简介:答案选择是一项重要而具有挑战性的任务。在大量标记的训练数据可用的领域已经取得了显著的进展。然而,获得丰富的注释数据是耗时和昂贵的过程,将答案选择模型应用到具有有限标记数据的新领域将会有很大的障碍。在本文中,我们提出了知识感知注意力网络(KAN),一个跨领域答案选择的迁移学习框架,使用知识库作为桥梁,使知识从源领域转移到目标领域。具体地,我们设计了一个知识模块,将基于知识的表示学习集成到答案选择模型中。所学的基于知识的向量表示由源领域和目标领域共享,这不仅利用大量的跨领域数据,而且还受益于正则化效应,从而导致更通用的文本表示来帮助新领域中的任务。为了验证我们的模型的有效性,我们使用SQUAD-T数据集作为源域数据集和三个其他数据集(即yahoo QA,TREC QA和insuranceQA)作为目标域。实验结果表明,KAN具有很强的适用性和通用性,在跨域答案选择方面明显优于目前最好的模型算法。

Abstract
Answer selection is an important but challenging task. Significant progress has been made in domains where a large amount of labeled training data is available. However, obtaining rich annotated data is a time-consuming and expensive process, creating a substantial barrier for applying answer selection models to a new domain which has limited labeled data. In this paper, we propose Knowledge-aware Attentive Network (KAN), a transfer learning framework for crossdomain answer selection, which uses the knowledge base as a bridge to enable knowledge transfer from the source domain to the target domains. Specifically, we design a knowledge module to integrate the knowledge-based representational learning into answer selection models. The learned knowledge-based representations are shared by source and target domains, which not only leverages large amounts of cross-domain data, but also benefits from a regularization effect that leads to more general representations to help tasks in new domains. To verify the effectiveness of our model, we use SQuAD-T dataset as the source domain and three other datasets (i.e., Yahoo QA, TREC QA and InsuranceQA) as the target domains. The experimental results demonstrate that KAN has remarkable applicability and generality, and consistently outperforms the strong competitors by a noticeable margin for cross-domain answer selection.

 

简介:远监督关系抽取极大地减少了从非结构化文本中提取关系事实的人力成本。但是它存在着噪声标签的问题,这会极大损害抽取性能。与此同时,知识图谱中所表达的有用信息仍未在最先进的远监督关系提取方法中得到充分利用。针对这些挑战,我们提出了一种新的协同去噪框架,该框架由两个分别利用文本语料库和知识图谱的基础网络组成,以及一个通过自适应双向知识精馏和以动态集成应对噪声变化实例的协作模块。在真实数据集上的实验结果表明所提出的方法可以有效减少噪声标签,并在最先进的方法上取得实质性的改进。

Abstract
Distantly supervised relation extraction greatly reduces human efforts in extracting relational facts from unstructured texts. However, it suffers from noisy labeling problem, which can degrade its performance. Meanwhile, the useful information expressed in knowledge graph is still underutilized in the state-of-the-art methods for distantly supervised relation extraction. In the light of these challenges, we propose CORD, a novel COopeRative Denoising framework, which consists two base networks leveraging text corpus and knowledge graph respectively, and a cooperative module involving their mutual learning by the adaptive bi-directional knowledge distillation and dynamic ensemble with noisy-varying instances. Experimental results on a real-world dataset demonstrate that the proposed method reduces the noisy labels and achieves substantial improvement over the state-of-the-art methods.

 

 

ACM SIGIR 2018国际会议

  • Ontology Evaluation with Path-based Text-aware Entropy Computation
  • 作者:Ying Shen*, Daoyuan Chen*, Min Yang, Yaliang Li, Nan Du, Kai Lei(* indicates equal contribution)
  • 文章链接:https://dl.acm.org/citation.cfm?id=3210067

简介:随着知识交换的重要性日益上升,本体已成为知识交换和语义集成等语义驱动应用程序共享知识模型开发的关键技术。使用熵来测量知识碱基的可预测性和冗余性,特别是本体的可预测性和冗余,已经取得了重大进展。然而,目前用于评估本体的熵应用只考虑单点连接,而不考虑路径连接,为每个实体和路径分配相等的权重,并假定顶点是静态的。针对这些不足,本文提出了一种基于路径的文本感知熵计算方法PTEC,该方法考虑不同顶点之间的路径信息和路径内的文本信息,计算整个网络的连接路径以及不同节点之间的不同权重。从基于结构的嵌入和基于文本的嵌入获得的信息乘以熵计算的连通性矩阵。基于本体统计信息(数据量)、熵评估(数据质量)和案例研究(本体结构和文本可视化),对三种真实世界本体进行了实验评价。这些方面相互证明了我们方法的可靠性。实验结果表明,PTEC能够有效地评价本体,特别是在医学领域。

ABSTRACT
With the rising importance of knowledge exchange, ontologies have become a key technology in the development of shared knowledge models for semantic-driven applications, such as knowledge interchange and semantic integration. Significant progress has been made in the use of entropy to measure the predictability and redundancy of knowledge bases, particularly ontologies. However, the current entropy applications used to evaluate ontologies consider only single-point connectivity rather than path connectivity, assign equal weights to each entity and path, and assume that vertices are static. To address these deficiencies, the present study proposes a Path-based Text-aware Entropy Computation method, PTEC, by considering the path information between different vertices and the textual information within the path to calculate the connectivity path of the whole network and the different weights between various nodes. Information obtained from structure-based embedding and text-based embedding is multiplied by the connectivity matrix of the entropy computation. An experimental evaluation of three real-world ontologies is performed based on ontology statistical information (data quantity), entropy evaluation (data quality), and a case study (ontology structure and text visualization). These aspects mutually demonstrate the reliability of our method. Experimental results demonstrate that PTEC can effectively evaluate ontologies, particularly those in the medical field。

 

简介:问答对排序由于其广泛的应用,如信息检索和问答(QA),近年来引起了越来越多的关注。深度神经网络在此任务上取得了重大进展。然而,在人类文本理解中起着至关重要的作用的文本背景信息和隐藏在上下文之外的关系,在最近取得最好结果的深度神经网络中几乎没有被深入研究。在本文中,我们提出了KABLSTM,一个知识感知的注意双向长短记忆循环神经网络,利用外部知识从知识图谱(KG),丰富了QA句子的向量表示学习。具体地,我们提出了上下文知识交互学习的体系结构,其中,我们设计了一个上下文信息引导的注意卷积神经网络(CNN),将外部知识嵌入到句子表示中。此外,提出了一种知识感知的注意机制来关注QA对的各个部分之间的重要信息。KABLSTM在两个广泛使用的基准QA数据集上评估:WikiQA和TREC QA。实验结果表明,KBLASTM具有较强的竞争优势和得到了最好的实验结果。

ABSTRACT
Ranking question answer pairs has attracted increasing attention recently due to its broad applications such as information retrieval and question answering (QA). Significant progresses have been made by deep neural networks. However, background information and hidden relations beyond the context, which play crucial roles in human text comprehension, have received little attention in recent deep neural networks that achieve the state of the art in ranking QA pairs. In the paper, we propose KABLSTM, a Knowledge-aware Attentive Bidirectional Long Short-Term Memory, which leverages external knowledge from knowledge graphs (KG) to enrich the representational learning of QA sentences. Specifically, we develop a context-knowledge interactive learning architecture, in which a context-guided attentive convolutional neural network (CNN) is designed to integrate knowledge embeddings into sentence representations. Besides, a knowledge-aware attention mechanism is presented to attend interrelations between each segments of QA pairs. KABLSTM is evaluated on two widely-used benchmark QA datasets: WikiQA and TREC QA. Experiment results demonstrate that KABLSTM has robust superiority over competitors and sets state-of-the-art.

 

IEEE Global Communications Conference 2018国际会议

  • Distributed Information-agnostic Flow Scheduling in Data Centers based on Wait-Time
  • 作者:Kai Lei, Keke Li, Jie Xing, Bo Jin, Yi Wang
  • 文章链接:暂无

简介:数据中心网络现有的流量调度方法主要是为了最小化短流的流完成时间,并没有考虑优化延时敏感的长流(比如VR视频流,AI交互式问答流)的流完成时间。此外,在现有的流量调度方法中,信息可知的方案(如L2DCT, D2TCP)在实际中难以部署,这是因为它们需要预先知道流的相关信息(如流的大小);而信息不可知的调度方案(即PIAS)虽然不需要提前知道流的大小信息,但它需要一个中央化的服务器,这就导致在网络规模很大时,PIAS的可扩展性很差。

考虑到现有方案的局限性,在本文中,我们提出一种分布式信息不可知的流量调度方法(DIAS),该方法既能优化短流的流完成时间,也能优化延时敏感的长流的流完成时间。在DIAS中,数据包是根据它们的优先级进行转发的,而数据包的优先级是根据它们在发送端的缓冲区内的等待时间决定的,数据包的等待时间越久,它的优先级越低。此外,DIAS不像PIAS一样采用一个集中化的服务器收集流量负载信息,而是采用每个交换机将流量负载信息附在ACK包中返回给发送端的方式,流量负载信息是用来调整决定数据包优先级的阈值的。ns-3模拟器中的实验结果显示,与DCTCP、L2DCT相比,DIAS分别能够降低54.7%和50.1%的流完成时间,此外,与PIAS相比,DIAS能够保证延时敏感的长流更短的流完成时间,因此比PIAS性能更好。

 

The 2013 International Joint Conference on Neural Networks (IJCNN)

  • Massively parallel learning of Bayesian networks with MapReduce for factor relationship analysis
  • 作者:Chen Wei, Wang Tengjiao, Yang Dongqing, Lei Kai, Liu Yueqin
  • 文章链接:https://ieeexplore.ieee.org/document/6706814

简介:贝叶斯网络(BN)是数据挖掘技术中最受欢迎的模型之一。 大多数BN结构学习算法都是针对集中式数据集开发的,其中所有数据都被收集到一个计算机节点中。 从大规模数据中学习BN结构往往成本太高或不切实际。 通过具有map和reduce两个功能的简单界面,MapReduce有助于并行实现许多实际任务,例如搜索引擎的数据处理和机器学习。 在本文中,我们通过使用MapReduce集群,提出了一种基于大规模日期集的BN结构并行算法。 我们讨论了使用MapReduce进行BN结构学习的好处,并通过将其应用于财务分析领域的现实世界财务因素关系学习任务来演示该方法的性能。

ABSTRACT

Bayesian Network (BN) is one of the most popular models in data mining technologies. Most of the algorithms of BN structure learning are developed for the centralized datasets, where all the data are gathered into a single computer node. They are often too costly or impractical for learning BN structures from large scale data. Through a simple interface with two functions, map and reduce, MapReduce facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning. In this paper, we present a parallel algorithm for BN structure leaning from large-scale dateset by using a MapReduce cluster. We discuss the benefits of using MapReduce for BN structure learning, and demonstrate the performance of this approach by applying it to a real world financial factor relationships learning task from the domain of financial analysis.

 

2014 IEEE 33rd International Performance Computing and Communications Conference (IPCCC)

简介:新浪微博已成为中国越来越重要的社交媒体,分享最新消息,推广新产品,讨论有争议的问题。新浪微博对社会的重要性日益提高,因此了解数百万活跃用户不断发布和搜索热门话题的“什么”,“何时”,“谁”非常重要。在本文中,我们开发了一种系统的方法来描述新浪微博用户在四个月的时间跨度内搜索的热门话题的时间分布,并发现相关的热门话题,这些话题不仅由同一用户发布,而且还出现在类似的一组推文消息。我们分析实时新浪微博推文数据流,研究用户搜索和热门话题推文活动之间的数量相关性和时间差距。此外,我们还研究了社交媒体和搜索引擎上热门话题搜索之间的相关性,以了解不同平台上的热门话题和用户行为。鉴于分析大量推文数据的挑战,我们探索了Hadoop MapReduce框架,以便从收集的数据集中有效处理数百万条推文,并量化MapReduce在分析推文流时的性能优势。据我们所知,本文首次尝试在新浪微博上描述热门话题的时间搜索模式,并研究它们与推特数据流以及搜索引擎统计数据的相关性。

Abstract

Sina Weibo has become an increasingly critical social media in China for sharing latest news, marketing new products, and discussing controversial issues. The rising importance of Sina Weibo on the society makes it very important to understand “what”, “when”, “who” on hot topics that are being continuously tweeted and searched by millions of active users. In this paper, we develop a systematic approach to characterize temporal distribution of hot topics searched by Sina Weibo users over a four-month time-span and to uncover correlated hot topics that are not only tweeted by the same users, but also appear in the similar set of tweet messages. We analyze real-time Sina Weibo tweet data streams and study volume correlations and temporal gaps between user searches and tweeting activities on hot topics. In addition, we examine the correlations between hot topic searches on social media and on search engines to understand hot topics and user behaviors across different platforms. Given the challenges of analyzing massive amount of tweet data, we explore Hadoop MapReduce framework to effectively process millions of tweets from the collected data-sets, and quantify the performance benefits of MapReduce on analyzing tweet streams. To the best of our knowledge, this paper is the first effort to characterize temporal search patterns of hot topics on Sina Weibo and to study their correlations with tweeting data streams as well as search engine statistics.

 

2015 IEEE International Conference on Communications (ICC)

简介:Twitter和新浪微博等新社交媒体已成为传播影响力的日益受欢迎的渠道,挑战电视和报纸等传统媒体。新浪微博上最具影响力和经过验证的用户,也称为大V账户,经常吸引数百万粉丝和粉丝,在社交媒体上创建大量“以名人为中心”的社交网络,在传播突发新闻,最新活动方面发挥关键作用关于社会问题的争议性意见。鉴于这些帐户的重要性,了解这些帐户的社交网络和用户影响以及描述其关注者的行为非常重要。为此,本文监控新浪微博上一组选定的有影响力的用户,并收集他们的推文流以及来自其关注者的这些推文上的转发和评论活动。我们对来自新浪微博的推文数据流的分析揭示了追随者在这些有影响力的用户的推文上发表评论的时间和内容,并在评论中发现了不同的时间模式和词汇多样性。基于从追随者特征中获得的洞察力,我们进一步开发了简单直观的算法,将追随者分类为垃圾邮件发送者和普通粉丝。我们的实验结果表明,在对这些有影响力的账户的推文发表评论的粉丝中检测垃圾邮件发送者时,所提出的算法能够达到95.20%的平均准确度。

Abstract

The new social media such as Twitter and Sina Weibo has become an increasingly popular channel for spreading influence, challenging traditional media such as TVs and newspapers. The most influential and verified users, also called big-V accounts on Sina Weibo often attract million of followers and fans, creating massive “celebrity-centric” social networks on the social media, which play a key role in disseminating breaking news, latest events, and controversial opinions on social issues. Given the importance of these accounts, it is very crucial to understand social networks and user influence of these accounts and profile their followers’ behaviors. Towards this end, this paper monitors a selected group of influential users on Sina Weibo and collects their tweet streams as well as retweeting and commenting activities on these tweets from their followers. Our analysis on tweet data streams from Sina Weibo reveals when and what the followers comment on the tweets of these influential users, and discovers different temporal patterns and word diversity in the comments. Based on the insight gained from follower characteristics, we further develop simple and intuitive algorithms for classifying the followers into spammers and normal fans. Our experimental results demonstrate that the proposed algorithms are able to achieve an average accuracy of 95.20% in detecting spammers from the followers who have commented on the tweets of these influential accounts.

 

2016 IEEE International Conference on Communications (ICC)

简介:微博的高人气极大地丰富了人们的生活,允许在线用户通过发表评论来分享他们的感受。但是,此社交媒体上的用户博客中也会发布越来越多的垃圾评论。在本文中,为了有效地检测中文微博中的垃圾评论,我们引入语义分析来构建自扩展垃圾邮件字典,当频繁出现在微博上时,自动扩展自身。语义分析的使用可以为我们提供有助于检测垃圾评论的附加功能。通过根据我们的自扩展垃圾邮件字典标准筛选微博评论的垃圾邮件权重和垃圾邮件比例,还提出了比例权重筛选器(PWF)模型来检测两种垃圾评论(AD和粗俗评论)。 。我们的实验结果表明,当检测到AD和粗俗垃圾评论的组合时,我们可以实现87.9%的平均检测准确度。特别是对于AD垃圾评论检测,我们可以达到96.2%的平均准确度,这比使用机器学习方法时更好。对结果的统计分析验证了我们提出的方法可以有效地识别垃圾评论并具有相对较高的准确度。

Abstract

The high popularity of Weibo has greatly enriched people’s lives, allowing online users to share their feelings through posting comments. However, more and more spam comments are also being posted in users’ blogs on this social media. In this paper, in order to effectively detect spam comments in Chinese micro-blogs, we introduce semantic analysis to construct a Self-Extensible Spam Dictionary which automatically expands itself when new words emerge on the micro-blogs frequently. The use of semantic analysis can provide us with additional features which are beneficial to detecting spam comments. A Proportion-Weight Filter (PWF) model is also proposed to detect two kinds of spam comments (AD and vulgar comments), by filtering the spam-weight and the spam-proportion of the Weibo comments based on our Self-Extensible Spam Dictionary criteria. Our experimental results demonstrate that when detecting a combination of both AD and vulgar spam comments, we can achieve an average detection accuracy of 87.9%. Particularly for AD spam comments detection, we can achieve an average accuracy of 96.2%, which is preferable compared to when using machine learning methods. The statistical analysis of the results verifies that our proposed methods can identify the spam comments effectively and to relatively high degrees of accuracy.

 

ACM-ICN ’14 Proceedings of the 1st ACM Conference on Information-Centric Networking

简介:本文设计并实现了一个基于NDN的媒体流系统可扩展控制面板。 该系统是基于先前基于IP的P2P媒体流系统Hippo [1]开发的,该系统包含一组用于操纵P2P功能的控制服务器,例如跟踪器等。系统可扩展性成为最困难的问题之一。 P2P系统的用户规模变得非常大。 我们利用SNC [2]的相同原理设计了NDN-Hippo的控制层。 至于实现,我们采取了两步的方法:首先将Hippo的控制层移植到基于NDN的系统,然后再移植媒体流量层。 通过分离控制层和媒体层,我们的演示证明,不仅可以在NDN中巧妙地和本能地实现跟踪器的某些管理功能,而且还大大提高了NDN版本的Hippo的可扩展性。

Abstract

An NDN-based scalable control panel for a media streaming system was designed and implemented in this paper. The system is developed based on a previous IP-based P2P media streaming system named Hippo [1], which contains a group of control servers to manipulate P2P functionalities, such as the tracker, etc. System scalability becomes one of the most difficult problems when the user size of P2P system grows very large. We took the advantages from the same principle of SNC [2] to design the NDN-Hippo’s control layer. As for implementation, we took a two-step approach: First porting the control layer of Hippo to NDN-based system, then porting media traffic layer later. By separating control and media layers, our demo demonstrates that not only some management functions of tracker can be smartly and instinctively achieved in NDN, but also the scalability of NDN version of Hippo has been greatly improved.

 

2015 IEEE Global Communications Conference (GLOBECOM)

简介:转发策略是命名数据网络(NDN)实现动态,自适应和智能转发的关键特性,但该领域的工作仍处于初级阶段。在本文中,选择NDN中多个备选方案中的哪个转发接口被定义为多属性决策(MADM)问题和基于最大化偏差的概率转发(MDPF)策略,以选择概率转发接口。由于多个网络指标(如接口状态,待处理的兴趣编号)被一起考虑,因此可以更准确地获得每个替代接口的可用性。因此,可以实现更好的内容传递效率。此外,MDPF提供了良好的可扩展性,因为可以添加任何适当的度量来增强或自定义它。我们在ndnSIM中实施该提案,并将其与各种拓扑和方案下的BestRoute和基于PI的策略进行比较。实验结果表明,MDPF策略对网络变化响应更快,更敏感,可以实现更高的吞吐量,更低的丢弃率以及更好的负载均衡。

Abstract

Forwarding strategy is the key feature of Named Data Networking (NDN) to realize dynamic, adaptive and intelligent forwarding, but work in this area is still at a very preliminary stage. In this paper, selecting which forwarding interface among multiple alternatives in NDN is defined as a multiple attribute decision making (MADM) problem and a maximizing deviation based probabilistic forwarding (MDPF) strategy is proposed to select forwarding interface on probability. Since multiple network metrics such as interface status, pending Interest numbers are considered together, each alternative interface’s availability is obtained more accurately. Thus, better content delivery efficiency can be achieved. In addition, MDPF provides good extensibility, as any appropriate metric can be added to enhance or customize it. We implement the proposal in ndnSIM and compare it with BestRoute and PI-based strategies under various topologies and scenarios. Experimental results show that MDPF strategy is more responsive and sensitive to network changes, and can realize higher throughput, lower drop rate as well as better load balance.