## Publications

20 results found

Yu W, Lin X, Zhang W, et al., 2017, Dynamical SimRank search on time-varying networks, Departmental Technical Report: 17/9, Publisher: Department of Computing, Imperial College London, 17/9

SimRank is an appealing pair-wise similarity measure based on graph structure. It iteratively follows the intuition that two nodes are assessed as similar if they are pointed to by similar nodes. Many real graphs are large, and links are constantly subject to minor changes. In this article, we study the efficient dynamical computation of all-pairs SimRanks on time-varying graphs. Existing methods for the dynamical SimRank computation (e.g., L-TSF and READS) mainly focus on top-k search with respect to a given query. For all-pairs dynamical SimRank search, Li et al.'s approach was proposed for this problem. It first factorizes the graph via a singular value decomposition (SVD), and then incrementally maintains such a factorization in response to link updates at the expense of exactness. As a result, all pairs of SimRanks are updated approximately, yielding O(r^4*n^2) time and O(r^2*n^2) memory in a graph with n nodes, where r is the target rank of the low-rank SVD. Our solution to the dynamical computation of SimRank comprises of five ingredients: (1) We first consider edge update that does not accompany new node insertions. We show that the SimRank update ∆S in response to every link update is expressible as a rank-one Sylvester matrix equation. This provides an incremental method requiring O(Kn^2) time and O(n^2) memory in the worst case to update n2pairs of similarities for K iterations. (2) To speed up the computation further, we propose a lossless pruning strategy that captures the "affected areas" of ∆S to eliminate unnecessary retrieval. This reduces the time of the incremental SimRank to O(K(m+|AFF|)), where m is the number of edges in the old graph, and |AFF| (<=n^2) is the size of "affected areas" in ∆S, and in practice, |AFF|<.

Yu W, Mccann, 2015, Effectively Positioning Water Loss Event in Smart Water Networks, 2nd International Electronic Conference on Sensors and Applications, Publisher: MDPI, ISSN: 1424-8220

Yu W, McCann J, 2015, Co-Simmate: Quick Retrieving All Pairwise Co-Simrank Relevance, The 53rd Annual Meeting of the Association for Computational Linguistics

Co-Simrank is a useful Simrank-like measureof similarity based on graph structure.The existing method iteratively computeseach pair of Co-Simrank score from a dotproduct of two Pagerank vectors, entailingO(log(1/ǫ)n3) time to compute all pairsof Co-Simranks in a graph with n nodes,to attain a desired accuracy ǫ. In this study,we devise a model, Co-Simmate, to speedup the retrieval of all pairs of Co-Simranksto O(log2(log(1/ǫ))n3) time. Moreover,we show the optimality of Co-Simmateamong other hop-(uk) variations, and integrateit with a matrix decomposition basedmethod on singular graphs to attain higherefficiency. The viable experiments verifythe superiority of Co-Simmate to others.

Yu W, McCann J, 2015, High Quality Graph-Based Similarity Retrieval on Large Graphs, The 38th ACM SIGIR International Conference, Publisher: ACM

Yu W, Lin X, Zhang W,
et al., 2015, Fast All-Pairs SimRank Assessment on Large Graphs and Bipartite Domains, *IEEE Transactions on Knowledge and Data Engineering*, Vol: 27, Pages: 1810-1823, ISSN: 1041-4347

SimRank is a powerful model for assessing vertex-pair similarities in a graph. It follows the concept that two vertices are similar if they are referenced by similar vertices. The prior work [18] exploits partial sums memoization to compute SimRankin O(Kmn) time on a graph of n vertices and m edges, for K iterations. However, the computations among different partial sums may have duplicate redundancy. Besides, to guarantee a given accuracy ϵ, the existing SimRank needs K = ⌈logC ϵ⌉iterations, where C is a damping factor, but the geometric rate of convergence is slow if a high accuracy is expected. In this paper, (1) a novel clustering strategy is proposed to eliminate duplicate computations occurring in partial sums, and an efﬁcient algorithm is then devised to accelerate SimRank computation to O(Kd′n2) time, where d′ is typically much smaller than m n . (2) A new differential SimRank equation is proposed, which can represent the SimRank matrix as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) In bipartite domains, a novel ﬁner-grained partial max clustering method is developed to speed up the computation of the Minimax SimRank variation from O(Kmn) to O(Km′n) time, where m′ (≤ m) is the number of edges ina reduced graph after edge clustering, which can be typically much smaller than m. Using real and synthetic data, we empirically verify that (1) our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude; (2) the revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores; (3) our ﬁner-grained partial max memoization for the Minimax SimRank variation in bipartite domains is 0.5–1.2 orders of magnitude faster than the baselines.

Yu W, McCann JA, 2015, High quality SimRank-based similarity search, Departmental Technical Report: 15/3, Publisher: Department of Computing, Imperial College London, 15/3

SimRank is an influential link-based similarity measure thathas been used in many fields of Web search and sociometry.The best-of-breed method by Kusumoto et al. [7], however,does not always deliver high-quality results, since it fails toaccurately obtain its diagonal correction matrix D. Besides,SimRank is also limited by an unwanted“connectivity trait”:increasing the number of paths between nodes a and b oftenincurs a decrease in score s(a, b). The best-known solution,SimRank++ [1], cannot resolve this problem, since a revisedscore will be zero if a and b have no common in-neighbors.In this paper, we consider high-quality similarity search.Our scheme, SR#, is efficient and semantically meaningful:(1) We first formulate the exact D, and devise a “varied-D”method to accurately compute SimRank in linear memory.Moreover, by grouping computation, we also reduce the timeof [7] from quadratic to linear in the number of iterations.(2) We design a “kernel-based”model to improve the qualityof SimRank, and circumvent the “connectivity trait” issue.(3) We give mathematical insights to the semantic differencebetween SimRank and its variant, and correct an argumentin [7]: “if D is replaced by a scaled identity matrix (1−γ)I,top-K rankings will not be affected much”. The experimentsconfirm that SR# can accurately extract high-quality scores,and is much faster than the state-of-the-art competitors.

Yu W, McCann J, 2015, Efficient partial-pairs SimRank search on large graphs, *Proceedings of the VLDB Endowment International Conference on Very Large Data Bases*, Vol: 8, Pages: 569-580, ISSN: 2150-8097

The assessment of node-to-node similarities based on graph topology arises in a myriad of applications, e.g., web search. SimRank is a notable measure of this type, with the intuition that "two nodes are similar if their in-neighbors are similar". While most existing work retrieving SimRank only considers all-pairs SimRank s(*, *) and single-source SimRank s(*, j) (scores between every node and query j), there are appealing applications for partial-pairs SimRank, e.g., similarity join. Given two node subsets A and B in a graph, partial-pairs SimRank assessment aims to retrieve only {s(a, b)}∀aεA,∀bεB. However, the best-known solution appears not self-contained since it hinges on the premise that the SimRank scores with node-pairs in an h-go cover set must be given beforehand.This paper focuses on efficient assessment of partial-pairs SimRank in a self-contained manner. (1) We devise a novel "seed germination" model that computes partial-pairs SimRank in O(k|E| min{|A|, |B|}) time and O(|E| + k|V|) memory for k iterations on a graph of |V| nodes and |E| edges. (2) We further eliminate unnecessary edge access to improve the time of partial-pairs SimRank to O(m min{|A|, |B|}), where m ≤ min{k|E|, Δ2k}, and Δ is the maximum degree. (3) We show that our partial-pairs SimRank model also can handle the computations of all-pairs and single-source SimRanks. (4) We empirically verify that our algorithms are (a) 38x faster than the best-known competitors, and (b) memory-efficient, allowing scores to be assessed accurately on graphs with tens of millions of links.

Yu W, McCann J, 2014, Sig-SR: SimRank Search over Singular Graphs, The 37th Annual ACM SIGIR Conference, Publisher: Association for Computing Machinery, Pages: 859-862

SimRank is an attractive structural-context measure of similaritybetween two objects in a graph. It recursively followsthe intuition that “two objects are similar if they are referencedby similar objects”. The best known matrix-basedmethod [1] for calculating SimRank, however, implies anassumption that the graph is non-singular, i.e., its adjacencymatrix is invertible. In reality, non-singular graphs are veryrare; such an assumption in [1] is too restrictive in practice.In this paper, we provide a treatment of [1], by supportingsimilarity assessment on non-invertible adjacency matrices.Assume that a singular graph G has n nodes, with r (< n)being the rank of its adjacency matrix. (1) We show thatSimRank matrix S on G has an elegant structure: S can berepresented as a rank r matrix plus a scaled identity matrix.(2) By virtue of this, an efficient algorithm over singulargraphs, Sig-SR, is proposed for calculating all-pairs SimRankin O(r(n2 + Kr2)) time for K iterations. In contrast, theonly known matrix-based algorithm that supports singulargraphs [2] needs O(r4n2) time. The experimental results onreal and synthetic datasets demonstrate the superiority ofSig-SR on singular graphs against its baselines.

Lv B, Yu W, Wang L, et al., 2014, Efficient Processing Node Proximity via Random Walk with Restart, The16th Asia-Pacific Web Conference, Publisher: Springer, Pages: 542-549, ISSN: 0302-9743

Yu W, Lin X, Zhang W, 2014, Fast incremental SimRank on link-evolving graphs, 2014 IEEE 30th International Conference on Data Engineering (ICDE) (2014), Publisher: IEEE, Pages: 304-315

SimRank is an arresting measure of node-pair similarity based on hyperlinks. It iteratively follows the concept that 2 nodes are similar if they are referenced by similar nodes. Real graphs are often large, and links constantly evolve with small changes over time. This paper considers fast incremental computations of SimRank on link-evolving graphs. The prior approach [12] to this issue factorizes the graph via a singular value decomposition (SVD) first, and then incrementally maintains this factorization for link updates at the expense of exactness. Consequently, all node-pair similarities are estimated in O(r4n2) time on a graph of n nodes, where r is the target rank of the low-rank approximation, which is not negligibly small in practice. In this paper, we propose a novel fast incremental paradigm. (1) We characterize the SimRank update matrix ΔS, in response to every link update, via a rank-one Sylvester matrix equation. By virtue of this, we devise a fast incremental algorithm computing similarities of n2 node-pairs in O(Kn2) time for K iterations. (2) We also propose an effective pruning technique capturing the “affected areas” of ΔS to skip unnecessary computations, without loss of exactness. This can further accelerate the incremental SimRank computation to O(K(nd+|AFF|)) time, where d is the average in-degree of the old graph, and |AFF| (≤ n2) is the size of “affected areas” in ΔS, and in practice, |AFF| ≪ n2. Our empirical evaluations verify that our algorithm (a) outperforms the best known link-update algorithm [12], and (b) runs much faster than its batch counterpart when link updates are small.

Yu W, McCann J, 2014, Efficient PartialPairs SimRank search on large graphs, Departmental Technical Report: 14/11, Publisher: Department of Computing, Imperial College London, 14/11

The assessment of node-to-node similarities based on graphtopology arises in a myriad of applications, e.g., web search.SimRank is a notable measure of this type, with the intuitionthat “two nodes are similar if their in-neighbors are similar”.While most existing work retrieving SimRank only considersall-pairs SimRank s(⋆, ⋆) and single-source SimRank s(⋆, j)(scores between every node and query j), there are appealingapplications for partial-pairs SimRank, e.g., similarity join.Given two node subsets A and B in a graph, partial-pairsSimRank assessment aims to retrieve only {s(a, b)}∀a∈A,∀b∈B.However, the best-known solution [17] is not self-containedsince it hinges on the premise that the SimRank scores withnode-pairs in an h-go cover set must be given beforehand.This paper focuses on efficient assessment of partial-pairsSimRank in a self-contained manner. (1) We devise a novel“seed germination” model that computes partial-pairs Sim-Rank in O(k|E|min{|A|, |B|}) time and O(|E|+k|V |) memoryfor k iterations on a graph of |V | nodes and |E| edges.(2) We further eliminate unnecessary edge access to improvethe time of partial-pairs SimRank to O(mmin{|A|, |B|}),where m ≤ min{k|E|, 2k}, and is the maximum degree.(3) We show that our partial-pairs SimRank model also canhandle the computations of all-pairs and single-source Sim-Ranks, as well as partial-pairs SimRank* (a related notionof SimRank). (4) We empirically verify that our algorithmsare (a) 38x faster than the best-known competitors, and (b)memory-efficient, allowing scores to be assessed accuratelyon graphs with tens of millions of links.

Yu W, Lin X, Zhang W,
et al., 2013, More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks, *Proceedings of the VLDB Endowment International Conference on Very Large Data Bases*, Vol: 7, Pages: 13-24, ISSN: 2150-8097

Similarity assessment is one of the core tasks in hyperlinkanalysis. Recently, with the proliferation of applications,e.g., web search and collaborative filtering, SimRank hasbeen a well-studied measure of similarity between two nodesin a graph. It recursively follows the philosophy that “twonodes are similar if they are referenced (have incoming edges)from similar nodes”, which can be viewed as an aggregationof similarities based on incoming paths. Despite its popularity,SimRank has an undesirable property, i.e., “zerosimilarity”:It only accommodates paths with equal lengthfrom a common “center” node. Thus, a large portion ofother paths are fully ignored. This paper attempts to remedythis issue. (1) We propose and rigorously justify SimRank*,a revised version of SimRank, which resolves suchcounter-intuitive “zero-similarity” issues while inheriting meritsof the basic SimRank philosophy. (2) We show thatthe series form of SimRank* can be reduced to a fairlysuccinct and elegant closed form, which looks even simplerthan SimRank, yet enriches semantics without sufferingfrom increased computational cost. This leads to a fixedpointiterative paradigm of SimRank* in O(Knm) time ona graph of n nodes and m edges for K iterations, which iscomparable to SimRank. (3) To further optimize SimRank*computation, we leverage a novel clustering strategy viaedge concentration. Due to its NP-hardness, we devise an ef-ficient and effective heuristic to speed up SimRank* computationto O(Knm˜ ) time, where ˜m is generally much smallerthan m. (4) Using real and synthetic data, we empiricallyverify the rich semantics of SimRank*, and demonstrate itshigh computation efficiency.

Yu W, Lin X, 2013, IRWR:Incremental Random Walk with Restart, The 36th international ACM SIGIR conference on Research and development in information retrieval, ACM SIGIR 2013, Publisher: ACM, Pages: 1017-1020

Yu W, Lin X, Zhang W, 2013, Towards efficient SimRank computation on large networks, IEEE 29th International Conference on Data Engineering (ICDE), Publisher: IEEE, Pages: 601-612, ISSN: 1063-6382

SimRank has been a powerful model for assessing the similarity of pairs of vertices in a graph. It is based on the concept that two vertices are similar if they are referenced by similar vertices. Due to its self-referentiality, fast SimRank computation on large graphs poses significant challenges. The state-of-the-art work [17] exploits partial sums memorization for computing SimRank in O(Kmn) time on a graph with n vertices and m edges, where K is the number of iterations. Partial sums memorizing can reduce repeated calculations by caching part of similarity summations for later reuse. However, we observe that computations among different partial sums may have duplicate redundancy. Besides, for a desired accuracy ϵ, the existing SimRank model requires K = [logC ϵ] iterations [17], where C is a damping factor. Nevertheless, such a geometric rate of convergence is slow in practice if a high accuracy is desirable. In this paper, we address these gaps. (1) We propose an adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicate computations occurring in partial sums), and devise an efficient algorithm for speeding up the computation of SimRank to 0(Kd'n2) time, where d' is typically much smaller than the average in-degree of a graph. (2) We also present a new notion of SimRank that is based on a differential equation and can be represented as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) Using real and synthetic data, we empirically verify that our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude, and that our revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores.

Yu W, Lin X, Zhang W, et al., 2012, SimFusion+: Extending SimFusion towards efficient estimation on large and dynamic networks, SIGIR’12, Publisher: ACM, Pages: 365-374

Yu W, Le J, Lin X, et al., 2012, On the Efficiency of Estimating Penetrating Rank on Large Graphs, 24th International Conference Scientific and Statistical Database Management, Publisher: Springer, Pages: 231-249, ISSN: 0302-9743

P-Rank (Penetrating Rank) has been suggested as a useful measure of structural similarity that takes account of both incoming and outgoing edges in ubiquitous networks. Existing work often utilizes memoization to compute P-Rank similarity in an iterative fashion, which requires cubic time in the worst case. Besides, previous methods mainly focus on the deterministic computation of P-Rank, but lack the probabilistic framework that scales well for large graphs. In this paper, we propose two efficient algorithms for computing P-Rank on large graphs. The first observation is that a large body of objects in a real graph usually share similar neighborhood structures. By merging such objects with an explicit low-rank factorization, we devise a deterministic algorithm to compute P-Rank in quadratic time. The second observation is that by converting the iterative form of P-Rank into a matrix power series form, we can leverage the random sampling approach to probabilistically compute P-Rank in linear time with provable accuracy guarantees. The empirical results on both real and synthetic datasets show that our approaches achieve high time efficiency with controlled error and outperform the baseline algorithms by at least one order of magnitude.

Li X, Yu W, Yang B, et al., 2011, ASAP : Towards Accurate, Stable and Accelerative Penetrating-Rank Estimation on Large Graphs, 12th International Conference, WAIM 2011, Publisher: Springer, Pages: 415-429, ISSN: 0302-9743

Pervasive web applications increasingly require a measure of similarity among objects. Penetrating-Rank (P-Rank) has been one of the promising link-based similarity metrics as it provides a comprehensive way of jointly encoding both incoming and outgoing links into computation for emerging applications. In this paper, we investigate P-Rank efficiency problem that encompasses its accuracy, stability and computational time. (1) We provide an accuracy estimate for iteratively computing P-Rank. A symmetric problem is to find the iteration number K needed for achieving a given accuracy ε. (2) We also analyze the stability of P-Rank, by showing that small choices of the damping factors would make P-Rank more stable and well-conditioned. (3) For undirected graphs, we also explicitly characterize the P-Rank solution in terms of matrices. This results in a novel non-iterative algorithm, termed ASAP , for efficiently computing P-Rank, which improves the CPU time from O(n 4) to O( n 3 ). Using real and synthetic data, we empirically verify the effectiveness and efficiency of our approaches.

Yu W, Zhang W, Lin X,
et al., 2010, A space and time efficient algorithm for SimRank computation, *World Wide Web*, Vol: 15, Pages: 327-353, ISSN: 1386-145X

Yu W, Lin X, Le J, 2010, Taming Computational Complexity: Efficient and Parallel SimRank Optimizations on Undirected Graphs, The 11th International Conference on Web-Age Information Management, Publisher: Springer, Pages: 280-296, ISSN: 0302-9743

SimRank has been considered as one of the promising link-based ranking algorithms to evaluate similarities of web documents in many modern search engines. In this paper, we investigate the optimization problem of SimRank similarity computation on undirected web graphs. We ﬁrst present a novel algorithm to estimate the SimRank between vertices in O(n3+ Kn2) time, where n is the number of vertices, and K is the number of iterations. In comparison, the most efﬁcient implementation of SimRank algorithm in [1] takes O(K n3 ) time in the worst case. To efﬁciently handle large-scale computations, we also propose a parallel implementation of the SimRank algorithm onmultiple processors. The experimental evaluations on both synthetic and real-life data sets demonstrate the better computational time and parallel efﬁciency of our proposed techniques.

Yu W, Lin X, Le J, 2010, A Space and Time Efficient Algorithm for SimRank Computation, 2010 12th International Asia-Pacific Web Conference (APWEB), Publisher: IEEE, Pages: 164-170

SimRank has been proposed to rank web documents based on a graph model on hyperlinks. The existing techniques for conducting SimRank computation adopt an iteration computation paradigm. The most efficient technique has the time complexity O(n3) with the space requirement O(n2) in the worst case for each iteration where n is the number of nodes (web documents). In this paper, we propose novel optimization techniques such that each iteration takes the time O(min{n · m,nr}) and requires space O (n + m) where m is the number of edges in a web-graph model and r ≤ log2 7. We also show that our algorithm accelerates the convergence rate of the existing techniques. Moreover, our algorithm not only reduces the time and space complexity of the existing techniques but is also I/O efficient. We conduct extensive experiments on both synthetic and real data sets to demonstrate the efficiency and effectiveness of our iteration techniques.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.