In real-world graphs such as social networks, Semantic Web and biological networks, each vertex usually contains rich information, which can be modeled by a set of tokens or elements. In this paper, we study a subgraph matching with set similarity (SMS2) query over a large graph database, which retrieves subgraphs that are structurally isomorphic to the query graph, and meanwhile satisfy the condition of vertex pair matching with the (dynamic) weighted set similarity. To efficiently process the SMS2 query, this paper designs a novel lattice-based index for data graph, and lightweight signatures for both query vertices and data vertices. Based on the index and signatures, we propose an efficient two-phase pruning strategy including set similarity pruning and structure-based pruning, which exploits the unique features of both (dynamic) weighted set similarity and graph topology. We also propose an efficient dominating-set-based subgraph matching algorithm guided by a dominating set selection algorithm to achieve better query performance. Extensive experiments on both real and synthetic datasets demonstrate that our method outperforms state-of-the-art methods by an order of magnitude.

Exact subgraph matching query requires that all the vertices's and edges are matched exactly. The Ullman's subgraph isomorphism method algorithm do not utilize any index structure, thus they are usually costly for large graphs. Tree Pi indexes graph databases using frequent subtrees as indexing structures. GADDI is a structure distance based subgraph matching algorithm in a large graph. Chao et AL. investigated the S Path algorithm, which utilizes shortest paths around the vertex as basic index units. Cheng et AL proposed a new two-step R- join algorithm to efficiently find matching graph patterns from a large graph. Zou et al. proposed a distancebased multi-way join algorithm for answering pattern match queries over a large graph. Shang et al.proposed QuickSI algorithm for subgraph isomorphism optimized by choosing an search order based on some features of graphs. SING is a novel indexing system for subgraph isomorphism in a large scale graph. GraphQL is a query language for graph databases which supports graphs as the basic unit of information. Sun et al. utilized graph exploration and parallel computing to process subgraph matching query on a billion node graph. Recently, an efficient and robust subgraph isomorphism algorithm TurboISO was proposed. RINQ are graph alignment algorithms for biological networks, which can be used to solve isomorphism problems. However, a query graph is much smaller than the data graph in subgraph isomorphism problems, while the two graphs usually have similar size in graph alignment problems. To solve subgraph isomorphism problems, graph alignment algorithms introduce additional cost as they should first find candidate subgraphs of similar size from the large data graph. In addition, existing exact subgraph matching and graph alignment algorithms do not consider weighted set similarity on vertices, which will cause high postprocessing cost of set similarity computation.

Due to the inefficiency of existing methods, we propose an efficient SMS2 query processing approach in this paper. Our approach adopts a “filter-and-refine” framework, which exploits unique features of both graph topology and dynamic weighted set similarity. In the filtering phase, we build a lattice-based index over frequent patterns of element sets of vertices in data graph G. Then, data vertices are encoded into signatures, and organized into signature buckets. We design an efficient two-phase pruning strategy based on the lattice-based index and signature buckets to reduce the SMS2 search space. In the refinement phase, we propose a dominating set 2(DS)-based subgraph matching algorithm to find subgraph matches with set similarity (defined in Definition 1). A dominating set selection method is proposed to select a cost-efficient dominating set of the query graph. In summary, we make the following contributions: 1) We design a novel strategy to efficiently process SMS2 queries. An inverted pattern lattice based indexing and a structural signature-based locality sensitive hashing are first constructed offline. During the online phase, a set of pruning techniques facilitated by the offline data structures are introduced and integrated together to greatly reduce thesearch space of SMS2 queries. 2) We propose set similarity pruning techniques that utilize a novel inverted pattern lattice over the element sets of data vertices to evaluate dynamic weighted set similarity. It introduces an upper bound on the dynamic weighted similarity measure to apply the anti-monotone principle to achieve high pruning power 3) We propose structure-based pruning techniques that explore a novel structural-signature-based data structure, where the signature is designed to capture the set and neighborhood information. An aggregate dominance principle is devised to guide the pruning 4) Instead of directly querying and verifying the candidates of all the vertices in the query graph, we design an efficient algorithm to perform subgraph matching based on the dominating set of query graph. When filling up the remaining (non-dominating) vertices of the graph, a distance preservation principle is devised to prune candidate vertices that do not preserve the distance to dominating vertices. 5) Last but not least, we demonstrate through extensive experiments that our approach can effectively and efficiently answer the SMS2 queries in a large graph database.

Comment is Only Available for registered users!
Create Account or
Login Now!