AutoG: a visual query autocompletion framework for graph databases

Composing queries is evidently a tedious task. This is particularly true of graph queries as they are typically complex and prone to errors, compounded by the fact that graph schemas can be missing or too loose to be helpful for query formulation. Despite the great success of query formulation aids, in particular, automatic query completion, graph query autocompletion has received much less research attention. In this paper, we propose a novel framework for subgraph query autocompletion (called AutoG). Given an initial query q and a user’s preference as input, AutoG returns ranked query suggestions Q′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q'$$\end{document} as output. Users may choose a query from Q′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q'$$\end{document} and iteratively apply AutoG to compose their queries. The novelties of AutoG are as follows: First, we formalize query composition. Second, we propose to increment a query with the logical units called c-prime features that are (i) frequent subgraphs and (ii) constructed from smaller c-prime features in no more than c ways. Third, we propose algorithms to rank candidate suggestions. Fourth, we propose a novel index called feature Dag (FDag) to optimize the ranking. We study the query suggestion quality with simulations and real users and conduct an extensive performance evaluation. The results show that the query suggestions are useful (saved roughly 40% of users’ mouse clicks), and AutoG returns suggestions shortly under a large variety of parameter settings.


Introduction
The prevalence of graph-structured data in modern realworld applications such as biological and chemical databases (e.g., PubChem) and co-purchase networks (e.g., Amazon.com)has lead to a rejuvenation of research on graph data management and analytics.Several novel graph data management platforms have emerged from academia, industrial research laboratories and start-up companies.Several database query languages have been proposed for textually querying graph databases (e.g., sparql and Cypher).Unfortunately, formulating a graph query using any of these query languages often demands considerable cognitive effort and requires "programming" skill at least similar to programming in sql.Yet, in a wide spectrum of graph applications consumers need to query graph data but are not proficient query writers.For example, chemists are not often expected to learn the complex syntax of a graph query language in order to formulate meaningful queries over a chemical compound database such as PubChem1 or eMolecule. 2Hence, it is important to devise intuitive techniques that can alleviate the burden of query formulation and thus increase the usability of graph databases.
A popular approach to make query formulation userfriendly is to provide a visual query interface (gui) for interactively constructing queries.In recent times, there has been increasing efforts to create such user-friendly guis from academia [18] and industry (e.g., PubChem and eMolecule) to ease the burden of query formulation.Given a partially constructed visual subgraph query, it is always desirable to suggest top-k possible query fragments that the user may potentially add to his/her intermediate query in the subsequent steps.Such suggestions can enhance user experience on graph databases and facilitate exploratory search [17], where non-expert user may learn, discover and investigate information from a graph data source through a sequence of queries and answers.
Example 1 Consider the visual subgraph query interface in Fig. 1 for querying PubChem.Suppose Mike wishes to search for compounds containing the chlorobenzene substructure.The partial subgraph query constructed by him is depicted in the Visual Graph Editor panel.It will be indeed helpful to Mike if the query system can suggest top-k possible query fragments (subgraphs) that he may add to his query in the next step. 3An example of such top-4 suggestions is shown at the bottom panel.Observe that each suggestion is composed by adding small increments to the query graph in the middle panel (indicated in gray).Mike may select the fourth suggestion by clicking on it, thus saving his mouse clicks to manually formulate the new nodes and edges.He may then continue formulating the final query graph in subsequent steps by leveraging the query suggestion capability iteratively.A further animated example and a prototype system can be found at https://goo.gl/Xr9MRY.
In the literature, such suggestions that assist query formulation are often referred to as query autocompletion.Techniques for query autocompletion have been proposed for web search and XML search [14].For instance, search engine 3 Ideally, the user interface may automatically show useful suggestions to users.The current gui (Fig. 1) provides an "Autocomplete" button for users to fetch the top-k suggestions to allow an explicit comparison of the experiences with and without suggestions for user tests.Fig. 2 Autocompletion framework for subgraph queries companies use their proprietary algorithms for providing keyword suggestions during query formulation.However, a corresponding capability for graph query engine is in its infancy.In fact, to the best of our knowledge, except for a recent demo for edge suggestions [19], the autocompletion of subgraph queries has not been studied before.
There are two key challenges of autocompleting subgraph queries.Firstly, in web search, the natural logical increments (i.e., tokens) of queries are keywords.However, the notion of "increments" of subgraph queries has not yet been defined.Furthermore, subgraph queries are structures, not a sequence of tokens.That is, there are many ways to compose the queries.Secondly, there can be potentially many candidate query suggestions.Consequently, it is paramount to return a ranked list of query suggestions at interactive time.
To address the aforementioned challenges, we propose a novel autocompletion framework for subgraph queries (namely AutoG [44]), whose simplified workflow is shown in Fig. 2. In a nutshell, AutoG allows users to submit an initial query q and a preference u and it returns ranked suggestions according to u.
To tackle the first challenge, we propose a novel notion of c-prime features as logical increments of subgraph queries.In this paper, we illustrate c-prime features with frequent subgraphs, but they can be any structural features that capture the structural characteristics of data graphs (e.g., [10,13,42]) that users may be interested in.c-prime features are the first structural features defined with feature composability-the number of ways that a feature can be composed from other small features.In short, a c-prime feature is a feature whose composability is no more than c.When possible increments are many, query autocompletion can be inefficient.Our main idea is that to optimize query autocompletion time, AutoG omits non-c-prime features because they may be formed from c-prime features/queries anyway.As shown in Fig. 2, in our proposed framework, a user submits an initial query q and his/her intent on ranking query suggestions.Then, the query graph is represented by c-prime features.
To overcome the second challenge mentioned above, we formalize the query autocompletion problem as a novel ranked subgraph query suggestion problem (Rsq).The goal of Rsq problem is to efficiently determine a candidate query The rest of the paper is organized as follows.Section 2 introduces preliminary concepts and formally states the autocompletion problem for subgraph queries.Section 3 proposes the c-prime features.The autocompletion framework is presented in Sect. 4. Section 5 presents an indexed approach to autocompletion.Section 6 proposes an automorphismbased pruning technique.Experimental study is reported in Sect.7. Related work is discussed in Sect.8. Section 9 concludes this paper."Appendices" contain all proofs, additional experiments on performances and technical details of index construction.

Preliminaries
In this section, we first provide the preliminaries and describe the problem being studied.Then, we elaborate the query composition operation (or simply composition) assumed by this paper which combines two graphs to form another graph.

Subgraph queries and background
We consider a graph database D as a set of data graphs {g 1 , g 2 , . .., g n }.Each graph is a 3-ary tuple g = (V, E, l), where V and E are the vertex and edge sets of g, respectively, and l is the label function of g.The size of a graph is defined by |E|.The query formalism adopted by this paper is subgraph isomorphism, defined as follows.
Definition 2 (Subgraph query) Given a graph database D = {g 1 , g 2 , . . ., g n } and a query graph q, the answer set of q is D q = {g|q ⊆ λ g, g ∈ D}.
Multiple subgraph isomorphic embeddings of g may exist in g , denoted as λ 0 g,g , λ 1 g,g , . .., λ m g,g .Therefore, subgraph isomorphism of g and g can be viewed as a relation between g and g , where each record is an embedding of g in g .Furthermore, in our technical discussions, we may use graph isomorphism of g i and g j , which is g i ⊆ λ g j and |g i .V | = |g j .V |, and graph automorphism of g, which is a graph isomorphism to itself.To keep the presentation intuitive, we may describe automorphism/graph isomorphism as a mapping of a node configuration (domain) to another node configuration (image).

Query composition
This subsection formalizes the query composition used in our technical discussions. 4We recall that queries are complex structures, and larger queries may be constructed from smaller queries in many ways.To facilitate the discussions, we define how a large query is constructed from two smaller queries by specifying how a common subgraph connects them.

Fig. 3 Query graph composition
Definition 3 (Common subgraphs (CS)) Given two graphs g 1 and g 2 , a common subgraph of g 1 and g 2 is a connected subgraph containing at least one edge and it is a subgraph of g 1 and g 2 (denoted as cs(g 1 , g 2 ), or simply cs when g 1 and g 2 are clear from the context), i.e., cs ⊆ λ 1 g 1 and cs ⊆ λ 2 g 2 .We define CS(g 1 , g 2 ) to be the set of common subgraphs of g 1 and g 2 .
A subtlety is that in the literature, maximal common subgraphs are extensively studied.However, we present common subgraphs because in query composition, large query graphs may not necessarily be formed via the maximal common subgraphs of small graphs.Definition 4 (Query composition) compose is a function that takes two graphs, g 1 and g 2 , and the corresponding embeddings, λ 1 and λ 2 , of a common subgraph cs as input, returns the graph g that is composed by g 1 and g 2 via λ 1 and λ 2 of cs, respectively, denoted as g = compose(g 1 , g 2 , cs, λ 1 , λ 2 ).

Query composition modes
In this subsection, we present the possible query composition modes supported by AutoG and the one that is assumed by our technical presentations.

Connected versus disconnected subgraphs
In this paper, we assume that both the query graphs and the query suggestions (formed by a composition of the query graphs and increments) are connected graphs.Hence, the common subgraphs are connected graphs, too.In the case of disconnected query graphs, we may simply pass them individually to AutoG.Similarly, the query suggestions presented in this paper are connected.A minor relaxation on the requirement of connected existing queries and increments will support disconnected query suggestions.
Edge increments versus subgraph increments A simple query autocompletion mode is to generate edge increments to the current query (e.g., [19]).However, edge suggestions have at least two drawbacks: (1) the query formulation process may take many steps, and (2) users can express limited structural information regarding their desired queries in each step.On the other hand, as shown in Sect. 1, we propose to increment the query through our proposed features, which can be subgraphs.In our experimental investigation on the query suggestions, the average number of edges of query increments is always more than one.
For simplicity of presentation, we assume connected subgraphs suggestions unless otherwise specified.Putting these together, the problem being studied can be described as follows.
Problem statement Given an existing query q, a ranking function util, a user preference u and a parameter k, compute a query suggestion set q i is composed by adding an increment to q and Q k is the top-k suggestions w.r.t. the ranking function util and the user preference u.

c-Prime features
Structural features of graphs have been extensively studied recently, for the purpose of optimizing the performances of structural queries, among other things.Their intuition is to determine a set of subgraphs that carry various structural characteristics of data graphs D (e.g., discriminative frequent subgraph [42] and action-aware frequent subgraph [21]).Data graphs are then indexed by the features.Given a query q, it is decomposed into a set of features F q .A data graph that does not contain F q cannot be an answer of q and hence can be pruned.Previous work shows that this approach can effectively prune non-answers.However, this approach does not consider query autocompletion.
In this section, we propose c-prime features.They are defined by how many ways they can be composed from smaller features.Intuitively, c-prime features are features that can be formed by smaller features in only a few ways.cprime features have not been proposed before because they are designed for suggesting query increments not filtering data graphs.c-prime features are orthogonal to existing features, i.e., users may integrate their existing features with c-prime features for their specific applications.
The design rationales of c-prime features are that (i) some features are important to query autocompletion because their Fig. 4 Frequent features (partial) with their composabilities absence leads to fewer possible suggestions, and (ii) some other features are less important because they can be constructed incrementally from small ones in numerous ways and can be suggested by query autocompletion anyway.
To discuss c-prime features, we start with frequent features [41].Frequent features are adopted because, without a prior knowledge, we may assume each data graph in the database D has the same chance of being retrieved by users' queries.Hence, frequent subgraphs of D have higher chances that appear in users' queries.
We present the formal definition of frequent features in Definition 12 in "Appendix 1".We provide an example of frequent features below.
Example 3 Figure 4 shows a set of frequent features, extracted from PubChem by gSpan [41].In the figure, the vertices C, N and O represent the chemical elements carbon, nitrogen and oxygen, respectively.The edges between the chemicals (i.e., C-C and C=C) signify the single and double bonds between two elements.Given a graph g and a frequent feature set F of a database D, we may decompose g into a set of features F g : { f 1 , ..., f n }, where ∀ f i ∈ F g implies f i ∈ F and f i ⊆ λ g.Similarly, we may decompose a query into a set of features and their embeddings.
Definition 5 A query q of AutoG is represented as a binary tuple (F q , λ), where F q is a set of features of q and λ takes a feature f q ∈ F q as input and returns the embedding of f q in q.
Example 4 Suppose f 18 is a query.A possible F q is { f 4 , f 6 , f 7 , f 10 , f 13 , f 18 }.One may easily derive embeddings of the double bond C=C ( f 4 ), the single bond C-C ( f 7 ) and other features in f 18 .
From the example, we can see that queries can be considered as compositions of features. 5However, how features are composed together to form queries requires some elaboration.Graph query composition here is structural. 6Feature embeddings are required to specify how large structures are formed (see Definition 4).In Definition 6, we define feature composability as a measurement of the number of embeddings of feature pair compositions that form the feature f .Definition 6 (Feature composability) The composability of a frequent feature f with respect to the feature set F, denoted as c( f ,F), is where cs ∈ CS( f i , f j ) and f i , f j ∈ F, the equality "=" denotes graph isomorphism and A(cs) denotes the automorphism relation of cs.
In Definition 6, the numerator of the composablity is the number of distinct feature embeddings that form the feature.The denominator |A(cs)| is needed in Definition 6 because the queries that are constructed from features via automorphic common subgraphs are structurally equivalent.
Given this background, we are ready to propose c-prime features, which are features that have a composability smaller than or equal to c. Definition 7 (c-prime feature) A feature f is a c-prime feature if and only if c( f , F) ≤ c.The feature is non-c-prime feature if and only if it is not a c-prime feature.
Assuming that each composition is equally likely, non-c-prime features have a high chance of being formed from c-prime features.Therefore, non-c-prime features have a higher chance of being recovered from query autocompletion.On the other hand, c-prime features may not be suggested as they may not be composed from other features.Ignoring c-prime features leads to less comprehensive query suggestions.
Example 5 Consider the features shown in Fig. 4 again.We annotate each feature with its composability at its lower righthand side.When the value of c was 4, the c-prime features (4-prime features) are { f 3 , f 4 ,..., f 16 }.While larger features may still be constructed from 4-prime features, they may require multiple construction steps.When c is set to 16, only f 28 is not c-prime feature.It is because f 28 can be constructed from f 7 and f 16 in many ways.In the case where the dataset has many features and their possible compositions are voluminous, query autocompletion may become too costly.Hence, the non-16-prime features may be omitted from query autocompletion as these features are more probably suggested.In other words, users may not lose query suggestion candidates even omitting non-16-prime features.
In addition, c-prime features have the properties of antimonotonicity and downward-closure properties, which are similar to those of frequent features.The details are provided in "Appendix 1".
Example 6 Consider Fig. 4 again.f 13 can be constructed from f 4 and f 13 via f 4 .The cs is f 4 .The cs has two embeddings in f 13 and f 4 , respectively.A(cs) is 2. The same counts are obtained from composing f 7 and f 13 via f 7 .The composability of f 13 is 2×2 / 2 + 2×2 /2 equals 4. Thus, f 13 is a 4-prime feature.f 13 is a subgraph of f 18 .After some counting, we note that f 18 is a 9-prime feature.By the property of anti-monotonicity, f 13 is certainly a 9-prime feature.Since f 18 is not a 4-prime feature, by the downward-closure property, any supergraphs of f 18 are not 4-prime features.

Integration of c-prime features with other features
We remark that this section illustrates c-prime features with frequent features (as their underlying features).However, depending on users' applications, AutoG may integrate other features into c-prime features.For instance, the structure search of PubChem7 provides some templates of query structures 8 for users to compose their queries, based on domain knowledge of chemical applications.In other applications such as web searches [32], the query templates can be automatically derived from query logs by using machine learning or data mining methods.When AutoG adopts such approaches, it adds the templates into the feature set of D. AutoG simply determines whether they are c-prime as before.

Autocompletion framework for subgraph queries-AutoG
This section presents the major steps in AutoG, namely (i) decomposing users' queries, (ii) determining the candidate query suggestions and (iii) ranking the suggestions, with respect to users' preference.An overview of the framework is shown in Fig. 2.

Query decomposition
Following the subgraph query processing method in the literature (e.g., [10,42]), AutoG assumes the (c-prime) features of data graphs are mined offline.At runtime, it decomposes a query into a feature set.However, AutoG requires the embeddings (intuitively, the locations) of the features in the query which show how they are connected.

Algorithm 1 Query Decomposition
Input: a query q, feature set F (determined by gSpan [41] offline) and user preference component γ Output: a set of embeddings of the c-prime features M q in q 1: Let F q be the c-prime features of q 2: Let M Fq be the embeddings of each f ∈ F q in q // e.g., using VF2 3: Let E be the edge set of q, and M q be an empty set 4: Initialize e.w = 1 for ∀e ∈ E, and m f .unused= true for ∀m f ∈ M Fq 5: while m f = find(E, M Fq ) do 6: M q ← M q ∪ m f 7: return M q 8: function find(E, M Fq ) 9: determine m f ∈ M Fq , where m f .unused= true, 10: m f covers at least one uncovered edge 11: e.w ← e.w * γ for ∀e ∈ m f .E 13: m f .unused← false 14: return m f 15: function util(m f ) 16: It is evident that the decomposition of a query is not unique.Further, the design rationales of query decomposition in AutoG are competing ones: Firstly, the larger the features, the more structural semantics they preserve; secondly, the larger the features, the higher the chance the features overlap.The overlapping features contain redundant information and hence should be avoided.Therefore, a user-specified parameter γ is introduced to specify the desirable degree of overlapping features.The larger the value of γ , the more likely the overlapping of decomposed features.
Example 7 Consider the graphs shown in Fig. 3. Suppose that f 18 is a user query.A possible decomposition is { f 10 f 13 }, of which the embeddings are overlapping.Algorithm 1 is a greedy algorithm for decomposing a query q with respect to the user-specified parameter γ .Initially, all the edges of q are not covered.Algorithm 1 iteratively determines an embedding of a c-prime feature to cover q until no more uncovered edges can be found.The output is a set of embeddings of the c-prime features M q in q.
More specifically, first, we determine the c-prime features F q of q by invoking a feature extraction program [41].We determine the embeddings M F q (called feature-query embeddings) of the extracted features by invoking VF2 (Line 2).We initialize the result feature-query embeddings M q to be an empty set and E to be the edge set of q to be covered (Line 3).We assume each edge e of q has a weight w, which denotes if e has been covered by some features in M q .Each feature-query embedding m f has a flag (called unused) to indicate whether the feature-query embeddings is in M q .Second, in Lines 5-6, we iteratively add the next m f with the largest utility (defined by Util) to M q .The weights of the edges of the added m f are degraded by γ (Line 12).
We make two remarks on Find (Lines 8-14).In general, there are multiple decompositions of a query.Find enumerates larger features, where the feature size is determined by the sum of edge weights (implemented by Util in Lines [15][16].Larger features are used because they require more human effort to compose, i.e., they may preserve more of the user's intention.Hence, when a user has drawn exactly a large feature in his/her initial query, Find leads AutoG to consider it as a whole, as opposed to small feature(s).However, large features may overlap.Thus, if an edge is covered by an m f and added to M q , the weight of the covered edge is reduced by a factor of γ (Line 12).
It should be remarked that queries may contain infrequent edges, which are not in F, and will not be handled by AutoG.The analogy is that in web searches, infrequent keywords are not suggested; similarly, in AutoG, infrequent logical units are not suggested.By definition, infrequent edges lead to small answer sets.In this case, users may need less assistance from AutoG.

Complexity analysis of query decomposition
The time complexity of Algorithm 1 is , where (a) the first term is the time for determining the embeddings of F q in q and T subiso is the time for a subgraph isomorphism call, and (b) the second term is for scanning the |M F q | embeddings to cover O(|E|) edges in the Find function, which is invoked O(|E|) times.
We provide several observations on the two terms in the above-mentioned complexity.A worst-case exponential-time subgraph isomorphism algorithm is needed to determine the embeddings of F q in q (Lines 1-2).In this paper, we use a practical subgraph isomorphism algorithm called VF2.Moreover, users typically draw small queries via a visual graph editor, e.g., a graph containing fewer than 24 edges.Hence, the size and the number of features of the query (F q ) are small.As a result, T subiso is small in practice, and VF2 is invoked only few times.Regarding the complexity for scanning the embeddings, the terms |E| and |M F q | are small, again, due to small query sizes.Finally, the calculations of Lines 5-16 do not incur large constants in the asymptotic complexity.Hence, Algorithm 1 decomposes queries efficiently.

Generation of candidate suggestions
After the query decomposition step, the query q is represented by a set of c-prime features and their embeddings in q.The next step is to generate candidate query suggestions.In this subsection, we present connected feature increments, which are the most technically intriguing query composition mode, as discussed in Sect.2.3.
Query increments can be added to the current query in multiple ways.Specifically, given a set of c-prime features, the number of compositions is, in the worst case, exponential to the query and feature sizes.However, in practice, many possible composed queries may not make sense that do not retrieve any data graphs.Such queries are also known as empty queries.Further, it is known that deciding the emptiness of a subgraph query is NP-hard.
This subsection formalizes a necessary condition for nonempty query compositions.We illustrate how to efficiently prune empty queries using the necessary condition and the unpruned queries are considered candidate suggestions.

4.2a Baseline
We present the condition with node labels for presentation simplicity, which can be readily extended to support edge labels.Consider a graph g = (V , E, ).Denote to be the label set, where = { (v) | v ∈ V }.For each node v ∈ V , we determine a vector of the counts of its neighboring node's label v , where The nodes of the graphs can be represented by such vectors and hence as data points in a -dimensional space.Denote S to a skyline in the -dimensional space of the data point representations of the nodes of the graphs.Given a query q, if it is non-empty, q does not contain a node whose vector dominates the nodes in S in some dimensions.This condition is formalized in Proposition 1.Its proof can be established by a proof by contradiction.

Proposition 1 A query q is a non-empty query (also refers to candidate suggestion) only if
Suppose that there are |S| data points on the skyline.The check of Proposition 1 requires As discussed, numerous possible queries may be generated and they are checked with S at runtime.Thus, we relax the check for efficiency.

4.2b Relaxed necessary condition
For each label l 1 in , we determine the maximum number of each neighboring label The necessary condition for non-empty queries is then expressed in terms of d l 1 ,l 2 , as shown in Proposition 2. The number comparisons are reduced to O(|q.V | × | |).It has been validated by our experiments that this simplification is both efficient and effective.

Ranking candidate suggestions
Candidate suggestions can be many.Since users may only be able to interpret a small subset of them, AutoG returns top-k suggestions w.r.t. a ranking function and a user preference component.When users formulate their queries, they may rank candidate suggestions differently, because of different query formulation scenarios: for example, expert users may use AutoG to speed up their manual query formulation, whereas novice users may prefer diversified suggestions for exploring a database.In this section, we model the preferences between different criteria with a ranking function and a user preference component.

Ranking function and user preference component
This subsection presents a ranking function for possibly novice users who prefer query suggestions that (i) return more answer graphs and (ii) are structurally diversified.The first preference simply reflects users' intent to retrieve more answers, whereas the second one recognizes the importance of avoiding similar suggestions (e.g., [15,34,37]).These two preferences can be quantified as the following objective functions: 1. sel(q): the selectivity of q on D, is |D q |/|D|.2. dist(q i , q j ): the "intra-dis-similarity" between a pair of suggestions, q i and q j .The total pairwise distance of suggestions reflects how diversified a set of suggestions are.
For illustration purposes, we adopt the maximum common edge subgraph (mces) for dist (see Definition 8 [9,22]).mces is adopted because adding edges (as opposed to nodes) to an existing query appears a natural logical step of composing queries.The distance definition is denoted as BS.
Definition 8 (BS) Given two graphs g 1 and g 2 , the graph distance based on the maximum common edge subgraph (mces) is defined as follows: where mces(g 1 , g 2 ) is a subgraph of g 1 with as many edges as possible that is isomorphic to a subgraph of g 2 .
Example 8 Take f 18 and f 22 in Fig. 4 as an example, The dist function has a few nice properties [9,22].It is a metric.It is a reflexive and symmetric function, which can be observed from its definition.Other graph distance functions can be adopted to implement dist.We study the suggestion quality with other popular graph distance metrics [22,36,38] in "Appendix 3" .Definition 9 (User intent value of query suggestions) Given a set of query suggestions Q : {q 1 , q 2 , ..., q k } and a user preference component α, the user intent value of Q (util) is defined as follows: The overall ranking function (util) is presented in Definition 9, which is the normalized weighted sum of the two objective functions mentioned above.The two objective functions can be competing: for example, it can be observed that in practice, the sel of smaller queries are often larger as more data graphs contain smaller queries; in contrast, smaller queries may have smaller structural differences between them and consequently, dist returns smaller values, and their diversities are relatively low.With the util function, we are ready to formulate the ranking problem of query suggestions and analyze it hardness.Definition 10 Given a query q, a set of query suggestions Q , the ranking function util, a user preference component α and a user-specified constraint k, the ranked subgraph query suggestion problem The Rsq problem is an NP-hard problem, which can be established by a reduction from the maximum independent set (Mis) problem.The proof is presented in "Appendix 2".

Template values of user preference α
The user preference α in Definition 9 expresses the relative importance of selectivities and suggestion diversities.To help users to set α, we may derive a set of templates of predefined preference components with intuitive semantics (such as selectivity-oriented and diversity-oriented suggestions) from the underlying dataset.Users may start with a predefined template and subsequently refine α after reviewing some query suggestions returned.
Alternatively, AutoG starts with a template of user preference.Based on the suggestions adopted by a user, AutoG may learn whether he/she prefers selectivities or diversities.The details of learning parameters from users' feedbacks, however, are beyond the scope of this paper.

Integration of other ranking functions
It should be remarked that the ranking function util presented in this subsection is for illustration purposes.That is, other objective functions can be readily plugged into the AutoG framework.Take the structure search of PubChem as a concrete example.We may include application-specific semantics in an additional objective function.Suppose F T is the set of query templates provided by PubChem and the templates in F T are c-prime features, as discussed at the end of Sect. 3. Suppose users are favorable to suggestions that contain the query templates F T .This can be achieved by introducing a function app(q ) that returns the number of features in F T contained in q .AutoG adds app to util and sets its preference as other objective functions.

Efficient selectivity and diversity computation
Next, we present efficient algorithms for determining sel and dist, which enable efficient ranking candidate suggestions.

Candidate answer selectivity estimation
We first recall some standard notations.We denote D q = eval(q, D) as the query evaluation of q on D and D q is the query result.The selectivity of q (denoted as sel(q, D)) is |eval(q, D)|/|D|.
Recall that eval(q, D) is NP-hard due to subgraph isomorphism tests.Hence, we propose to leverage feature-based query processing to efficiently estimate eval.The benefits of this approach are twofold.1) c-prime features can be seamlessly integrated into existing feature-based approaches.This estimation does not incur much overhead.2) It has been known that the feature-based approach (e.g., [42]) can efficiently determine candidate answer sets of subgraph queries, which are close to the actual answer sets.
In a nutshell, each feature is associated with a set of IDs of the graphs that contain the feature.Given the features of a query F q , the candidate set is obtained by the intersection of the sets of IDs associated with each feature in F q .The numerous intersections of large ID sets may be costly, especially when ranking suggestions online.Hence, we estimate the selectivity by adopting a systematic sampling after a uniformly random permutation of the graph IDs [8].
W.l.o.g, we assume two sets A and B, and |A| < |B|.Then, |A| is the population size.The real selectivity, |A ∩ B|, is the number of success states in the population.S = |A|/m is the number of draws, where m is the user-specified sampling interval.The number of observed successes is denoted as k.The probability that the observed successes exactly equals to k is given by Fig. 5 Triming compositions for mces computation The error of our estimation method can be analyzed based on hypergeometric distribution, that describes the probability of k successes in n draws, without replacement.

Diversity approximation
The second component of util is the structural diversity of query suggestions in Q .dist makes the overall ranking function util submodular, so that greedy algorithms are its natural heuristic.(Please refer to "Appendix 3" for the experiments on ranking using different graph distance functions).
To efficiently implement the dist function of two suggestions, our main idea is to trim the common parts from them before calling the exponential-time algorithm for mces.This can be efficiently computed because (i) the query suggestions are composed by adding different increments on the same existing query graph, and (ii) some auxiliary structures between possible composable features can be computed offline.For brevity, we omit the tedious pseudo-code but illustrate the major steps with the following example.
Example 9 Given a current query q which is simply a feature f 57 .Consider two possible compositions (shown in Fig. 5) that construct query suggestions from f 57 by adding either f 19 (denoted as q 1 ) or f 28 (denoted as q 2 ).Note that the existing query q and increments f 19 and f 28 , in this example, are features and their compositions can be enumerated offline.Suppose we compute mces of q 1 and q 2 .Some parts of q 1 and q 2 are trivially common and are not necessary to perform the costly mces computation.Thus, we reduce q 1 and q 2 to the trimmed subgraphs q1 and q2 for computing the non-trivial mces of q 1 and q 2 .Further, some intermedi-ate results are indexed offline.Specifically, the major offline steps are presented as follows.
1. Denote cs 1 (resp.cs 2 ) to be the common subgraph between f 57 and f 19 (resp.f 57 and f 28 ). 2. The subgraph s computed by f 57 -cs 1 -cs 2 is trivially a part of the mces of q 1 and q 2 ; 3. q1 is obtained by q 1s.Similarly, q2 is q 2s; 4. The embedding of cs 1 (resp.cs 2 ) in f 57 is computed offline.It also specifies its location in q1 (resp.q2 ), which minimizes the search of mces.In particular, the nodes 0, 1 and 2 of q1 must map to the nodes 0, 1 and 2 of q2 ; 5.An mces algorithm determines the mces of q1 and q2 offline, and it returns s 1,2 .6.In s 1,2 , the nodes {0,1,2} are from cs 1 and cs 2 (i.e., the existing query q).The non-trivial mces is C-C (1, 8 (or 9)), which contains one edge.
When candidate suggestions are ranked online, the size of mces between q 1 and q 2 is obtained by simply adding the sizes of the existing query q (i.e., f 57 ) and the non-trivial mces: We provide two remarks on the above mces computations.(i) While the query is provided by users online, features and their compositions can be enumerated offline.Therefore, sizes of the non-trivial mces between compositions can be indexed offline.(ii) The above optimization significantly speeds up the online mces computation because query suggestions contain the same existing query graph.

Greedy ranking algorithm
In this subsection, we propose a ranking algorithm for a set of candidate suggestions with respect to the ranking function util and the user preference α.Note that the util function has two monotone submodular components.It can be easily established that the util function is also submodular.A decent property of a submodular function is that greedy algorithms work and guarantees a 1/2opt approximation ratio [6].
We propose a two-level greedy algorithm (Algorithm 2) to rank the candidate suggestions.A two-level algorithm is proposed because the candidate suggestions can be many and computing diversity between every possible pairs of them involves dist, which is time-consuming.In a nutshell, at the first level, for each feature f embedded in a specific location λ of the query, the algorithm greedily determines its top-k suggestions (denoted as Q f,λ k ) that increment f in location λ.At the second level, it greedily determines the overall top-k suggestions from Q f,λ k s, for all ( f , λ)s.Hence, it avoids computing dist of all possible pairs of candidate suggestions.The pseudo-code of the greedy algorithm is presented in Algorithm 2. We elaborate its details below.
Greedy_local: First, Greedy_local (Lines 14-15) determines the possible suggestions composed by adding a feature to the feature-query embedding ( f , λ) of q, denoted as Q C .For efficiency purposes, we further restrict q is increased by at most δ edges (Line 15).Line 16 computes the pairwise dist between suggestions in Q C .In each iteration step (Lines 17-20), it adds the composed suggestion q c that makes the times (Line 2).In each Greedy_local call (Line 3), we obtain candidate suggestions (Line 4) and take the union with the top-k suggestions from each possible ( f , λ) obtained so far (denoted as Q ).Line 5 computes the pairwise dist between suggestions in Q online.In each iteration step (Lines 6-10), it adds the composed suggestion q that makes the util function of Q k the largest.This step is repeated until it obtains k suggestions in Q k .
Remarks.The greedy algorithm involves a trade-off between the efficiency and suggestion quality.Greedy_local obtains 1/2Opt suggestions w.r.t a specific ( f , λ), whose computation can be further optimized offline.Since M q is only available online and the time for computing the user intent value of a query set, T util , can be potentially long, Greedy_global is run only on Q instead of Q C s of all possible ( f , λ)s.Alternatively, when the query suggestion time is already acceptable, one may tune AutoG to produce more suggestions for ranking as follows: (i) One may include more features, e.g., by lowering the minimum support and/or increasing the composabiliy of c-prime features.(ii) AutoG can be tuned to allow more overlapping features in query decomposition by using the parameter γ .(iii) One may set the maximum query increment size δ to a large value.

Complexity analysis of the greedy ranking algorithm
This section analyzes the asymptotic complexity of the greedy ranking algorithm (Algorithm 2), presented in Sect.4.3.3.We remark that when determining the user intent value of query suggestions for ranking, the sampling method of selectivity estimation (presented in Sect.4.3.2) is significantly more efficient when compared to the worstcase exponential-time computation of the query suggestions' diversity, which involves computing structural difference between graphs (dist) multiple times.For a succinct analysis, we assume that there is an oracle that efficiently provides selectivity estimation of a query and omit it in the analysis.

Greedy_local:
The time complexity of the local greedy algorithm is simply 1.The first term is due to the addition of a feature to ( f , λ), where λ is the embedding of f in the current query.|Q C | is the number of possible suggestions at ( f , λ), which is of modest value.T compose is the time for adding a feature in F to ( f , λ) of q.Such an addition requires to check whether the feature contains a common subgraph with f .This requires subgraph isomorphism calls.2. The second term is due to the time for computing the composed suggestion q c that makes the util of Q f,λ k the largest.Denote T util to be the time for computing the user intent value of a set of suggestions.Recall that in each iteration of the greedy algorithm, we augment Q

Greedy_global:
The time complexity of the overall greedy algorithm is 1.The first term is due to the calls of local greedy algorithm of feature embeddings.Greedy_global calls Greedy_local once for each embedding.There are |M q | embeddings in total.Recall humans often draw small queries.Hence, |M q | is small in practice.2. The second term is due to the ranking of Q to obtain the overall top-k suggestions Q k .We remark that each Greedy_local call returns k suggestions.Hence, the number of query suggestions ranked by Based on the above analysis, we observe that the algorithm calls two worst-case exponential-time subroutines with graphs of small or modest sizes.In particular, they are (i) the subgraph isomorphism calls when determining the possible common subgraphs of feature pairs, and (ii) the structural difference between query suggestions for computing their diversities.In Sect.4.3.2,we have proposed an optimization for computing suggestion diversities.Moreover, in the next section, we propose an index to further optimize them.

Algorithm 2 Ranking Candidate Suggestions
Input: a query q represented by M q , user preference component α, number of suggestions requested k and max.increment size δ Output: the top-k query suggestions Q k 1: function greedy_global(q, α, k) 2: for all ( f, λ) ∈ M q do 3: for all i = 1 . . .k do //global ranking 8: q max = argmax(q , util(q , Q k )), where q ∈ Q 9: Q C is the possible suggestions composed by adding a feature to the feature-query embedding ( f , λ) of q, where 15: compute dist between each pair of suggestions in Q C 17: for all i = 1 . . .k do 18:

Indexed autocompletion for subgraph queries-AutoGI
To optimize the autocompletion framework presented in Sect.4, we present a novel index, called feature Dag (FDag) index, and its associated algorithms.It is the first structure that indexes features and records their structural information for query autocompletion, including subgraph isomorphisms between features, features' automorphisms and auxiliary structural differences between query compositions.We present the definition of FDag and its operations, whereas we postpone the details of the index construction to "Appendix 4".

Feature DAG (FDAG) index
Prior to the definition of FDag, we present the design rationales of FDag: 1. Greedy_local (Algorithm 2, Lines 14-15) involves adding a feature f j ∈ F to a query q, via a feature f i embedded in q, where f j and f i have a common subgraph cs, i.e., cs ⊆ λ i f i and cs ⊆ λ j f j .Numerous suggestions may be potentially generated (i.e., |Q C | can be large).Determining all possible common subgraphs of two features is costly.Hence, FDag indexes all subgraph isomorphic embeddings between features.
2. Suggestions that are formed via common subgraphs which are automorphic to each other are structurally equivalent.FDag indexes automorphic embeddings of each feature, so that automorphic suggestions are generated only once.3.All possible pairwise feature compositions are enumerated and indexed so that the time for adding an increment to a feature (T compose ) is done in an FDag lookup.4. As motivated in Sect.4.3.2, it has been known that determining structural differences between graphs is potentially costly (i.e., T util can be large).Thus, FDag indexes the auxiliary structure for determining structural differences between compositions (illustrated with Example 9) so that T util is significant reduced.5.The ranking function presented in Sect.4.3.1 involves selectivity estimations.FDag indexes the graph IDs of a feature with a predefined sampling interval m.
The feature Dag (FDag) index is then formally presented in Definition 11.We provide the algorithmic details for constructing FDag in "Appendix 4".

Definition 11
FDag is a Dag (V, E, M, anc, des, A, ζ, η, D), where 1. V is a set of index nodes.Each node v represents a feature denoted as f v .For presentation simplicity, we may often use f v to refer to the index node; 2. E : further, M is a function that takes an edge (v i , v j ) as input, i.e., f v i ⊆ λ f v j , and returns the subgraph isomorphism embeddings of f v i in f v j , often denoted as M f i , f j ; anc and des are functions that take an index node v as input and return its ancestor and descendant nodes, respectively.3. A takes a feature f v as input and returns the automorphism embeddings of f v , often denoted as A f v ; 4. ζ is a function that takes a c-prime feature f v as input and returns a set of composition records C as output, where each record in C is a 6-ary tuple ( , where cs is the common subgraph of f v and f v j , λ v (resp.λ v j ) specifies the embedding of cs in f v (resp.f v j ) and F l is a set of features embedded in the composed graph; 5. η is a function that takes a pair of compositions as input and returns the auxiliary structural difference between the pair; and 6.D takes an index node f v as input and outputs a sample of the IDs of the graphs that contain f v .We often denote the graph IDs of f v as D f v .
FDags can be large in worst case.However, in practice, their sizes are far from the worst-case ones.(i) The number  8).(ii) A feature pair may have exponentially many subgraph isomorphism embeddings between them.In practice, the average number of such embeddings per feature pair is around 2. (iii) Regarding automorphisms, our experiments show that about 12% of the features have multiple automorphisms.(iv) Each feature pair has around 120 possible compositions on average (indexed in ζ ).
Example 10 We constructed the FDag for PubChem and illustrated its partial sketch in Fig. 6.Suppose f 10 is the current query.f 13 is a possible increment to f 10 that forms f 22 , via f 4 .Query suggestion f 22 can be efficiently determined by using FDag as follows.
f 4 is located from anc( f 10 ).f 13 is in des( f 4 ).f 22 is retrieved from the compositions of f 10 (i.e., ζ f 10 ) via an FDag lookup.A trivial query composition is needed when the composed suggestion is not a feature.The composed suggestion (i.e., f 22 ) is ranked against other possible candidate suggestions (formed by other compositions of f 10 ).This is efficient because the intermediate results of the structural differences between compositions of f 10 are recorded in η.

Autocompletion by using FDAG
We end this section by highlighting how FDag optimizes the online ranking (Algorithm 2).Determining the set of possible candidate suggestions Q C (Line 14) that have large util values (Line 18) is computationally costly.In Line 14 of Greedy_local, the query increments that can be composed with the feature f (at the location λ of q) are fixed.Therefore, the FDag indexes all possible compositions C of Given a composition c = ( f, f , cs, λ 1 , λ 2 ) and a suggestion q , the constraint |q | − |q| ≤ δ can be then easily checked (Line 15).Then, the util function has the sel and dist components (Lines 16-18).sel can be efficiently estimated from D of FDag.Regarding dists between q and other suggestions constructed from ( f , λ), some intermediate results had been indexed in η of FDag.Further, since q differs from q by at most δ edges, dist can be efficiently derived from η. Hence, T util is reduced.

Pruning redundant compositions via graph automorphism
A unique problem of subgraph query composition is that compositions can be structurally identical and therefore redundant.These redundant compositions adversely affect the performance of AutoG in two ways.First, the generation of these redundant compositions is useless.Second, these redundant compositions are mixed the useful ones, and thus, subsequent ranking of suggestions is required to eliminate them.In this section, we detail an automorphism-based optimization to prune them, by indexing automorphisms of features A in FDag.
Example 11 (Redundant compositions) Figure 7 shows an example of redundant compositions.Here, the cs is f 4 .
The intuition of the redundant compositions is that the graph increment (e.g., the feature f 48 in Example 11) may "rotate" and are combined to form the same query suggestion.Such "rotations" can be captured by the automorphism relation of the graph increment.Recall that the automorphism relation A G of a graph G is an isomorphism of G to itself.Determining A G is one of the NP-hard problems.However, when the vertex degrees are bounded by a constant, there is a polynomial time algorithm [25] for solving the graph automorphism problem since the sizes of features are often small (e.g., the largest feature of PubChem contains 14 vertices when we set the minimum support of features to 10%, and its degree must be no larger than 14).Thus, determining the automorphisms of each feature A is efficient offline.They are indexed FDag (see Definition 11).
We elaborate the major steps of the pruning with ( f i , f j , cs), where cs is a common subgraph of f i and f j .cs may have multiple embeddings in f i and f j , respectively.Denote such embeddings as M cs, f i , which are all the possible subgraph isomorphic embeddings of cs in f i .Without any background knowledge of the existing query q other than f i ⊆ λ q, we perform two pruning steps exploiting the automorphism of cs and f j . 9.We prune redundant compositions resulted from the "rotation" of cs.Suppose there is an automorphic embedding λ ∈ A cs that relates two node configurations of cs, V 0 and V 1 :

v (|V (cs)| )
. 10 Then, cs can be embedded in a subgraph s i of f i in multiple ways M cs, f i .Suppose the two embeddings λ 0 cs, f i and λ 2 cs, f i of M cs, f i map two node configurations of cs (V 0 and V 1 ) to the same node configuration of f i .Then, we keep only one of such embeddings.This step is to reduce the number of compositions from 2. We prune the redundant compositions resulted from the "rotation" of f j .Suppose there are two embeddings λ 0 and λ 1 of cs into the subgraph s j of f j .Denote V 0 and V 1 are the node configuration of f j that the nodes of cs mapped to, specified by λ 0 and λ 1 .If V 0 and V 1 are automorphic, we keep only one such embedding and prune the other.

Experimental evaluation
This section presents an experimental evaluation of the proposed AutoG framework.We first investigated the suggestion quality via user tests and simulations and then conducted an extensive performance evaluation on popular real and synthetic datasets.In particular, we studied the usefulness of AutoG, the overall performance of AutoG, the effectiveness of the optimizations and the effect of the parameters of AutoG.

Software
We implemented a prototype of AutoG.The interface is shown in Fig. 1.The prototype was mainly implemented in C++, using VF2 [12] for subgraph test and the McGregor's algorithm [27] (with minor adaptation) for determining mces.We used the gSpan implementation from [41] for frequent subgraph mining.

Hardware
We conducted all the experiments on a machine with a 2.67GHz processor and 64GB memory running the Linux OS.All the indexes were built offline once and loaded from a hard disk and were then fully memory resident for online query suggestions.
Datasets We used the datasets and query sets provided by iGraph [16] and followed their default settings.We used two popular benchmarked real datasets: (i) PubChem [31], a real chemical compound dataset consisting of 1 million graphs, and (ii) Aids [30] (the AIDS Antiviral dataset), which consists of 10,000 graphs.For the synthetic datasets, we used synthetic.10K.E30.D5.L20 and synthetic.10K.E30.D5.L80 (hereafter referred to as Syn-1 and Syn-2), both of which consist of 10,000 graphs.Table 1 shows some characteristics of the datasets: the number of graphs (|D|), the average number of vertices and edges (avg(|V |) and avg(|E|)), and the number of vertex and edge labels (|l(V )| and |l(E)|).

Query sets
The query sets were taken from [16], with the query size ranging from 4 to 24.Each query set of a particular size contained 100 queries.
For time measurements, we reported the elapsed wallclock times.It is known that there are large variations in subgraph query times [40].To avoid the reported times being governed by few long (or short) queries, we discarded the runtimes that were beyond two standard deviations from the mean.The reported runtime was the average of the remaining runtimes.

Mining of c-prime features
We ran gSpan to obtain sufficient features for AutoG to compose suggestions.In particular, we set the default minimum support value (minSup) for the real datasets to 10%.We set minSup to 5% for synthetic datasets simply because their frequent subgraphs are relatively scarce.The maximum feature size max L was set to 10 for all datasets.Some statistics of the features are summarized in Table 2.
Default AutoG settings The default maximum increment size (i.e., δ) was set to 5, which is large enough to provide a large number of candidate suggestions.We set the default composability c to infinity unless we specifically studied its effects.There are five parameters (i.e., m, γ , α, k and |q|) of online AutoG processing.In the default setting, we set m to 4, γ and α to 0.5, k to 10 and |q| to 8, unless otherwise specified.

Suggestion quality
Simulations We investigated the qualities of the suggestions via simulations under a large variety of parameter settings.
For each target query, we started with a random 2-edge subgraph.In each step, we called AutoG.Then, we chose a suggestion with the largest size.If no suggestion was useful, we augmented the query by a random edge toward the target query.Each target query set contained 100 queries, which are publicly available [43].To study suggestion qualities, we employ several popular metrics, listed in Table 3.We report some representative results from PubChem in Tables 4, 5 and 6.Table 4 shows the quality metrics of Q8 with various δ.We remark that same trends were observed on Q12 and Q16.Table 5 shows the quality metrics of various target query sizes. 1 Hit shows that AutoG suggestions were almost always used, and #AutoG shows that the suggestions were used in multiple iterations of query formulation; the numbers of edges added by AutoG were around 30%; and TPM shows that AutoG saved roughly 42% mouse clicks in query formulation.Table 6 shows the suggestion quality when we varied k.Table 6 shows the qualities increased with k.It is not surprising because as more suggestions are returned, the higher chance some of them are adopted.
Table 7 shows the quality metrics of Q8 with various α.The results showed that the suggestion qualities increased as the value of α increased and were stable when α ≥ 0.1.That is, when the factor of the selectivities of suggestions (e.g., α ≥ 0.1) was adequately significant in the ranking, AutoG produced high quality suggestions.To obtain helpful suggestions, α may be set to some values greater than 0.1 so that both selectivities and diversities involve in suggestion ranking.
User test Next, we conducted a user test with 10 volunteers.Each user was given 2 queries with high, medium and low TPM values, respectively, from the simulation.We randomly shuffled these 6 queries. 11The users were asked to formulate the target queries via the visual aid shown in Fig. 1.They expressed their level of agreement to the statement "AutoG is useful when I draw the query."via a symmetric 5 level agree-disagree Likert scale, where 1 means "strongly disagree" and 5 means "strongly agree"). 12ur result showed that the correlation coefficient between TPMs and users' points is 0.96 and the p value is 0.002.Therefore, TPM is a good quality indication of AutoG.The average ratings of the queries with high, medium and low TPM values are 4.55 (between "strongly agree" and "agree"), 2.95 ("neither agree nor disagree") and 1.65 (between "disagree" and "strongly disagree"), respectively.

Index construction performance
Next, we report the performance of building FDag in Tables 8, 9 and 10.The major steps are (i) to build the structure of the FDag, (ii) to enumerate all feature pair compositions and (iii) to precompute for the mces distances.We elaborate the results below.8 report the number of vertices and edges of FDag.It can be observed that the FDag structures are sparse.Such construction times were small.Optimization for composition enumeration We enumerated all the possible feature pair compositions as detailed in "Appendix 4".The results are reported in Table 9.A fact from "larger fs" is that a large portion of graphs formed by small features contained new large features."Before opt" reports the total numbers of possible feature pair compositions.It can be observed that while the numbers of features |V F Dag | were modest, the numbers of possible compositions generated (using the default δ value) were large (e.g., millions of compositions for Aids, PubChem and Syn-2)."After autom.opt" shows the numbers of compositions optimized after applying the automorphism-based optimization presented in Sect.6.The optimization determined that on average, over 10% of the compositions were redundant and pruned, and in particular, 30% of the compositions for the Aids datasets were pruned."After nec.cond."reports the numbers of the useless compositions pruned by the necessary condition introduced in Sect.4.2.They reduced 13 and 45% of the compositions after the automorphism-based optimization for the Aids and PubChem datasets, respectively.This optimization prunes few compositions of the synthetic datasets since the graphs were randomly generated, and thus, such patterns could not be found.

Graph distance precomputation
The precomputation of some auxiliary structures for the graph distance between each pair of compositions for each node in FDag is reported in Table 10.We report the numbers of composition pairs and the precomputation times with the proposed technique in Sect.4.3.2.The precomputed results are used in online processing.
Varying δ The δ value determines the number of candidate suggestions to be ranked; the larger δ, the more suggestions to be ranked.In Fig. 8, we show the effectiveness of the automorphism-based optimization when δ was varied.The  results show that the effectiveness was stable as δ was varied on all datasets.Next, we further investigated the effects of δ on the pruning of empty suggestions using the necessary condition.Figure 9 shows that this optimization on real datasets was more effective when the δ values were larger.This reflects that when one uses large increments, the resulting suggestions deviate more from those that could retrieve some data graphs.Synthetic datasets were skipped as this optimization was mainly effective on our real datasets.

Online autocomplete performance
We conducted a detailed evaluation of the online AutoG processing.We report the average response time (art) of AutoG under the default setting in Fig. 10.For the synthetic datasets, we obtained short arts as their feature pair compositions were relatively few.The arts of Aids and PubChem were slightly shorter than 1s and 4s, respectively.Thus, the AutoG time is generally short.We remark that the default value of c is set to infinity leading to the longest arts.As c decreases, the art decreases, too.
Varying m We varied m from 1 to 16, to study its impact on the overall art.The larger m, the smaller the sample sizes for estimating the selectivities of the ranking function.The result is presented in Fig. 11.m has negligible effects on Aids, Syn-1 and Syn-2 datasets.The reason is that selectivity estimation was not the performance bottleneck.However, the candidate answer sets of queries on PubChem were large and the selectivity estimation contributed notable computation times to query suggestions.Hence, as m decreased, the art of PubChem decreased.
Varying γ and α We ranged γ from 0 to 1. Figure 12 shows the effects of γ on arts.The art was always less than 4.2s.We also noticed that the art increased slightly when γ was approaching to 1.The higher the value of γ , the more overlapping was allowed and the more features were returned by query decomposition.With more features of queries, AutoG took more time to rank possible query suggestions.We verified that since α was only a weight in the util function, it did not have a noticeable impact on art, as shown in Fig. 13.
Varying k We varied k from 10 to 50 and reported the arts for each dataset in Fig. 14.The largest value of k tested was 50.The results show that the times were less than 1.5s for Aids, Syn-1 and Syn-2.The time for PubChem was shorter than 4.7s.As expected, the overall art increased as k increased.The reason is that AutoG determined and compared more possible suggestions, when the values of k were large.
Varying query sizes Figure 15 shows the art as the query size increased.For the queries of sizes smaller than 16, AutoG finished within 10s.For the queries of the sizes 24, the art can be as long as 40s (e.g., for PubChem).This is due to the NP-hardness of mces in global diversification (i.e.,   their runtimes were exponential to query sizes).However, it should be remarked that the result sets of such large-sized queries are almost always small (e.g., roughly 25 on average for Q24 of PubChem), which are often human manageable, and thus, the needs of AutoG were arguably less when compared to those of smaller sizes.
Varying c When the values of c were decreased, the numbers of features in FDag were reduced, too.The numbers of c-prime features by varying c are reported in Table 11.Figures 16 and 17 show the effects on the suggestions in terms of art, user intent value and their selectivities and diversities on PubChem, respectively.As expected, when the values of c were small, both selectivites and diversities were low.c was easy to set because these quantities were saturated when c reached a few hundreds.In default settings, c was infinity.However, if we sought smaller art, we could set c to a smaller value.In summary, we observed from Figs. 10 to 15 that the proposed AutoG framework can interactively determine suggestions under a large variety of parameter settings.We have conducted additional experiments to investigate the detailed performance of AutoG under different settings and other implementations.We have presented the results in "Appendix 3".

Related work
There have been some innovative works on query autocompletion for keywords (e.g., [2,29,39]).For brevity, we could not include them.This section focuses on some representative works related to graph databases and their usability.

Graph features Various graph features have been proposed
to indexing graphs (e.g., [10,13,35,42,45]), to enhance query processing, by filtering non-answers efficiently.In comparison, c-prime features are defined with composability and how they are connected to form larger graphs are indexed.Hence, they assist users to compose their queries.Mottin et al. [28] studied the problem of graph query reformulation.The reformulated queries maximally cover the results of the current query.This approach assumes that all query results are relevant.When queries are small (e.g., queries of the size 8 for PubChem), the number of answers is 30K graphs on average, according to our simulation.Users may not interested in all of them.In contrast, AutoG ranks suggestions based on their selectivities and the diversities.Li et al. [14] proposed to extend keyword search autocompletion to XML queries.In [23], structures are associated with the query keywords.However, keyword searches are inherently difficult (if possible at all) to express structural queries.LotusX provides position-aware autocompletion capability for XML [24].Autocompletion learning editor for XML provides intelligence autocompletion [1].In contrast, this paper focuses on subgraph queries (structural search) for graphs.

Visual query composition
To alleviate the burdens of structural query composition, visual aids (or GUIs) have been studied, especially in the context of XML queries [7,11,33].For example, graphical constructs of XML queries (XML-GL) have been proposed [11].QURSED provides a query editor for building reports [33].The XML Query By Example (XQBE) provides tools to express graphical constructs of complicated XML queries [7].One possible reason why GUIs have received significant research attention is that XML data are structures.Their queries are tedious to compose, and they are naturally visualized as pictures.The same arguments can be applied to graph databases [4,18], but their data and query languages are even more complex.GQBE presents a system that allows user to query knowledge graphs by example [20].The support of interactive simple feedback [3] at opportunity times [5] via a GUI differs from query autocompletion.
Exploratory search Exploratory search has known to be useful for enhancing interactions between users and search systems (e.g., [26]).Obviously, the idea of exploration search can be applied to graph data.The query autocompletion is consistent to exploratory search that it allows users to construct their queries incrementally and explore their intermediate query results.

Conclusion
This paper presents a subgraph query autocompletion framework, namely AutoG, that provides query suggestions to users as they are formulating their queries.The logical units of query increments are c-prime features, which can be composed from smaller features in no more than c ways.Existing structural features of graphs can be adopted to the concept of c-prime features.We have proposed query decomposition, candidate suggestion generations and ranking.We have proposed an index called FDag and optimization methods.We conducted extensive experiments on both real and synthetic datasets.The results showed that AutoG saves about 40% of users' mouse clicks in query formulation, the response time of suggestions is short, and the optimizations are effective under a large variety of settings.In future work, we are investigating AutoG for users with different domain knowledge and adopting machine learning techniques on query logs to determine query templates and parameters for AutoG.
Fig. 18 An illustration of the query suggestions generated from an Mis instance each vertex v in V , we have a query suggestion q v corresponding to each v in V .We construct Q V such that each query q v in Q V has exactly |Q V | edges.The structure of each query is a star, with a common first edge (v a , v b ) and the other edges encode the following (also illustrated with Fig. 18): ) is an edge of q v i and q v j .That is, mces between q v i and (v b , v j i, j) is introduced to q v j .Then, the mces between q v i and q v j is {(v a , v b )} only.
The maximum independent set is at most of the size |V |/2.Therefore, we invoke Rsq on Q V , where k is ranged from 1 to |V |/2 and α is set to 0. That is, only the diversity component of the ranking function is considered.
Case (1) Suppose Q v is a solution of Rsq.If for some q i , q j ∈ Q v and mces(q i , q j ) is not just (v a , v b ), then it guarantees there does not exists Q v such that |Q v | = |Q v | and for all q i , q j ∈ Q v , such that mces(q i , q j ) is just (v a , v b ).This is because Q v would have been ranked higher, according to util (i.e., Q v is more diversified than Q v ).And by the reduction above, there is an edge (v i , v j ) in E. The corresponding V is not an Is.
Case (2) Suppose Q v is a solution of Rsq and for all q i , q j ∈ Q v , mces(q i , q j ) is (v a , v b ).By the reduction above, there is no edge between v i and v j , for all v i , v j .Thus, the corresponding V is an Is.
Putting these together, let Q v is the largest set returned by invoking Rsq's for k ranging from 1 to |V |/2, whose corresponding V is an Is.Suppose Q V is returned by Rsq and larger than Q v .Then, Q V belongs to Case (1).According to Case (1), there is no Is of the size |Q V | can be obtained.Therefore, V is the maximum independent set.

Appendix 3: Additional experiments Suggestion qualities with different underlying definitions in AUTOG
Quality metrics under other mces distance metrics For illustration purposes, the paper adopts the maximum common edge subgraph (mces) for dist (see BS in Definition 8).It is possible to plug in other edge based distance metrics into the AutoG framework (without modifications) to represent the "intra-dis-similarity" between a pair of suggestions.In this experiment, we report the quality metrics of the suggestions when AutoG uses two other mces distance metrics.
The two distance metrics are presented in [22,36,38], denoted as WSKR and FV.The first distance metric (WSKR) uses the size of the union instead of the size of the larger graphs to distinguish variations in the graph sizes [38]: The second distance metric (FV) is based on the maximum common subgraph and minimum common supergraph of the two graphs.Therefore, it takes into consideration of both superfluous and missing structure information of the two graphs [36].We normalize FV by dividing it by AutoG achieved similar stable qualities under these different distance metrics and parameter settings.For presentation brevity, we reported the results of suggestion qualities in terms of #AutoG and TPM only.Tables 12, 13 and 14 show that regardless the distance adopted, the suggestions were used in multiple iterations of query formulations.BS performed slightly better than FV and WSKR.Tables 15, 16 and 17 show similar trends from BS, FV and WSKR, while BS almost always performed the best.Therefore, users may pick the distance metric that is intuitive to their applications.

Suggestion qualities of the c-prime features on top of gIndex
In the paper, c-prime features are defined (Definition 7) with frequent features (Definition 12).As discussed, c-prime features are orthogonal to other features.As a proof of concept for integrating other features to AutoG, we implemented c-prime features on top of discriminative features proposed in gIndex-the seminal work of using features for subgraph query performance.
In a nutshell, we counted the composability of the discriminative features and index those that are c-prime.We used the implementation from iGraph [16].We adopted the default parameter values for gIndex.In particular, the support threshold was set to 10%, the maximum feature size max L was 10, and the discriminative ratio γ min was 2. We implemented the same size-increasing function as in [42].Under this setting, we obtained 2370 discriminative features from the PubChem dataset.We then constructed the FDag index as before.
We compared the suggestion qualities of AutoG using the proposed c-prime features and c-prime features on top of discriminative features, simply denoted as gIndex.We report the suggestion qualities from simulations in Tables 18, 19 and 20.The results showed that such suggestions were still somewhat useful.As expected, when they are compared to the results from Tables 4, 5 and 6, the proposed c-prime features gave clearly higher quality suggestions than those by using discriminative features.More specifically, when we varied δ, the average values of 1 Hit (%), #AutoG, AutoG |E| and TPM (%) of our proposed c-prime features were 99%, 2.8, 3.6, and 48%.whereas those of gIndex were 46%, 0.5, 1.2, and 19%.When we varied |q|, those quality metrics of the proposed features were 99%, 3.2, 5.0, and 44%, whereas those of gIndex were 76%, 1.1, 3.0, and 27%.Similarly, when we varied k, those quality metrics of the proposed features were 93%, 1.9, 2.7, and 36%, whereas those of gIndex were 49%, 0.5, 1.4, and 21%.The reason is simple.The design goal of gIndex is to efficiently prune non-answer graphs of a query, not query autocompletion.

Online performance breakdowns
Next, we present a detailed performance study of each major step of the online processing.

Query decomposition
We report that the query decomposition phase always took less than a few milliseconds, for all queries and datasets under all the aforementioned parameter settings.

Local ranking
The runtimes of local ranking phase under various parameter settings are presented in Figs. 19, 20, 21, 22, 23 and 24.When comparing Fig. 19 to Fig. 10, we observed that the local ranking was the bottleneck of online processing.The reason is that local ranking compared many pairs of candidate suggestions, even they were indexed by FDag.Consistent results can be observed from experiments with the parameters m, γ , α and top-k were varied (e.g., Figs.20,11), except with the following situation.Under the default setting, the runtimes of the local part increased slightly with k as shown in Fig. 23.We also noted from Fig. 24 that the art of the online processing for local ranking increased sub-linearly with |q|.

Global ranking
The runtimes of the global ranking phase under a large variety of parameter settings are reported in Figs. 25, 26, 27, 28, 29 and 30.When varying m, γ and α, the times were quite stable.We noted from the experimental results that the global diversification took <60 ms increased, the queries may be decomposed into large features.Computing mces of large features online is known to be costly, and (2) Large queries may be decomposed into more features, which in turn resulted in more mces calls.
From the last experiment, we found that the art of online processing was determined by either local or global rankings, which are in turn dependent to |q|.We illustrate further the relations between the two by reporting performance breakdown on all datasets datasets (see Figs. 31,32,33,34).For Aids and PubChem, we observed that the runtimes of global ranking increased much faster than those of the local one, due to the costly online mces computations.For Syn-1 and Syn-2, we observed the arts of the two phases exhibited linear trends.queries that were estimated to be non-empty, but are in fact empty; "large error" denotes estimated queries whose errors were larger than 100%; and "small error" denotes the remaining queries.Fig. 35 shows the percentages for each of the three categories under different values of sampling step m.About 80% of the queries were estimated correctly.Fig. 36 reports that the mean estimation errors of "small error" were all below 1.2%.Thus, the selectivity estimation was accurate.
Effectiveness of structural trimming for mces Fig. 37 reports the average speedups due to the trimming technique introduced in Sect.4.3.2 on all datasets.There were at least three orders of magnitudes of speedup.We remark that there were some large queries that could not be finished without trimming; these queries are excluded.Recall that mces was a performance bottleneck.With trimming techniques, we report in Fig. 38 that the costs of mces were around 30% (respectively, 60%) of global ranking costs for the synthetic datasets (respectively, the real datasets).

Appendix 4: The FDAG construction
In this appendix, we present the construction algorithm of FDag (shown in Algorithm 3).The details of Algorithm 3 can be presented as follows.First, we sort F in ascending order of their edge numbers (Line 2).Then, we process the features one by one (Line 3).We create a node v f in FDag, for each f ∈ F, and compute the automorphism relation of f (A f ).For index nodes v f and v f , we perform a subgraph for all v ∈ I.V do 9: add all the embeddings of f v in f v to I .M v ,v 12: construct the functions anc and des of I 13: enum_compositions(I , δ) 14: return I isomorphism test between f and f (Line 9).If subgraph isomorphism relations exist, we add the edge (v f , v f ) to FDag, and associate the subgraph embeddings to the edge (Lines 10-11).Finally, we generate anc and des from the FDag structure (Line 12).Next, Line 13 enumerates the possible compositions of a feature pair and determines intermediate results for computing structural difference between compositions (as presented in Algorithm 4).
Feature pair composition enumeration Algorithm 4 requires further elaboration.It takes FDag and the maximum increment size as input, and outputs a set of feature pair compositions and some auxiliary data (i.e., f i j in Line 5 and F l in Line 12).We highlight some important steps before the details.(i) In Line 5, it shows that a query composition ( f i , f j , cs, λ i , λ j ) forms f i j , and f i j itself can be a feature.We record it in ζ ; we count the occurrences of f i j in all ζ s and obtain f i j 's composability.(ii) While f i j may not be a feature, it may contain some features other than f i and f j .We record such features in F l .It is known that in feature-based query processing, the more features (F q ) the query has, the more accurate the candidate query answer set (D q ) is, because D q = ∩ f ∈F q D f .(iii) The parameter δ is the threshold on the size increment in a query suggestion.The number of candidate suggestions increases exponentially with δ.Therefore, for all cs ∈ F do 3: for all f i , f j ∈ des(cs) and CheckIncement( f i , f j , cs, δ) do 4: for all λ i , λ j where cs ⊆ λ i f i and cs ⊆ λ j f j do 5: to ensure an interactive response, we may set a modest δ (e.g., its default is 5 in our experiments).
The details of Algorithm 4 can be described as follows.It starts the enumeration from a feature (cs), and composes large query graphs from two descendants ( f i and f j ) of cs (Lines 1-2).Line 3 checks if the increment is smaller than δ.In Line 4, we iterate through each embedding of cs in f i and f j , respectively.In Line 5, we compose a larger graph (denoted as f i j ) from f i and f j via cs.If f i j is a feature, then Algorithm 4 just detected a possible way to compose f i j .In Lines 6-7, we employ the techniques in Sect.4.2 to prune the empty queries.
Lines 8-15 compute further information to optimize the ranking procedure (Sect.4.3).Lines 8-9 check if f i j is a feature.If yes, f i j is recorded, for determining the composability of f i j .Otherwise, Lines 10-12 determine if there are any features (other than f i and f j ) embedded in f i j , specifically, F l = { f |cs ⊆ λ f ∧ f ⊆ λ f i j }, where f ∈ F. F l are the set of features that contains in f i j .As presented in Sect.4.3.2, the selectivity of a query is estimated by the intersection of the candidate answers of the features of the queries.When F l and f i j are used in such estimation, it is more accurate than the estimation with only f i and f j .Lines 13-15 compute the auxiliary structures that optimize the online mces distance computation (Sect.4.3.2).

q, u 2
Decompose query q into a set of embeddings Mq of c-prime 4 Rank top-k candidate suggestions 5 Review Q k = {q 1 , . . ., q k } 1 Submit query q and user intent u 3 Generate all Mq candidate suggestions using Mq over FDag Q Q = {q 1 , . . ., q n } features Fq in q, where Fq ⊆ F a feature set mined offline) suggestion from Q C that gives the largest user intent value, which takes O(|Q C | × T util ) to find the largest one.The iteration is repeated k times.We remark that to compute the user intent value of Q f,λ k , it requires to compute the structural diversity of the Q f,λ k and it involves the dist function.We adopt the technique presented at the end of Sect.4.3.2 to optimize this step.3. The last term is the time for outputting the top-k suggestions Q f,λ k of the feature embedding ( f , λ).

Fig. 6
Fig. 6 Major structures of the (partial) FDag of PubChem

Fig. 7
Fig. 7 Example of redundant compositions FDag structure The columns |V F Dag | and |E F Dag | in Table

Algorithm 4
Enumerate Feature Pair Compositions Input: FDag I and the max.increment size δ Output: the ζ function of I 1: function enum_compositions(I , δ) 2: which is of modest number.In each iteration, the suggestion from |Q | that makes the value of util of Q k the largest (computed in T util time) is added to Q k .The iteration is repeated k times.3. Finally, the last term is the time for outputting the top-k suggestions |Q k |.

Table 1
Some characteristics of the datasets

Table 2
Statistics of the feature sets

Table 3
Quality metrics and their meanings

Table 8
FDAG structure construction

Table 10
Graph distance precomputation

Table 11
Number of c-prime features by varying c