Keywords

1 Introduction

A lot of semantic data are available in the LOD Cloud. We can reach these datasets with SPARQL queries via the SPARQL endpoints, but the non-expert user has trouble to find these endpoints. The user needs to know the URL of the endpoints and needs to know which endpoint stores the data that the user wants to use. When the user found an endpoint another problem is that the user does not know the structure of the dataset. The structure is needed for the SPARQL query. When the non-expert user found endpoint and know the structure of the dataset, how can he write SPARQL query easier. The federated systems give a solution for the first two problems. A federated system knows the URLs of the endpoints, and it knows which endpoint can answer the SPARQL query. A federated system contains more components. It has a query parser that splits the query of the user to subqueries. After that, it searches the endpoint for all subqueries. This phase is the source selection phase. Finally, it makes the answer from the subresults that come from the endpoints. All components have challenges. For example, a good question in the source selection is how we can decide which endpoint can answer the subquery? We know two techniques for this question: the catalog technique and the ASK technique. The catalog technique stores information about the endpoints. For example, it stores all predicates that are available on the endpoints. It can reach with simple SPARQL query. This information needs to be up-to-date, so the system needs to call the endpoints to refresh the information. The question is how and when we can do this? Our solution uses the recommendation. It has predicates information about the endpoints and we use this information to update the catalog.

The ASK technique does not use any stored information about the datasets of the endpoints. It asks every time the endpoints with simple SPARQL queries. The ASK query returns simple true/false value based on the endpoint can answer the subquery or not. The data are up-to-date on every time to choose endpoints for evaluations, because the query runs on the live endpoints. The disadvantage of the ASK technique is it runs on every endpoint every time. Our solution is similar to the catalog technique. We reduce the number of the possible endpoints with the information that comes from the recommendations. When we get recommendation, we know that endpoint can answer the subquery. In this paper we concentrate only the source selection problem.

The data are stored in the LOD Cloud and it has an important property. Data can connect to other data stored in another datasets. This property is true on the semantic data. We can write query that uses more endpoints and connect the data together. These queries are called federated queries. Since the SPARQL 1.1 we can do this with the SERVICE keyword. This keyword needs an endpoint URL and the triple patterns that run on that endpoint. The variables connect the triple patterns in the query. We mentioned earlier, we need to know the URL of the endpoint and we need to know the structure of the datasets. These limitations make the query writing complicated. The earlier presented systems solve these problems too because when we write a SPARQL 1.0 query (without SERVICE), the federated system can decide which endpoints answer the subquery. After that, the system merges the results to the final answer. Another problem is that we do not know the datasets of the endpoints. The recommendation systems give a solution for these problems. Some recommendation systems work like the autocomplete in the SQL environment. This technique offers possible values with the prefix of the values. For example, when we write a property the system offers the properties that have this prefix value. In this case, we need to know some prefix to the system can offer.

Another solution offers triple patterns to the user based on the semi-finished query. In this case, the user only needs to choose from the triple patterns to make a SPARQL query. In our previous work [4] the recommended triple patterns come from a federated system. Therefore, the final query will be a federated query. The system gets the necessary information from the recommendation, so when the query is finished it can reduce the number of the necessary endpoints.

The system that we presented in our previous paper first queries the possible types of the endpoints. The user can choose from this list that want to use. After that, the system asks the endpoints about the types. It asks the possible predicates to these types. It works with simple SPARQL query, so the endpoints are able to answer the query easier. A predicate has rdfs:range properties typically. This property gives more information about a dataset. The rdfs:range property gives the range of a predicate. The range is a type, so we can make more recommendation with this information because we can query the predicates of this type. This recommendation adds more help to the user to write SPARQL query.

A simple SPARQL query has a following structure: \(SELECT + variables + WHERE + conditions\). The variable part contains the variables that we want to ask. The condition part contains triple patterns that filter the results. All result line must fulfill the conditions. The recommendation systems offer these triple patterns to the user. The federated systems use these triple patterns to make the subqueries and answer the final query.

In this paper, we present that we can configure the federated systems with the recommendation system. We get recommended triples and we can use the predicates for the evaluation. The evaluation needs less time when it selects only the necessary endpoints.

The rest of the paper is organized as follows. Section 2 introduces the cost model of the evaluation of the federated systems and our solution. Section 3 shows the related works in the topic of federated systems and SPARQL GUI. Section 4 presents the challenges of the paper. Section 5 presents the extended recommendation algorithm and the cost of this technique. Section 6 evaluates the algorithm and techniques with some statistics. Finally, Sect. 7 summarizes our results.

2 Preliminaries

In this section we introduce some basic concepts. We present the source selection strategies. We present a cost model that represents the number of the necessary request to evaluate a federated query.

2.1 Source Selection

We mentioned earlier the query parser divides the query to subqueries and the source selection searches the possible endpoints that can answer the subqueries. This is an important part of a federation system. There are two possible methods to choose endpoint: catalog technique and the ASK technique. The Catalog technique stores information about the endpoints. It uses this information to choose the endpoints. If some information is not up-to-date, the system cannot answer the SPARQL query. Therefore, the system needs to update the information about the endpoint. The ASK technique runs ASK query to know the endpoint can answer the subquery. It queries all the endpoints with every subquerie. This technique uses only the URL of the SPARQL endpoints. The disadvantage is it uses a lot of queries, but the advantage is it does not need to store information about the endpoints.

2.2 Query Cost Model

We present a cost model of the federation systems. The model represents the number of the query that need for the evaluation. First, we present a basic model and later we present the specific model for the source selection techniques. The model contains a configuration part and a source selection part. The configuration part is the number of the queries that need for the configuration of the federated system. The source selection part is the number of queries that needs to the selection of endpoints, and that needs to the evaluation of the subqueries. Let n the number of the endpoints.

$$\begin{aligned} Cost_{MaxEval}(n) = Cost_{conf}(n) + Cost_{subQuery}(n) \end{aligned}$$
(1)
$$\begin{aligned} Cost_{conf}(n) = (\sum _{i=1}^{n} Cost_{subConf}) * count(triple) \end{aligned}$$
(2)
$$\begin{aligned} Cost_{subQuery}(n) = (\sum _{i=1}^{n} Cost_{EPQuery}) * count(subQuery) \end{aligned}$$
(3)

The Eq. 1 shows the maximal number of requests need for the evaluation of a federated query. The cost depends on how many endpoints are configured in the system (n), because all endpoints need to be configured and all endpoints need to query during the evaluation. The first part (Eq. 2) of the cost is the configuration part. The system collects information about the endpoint (\(Cost_{subConf}\)). This information is used by the source selection. It also depends on the number of the configured endpoints. In the worst case the system uses all the endpoints, and it needs to configure all the endpoints. The second part (Eq. 3) of the cost is a querying part. This part needs for the evaluation of the subqueries (\(Cost_{EPQuery}\)). The worst case is when every subquery runs on every endpoint, but the system does not use usually all the endpoints because the source selection decreases the number of the endpoints.

3 Related Work

3.1 Federated Systems

Rakhmawati et al. [10] presented the federated systems in a survey. They showed the federated system components and the available evaluation strategies. They presented the source selection techniques, too. They called that ASK Query and the Data catalogue.

Verborgh et al. [15] mentioned the availability of the SPARQL endpoints is low. They offer a client-side solution for this problem. Their idea is the client does not send complex query to the endpoint. It uses a pipeline of the simpler queries. It is similar to the recommendation technique where the system runs simple query for the triple patterns.

Buil-Aranda et al. [1] analyzed the federated SPARQL evaluation strategies. They found some federated system do not answer correctly all queries because all endpoints have result size limitation. They compared the evaluation strategies.

We use in this paper two federated systems that use the two mentioned source selection strategies. One is the Darq [9] (distributed ARQ). It is an extension of the Jena [6] ARQ to support federated SPARQL queries. The data source selection is based on the preconfigured properties.

Another system that we use is the FedX [14]. The FedX is an optimization technique for the SPARQL federated queries. It cannot use any information about the endpoints it uses only ASK query for source selection.

3.2 Autocomplete, Recommendation

Hoefler mentioned in his paper [5] that the SPARQL is difficult for non-expert users. The users do not know URL of the endpoints or do not know which data store on it. He observed the user know the spreadsheet applications like Excel. The idea is to make the semantic data into tabular form.

Campinas mentioned too in his paper [2] that the Semantic Web usage is difficult because writing a SPARQL query is complicated. He implemented a data-based auto-completion that recommends items that can be predicates, classes or named graph. Their aim is to make an easy-to-use library.

Kramer et al. [7] presented a SPARQL index technique for autocomplete. It works on the temporary query. When the user writes a ‘?’ or a ‘<’ symbol the system recommends variables or IRIs based on the earlier queries. In our case the recommendation uses the federated system and LOD Cloud for recommendation.

Lehmann and Bühmann [8] presented a technique for making SPARQL query. Their solution is based on the question-answer and the positive learning techniques. The system makes recommendation based on the user selection. The selection is the base of the next iteration. The recommendation runs until the query is found or it is a not learnable query.

Rietveld et al. implemented the Yasgui [11], the user-friendly SPARQL client. It is a web based SPARQL client. It uses a proxy to reach the endpoints. It has an autocomplete function for the prefixes, namespaces and properties from multiple endpoints, but it uses only one endpoint to evaluate the query.

We presented our previous work [4] a recommendation technique with federated system. It recommends only the prefixes and the properties of an rdf:type. In this paper, we extend this technique with a more usable recommendation.

Saleem et al. presented the HiBISCus [12], a hypergraph based approach for source selection for federation system. They tested their system with Darq and Fedx, too.

4 Challenge

Our goal is to reduce the number of the endpoints that necessary for the evaluation. The reduction is based on the recommendations that come from the recommendation system. The main idea is the recommendation has information about the endpoints that enough to answer the query. Every time when the system gets new recommended triples it gets predicate information. This information is up-to-date every time. Another advantage is when the endpoint does not response on a recommendation request the system knows this endpoint does not need for evaluation. The recommendation technique reduces the number of endpoints that will use for evaluation. In this paper, we do not want to make a new evaluation strategy or join order technique. We use the available federated systems and we make a better configuration with the recommendation system. The challenge is how we can use this information for better configuration. The configuration of a federated system is static. With the recommendation we can change the configuration for a faster evaluation.

Another aim is to extend the recommendation technique. In our previous work we presented the recommendation technique. It uses only the rdf:type and the predicates of a type. We can use another predicates to get new recommendation. The usable predicate is a rdfs:range.

5 Technique

First, we extend the earlier presented technique. In previous paper, we use only the prefix, the rdf:type and its predicates. We extend this recommendation with the rdfs:range.

5.1 Recommendation Technique

The earlier presented algorithm first collects the available types from the endpoints. It can do it with simple SPARQL query.

It makes recommendations from the types. The recommendation is a triple where the subject is a variable that is not used in the query. The predicate is the rdf:type and the object is the new type. We get another information with this recommendation because the new type has rdf:type predicate information. We store this information about the endpoints.

Another function of the system is the \(predicate\ recommendation\) that makes new triple pattern based on the temporary query. The input of the function is a variable and the type of the variable. The federated system asks all the endpoints with this type. We request some entities that have this type and we query the predicates of the entity. We store these predicates in a map to reduce the duplicated predicates. Finally, we prepare recommendations with the new predicates. The subject of the recommended triple is the variable that is in the input of the function. The predicate is the new predicate and the object is a new generated variable that is generated from the input variable and the predicate name. The new predicate has information about the endpoint, too. We store the new information from the new predicate. We get the predicates with simple queries. If we know a predicate we can ask the range of this predicate (IdentityRangeQuery) and if the object is a variable we can make a new recommendation to this variable. We ask the range and we use the technique mentioned earlier.

figure a
Fig. 1.
figure 1

SPARQL writing with recommendation and information recovery.

We present the algorithm in an example on Fig. 1. On the left side we see the SPARQL. In the first step it is empty. The system recommends some types (in example: Person, Car). With this the system gets some predicates (rdf:type) information. We accept the Person type and we get new recommendations (dbpedia-owl:birthPlace and rdfs:label) and the system stores the new predicates (dbpedia-owl:birthPlace, rdfs:label). We select the rdfs:birthPlace and the system knows with the help of the range predicate the type of the ?place variable. It is dbpedia:Place and recommends new predicates for the ?place. The system updates the predicate information of the endpoint.

5.2 Recommended Information Is Enough

The Algorithm 1 selects the necessary endpoints. The EPQ variable represents the set of the necessary endpoints (see line 2). Let the UserSelect a list that represents the selected triple patterns that chose the user (see line 3). We check the selected triple pattern which endpoint can answer (see line 6). If the endpoint can answer the triple pattern, it is necessary for the evaluation.

figure b

Remark 1

(Recommended Information Is Enough.) Let EPQ the set of the endpoints that are necessary for the evaluation. Let Endpoints the set of all endpoints that the system knows. Let the UserSelect a list of triple patterns that the user choses. Then \(\exists \) tp \(\in \) UserSelect, e \(\in \) Endpoints, answer(e,tp) \(\Rightarrow \) e \(\in \) EPQ

Proof

The federated system needs for the evaluation that all the triple patterns (TP) belong to an endpoint. If the TP belongs to the endpoint, the endpoint can answer the TP. We prove this statement indirectly. Let e an endpoint that can answer the TP, but the e endpoint does not exist in the EPQ. That means the TP does not come from that endpoint, but the user selected it. Every recommendation comes from the endpoints, so the TP comes from e, and the e endpoint can answer the TP and it pushed to the EPQ.

5.3 Query Cost of Federated Techniques

In the previous section we presented the information recovery from the recommendation. In Sect. 2.2 we showed the number of request of the query evaluation. Now we describe the cost of the two source selection techniques.

Query Cost of ASK Technique. The ASK technique makes queries every time to every endpoint. This is equal to the basic \(Cost_{conf}\). The \(Cost_{subquery}\) is less because the system does not run the subquery on every endpoint. It uses only endpoints that are selected by configuration phase. Let m the number of the selected endpoint and \((m < n)\).

$$\begin{aligned} Cost_{askSubquery}(m) = (\sum _{i=1}^{m} Cost_{EPQuery}) * count(subQuery) \end{aligned}$$
(4)

Cost of Catalog Technique. The catalog technique uses preconfigured information about the endpoints. This section runs independently from evaluation. The cost of the configuration is 0 in the evaluation phase. In this case, the system uses only the endpoint that is selected by the configuration. It is similar to the ASK technique. The disadvantage of this technique is the configuration is not up-to-date. It is necessary to collect information about the endpoints.

Cost of Recommend Configuration. The recommendation technique uses the ASK technique during the query creating. It serves the configuration to be up-to-date. The advantage of this technique is it reduces the number of the endpoints because it selects endpoints that are used during the query creating.

Cost Comparison. Let \(recommend(n) \le n\). The recommend(n) reduce the number of the endpoints based on the recommendation.

$$\begin{aligned} Cost_{catalog}(recommend(n)) \le Cost_{catalog}(n) \end{aligned}$$
(5)
$$\begin{aligned} Cost_{ASK}(recommend(n)) \le Cost_{ASK}(n) \end{aligned}$$
(6)

When we use the catalog technique the system reads the configuration and it asks only that endpoints that can answer the subquery. When we use the recommendation technique we do not change the federated systems. We change only the configuration with the recommendation. It means the catalog technique reads less endpoint configurations when it uses the recommendation. Without the recommendation the technique checks all endpoint configurations, but in point of request view the two cases are the same. The Eq. 5 represents that the cost when we use the recommendation is less than without the recommendation.

When we use the ASK technique the system asks all the endpoints with ASK query. On query phase it queries the necessary endpoints where the predicates are. The cost of the ASK technique is more expensive than the catalog technique because it asks the endpoints on configuration phase too. When we use the recommendation technique with the ASK technique the number of the requests are smaller because the configuration phase uses less endpoints. It uses only the endpoints that come from the recommendation. The Eq. 4 represents that the cost when we use the recommendation is less than without the recommendation.

6 Experiments

We evaluated our techniques with two related systems. We used the Darq [9], because it uses statistics about the endpoints and we used the FedX [14] because it uses the ASK technique. We have run both from java code on a Virtual Machine. This VM has Intel Xeon X5650 CPU (4 core, 2,67 GHz) with 4 GB memory. We installed two Virtuoso endpoints [3] (ver. 06.01) on two computers. The computers have Intel Core i5 650 CPU (4 core, 3,2 GHz) with 4 GB memory. Both Virtuoso store the FedBench datasets. First Virtuoso stores the NyTimes and the Jamendo datasets and the second stores the DBpedia, LinkedMDB and the Geonames datasets. We configured the Virtuosos to reach the datasets separately. We used the FedBench [13] Cross-Domain queries to evaluate our model. We have run all queries one time for warm-up and five times for the statistics.

When we used the FedX we set all the endpoint for comparison. When we used the DarQ we configured all endpoints with the RDFstat generator.

Figure 2 shows the runtime with and without recommendation. The left figure shows the query 1 and 2, the right figure shows the query 4 and 6 and the middle figure shows the query 3 and 5. The query 7 runs out of memory in both cases. The Darq does not support the UNION and the unbounded queries. In the federated benchmark the query 1 has UNION operator. In this case, the result is the same with and without recommendation and it does not have answer. The query 6 has unbound variable, but we get answer and better result with recommendation.

Fig. 2.
figure 2

Runtime with the DarQ (Color figure online)

Fig. 3.
figure 3

Runtime with the FedX (Color figure online)

Figure 3 presents the runtime of the FedX system. We experienced the runtime of all queries was better with recommendation except query 6. It has the same runtime in both cases because the recommendation sets all the endpoint for the evaluation.

7 Conclusions

The source selection is an important part of a federated system. There are two source selection strategies. One stores information about the endpoint and it uses that information to select the endpoints. This technique has a problem. The catalog needs to be up-to-date. The system has to update this with queries. Another technique is the ASK technique where the system runs ASK queries to know which endpoint can answer the triple pattern. This technique asks all the endpoints with every triple pattern. We reduced the number of the endpoints with a recommendation technique.

The recommendation technique is a user-friendly interface to write SPARQL easier. It is similar to the autocomplete function of the SQL. When the user writes the SPARQL query the system runs simple queries on the endpoints to make recommendation. The recommendations are triple patterns. These triple patterns used by the user in his query. In this paper, we extended our recommendation technique with new extension that uses the rdfs:range predicates. We used the recommendation system to get information about the endpoints. We used this information to set the federated system configuration. We presented the algorithm for getting the information. We compared our result with two federated systems: the Darq and the FedX. We have seen that the recommendation can help the federated system to evaluate the query.

In the future we would like to make a library for further applications and we would like to make a web page where this technique can work in order to the non-expert users can use it.