Abstract
While some public transit data publishers only provide a data dump – which only few reusers can afford to integrate within their applications – others provide a use case limiting origin-destination route planning api. The Linked Connections framework instead introduces a hypermedia api, over which the extendable base route planning algorithm “Connections Scan Algorithm” can be implemented. We compare the cpu usage and query execution time of a traditional server-side route planner with the cpu time and query execution time of a Linked Connections interface by evaluating query mixes with increasing load. We found that, at the expense of a higher bandwidth consumption, more queries can be answered using the same hardware with the Linked Connections server interface than with an origin-destination api, thanks to an average cache hit rate of 78%. The findings from this research show a cost-efficient way of publishing transport data that can bring federated public transit route planning at the fingertips of anyone.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The way travelers want route planning advice is diverse: from finding journeys that are accessible with a certain disability [2], to taking into account whether the traveler owns a (foldable) bike, a car or a public transit subscription, or even calculating journeys with the nicest pictures on social network sites [8]. However, when presented with a traditional route planning http api taking origin-destination queries, developers of e.g. traveling tools are left with no flexibility to calculate journeys other than the functions provided by the server. As a consequence, developers that can afford a larger server infrastructure, integrate data dumps of the timetables (and their real-time updates), into their own system. This way, they are in full control of the algorithm, allowing them to calculate journeys in their own manner, across data sources from multiple authorities.
When publishing departure-arrival pairs (connections) in chronologically ordered pages – as demonstrated in earlier work [4] – route planning can be executed at data integration time by the user agent rather than by the data publishing infrastructure. This way, the Linked Connections \({{\textit{(}}\textsc {lc}{} \textit{)}}\) framework Footnote 1 allows for a richer web publishing and querying ecosystem within public transit route planners. It lowers the cost for reusers to start prototyping – also federated and multimodal – route planners over multiple data sources. Furthermore, other sources may give more information about a certain specific vehicle that may be of interest to this user agent. It is not exceptional that route planners also take into account properties such as fares [5], wheelchair accessibility [2] or criminality statistics. In this paper, we test our hypothesis that this way of publishing is more lightweight: how does our information system scale under more data reuse compared to origin-destination apis?
The paper is structured in a traditional way, first describing state of the art, then introducing more details on the Linked Connections framework, to then describe the evaluation’s design, made entirely reproducible, and discuss our open-source implementation. Finally, we elaborate on the results and conclusion.
2 State of the Art
The General Transit Feed Specification (gtfs)Footnote 2 is a framework for exchanging data from a public transit agency to third parties. It is – at the time of writing – the de-facto standard for describing and exchanging transit schedules. It describes the headers of several CSV files combined in a ZIP-file. Using a calendar and calendar exceptions, both periodic as aperiodic transit schedules can be described. gtfs also describes the geographic shape of trips, fare zones, accessibility, information about an agency, and so forth. We provided URIs to the terms in the gtfs specification through the Linked gtfs Footnote 3 vocabulary. Reusing these gtfs files, route planners exist in software as a service platforms such as Navitia.io, in end-user applications such as CityMapper, Ally or Google Maps, or as open-source software, such as Open Trip Planner or Bliksemlabs RRRR. It is also possible a that an agency, such as the Dutch railway company, the SNCB or the Deutsche Bahn, expose a route planner over http themselves.
Data dumps and query apis can be seen as two extremes on the Linked Data Fragments (ldf)Footnote 4 axis. Triple Pattern Fragments – similarly to Linked Connections [4] – already defined a way to publish an rdf dataset only through its triple patterns [10]. This moves the evaluation of all Basic Graph Patterns – and by extension sparql queries – to the client, reducing cpu load on the server.
In route planners, not sparql queries but Earliest Arrival Time queries (eat) [9] need to be able to be solved. This is a question involving a time of departure, a departure stop and an arrival stop, expecting the earliest possible arrival time at the destination. More complex route planning questions exist, such as the Minimum Expected Arrival Time \({{\textit{(}}\textsc {meat}{} \textit{)}}\) [6] or multi-criteria profile queries [1, 5, 11]. The Connection Scan Algorithm \({{\textit{(}}\textsc {csa}{} \textit{)}}\) is an approach for planning that models the timetable data as a directed acyclic graph [6, 9]. By topologically sorting the graph by departure time, the shortest path algorithm only needs to scan through connections within a limited time window to solve an eat query. csa can be extended to solving problems where it also keeps the number of transfers limited, as well as calculating routes with uncertainty [6]. The ideas behind csa can scale up to large networks by using multi-overlay networks [9].
3 Linked Connections
In this section, we design our information system for public transport data, called the Linked Connections framework. As our datasets need to be interoperable with other data published on the Web, we choose http as our uniform interface. We allow cross origin resource sharing and enable cache header to indicate when we expect a document to change. In order to document our identifiers using this uniform interface as well, rdf is chosen as a knowledge representation model. We further only define the hypermedia controls needed in each document as well as the vocabularies to be used.
To solve the eat problem using csa, timetable information within a certain time window needs to be retrievable. Other route planning questions need to select data within the same time window, and solving these is expected to have similar resultsFootnote 5 Instead of exposing an origin-destination api or a data dump, a Linked Connections server paginates the list of connections in departure time intervals and publishes these pages over http. Each page contains a link to the next and previous one. In order to find the first page a route planning needs, the document of the entry point given to the client contains a hypermedia description on how to discover a certain departure time. Both hypermedia controls are expressed using the Hydra vocabularyFootnote 6.
The base entities that we need to describe are connections, which we documented using the lc Linked Data vocabularyFootnote 7. Each connection entity – the smallest building block of time schedules – provides links to an arrival stop and a departure stop, and optionally to a trip. It contains two literals: a departure time and an arrival time. Linked gtfs can be used to extend a connection with public transit specific properties such as a headsign, drop off type or fare information. Furthermore, it also contains concepts to describe transfers and their minimum change time. For instance, when transferring from one railway platform to another, Linked gtfs can indicate that the minimum change time from one stop to another is a certain number of seconds.
In the proof of concept built for this paper available at http://linkedconnections.org, we implemented this hypermedia control as a redirect from the entry point to the page containing connections for the current time. Then, on every page, the description can be found of how to get to a page describing another time range. In order to limit the amount of possible documents, we only enable pages for each XFootnote 8 minutes, and do not allow pages describing overlapping time intervals. When a time interval is requested for which a page does not exist, the user-agent will be redirected to a page containing connections departing at the requested time. The same page also describes how to get to the next or previous page. This way, the client can be certain about which page to ask next, instead of constructing a new query for a new time interval.
A naive implementation of federated route planning on top of this interface is also provided, as a client can be configured with more than one entry point. The client then performs the same procedure multiple times in parallel. The connections streams are merge sorted as they are downloaded. A Linked Data solution is to ensure that the client knows how to link a stop from one agency to a stop from another.
4 Evaluation Design
We implemented different components in JavaScript for the Node.js platform. We chose JavaScript as it allows us to use both components both on a command-line environment as well as in the browser.
- CSA.js:
-
A library that calculates a minimum spanning tree and a journey, given a stream of input connections and a queryFootnote 9
- Server.js:
-
Publishes streams of connections in JSON-LD, using a MongoDB to retrieve the connections itselfFootnote 10
- Client.js:
-
Downloads connections from a configurable set of servers and executes queriesFootnote 11
- gtfs2lc:
-
A tool to convert existing timetables as open data to the Linked Connections vocabularyFootnote 12
We set up 2 different servers, each connected to the same MongoDB database which stores all connections of the Belgian railway system for October 2015: 1. a Linked Connections server with an NGINX proxy cache in front, which adds caching headers configured to cache each resource for one minuteFootnote 13 and compresses the body of the response using gzip; 2. a route planning server which exposes an origin-destination route planning interface instead using the csa.js libraryFootnote 14.
These tools are combined into different set-ups:
- Client-side route planning.:
-
The first experiment executes the query mixes by using the Linked Connections client. Client caching is disabled, making this simulate the lc without cache set-up, where every request could originate from an end-user’s device.
- Client-side route planning with client-side cache.:
-
The second experiment does the same as the first experiment, except that it uses a client side cache, and simulates the lc with cache set-up.
- Server-side route planning.:
-
The third experiment launches the same queries against the full route planning server. The query server code is used which relies on the same CSA.js library as the client used in the previous two experiments.
In order to have an upper and lower bound of a real world scenario, the first approach assumes every request comes from a unique user-agent which cannot share a cache, and has caching disabled, while the second approach assumes one user-agent does all requests for all end-users and has caching enabled. Different real world caching opportunities – such as user agents for multiple end-users in a software as a service model, or shared peer to peer neighborhood caching [7] – will result in a scalability in-between these two scenarios.
We also need a good set of queries to test our three set-ups with. The iRail projectFootnote 15 provides a route planning api to apps with over 100 k installations on Android and iPhone [3]. The api receives up to 400 k route planning queries per month. As the query logs of the iRail project are published as open data, we are able to create real query mixes from them with different loads. The first mix contains half the query load of iRail, during 15 min on a peak hour on the first of October. The mix is generated by taking the normal query load of iRail during the same 15 min, randomly ordering them, and taking the half of all lines. Our second mix is the normal query load of iRail, which we selected out of the query logs, and for each query, we calculated on which second after the benchmark script starts, the query should be executed. The third mix is double the load of the second mix, by taking the next 15 min, and subtracting 15 min from the requested departure time, and merging it with the second mix. The same approach can be applied with limited changes for the subsequent query mixes, which are taken from the next days’ rush hours. Our last query mix is 16 times the query load of iRail on the 1st of October 2015. The resulting query mixes can be found at https://github.com/linkedconnections/benchmark-belgianrail.
These query mixes are then used to reenact a real-world query load for three different experiments on the three different architectures. We ran the experiments on a quad core Intel(R) Core(TM) i5-3340M cpu @ 2.70 GHz with 8 GB of RAM. We launched the components in a single thread as our goal is to see how fast cpu usage increases when trying to answer more queries. The results will thus not reflect the full capacity of this machine, as in production, one would run the applications on different worker threads.
Our experiments can be reproduced using the code at https://github.com/linkedconnections/benchmark-belgianrail. There are three metrics we gather with these scripts: the cpu time used of the server instance (http caches excluded), the bandwidth used per connection, and the query execution time per lc connection. For the latter two, we use a per-connection result to remove the influence of route complexity, as the time complexity of our algorithm is O(n) with n the total number of connections. We thus study the average bandwidth and query execution time needed to process one connection per route planning task.
5 Results
Figure 1 depicts the percentage of the time over 15 min the server thread was active on a cpu. The server cpu time was measured with the command pidstat from the sysstat package and it indicates the server load. The faster this load increases, the quicker extra cpus would be needed to answer more queries.
When the query load is half of the real iRail load on October 1st 2015, we can see the lowest server load is the lc set-up with client cache. About double of the load is needed by the lc without client cache set-up, and even more is needed by the query server. When doubling the query load, we can notice the load lower slightly for lc with cache, and the other two raise slightly. Continuing this until 16 times the iRail query load, we can see that the load of the query server raises until almost 100%, while lc without cache raises until 30%, while with client cache, 12% is the measured server load.
In Fig. 2, we can see the query response time per connection of 90% of the queries. When the query load is half of the real iRail load on October 1st 2015, we notice that the fastest solution is the query server, followed by the lc with cache. When doubling the query load, the average query execution time is lower in all cases, resulting in the same ranking. When doubling the query load once more, we see that lc with client cache is now the fastest solution. When doubling the query load 12 times, also the lc without client cache becomes faster. The trend continues until the query server takes remarkably longer to answer 90% of the queries than the Linked Connections solutions at 16 times the query load.
The average bandwidth consumption per connection in bytes shows the price of the decreased server load as the bandwidth consumption of the lc solutions are three orders of magnitude bigger: lc is 270B, lc with cache is 64B and query-server is 0.8B. The query server only gives one response per route planning question that is small. lc without client cache has a bandwidth that is three orders of magnitude bigger than the query server. The lc with a client cache has an average bandwidth consumption per connection that is remarkably lower. On the basis of these numbers we may conclude the average cache hit-rate is about 78%.
6 Conclusion
This paper introduced the Linked Connections framework for publishing queryable public transit data. We measured and compared the raise in query execution time and cpu usage between the traditional approach and Linked Connections. We could have hypothesized that we would find the origin-destination api approach to give faster response times, with a higher server cpu load that will increase even faster when the number of queries increase. We indeed achieved a better cost-efficiency: when the query-interface becomes saturated under an increasing query load, the lightweight lc interface only reached 1/4th of its capacity, meaning that the same load can be served with a smaller machine, or that a larger amount of queries can be solved using the same server. As the server load increases, the lc solution even give faster query results.
These result are strong arguments in favor of publishing timetable data in cacheable fragments instead of exposing origin-destination query interfaces when publishing data for maximum reuse is envisioned. The price of this decreased server load is however paid by the bandwidth that is needed, which is three orders of magnitude bigger. When route planning advice needs to be calculated while on for example a mobile phone network, network latency, which was not taken into account during the tests, may become a problem when the cache of the device is empty. An application’s server can however be configured with a private origin-destination api, which in its turn is a consumer of a Linked Connections dataset, taking the best from both worlds.
Our goal was to enable a more flexible public transport route planning ecosystem. While even personalized routing is now possible, we also lowered the cost of hosting the data, and enabled in-browser scripts to execute the public transit routing algorithm. Furthermore, the query execution times of queries solved by the Linked Connections framework are competitive. Until now, public transit route planning was a specialized domain where all processing happened in memory on one machine. We hope that this is a start for a new ecosystem of public transit route planners.
Notes
- 1.
- 2.
- 3.
Base URI: http://vocab.gtfs.org/terms#.
- 4.
- 5.
- 6.
- 7.
- 8.
X is a configurable amount.
- 9.
- 10.
- 11.
- 12.
- 13.
The Belgian railway company estimates delays on its network each minute.
- 14.
The route planning server uses the CSA.js library to expose a route planning api on top of data stored in a MongoDB. The code is available at https://github.com/linkedconnections/query-server.
- 15.
References
Bast, H., Carlsson, E., Eigenwillig, A., Geisberger, R., Harrelson, C., Raychev, V., Viger, F.: Fast routing in very large public transportation networks using transfer patterns. In: Berg, M., Meyer, U. (eds.) ESA 2010. LNCS, vol. 6346, pp. 290–301. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15775-2_25
Colpaert, P., Ballieu, S., Verborgh, R., Mannens, E.: The impact of an extra feature on the scalability of linked connections. In: COLD Workshop at ISWC2016 (2016)
Colpaert, P., Chua, A., Verborgh, R., Mannens, E., Van de Walle, R., Vande Moere, A.: What public transit API logs tell us about travel flows. In: WWW 2016: 25th International World Wide Web Conference (2016)
Colpaert, P., Llaves, A., Verborgh, R., Corcho, O., Mannens, E., Van de Walle, R.: Intermodal public transit routing using linked connections. In: Proceedings of the 14th International Semantic Web Conference: Posters and Demos, October 2015
Delling, D., Pajor, T., Werneck, R.F.F.: Round-based public transit routing. In: Proceedings of the 14th Meeting on Algorithm Engineering and Experiments (ALENEX 2012) (2012)
Dibbelt, J., Pajor, T., Strasser, B., Wagner, D.: Intriguingly simple and fast transit routing. In: Bonifaci, V., Demetrescu, C., Marchetti-Spaccamela, A. (eds.) SEA 2013. LNCS, vol. 7933, pp. 43–54. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38527-8_6
Folz, P., Skaf-Molli, H., Molli, P.: CyCLaDEs: a decentralized cache for triple pattern fragments. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 455–469. Springer, Cham (2016). doi:10.1007/978-3-319-34129-3_28
Quercia, D., Schifanella, R., Aiello, L.M.: The shortest path to happiness: recommending beautiful, quiet, and happy routes in the city. In: Proceedings of the 25th ACM Conference on Hypertext and social Media, pp. 116–125. ACM (2014)
Strasser, B., Wagner, D.: Connection scan accelerated. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments, pp. 125–137. Society for Industrial and Applied Mathematics (2014)
Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., Colpaert, P.: Triple pattern fragments: a low-cost knowledge graph interface for the Web. J. Web Semant. 37, 184–206 (2016)
Witt, S.: Trip-based public transit routing. In: Bansal, N., Finocchi, I. (eds.) ESA 2015. LNCS, vol. 9294, pp. 1025–1036. Springer, Heidelberg (2015). doi:10.1007/978-3-662-48350-3_85
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Colpaert, P., Verborgh, R., Mannens, E. (2017). Public Transit Route Planning Through Lightweight Linked Data Interfaces. In: Cabot, J., De Virgilio, R., Torlone, R. (eds) Web Engineering. ICWE 2017. Lecture Notes in Computer Science(), vol 10360. Springer, Cham. https://doi.org/10.1007/978-3-319-60131-1_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-60131-1_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60130-4
Online ISBN: 978-3-319-60131-1
eBook Packages: Computer ScienceComputer Science (R0)