Comparing Hypotheses About Sequential Data: A Bayesian Approach and Its Applications

Florian Lemmerich²²,
Philipp Singer^22,23,24,25,
Martin Becker²³,
Lisette Espin-Noboa²²,
Dimitar Dimitrov²²,
Denis Helic²⁴,
Andreas Hotho²³ &
…
Markus Strohmaier^22,25

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10536))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3295 Accesses

Abstract

Sequential data can be found in many settings, e.g., as sequences of visited websites or as location sequences of travellers. To improve the understanding of the underlying mechanisms that generate such sequences, the HypTrails approach provides for a novel data analysis method. Based on first-order Markov chain models and Bayesian hypothesis testing, it allows for comparing a set of hypotheses, i.e., beliefs about transitions between states, with respect to their plausibility considering observed data. HypTrails has been successfully employed to study phenomena in the online and the offline world. In this talk, we want to give an introduction to HypTrails and showcase selected real-world applications on urban mobility and reading behavior on Wikipedia.

This work summarizes a previous publication presenting the HypTrails approach [5] and three selected papers [1,2,3] that utilize it.

You have full access to this open access chapter, Download conference paper PDF

MixedTrails: Bayesian hypothesis comparison on heterogeneous sequential data

Article 07 July 2017

Sequence Analysis of Life History Data

1 Introduction

Today, large collections of data are available in the form of sequences of transitions between discrete states. For example, people move between different locations in a city, users navigate between web pages on the world wide web, or users listen to sequences of songs of a music streaming platform. Analyzing such datasets can leverage the understanding of behavior in these application domains. In typical machine learning and data mining approaches, parameters of a model (e.g., Markov chains) are learned automatically in order to capture the data generation process and make predictions. However, it is then often difficult to interpret the learned parameters or to relate them to basic intuitions and existing theories about the data, specifically if many parameters are involved. In a recently introduced line of research, we therefore aim to establish an alternative approach: we develop a method that allows to capture the belief in the generation of sequential data as Bayesian priors over parameters and then compare such hypotheses with respect to their plausibility given observed data. In this work, we want to showcase our general approach [5], which we call HypTrails, and present some practical applications in various domains [1,2,3], i.e., sequences of visited locations derived from photos uploaded to Flickr, taxi directions in Manhattan, and navigation of readers in Wikipedia.

2 Bayesian Hypotheses Comparison in Sequential Data

For comparing hypotheses about the transition behavior in sequence data, we follow a Bayesian approach. As an underlying model, we utilize first-order Markov chain models. Such models assume a memory-less transition process between discrete states. That means that the probability of the next visited state depends only on the current one. The parameters of this model, i.e., the transition probabilities \(p_{ij}\) between the states, can be written as a single matrix.

In HypTrails, we want to compare a set of hypotheses \(H_1, \ldots , H_n\) with respect to how well they can explain the generation of the observed data. Each of the hypotheses captures a belief in the transition between the states as derived from theory in the application domain, from other related datasets, or from human intuition. To specify a hypothesis, the user can express a belief matrix, in which a high value in a cell (i, j) reflects a belief that transitions between the states i and j are more common. With HypTrails, these belief matrices are then automatically transformed into Bayesian Dirichlet priors over the model parameters (i.e., the transition probabilities in the Markov chain). This transformation can be performed for different concentration parameters \(\kappa \). A higher value of \(\kappa \) generates a prior that corresponds to a stronger belief in the hypothesis. For each hypothesis \(H_i\), and each concentration parameter \(\kappa \), we can then compute the marginal likelihood \(P(D|H_i)\) of the data given the hypothesis. Given our model, the marginal likelihood can efficiently be computed in closed form. The higher the marginal likelihood of a hypothesis is, the more plausible it appears to be with respect to the observed data. For quantifying the support of one hypothesis over another, we utilize Bayes Factors, a Bayesian alternative to frequentists p-values, which can directly be interpreted with lookup tables [4]. For a set of hypotheses, the marginal likelihoods induce an ordering of the hypotheses with respect to their plausibility given the data. However, the plausibility of hypotheses is always only checked relatively against each other. Therefore, often a simple hypothesis is used as a baseline, e.g., the uniform hypothesis that assumes all transitions to be equally likely.

To compare hypotheses, all priors should be derived using the same belief strength \(\kappa \). To make comparisons across different belief strength, HypTrails results are typically visualized as line plots, in which each line corresponds to one hypothesis. The x-axis specifies different values of the concentration parameter \(\kappa \), and the y-axis describes the marginal likelihood of a hypothesis, cf. Fig. 1.

3 Applications

Next, we outline three real-world applications of this technique.

3.1 Urban Mobility in Flickr

In a first study, we focused on geo-temporal trails derived from Flickr. In particular, we crawled all photos on Flickr with geo-spatial information (i.e., latitude and longitude) from 2010 to 2014 for four major cities (Berlin, London, Los Angeles, and New York). We used a map grid to construct a discrete state space of locations. Then, we created a sequence of locations for each user that uploaded pictures of that city based on the picture locations. On the sequences, we evaluated a variety of hypotheses such as a proximity hypothesis (next location is near the current one), a point-of-interest hypothesis (next location will be at a tourist attraction or transportation hub), a center hypothesis (next location will be close to the city center), and combinations of them. As a result, rankings are mostly consistent across cities. Combinations of proximity and point-of-interest hypotheses are overall most plausible. Figure 1 shows example results for Berlin.

3.2 Taxi Usage in Manhattan

In a second study, we investigated again trails of urban mobility. In particular, we studied a dataset of taxi trails in Manhattan^{Footnote 1}. In this study, we used tracts (small administrative) units as a state space of locations. Using additional information on these tracts extracted from census data and data from the FourSquare API, we investigated more than 60 hypotheses such as “taxis drive to tracts with similar ethnic distribution” or “taxis will drive to popular locations w.r.t. check-ins”. We also performed spatio-temporal clustering of the sequence data and applied HypTrails on the individual clusters to find behavioral traits that are typical for certain times and places. For instance, we discovered a group of taxi rides to locations with a high density of party venues on weekend nights.

3.3 Link Usage in Wikipedia

In another work, we studied transitions between articles in the online encyclopedia Wikipedia. In particular, we were interested in which links on a Wikipedia page get frequently used. For that purpose, we applied HypTrails on a recently published dataset of all transitions between Wikipedia pages for one month^{Footnote 2} using the set of all articles as state space. For constructing hypotheses, we considered hypotheses based on visual features of the links (e.g., “links in the lead paragraph get clicked more often” or “links in the main text get clicked more often”), hypothesis based on text similarity between articles, and hypotheses based on the structure of the link network of Wikipedia articles. As a result, hypotheses that assume people to prefer links at the top and left-hand side, and hypotheses that express a belief in more frequent usage of links towards the periphery of the article network are most plausible.

4 Conclusion

In this work, we gave a short introduction into the HypTrails approach that allows to compare the plausibility of hypotheses about the generation of a sequential datasets. Additionally, we described three real-world applications of this technique for studying urban mobility and reading behavior in Wikipedia.

Notes

References

Becker, M., Singer, P., Lemmerich, F., Hotho, A., Helic, D., Strohmaier, M.: Photowalking the city: comparing hypotheses about urban photo trails on Flickr. In: International Conference on Social Informatics (SocInfo), pp. 227–244 (2015)
Google Scholar
Dimitrov, D., Singer, P., Lemmerich, F., Strohmaier, M.: What makes a link successful on Wikipedia? In: International World Wide Web Conference, pp. 917–926 (2017)
Google Scholar
Espín Noboa, L., Lemmerich, F., Singer, P., Strohmaier, M.: Discovering and characterizing mobility patterns in urban spaces: a study of Manhattan taxi data. In: International Workshop on Location and the Web, pp. 537–542 (2016)
Google Scholar
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795 (1995)
Article MathSciNet MATH Google Scholar
Singer, P., Helic, D., Hotho, A., Strohmaier, M.: Hyptrails: a Bayesian approach for comparing hypotheses about human trails on the web. In: International World Wide Web Conference, pp. 1003–1013 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

GESIS - Leibniz Institute for the Social Sciences, Mannheim, Germany
Florian Lemmerich, Philipp Singer, Lisette Espin-Noboa, Dimitar Dimitrov & Markus Strohmaier
University of Würzburg, Würzburg, Germany
Philipp Singer, Martin Becker & Andreas Hotho
Graz University of Technology, Graz, Austria
Philipp Singer & Denis Helic
RWTH Aachen, Aachen, Germany
Philipp Singer & Markus Strohmaier

Authors

Florian Lemmerich
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Singer
View author publications
You can also search for this author in PubMed Google Scholar
Martin Becker
View author publications
You can also search for this author in PubMed Google Scholar
Lisette Espin-Noboa
View author publications
You can also search for this author in PubMed Google Scholar
Dimitar Dimitrov
View author publications
You can also search for this author in PubMed Google Scholar
Denis Helic
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Hotho
View author publications
You can also search for this author in PubMed Google Scholar
Markus Strohmaier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Lemmerich .

Editor information

Editors and Affiliations

Google Research, Google Inc., Zurich, Switzerland
Yasemin Altun
NASA Ames Research Center, Mountain View, USA
Kamalika Das
Oath, Sunnyvale, USA
Taneli Mielikäinen
Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
Donato Malerba
Institute of Computing Science, Poznan University of Technology, Poznan, Poland
Jerzy Stefanowski
Laboratoire d’ Informatique (LIX), École Polytechnique, Palaiseau, France
Jesse Read
Department of Computer Science, Stanford University, Stanford, USA
Marinka Žitnik
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lemmerich, F. et al. (2017). Comparing Hypotheses About Sequential Data: A Bayesian Approach and Its Applications. In: Altun, Y., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10536. Springer, Cham. https://doi.org/10.1007/978-3-319-71273-4_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-71273-4_30
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71272-7
Online ISBN: 978-3-319-71273-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Comparing Hypotheses About Sequential Data: A Bayesian Approach and Its Applications

Abstract

Similar content being viewed by others