Feasibility of estimating travel demand using geolocations of social media data

5085 Accesses
18 Citations
10 Altmetric
Explore all metrics

Abstract

Travel demand estimation, as represented by an origin–destination (OD) matrix, is essential for urban planning and management. Compared to data typically used in travel demand estimation, the key strengths of social media data are that they are low-cost, abundant, available in real-time, and free of geographical partition. However, the data also have significant limitations: population and behavioural biases, and lack of important information such as trip purpose and social demographics. This study systematically explores the feasibility of using geolocations of Twitter data for travel demand estimation by examining the effects of data sparsity, spatial scale, sampling methods, and sample size. We show that Twitter data are suitable for modelling the overall travel demand for an average weekday but not for commuting travel demand, due to the low reliability of identifying home and workplace. Collecting more detailed, long-term individual data from user timelines for a small number of individuals produces more accurate results than short-term data for a much larger population within a region. We developed a novel approach using geotagged tweets as attraction generators as opposed to the commonly adopted trip generators. This significantly increases usable data, resulting in better representation of travel demand. This study demonstrates that Twitter can be a viable option for estimating travel demand, though careful consideration must be given to sampling method, estimation model, and sample size.

Georeferenced X (formerly twitter) data as a proxy of mobility behaviour: case study of Norway

Article Open access 11 September 2024

Estimating local commuting patterns from geolocated Twitter data

Article Open access 02 October 2017

Social media and mobility landscape: Uncovering spatial patterns of urban human mobility with multi source data

Article 29 September 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Travel demand estimation is essential for urban planning and management of transportation networks. The time series of visits to various locations by individuals are aggregated to study the flows of people between different zones/regions. Based on the spatio-temporal scale of the aggregation, an origin–destination (OD) matrix can be constructed with the origins and destinations of all trips. These OD matrices are particularly important for representing travel demand (Calabrese et al. 2011). Traditionally, the estimation of OD matrices relies on input data from household travel surveys, censuses, and traffic surveys that feature representative populations and detailed information about travel mode and trip purposes. However, data collection frequency, methods, and data availability vary across countries (and across cities within a country), making it difficult to interpret the results. For example, in the UK and the Netherlands the travel surveys are done annually, but that is an exception. Other places do not do them regularly, if at all. Portugal had one travel survey carried out for two metro areas in 2017, but nothing more since. Otherwise, mobility is derived from the census data (carried out every 10 years), but that offers a different resolution since it is not based on travel diaries. On top of these issues, the costs of these surveys are increasing, while the response rates are decreasing over time (Yue et al. 2014), making it hard to keep the travel demand models up to date. Emerging data sources associated with mobile/smart phones are increasingly leveraged to overcome these drawbacks.

In the last decade, the emerging data sources have significantly improved our understanding of travel behaviour (Gonzalez et al. 2008; Song et al. 2010; Barbosa et al. 2018) and have brought new opportunities for travel demand modelling (Anda et al. 2017). Common emerging data sources are call detail records (CDR) (Calabrese et al. 2011), smart card data, GPS-enabled devices, and geotagged social media, e.g., Twitter (Lee et al. 2019; Hasnat and Hasan 2018).

Alongside the development of information and communication technologies (ICT), interest in online social media services, e.g. Twitter, has grown among the transportation research community (Rashidi et al. 2017). A tweet typically contains multiple components that can be useful for transport research, including text, hashtag, location, and timestamp. When users choose to have their location reported when sending out tweets, these are called geotagged tweets. Despite geotagged tweets accounting for a small proportion (1–3%) of all tweets (Morstatter et al. 2013), these check-ins provide precise location information and have increasingly been used for estimating mobility and travel demand either at the global (e.g. Hawelka et al. 2014) or regional level (e.g. Yang et al. 2015).

In the estimation of travel demand, two forms of data are often used: longitudinal and lateral. A longitudinal data set is characterised by long-term (more than 24 h) and continuous observations focusing on a group of participants. A lateral data set is often collected based on a particular area, such as a city or a country, during a short to medium time period, and it usually covers a larger population. Thus, the data offer either broader or longer coverage, but rarely both.

Geotagged tweets can be obtained in three ways: (1) Purchase the complete set of public tweets from Twitter Firehose (Twitter 2019c); (2) Access the Streaming API to get a maximum of 1% of the public tweets (Twitter 2019a); (3) Access the user timeline by user name/ID to get a maximum of 3200 historical tweets that are set by the user as publicly accessible (Twitter 2019b). Different collection channels of geotagged tweets correspond to different data forms. Sampling methods (1) and (2) collect geotagged tweets generated within a specified region, while sampling method (3) collects data from user timelines without any spatial boundaries.

Geotagged tweets collected from Twitter Firehose and Streaming API are often limited to a geographical bounding box yielding a lateral data set. It covers a large number of Twitter users but takes time to accumulate enough samples for each individual, and movements outside or across the bounding box are not captured (Liao et al. 2019). Alternatively, by accessing User Timeline API, all publicly available historical tweets by a specific user can be collected to form a longitudinal record of individual trajectories without any geographical boundaries. Longitudinal geotagged tweets are collected without being constrained to a specific area, but typically with a smaller number of individuals, albeit a much larger overall sample size (one to two orders of magnitude more samples per user).

Most studies use geotagged tweets in the lateral form, focusing on a specified area in line with the spatial scale of policy-making and urban planning. For example, one study modifies a classic movement model by integrating locations posted on Foursquare (which Twitter integrates) for origin–destination estimation in Austin, Texas (Jin et al. 2014). Longitudinal data can also be scaled up to large numbers of Twitter users to study the OD flows between global cities (Lenormand et al. 2015).

One recent literature review shows that experts are optimistic about the usefulness of such data sources for modelling travel behaviour (Rashidi et al. 2017). Compared with the other data sources, geotagged tweets have several strengths: long collection duration, large number of studied individuals, large spatial coverage, ease of access, low cost, and accurate location information. The low cost of retrieving geotagged tweets makes them especially appealing compared to other data sources (Rashidi et al. 2017). The data source is free to access, and it provides precise location information with a spatial resolution of around 10 m compared with 100–200 m for call detail records (CDR) (Jurdak et al. 2015). Moreover, it is relatively scale free, i.e. analyses can be done with any desired time frame and spatial boundaries based on the research question at hand (Liao and Yeh 2018).

Despite the wide applications, rigorous cross-validation of the use of emerging data sources, such as geotagged social media data, to approximate the travel demand, and their robustness across spatial and temporal scales is still lacking. The main criticism of Twitter data pertains to two aspects: a biased population representation, and low and irregular sampling. Geotagged tweets can capture movements over multiple years and include overseas visits, but the data are “sparse”, thus the picture of actual movements is incomplete (Liao et al. 2019). There have been studies comparing multiple data sources to identify/adjust the biases (e.g. Wesolowski et al. 2013; Tasse et al. 2017) and to validate against “ground truth” (e.g. Lee et al. 2019). It is worth noting, however, that the “ground truth” is also an incomplete picture of reality, as it is, at best, based on the knowledge from well-recognised but limited data collection and established modelling techniques.

This study attempts to comprehensively examine the validity of using geotagged Twitter data for travel demand estimation by comparing Twitter data sets with established data sources. We first compare the empirical trip records with respect to the commuting travel demand and the overall travel demand for an average weekday. We then create gravity models based on Twitter data to estimate the overall travel demand at both the national (long-distance travel above 100 km) and city level. Finally, we compare Twitter-based OD matrices and trip distance distributions with those from the other established sources using spatially weighted structural similarity index and Kullback–Leibler divergence, respectively.

The main contributions of this study lie in the quantification of the feasibility of using geolocations of Twitter data for estimating commuting demand and the overall travel demand, given different sample sizes, sampling methods of Twitter data, and spatial scales. In addition, we develop a novel approach using geotagged tweets as attraction generators as opposed to the commonly adopted trip generators. This significantly increases usable data, resulting in better representation of travel demand and the promise for using Twitter data at a finer spatiotemporal resolution.

The remainder of this paper is organised as follows. “Related work” section reviews work related to travel demand estimates using social media data and outlines the objectives of the present study. “Data description” section describes the data, and “Methodology” section describes the methods used. The results are presented in “Results” section, and “Discussion” section discusses the findings. “Conclusion” section concludes and identifies future research needs.

Related work

Modelling travel demand

For travel demand estimation, one needs to first extract activities and trips where Twitter data have proven useful for both conventional four-step modelling and activity-based modelling by providing inferred activities and trips. There has been increased interest in developing methods to infer this information using social media check-in data, such as Twitter data. One recent study has demonstrated that Twitter data can be integrated with an household travel survey to improve the quality of OD matrices (Cheng et al. 2020). Constructing activity-based models requires trip purpose, departure time, and socioeconomic attributes of travellers, among other attributes. The content of geotagged tweets is often used with text mining to extract those attributes, e.g., the activity purposes such as work and leisure and the socio-economic profile of Twitter users (Hasan and Ukkusuri 2014; Abbasi et al. 2015; Maghrebi et al. 2015).

The methodology of four-step travel demand modelling (McNally 2007) consists of trip generation and trip distribution as the first two steps. It starts from the definition of a trip, which is the connection between two consecutive stays generated by the same individual. This individual refers to a phone user when using CDR data (Calabrese et al. 2011), or a survey participant from a one-day travel diary. When it comes to geotagged social media data, a trip is generally defined in the literature as the connection between two consecutive geotagged tweets generated by the same Twitter user. However, due to the sparsity and incomplete trajectory of geotagged tweets, the time interval between two consecutive geotagged tweets can be extremely long (from a few hours to several weeks/months), while the air distance can be close to zero. Therefore, in this context, “displacement” is a more appropriate term than the traditional sense of the trip. Despite a displacement in geotagged tweets being different from a record in a travel diary, existing literature often uses these two terms interchangeably.

Trip generation

Trip generation involves the estimation of the number of trips produced by and attracted to each zone, either using empirical data directly, or modelled results based on zonal demographics and land use information.

Social media data such as displacements in Twitter data need to be processed to become trips. Gao et al. (2014); Kheiri et al. (2015) and Lee et al. (2019) propose displacement conversion where they filter out those displacements with time intervals longer than a selected time threshold, e.g. 4 h, 12 h, or 24 h. However, this time threshold is arbitrary and the choice results in a massive reduction of available data.

Instead of geotagged displacements, one can model destination choices to estimate zonal attractiveness. Hasnat et al. (2019) applied Twitter data together with census tract data for modelling travellers’ destination choice behaviour, which suggests that Twitter data can be utilised effectively for modelling destination choices that reflect the attractions of zones. Molloy and Moeckel (2017) develop a long-distance destination choice model using Foursquare check-ins whose results suggest that check-ins from social media platforms can improve destination choice models, particularly for leisure travel.

Trip distribution

Trips are further aggregated to OD zones depending on the spatial scale. The step of trip distribution assigns trips produced by each zone to each of the other zones where these trips are attracted to Anda et al. (2017). There are many models to assign the number of trips between each pair of OD zones. In a study by Yang et al. (2015) of the Chicago metropolitan region, daily check-ins from Foursquare are used to estimate the productions and attractions in each traffic analysis zone as inputs to gravity models for estimating trip distribution. By further calibrating against the OD matrix from other data sources such as CDRs, they demonstrate how to use gravity models with check-in data to estimate the OD matrix. Kheiri et al. (2015) use the radiation model, rank-based model, and population-weighted opportunities model to distribute the trips generated with Foursquare check-ins to estimate the OD matrix.

Commuting travel demand estimation

Estimating the OD matrix according to trip purpose points toward more specific applications. Commuting flows account for a large share of total trips, therefore they attract more attention. For example, Zagatti et al. (2018) use CDRs to estimate an OD matrix of commuting flows. For social media data, some data sources have trip purposes (activity types), such as Foursquare, while Twitter data do not directly provide this information. With a small share of check-ins at home/workplace from Foursquare when compared with the actual daily mobility, Yang et al. (2015) focus on non-commuting trips. To construct OD matrices of commuting flows with geotagged tweets or CDRs, one needs to detect home/workplace when the trip purpose is not explicitly given. Schneider et al. (2013) assume that the most visited location during weekends and 7 pm–8 am on weekdays is the home location and the second most visited location during 8 am–8 pm on weekdays is identified as one’s workplace. Combining such temporal rules and visiting frequency, this method has been widely used to identify the home/workplace through social media data (Wang et al. 2018; Osorio-Arjona and García-Palomares 2019), sometimes together with land-use information (Osorio-Arjona and García-Palomares 2019).

Efforts that infer the home/workplace from geotagged tweets must consider the behavioural bias of people geotagging consciously and intentionally in uncommon places to communicate and show where they have been (Tasse et al. 2017). Home and workplace are at the opposite extreme, i.e., they are the most common places that people visit on a daily basis. A preliminary comparison between Twitter data and the national travel survey suggests that the low probability of reporting home and workplace implies that further scrutiny of the validity of estimating commuting-OD matrices based on geotagged tweets is required.

Validation against other data sources

Researchers have devoted efforts to validating geotagged tweets with other data sources. A study focusing on the U.S. found that densely populated regions and males were over-represented among Twitter users (Mislove et al. 2011). In addition, there are two possible types of behavioural distortion for Twitter users who geotag: only tweeting at specified locations or times, and geotagging only certain or all of the tweets.

When cross-validating against data with higher temporal resolution such as CDR (Lenormand et al. 2014), good agreement is generally found regarding, for instance, trip distance distribution. When validating geotagged tweets against travel surveys, studies show that geotagged social media data capture the displacement distribution, length, duration, and start time of trips reasonably well for the purpose of inferring individual travel behaviour (Zhang et al. 2017; Liao et al. 2019). Validations using CDR need careful interpretation, as CDR and geotagged tweets are both passive data collection methods that share some similar shortcomings.

Good agreement on fundamental indicators of individual travel behaviour does not necessarily guarantee a good proxy for the travel demand at the population level. Some studies comparing geotagged tweets with traffic data (Ribeiro et al. 2014) and travel-demand data (Lee et al. 2015, 2019; Yang et al. 2015) have generally achieved good results. However, as pointed out recently by Lee et al. (2019), the sparsity of geotagged tweets leads to sparse OD matrices and therefore cannot replace other travel demand forecasting methods for state-wide travel models.

Study objectives

The work comparing geotagged tweets with other data sources for travel demand estimation still lacks systematic rigour in at least four areas: (1) Commuting travel demand. The basic temporal technique to identify home or workplace has been widely applied for deriving commuting trips. Our preliminary results from previous analyses suggest that identifying home and workplace locations through geotagged tweets gives mixed results and the reliability of the method requires further scrutiny; (2) Spatial scale. Most studies look at pre-selected regions without exploring the effects of spatial scales on travel demand estimation, whereas we hypothesise that the feasibility of using Twitter data for travel demand estimation can depend on the scale; (3) Sampling methods. The existing literature is not clear on how different sampling methods (region-based vs. user-based) affect the validity of using geotagged tweets to estimate travel demand; (4) Sample size. It remains unclear how the sparsity of Twitter data affects the validity of using it for travel demand estimation.

To fill these gaps in the literature, we systematically examine the validity of using geotagged tweets collected within a specified region, and from user timelines, to approximate the OD matrix at different spatial scales. We compare these Twitter-based OD matrices with the Swedish national travel survey and output from Swedish Transport Administration (Trafikverket) traffic models. Specifically, we attempt to answer the following questions:

Are Twitter data a feasible source for representing commuting travel demand?
Can geolocations of Twitter data be used to create models for travel demand estimation?
How do spatial scale, sampling method, and sample size of Twitter data affect its representativeness for travel demand?

Data description

This study focuses on Sweden as a whole and on Greater Gothenburg, located in western Sweden. Sweden is a European country with a population of 10.2 million in 2019 and the GDP per capita was 54.6 kUSD in 2018 (Statistics Sweden). Gothenburg is its second largest city for which Greater Gothenburg covers its metropolitan area with a population of around 1 million.

Specifically, four datasets have been used in this study. Two Twitter datasets collected using different sampling methods: lateral geotagged tweets (Twitter LT), and longitudinal geotagged tweets (Twitter LD). And two datasets with which the Twitter data are compared: the Swedish National Travel Survey; and OD matrices from the Sampers model, a traffic simulation model with the travel demand module embedded, developed by the Swedish Transport Administration. The traffic zones used by Sampers are illustrated in Fig. 1 for two spatial scales: Greater Gothenburg (city level) and Sweden (national level). Detailed descriptions of each dataset are presented in this section.

Twitter data

Lateral geotagged tweets (Twitter LT)

We purchased data from Gnip, a Twitter subsidiary, during a 6-month period (20 December 2015–20 June 2016) within the geographical bounding box of Sweden (Jeuken 2017; Liao et al. 2019). Gnip sells complete historical tweets in bulk and provides access to the Firehose API.

Longitudinal geotagged tweets (Twitter LD)

We identify 7773 top geotag users from Twitter LT who geotagged their tweets most frequently during that 6-month period. We extract those top users’ historical tweets using Twitter User Timeline API, without applying a spatial boundary limit. This method has a maximum number of tweets that can be collected from a specified user, producing varied time spans and varied tweet numbers, as not all users reached the 3200-tweet maximum.

Preprocessing and statistics of Twitter data

All the geotagged tweets are preprocessed to reduce potential artefacts causing biases in travel demand estimation. First, we only keep tweets that were generated from mobile devices. Moreover, those users who only had geotagged tweets of a single place are removed due to being bot accounts, e.g., for job posting or weather updates (Ek and Wennerberg 2020). Next, Twitter users can cross-post geotagged tweets from other social media platforms, yielding a place’s location being posted instead of the tweet’s precise geolocation, for example, the centre of Sweden or the centre of Gothenburg. These geotagged tweets without precise GPS coordinates are also removed. Finally, two filters are implemented for Twitter LD only. The top geotag Twitter users who have less than 50 geotagged tweets in total are removed. Considering the long time span of a given Twitter user’s Twitter timeline, he/she might have migrated from one country to another. To avoid confusion, we only keep the latest time period of the geotagged tweets where a Twitter user is assumed to live in Sweden. For the national level, all the Twitter LT and LD are used while for the city level, only these geotagged tweets within the boundary of Greater Gothenburg are used.

We derive the home and workplace locations from Twitter LD given the larger numbers of geotagged tweets per user. The home location is identified as the most-visited location on weekends and between 7pm and 8am on weekdays, whereas the most visited non-home location between 8am and 8pm on weekdays is identified as the user’s workplace (Schneider et al. 2013; Wang et al. 2018; Osorio-Arjona and García-Palomares 2019).

Following the practice in the literature to account for the fact that Twitter users are not representative of the overall population, we give weights for individual Twitter users in Twitter LD. The weight is the ratio of Twitter users to the true population in the municipality (Wang et al. 2018). The trips of the Twitter users in Twitter LD are aggregated and multiplied with their individual weight to derive a population-level travel demand estimation. The Twitter users’ distributions are found to correlate with the census (Kendall’s tau = 0.65, $p<0.001$). However, top Twitter users tend to be over-represented in big cities especially the top three cities in Sweden: Stockholm (Twitter = 18% vs. Census = 9.4%), Gothenburg (6.9% vs. 5.6%), and Malmö (4.5% vs. 3.3%).

The basic statistics of Twitter LT and LD are summarised in Table 1. Compared with Twitter LT, Twitter LD collected from user timelines without using any spatial bounding box covers a longer time span, contains a larger volume of geotagged tweets and a higher number of geotagged tweets per user and in total, but covers a smaller population than Twitter LT. The distribution of the number of total geotagged tweets per user is shown in Fig. 2.

Table 1 Statistics of Twitter data used in this study

Full size table

Swedish national travel survey (Survey)

The survey data come from the Swedish National Travel Survey (one-day travel diary) for the years of 2011 to 2016 (Official Statistics of Sweden 2016). It consists of a total of 171,553 trips from 38,258 participants covering 2189 record days, with detailed information on individual trip’s origin and destination, distance, travel time, and participant’s home/workplace. The spatial accuracy is the municipality level.

Model-based travel demand estimations (Sampers)

The Swedish Transport Administration uses the Sampers model to calculate changes in traffic volumes under different scenarios. Both the city level and the national level have their own traffic analysis zones that follow the census boundaries and homogeneous socioeconomic characteristics. These spatial zones are used for creating OD matrices with Twitter data so that we can compare Twitter with Sampers’ model output.

Sampers calculates travel demand based on studies of travel habits derived from travel surveys, looking at where, how and how often people want to travel, which forms the OD matrices. The model output represents the total travel demand for an average weekday. We used the latest OD matrices (2014) from Sampers for Greater Gothenburg and the entire Sweden. At the national level, we focus on the long-distance trips of Sampers model ($\ge$ 100 km).

Methodology

In order to examine the feasibility of using Twitter data for travel demand estimation, we use an analytic framework to compare Twitter with the other established data sources, as shown in Fig. 3. In practice, transport planners collect empirical trip data from a small sample of the population and create a model to simulate the travel demand of the overall population for further application, such as traffic flows modelling. Therefore, we divide the comparison into two focuses: empirical trip records (“Trip records” section) and model output (“Travel demand model construction” section).

We first compare the empirical trip records obtained from Twitter with those from travel survey data with respect to the overall travel demand for an average weekday (“Processing weekday trips” section) and commuting travel demand (“Processing commuting trips” section). In this part of the validation, we also examine the stability of the similarity between Twitter and the travel survey over time. After the analysis of the empirical trips, we create the gravity models, based on Twitter data collected with two sampling methods, to simulate the overall travel demand at both the national (long-distance travel above 100 km) and city level. We use two methods for the step of trip generation (“Trip generation” section) followed by the gravity model for the trip distribution (“Trip distribution” section); they are trips converted from displacements by adding a time threshold (Model A) and the density-based approach proposed in this study (Model B). Model B is proposed as an alternative to Model A to solve the sparsity issue of Twitter data. Finally, we evaluate the results (“Evaluation of Twitter OD matrices” section) by comparing the Twitter-based trips and model outcomes with those from the national travel survey (Survey) and the Sampers model. The techniques used for the comparison include visualisation, similarity measure (“Spatially weighted structural similarity index” section), and trip distance distribution.

Trip records

Processing weekday trips

We define geotagged displacements by connecting every two consecutive geotagged tweets generated by the same user. To convert these displacements into trips, a time threshold can be used to filter out those displacements that have a time interval longer than a predefined threshold (Gao et al. 2014; Kheiri et al. 2015; Lee et al. 2019). We select 270 min, i.e., the 99th percentile of travel time between municipalities from Survey, as the time threshold for the national-level trip generation. For the city-level trip generation, the time threshold of 140 min is selected which is the 99th percentile of travel time within the corresponding county where most parts of Greater Gothenburg are located.

Survey contains complete sets of trip records at the municipality level. By directly aggregating the weekday records, we get the OD matrix of the overall trip records for an average weekday. The low cost of collecting Twitter data makes it easier to keep them updated over time. However, their actual use also depends on the stability of the similarity between Twitter trips and Survey trips over time. Instead of aggregating the records available, we look into the similarity of OD matrices from 2011 to 2016 at the national level by aggregating the records yearly.

Processing commuting trips

To construct commuting flows with Twitter LD, we define trips connecting home (origin) and workplace (destination), and aggregate those trips at the municipality level. This gives the national commuting OD matrix based on Twitter LD.

To compare a Twitter-based commuting OD matrix we need to construct an equivalent Survey-based OD matrix. Survey has the home and workplace of each participant at the municipality level, and each participant is assigned an individual weight standing for the representativeness of his/her socio-demographic profile in the overall Swedish population, regarding the time period of participating the survey, region, age, and gender. Specifically, the weight is designed as the ratio between the population and the survey respondent in the respective stratum. By linking home and workplace as a commuting trip for a given individual, multiplied by his/her individual weight, we aggregate all the commuting trips and construct the national commuting OD matrix.

Travel demand model construction

This section introduces the method of taking empirical trips to create modelled output of travel demand (OD matrix). The method consists of two steps, trip generation (“Trip generation” section) and trip distribution (“Trip distribution” section).