Motivation

Location-based social networks/media such as Foursquare have gained a lot of attention during the last years and have transformed to platforms for exploring the urban environment and obtaining recommendations. Even though the recommendations provided by (the proprietary algorithm of) Foursquare certainly consider a variety of features, in a large number of research studies - including my own! - the number of check-ins to a venue are used as a proxy for the quality of a venue. However, the "age" of a venue is strongly correlated with the number of checkins that have been generated in this venue. Hence, simply because venue X has double the number of checkins compared to venue Y does not necessarily mean that its quality is better as well; it might just be that X is older than Y.

In this post, we are trying to explore the effect of the age of a venue on its popularity as captured from the number of checkins it has accumulated. While using the number of checkins as a proxy for the quality of a venue in a research study might not be crucial (especially if the goal is to test a methodology or an algorithm), it can have crucial implications in a commercial system that does not account for the time-effect.

Our exploratory results indicate a strong connection between the age of a venue and the number of check-ins it has obtained. This should not come with surprise since similar phenomena have been empirically observed in a variety of different settings (e.g., the citations accumulated by a scientific paper) and/or it is expected from various theoretical models that have tried to explain empirical observations (e.g., the preferential model for network formation - older nodes tend to have higher degrees). Hence, decomposing the effect of time on aggregated, observed, counts is important, yet challenging (this post unfortunately doesn't solve this problem but marely showcases it).


Methods

To showcase the effect of time on a venue's popularity, we collected data using Foursquare's public venue API for food venues in the city of NY. We focused on a specific commercial type of venues, i.e., restaurants, in order to avoid any complications in analyzing data from venues in different contexts. In particular, we collected information for the creation time of the venue in the system, the aggregated number of check-ins in the venue during the time of data collection and the rating of the venue by Foursquare users (if any).

Using these data we perform the following analysis:

  • We calculate the pairwise correlations between age, aggregated check-ins and ratings of a venue
  • Analyze the age of venues that exhibit similar aggregated check-in counts, i.e., they belong to the same check-in class
  • Analyze the distribution of check-ins for venues with similar age, i.e., they were created during the same epoch in the system

We use three distinct check-in classes; (i) venues with less than 500 check-ins in total, (ii) venues with check-ins between 4,000 and 6,000 and (iii) venues with more than 10,000 checkins. These classes provides as sets with similar number of venues as well (947, 913 and 599 respectively). For the age of a venue we rescale the time using the time of crawling $t_{max}$ and the time of creation for our oldest venue in the dataset $t_{min}$. We then rescale the absolute time $t$ to a scaled version $\tau$ as: $\tau=\frac{t-t_{min}}{t_{max}-t_{min}}$. Hence, $0\le \tau\le1$, and small values of $\tau$ represent older venues.


Results

Pairwise correlations

We begin by examining the Pearson correlation coefficient $r$ between the number of check-ins and the rescaled age of a venue. Our results indicate that there is a medium level $(r = -0.23)$ and significant $(p-value\le 0.05)$ correlation. This essentially means that an important fraction of the check-ins that a venue has accumulated can be attributed simply to each age - the correlation is negative since higher $\tau$ corresponds to a newer venue.

One can argue that most of these platforms offer readily available ratings for venues that have been provided from the users themselves and capture better the quality of the establishment. While this is true, our dataset provides some evidence that the rating can also correlated with the age of the venue. In particular, we calculated the Pearson correlation between the rating of a venue and its rescaled age $\tau$. We find that in this case the correlation is smaller (compared to the number of check-ins), i.e., $r=-0.12$, but it is still significant $(p-value\le 0.05)$. Of course, this again does not mean that the rating of a venue improves marely with time. It might actually be an evidence (and only an evidence!) that businesses improve themselves through the feedback they get from their customers as time progresses.

Let's now turn our attention back to the effect of time on the number of check-ins a venue has accumulated.

Age distribution for venues in the same check-in class

We define a check-in class to be a set of venues that have accumulated a similar number of check-ins through their lives (irrespective of their creation time). As aforementioned, for our results we define three classess; venues with (a) fewer than 500 check-ins, (b) between 4,000 and 6,000 and (c) more than 10,000. For every class, we compute the distribution of the rescaled age of the venues of this class. Figure 1 presents the empirical CDF $(F_a$, $F_b$ and $F_c$ respectively $)$, which clearly shows that the distribution of venues with a high number of check-ins is shifted to the left, i.e., towards older ages. In fact, performing the one-sided Kolmogorov-Smirnov (KS) test:

H_0: F_X = F_Y
H_1: F_X > F_Y

where $F_x>F_y$ represents the alternative hypothesis that the CDF of $X$ lies above that of $Y$. The KS-test with $X=a$ and $Y=b$ gives a p-value less than $0.01$ and the same is true when $X = b$ and $Y = c$. This means that we can reject $H_0$ in both cases, that is, $F_a > F_b > F_c$ at the significance level $\alpha = 0.01$.

Check-in distribution for venues born during the same epoch

Next we focus on venues that have been created in the system around the same epoch, and we examine the distribution of the number of check-ins in every epoch. We consider 3 different epochs, namely, $\tau < 0.1$, $\tau \approx 0.5$ and $\tau > 0.8$ (with the respective probabilities being $f_{0.1}$, $f_{0.5}$ and $f_{0.8}$ respectively). We also calculate the check-in distributions $f$ for all the venues, that is, irrespective of their birth time. As we see in Figure 2 the distributions are different for each epoch and in particular, as we move to earlier epochs (red curve) it is more possible to find venues with both large and small number of check-ins. However, as we focus on epochs that were created later in time, they tend to have venues which have only accumulated a small number of check-ins and the "exceptional" establishments with many check-ins are more rare.

Furthermore, typically the distribution of the aggregated number of check-ins to venues is described through a power-law distribution. However, this not need be the case for the check-in distribution of epochs. In particular, we perform the following statistical test (using the poweRlaw package in R, which utilizes statistical bootstrap):

H_0: The~ power-law ~distribution~ cannot ~be ~ruled ~out
H_1: The~ power-law ~distribution~ can ~be ~ruled ~out

The p-value of the test for $f_{0.1}$ was 0.03, which means that $H_0$ can be rejected at the significance level $\alpha = 0.05$. For $f_{0.5}$ and $f_{0.8}$ the p-values were 0.59 and 0.32 respectively and hence the power-law cannot be ruled out. Furthermore, when performing this test for the aggergate distribution $f$, the p-value is 0.21, which again means that we cannot rule out the power-law distribution, However, the take-away here is that the age of a venue in the system clearly affects the distribution of the check-ins. If we focus on venues that were generated earlier in the system, the typical heavy-tailed distributions for the number of check-ins disappear.


Comments

If you have any comments/thoughts (or simply want access to the data I used) please e-mail (kpele A-T pitt.edu) me or and I will post it here with the appropriate credits!