Each review in the Tripadvisor's dataset contains two different dates:
We perform an analysis to estimate the size of the difference between the review date and the visit date, that is the time elapsed between the moment of the visit and the moment in which the review is written. To this aim, we used only reviews for which both dates are available. When considering the visit date, since the exact day of the visit is unavailable, we set the first day of the correspondent month as the day of the visit date.
The following graph shows the distribution of the delta of the dates, computed as the difference between each review and the visit date.
The following graph shows the cumulative density function of the distribution.
Descriptive statistics for the distribution:
variable | value |
---|---|
count | 281338.000000 |
mean | 40.902345 |
std | 62.461652 |
min | 1.000000 |
25% | 13.000000 |
50% | 23.000000 |
75% | 34.000000 |
80% | 40.000000 |
90% | 83.000000 |
max | 497.000000 |
The table highlights that for 50% of the reviews, the difference is 23 days, for 75% of the reviews, dates differs at most 34 days and the remaining 25% of reviews belong to the "long tail" of the distribution.
This result indicates that the error introduced using the review date instead of the visit date is limited to a month for the majority of reviews.