I use traffic flow data from California Freeway Performance Measure Systems (PeMS). The dataset spans on years and contains huge number of stations (that correspond to different highways/streets etc.). I select around two years of timespan from October 2020 to June 2022. I will show three different stations with varying occupancies in this post.
The data contains various statistics such as total traffic flow of 5 minute intervals, average speed and occupancy. I will be only interested in total traffic flow of 15 minute intervals. The reasoning behind 15 minutes is discussed in many papers like Lippi. Briefly, 15 minute is a good trade-off between reducing the noise with keeping the data relatable and useful.
Here is a one week look to the data:
It is clear that weekdays and weekends look different. This lead people to use only weekends in their projects especially for capturing seasonal (daily) patterns.
Using this data as an example, I will show a way to find missing dates in huge datasets.
In PeMS data there were some dates with no observations, however their dates were also not included. Thus we need to find the dates that do not appear in the data to find the missing observations.
library(lubridate)
# create all dates along the data.
= seq(ymd_hm('2020-04-01 00:00'), ymd_hm('2022-06-04 23:55 '), by = '5 mins')
alldates
# create a checker
= data.table(Timestamp = alldates,
bigdates isempty = 0)
# merge the checker with the data of interest
= merge(bigdates,
bigdates %>% filter(Timestamp < "2022-06-03"),
data all = TRUE,
by = "Timestamp")
# the NA values in the merge dataset are the non-existing ones.
which(is.na(bigdates$Station)), ]$Timestamp bigdates[
I did not include the results, but there were some instances of missing data. I handled them with two methods:
Here is a custom (and weird) function for the second case. Even though it was simple to explain, working with time data may be tricky. In my case UTC was not stable in different runs. Also, I did not want to work with indices because of some other problems that I don’t recall.
<- function(data, year = 2022, month = 2, mday = 5, hour, minute, year1 = 2022, month1 = 1, mday1 = 29, year2 = 2022, month2 = 1, mday2 = 22, year3 = 2022, month3 = 1, mday3 = 15) {
missing_replacer_3week = rbind(data[year(Timestamp) %in% year1 & month(Timestamp) == month1 & mday(Timestamp) == mday1 & hour(Timestamp) == hour & minute(Timestamp) == minute, 8:13],
xxx year(Timestamp) %in% year2 & month(Timestamp) == month2 & mday(Timestamp) == mday2 & hour(Timestamp) == hour & minute(Timestamp) == minute, 8:13],
data[year(Timestamp) %in% year3 & month(Timestamp) == month3 & mday(Timestamp) == mday3 & hour(Timestamp) == hour & minute(Timestamp) == minute, 8:13])
data[= data.table(t(apply(xxx, 2, mean)))
xxx year(Timestamp) == year & month(Timestamp) == month & mday(Timestamp) == mday & hour(Timestamp) == hour & minute(Timestamp) == minute, 8:13] =
data[
xxxyear(Timestamp) == year & month(Timestamp) == month & mday(Timestamp) == mday & hour(Timestamp) == hour & minute(Timestamp) == minute, 3:7] =
data[year(Timestamp) %in% year1 & month(Timestamp) == month1 & mday(Timestamp) == mday1 & hour(Timestamp) == hour & minute(Timestamp) == minute, 3:7]
data[return(data)
}
This post was a short introduction. After preparing the data, I started working on similar trajectories.