This post is about finding and analysing similar trajectories using Traffic-flow data as a real world example. We aim to find meaningful relationship between different trajectories in our data. Then we plan to use them for forecasting in the next blogpost.
I gave brief information about the data and did some cleaning/preperation in the previous post. Now I will talk about distance functions and use them for capturing patterns/similarities in our data.
Before that, I split data to 2 main parts training and testing. Also I split each part to two smaller intersecting parts which I call reference and query. There is no need to explain train/test splitting, the reference and query parts are because of the way searching for historical similarity works, which I will tell right after showing the split.
Note that, query and reference parts are intersecting with each other. When searching for similar trajectories, we need to use data from past otherwise the first entries of the query part would have nowhere to look. As an example, if we were only using October 2021 to February 2022 in training, first weeks of October 2021 would only be searching the same days and maybe weeks for similar patterns. With query/reference split I create a year long pool to search for similar patterns. This also lets me to capture a possible yearly pattern.
Also, I see that the flow data changes drastically right before 2021. This makes 4+12 months of split a viable option, using a bigger set may create problems.
For capturing trajectory similarities, we first need trajectories. I arrange the time-series data to overlapping small windows (vectors/trajectories). The lenght(size) of these windows are of great interest. I will be calling it windowsize in this and following posts.
Windowsize should be large enough to capture an ongoing pattern but small enough to be relateble. As an example, 4 hour long trajectory gives strong clues about the time of the day, trend of the traffic and that specific days average flow. Using 24 hour would may give better information about that day but it would miss time of the day and other critical information.
When windowed data arrangement is done, I use distance functions to find closest windows for each window. This functions are usually metrics, but it is not necessary. I first talk about the basic metrics. Then I may talk about custom made distance functions that fit better for my purpose. I leave explaining the details of each distance to below chart from Shirkhorshidi et. al., I will only talk about some insights about the measures I will use.
Euclidean (or \(\ell^2\)) distance is an obvious choice for measuring distance. It is widespread and easy to implement/understand.
In our case, euclidean distance gives easy to understand results. If two trajectories are similar according to euclidean distance, we can confidently say they are indeed similar. However it has some of shortcomings. Here are some of them with possible solutions.
A single steep anomaly may disrupt the whole measure (especially right before/after rush hours)
Smoothen the outlier values
Ignore some of the elements in the trajectory
Past values may affect the result by a lot
Weight the latter elements more (weighted euclidean)
Similar to Euclidean
Very easy to implement
Very fast
not expected to work as well as euclidean (however it performed ok, check next post)
may catch different patterns compared to above three.
may miss many important patterns too
will add.
Inspecting the closest neighbors gives a more or less uniform distribution.
Here are two example visuals showing the nearest (only the first one) neighbor of first 500 observations from the query(train) set. (500 observations are about 5 days long)
Also increasing 5 days to 1 month, we see similar results.
In both examples there is uniform-like patterns. I checked different statistics like
These results suggests verifies our choice of using long time-span. Thus in the next post, I will be using the nearest neighbors for making forecasts on the full size data without any extra touch.