A problem that I frequently run across is missing data. Sometimes we can safely ignore missing data and other times we have to drop the missing observations. Sometimes we may aim to estimate the values of the missing data using interpolation methods. Interpolation is a great method of not losing data, but to use it you have to be fairly careful about your assumptions. If you can make the "missing at random" (MaR) assumption, then you may have a lot of leeway to salvage your missing data.
I do a lot of work with environmental datasets, so many times my data have a strong cyclical component. One way to interpolate missing values in cyclical data is to use the "STLplus" package in R. STL decomposition or Seasonal Trend Loess is a method that splits your data into two major components, a seasonal component and a trend component. STL then performs the local-linear loess regression on both components to estimate changes in the trend and seasonality over time. The method is very flexible, allowing you to choose the degree of the polynomial estimating the regression on both components, and produces convincing results (Cleveland et al., 1990).
STL has had a couple of implementations in R, including the fantastic STLplus written by Ryan Hafen. The package that primarily has aimed to better handle the treatment of missing data. On top of this package I have written a few functions (STLinterp) to properly cross validate the procedure, which I demonstrate here.
To begin to show how to use the STL method, we can have a look at a real problem I had. In a dataset I have many different time series of observation wells across India. These observation wells are very important to understand how water scarcity has changed over time, but the data quality is sometimes not the best, with occasional periods of missingness. The data show strong seasonality from the yearly monsoon. For an example here, I've chosen the time series for one district, Araria, a district in the state of Bihar in Northern India.
pacman:::p_load(tidyverse,dplyr,devtools,stlplus)
araria <- read_csv("araria.csv")
araria %>% head()
#> # A tibble: 6 x 3
#> state.dist time med.welldepth
#> <chr> <dbl> <dbl>
#> 1 Bihar, Araria 9 -2.58
#> 2 Bihar, Araria 10 -2.86
#> 3 Bihar, Araria 11 -1.44
#> 4 Bihar, Araria 12 -1.51
#> 5 Bihar, Araria 13 -4.23
#> 6 Bihar, Araria 14 -4.92
If we plot the data, we can see there are gaps in the data. The gaps appear to be randomly distributed, which provides evidence to satisfy our MaR assumption
araria %>% ggplot(aes(time,med.welldepth))+
geom_line(size=.75,color ="steelblue")+
geom_point(size=2,shape=1,color="steelblue")+
labs(title="Well Depth in Araria, Bihar", x= "Season", y="Median Well Depth")+
theme_classic()
We can then use the basic functionality of STLplus to estimate the missing values. This function requires a series of parameters that modify the underlying loess regressions. These parameters are things such as degree of the polynomial used in loess, the bandwidth or window chosen for the local linear regression and more. These parameters have non-trivial implications on the resulting STL decomposition, and can be hard to know in advance.
#example how to use STL plus to fill in missing values
araria_ts <- ts(araria$med.welldepth, frequency=4)
# which na?
missing_indicies<- which(is.na(araria_ts))
#perform stl, reconstruct missing
stlobj <- stlplus(araria_ts, s.window=4, s.degree=2, t.window=4,
t.degree=2,fc.window=4)
reconstruct <- seasonal(stlobj) + trend(stlobj)
#replace NAs in the dataset with interpolated values
araria_reconstructed <- araria %>%
mutate(med.welldepth =replace(med.welldepth,
is.na(med.welldepth), reconstruct[missing_indicies]))
STLplus provides an opportunity to see the underlying seasonal and trend components that are estimated. This could be potentially of interest if, for example, you are only interested in seeing how the trend differs over time.
#plot seasonal and trend components
araria$season_stl <- seasonal(stlobj)
araria$trend_stl <- trend(stlobj)
araria_long <- araria %>% pivot_longer(cols=ends_with("stl"), "component")
araria_long %>% ggplot(aes(time, value, col=component))+
geom_line(size=.75)+
geom_point(size=2,shape=1)+
labs(title="Estimated Season vs Trend Components with STL",
x= "Season", y="Well Depth")+
scale_color_discrete(name = "Component", labels = c("Seasonal", "Trend"))+
theme_classic()
Once we have estimated our STL components, we can add the seasonal and trend back together and plot how the new interpolated points look relative to the actual data.
#Show reconstructed points
names(araria_reconstructed)[3] <- "recon_welldepth"
araria <- bind_rows(araria, araria_reconstructed[missing_indicies,])
araria %>% ggplot(aes(time,med.welldepth))+
geom_line(size=.75,color ="steelblue")+
geom_point(size=2,shape=1,color="steelblue")+
geom_point(aes(x=time,y=recon_welldepth),color="red", size=2)+
labs(title="Reconstructed points (in Red)", x= "Season", y="Well Depth")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
We can now plot the actual data and the interpolated data together as a single non-disjoint time series.
#see reconstructed time series
araria_reconstructed %>% ggplot(aes(time,recon_welldepth))+
geom_line(size=.75,color ="steelblue")+
geom_point(size=2,shape=1,color="steelblue")+
labs(title="Reconstructed Well Depth for Araria, Bihar",
x= "Season", y="Median Well Depth")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
The above actually looks pretty believable. It's hard to see which points we've interpolated which is a testament to the power of the method and also the danger of interpolation when the ground truth is not actually known. One excellent way to estimate "accuracy" in this context is in held-out data. Basically we simulate extra missing data points and estimate STL. Now we know the difference between the actual and our interpolated points so we can choose a best parameter set.
I've implemented two cross validation methods in STLinterp. The first, Monte Carlo draws a random subsample of the data that is proportion P of the data and will do so K number of times. For example of Araria in the first draw the alogirthm might hold out index 12, 31, 15, 16, 62. In the next draw it might hold out 21, 5, 31, 77, 81. Mean squared error (MSE) is then calculated and averaged across all draws. The second method is the more familiar k-fold cross validation technique where K splits are randomly made of training data. The model is built on k-1 splits and evaluated on the kth split. MSE is then averaged across splits.
In both cases you want to have small testing set and a large training set, and test the accuracy of the model across many runs. For k-fold this means making k nearly the length of the non-missing data (leave one out is k = length(data) -1 ). For the Monte Carlo method, this means making p close to 1 and k large. Experiments have suggested to me that k-fold is more stable for fewer iterations, but the jury is still out as to which is better.
STLinterp makes estimating accuracy easy in both cases. To calculate using Monte Carlo cross validation you can use:
source_url("https://github.com/M-Harrington/STLinterp/blob/master/STLinterp.R?raw=TRUE")
#check quality of prediction
params <-c(4,2,4,2,4,1) #s.window , s.degree, t.window , t.degree, fc.window, fc.degree respectively
mc_cross_val(araria_ts, grid_row = params, p=.95,k=1000)
#> [1] 5.50449
Note that MSE usually decreases as p increases (or k increases in k-fold) because the model has more points to train on. Likewise if we want to preform k-fold cross validation we can use:
kfold_cross_val(araria_ts,grid_row=params,k = 65)
#> [1] 1.37312
The real benefit of using STLinterp, however, is to test many different hyperparameter sets simultaneously through a grid search. You can get the entire grid and all of the scores using value = "grid":
#but maybe want to test a whole bunch of hypotheses
grid_results <- STLinterp(araria_ts, s.window = c(5,6,8),
t.window = c(5,10,15),t.degree = c(1,2), s.degree = c(1,2), k=65,
type="kfold", value = "grid")
grid_results %>% head()
#> s.window s.degree t.window t.degree fc.window fc.degree CVscore
#> 1 5 1 5 1 NA NA 0.6679875
#> 2 6 1 5 1 NA NA 0.6197527
#> 3 8 1 5 1 NA NA 0.6083444
#> 4 5 2 5 1 NA NA 1.7325874
#> 5 6 2 5 1 NA NA 0.8141298
#> 6 8 2 5 1 NA NA 0.6785703
Or you can only get back the best parameter set using value="best".
#or return just the best parameter set
STLinterp(araria_ts, s.window = c(5,6,8),
t.window = c(5,10,15),t.degree = c(1,2), s.degree = c(1,2), k=65,
type="kfold", value = "best")
#> s.window s.degree t.window t.degree fc.window fc.degree CVscore
#> 2 6 1 5 1 NA NA 0.609603
Lastly there occasionally a "bug" where the algorithm cannot perform on particularly sparse data (with long runs of missingness). This can frequently be "solved" with increasing window sizes for the trend and seasonal components, but this method is likely going to perform significantly worse with less data to estimate and is likely worth trying another method at that point.
So thank you for reading, let me know if the method ends up helping you or if you have any suggestions for improvements to STLinterp, or any questions about the techniques used above!
Bibliography
Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. (1990). STL: A seasonal-trend decomposition. Journal of official statistics, 6(1), 3-73.
Comentários