case-study.tex

\chapter{Results and discussion}
This chapter presents case studies of Bruny Island and ISO New England using the proposed forecasting models.

\section{Bruny Island}
Bruny Island, shown in Figure \ref{fig:bruny_map}, is located approximately two kilometres off the coast of south-east Tasmania with a permanent resident population of approximately 800 people.
The island is a popular holiday destination, with Easter periods typically experiencing an influx of up to 500 cars in a single day.
The island is supplied by two feeders, depicted in Figure \ref{fig:bruny_network}, with  one feeder supplying a small portion of the island to the North and the other supplying the main portion of the island to the South.
This case study deals only with the feeder supplying the main portion of the island to the South.

During holiday period morning and afternoon peaks the submarine feeder reaches its capacity and a diesel generator located on the island is used to reduce the feeder load.
The substantial increase in load over the Easter holiday period for multiple years can be seen in Figure \ref{fig:bruny_easter}.

To avoid the use of the generator, the CONSORT project installed a set of residential batteries on the island for the purposes of peak-shifting.
These batteries are coordinated by the network aware coordination algorithm (NAC).
In order to peak-shift while making efficient use of the batteries, the NAC requires an accurate forecast of load with a 24-hour horizon and 30-minute resolution.

The proposed forecasting models were evaluated on historical data, with the details of the implementations and the results presented in the following sections.
A transformer-based model was implemented for online forecasting as part of the NAC, with details given in section \ref{consort-eval}.

\begin{figure}[htbp]
	\centerline{\includegraphics[width=.45\textwidth]{images/bruny_single_line.pdf}}
	\caption{Single line schematic of the distribution network on Bruny Island.}
	\label{fig:bruny_network}
\end{figure}

\begin{figure}[htbp]
	\centerline{\includegraphics[width=.9\textwidth]{images/easter_bruny.pdf}}
	\caption{Easter load on Bruny Island 2008 through 2017.
		Unusable missing/bad data can also be seen in this graph.}
	\label{fig:bruny_easter}
\end{figure}

\subsection{Data analysis}
\label{bruny-data-analysis}

In section \ref{patterns-profiles} the general properties of load profiles and how they are influenced by exogenous factors was discussed. Now, these general properties will be briefly investigated for the Bruny Island feeder.
\subsubsection{Available data}
The following data is available:
\begin{itemize}
	\item Apparent power measured at recloser R1 (figure \ref{fig:bruny_network}) between January 1, 2007 and June 25, 2018
	\item Ambient Temperature, solar irradiance, humidity, and wind speed at Lenah Valley between October 1, 2009 and June 25, 2018
	\item Vehicle movement data in 10 minute resolution between June 4, 2015 and March 24, 2018
\end{itemize}	
Vehicle movement data is provided as a set of observations every 10 minutes recording the number of vehicles arriving on the island, and the number of vehicles departing the island.
By integrating this, a relative number of vehicles on the island can be established.
There is a significant amount of bad or missing data throughout these datasets which has been handled by limiting the use of data where there are too many missing or bad values.

\subsubsection{Analysis}
Figure \ref{fig:load-profiles} shows apparent power draw on Bruny Island over a winter week, a summer week, and over an entire year.
These two weeks were selected to avoid special days such as holidays and are representative of typical weeks.
Several differences can be seen between the summer and winter weeks.
In Summer, the midday load is about the same as the overnight load, whereas in Winter it is quite different.
Summer afternoon peaks are smaller than the morning peaks, whereas in winter they are similar.
At a high level, it is clear that the load is generally larger in winter -- likely as a result of increased residential heating.
An intuitive reaction to this might be to use different forecasting models for summer and winter - but of course then a delineation must be decided on for when to switch between the two models.
As one of the aims of the forecasting system is to be practical to implement, it is desirable to have a single system that is be able to forecast both summer and winter.
By supplying temperature as an exogenous input this should be possible.

\begin{figure}[htbp]
	\centering
	\subfigure[Load over a single summer week in 2018.]{
		\includegraphics[width=0.8\textwidth]{images/simple-week-summer.pdf}
		\label{fig:simple-week-summer}}
	\vfil
	\subfigure[Load over a single winter week in 2017.]{
		\includegraphics[width=0.8\textwidth]{images/simple-week-winter.pdf}
		\label{fig:simple-week-winter}}
	\vfil
	\subfigure[Peak daily load and peak daily temperature over all of 2017.]{
		\includegraphics[width=0.8\textwidth]{images/max-load-max-temp.pdf}
		\label{fig:max-load-max-temp}}
	\caption{Bruny Island load profiles.}
	\label{fig:load-profiles}
\end{figure}

It was mentioned in section \ref{patterns-profiles} that weekends and weekdays tend to have different load profiles, but this is not immediately obvious from figure \ref{fig:load-profiles}.
Figure \ref{fig:average-profiles} shows this difference between days of the week.
What is immediately obviously is that the differences between days of the week are not restricted to 24 hour bounds -- the different load profile shapes gently merge into each other over the course of an afternoon or morning.
% This is an especially important observation for a forecasting system that needs to be able to perform forecasts not just at a single time each day, but at any time.
It can be seen that the Friday profile morphs into the Saturday profile during the afternoon, perhaps as people arrive on the island for the weekend, and the Sunday profile morphs into a weekday profile over the afternoon, perhaps as people depart the island.

Figures \ref{fig:average-arriving} and \ref{fig:average-departing} further support that the changes in load profiles are a result of people arriving on or leaving the island.
It can be seen that many vehicles tend to arrive on the island on Friday afternoon, leading to the change in load profile between Friday and Saturday.
Likewise, many vehicles leave the island on Sunday afternoon, shifting the load profile from Sunday to a weekday.

This brief investigation serves to highlight some of the challenges that the load forecasting system must handle.

\begin{figure}[htbp]
	\centering
	\subfigure[Average load profiles for each day of the week.]{
		\includegraphics[width=.8\textwidth]{images/average-profiles.pdf}
		\label{fig:average-profiles}}
	\vfil
	\subfigure[Average number of cars arriving on the island for each day of the week.]{
		\includegraphics[width=.8\textwidth]{images/average-arriving}
		\label{fig:average-arriving}}
	\vfil
	\subfigure[Average number of cars leaving the island for each day of the week.]{
		\includegraphics[width=.8\textwidth]{images/average-departing}
		\label{fig:average-departing}}
	\caption{Bruny Island average load profiles.}
	\label{fig:average-load-profiles}
\end{figure}

\subsection{Forecasting tasks}
\label{forecasting-tasks}
The forecasters were evaluated on several tasks, shown in table \ref{table:offline-results-bruny-lscape}, and discussed in section \ref{bruny_offline_results}.
The hourly task has the forecaster predict the future 24 hours of load in one-hour resolution given the previous 24 hours of load and the future 24 hours of date/time, holiday, and temperature data.
The half-hourly task is the same as hourly, except using half-hour resolution data.
The hourly long task has the forecaster predict the future 24 hours of load in one-hour resolution given the previous 168 hours (one week) of load and the future 24 hours of date/time, holiday, and temperature data.
The hourly ($c$=3) task is the same as the hourly task but with the loss function modified, as described in section \ref{train-reg}, with $c=3$.

The tasks listed as being ``with similar profiles" include five similar profiles ($k=5$, as described in the following section \nameref{data-model-config}).
The tasks that do not list similar profiles have no similar profiles, $k=0$.

\subsection{Data and  model configuration}
\label{data-model-config}
The available data was split into a training set (October 1 2009 through February 15 2014) and a testing set (February 16 2014 through June 25 2018).
For the hourly task, the network was supplied with data in hourly resolution from the previous and future 24 hours:
\begin{itemize}
	\item The previous 24 hours of load, $\vl = [ l_1 \: l_2 \: ...  \: l_{24}]^\top$, with $l_{24}$ being the most recent load.
	\item Temperature for the future 24 hours, $\vt = [t_1 \: t_2 \: ... \: t_{24}]^\top$, with $t_1$ being the soonest point in the future.
	\item Day of the week as an integer from 0 to 6 (local time) over the future 24 hours, $\vd = [d_1 \: d_2 \: ... \: d_{24}]^\top$, with $d_1$ being the soonest point in the future.
	\item Minutes since midnight (local time) for the future 24 hours, $\vm = [m_1 \: m_2 \: ... \: m_{24}]^\top$, with $m_1$ being the soonest point in the future.
	\item Boolean 1 or 0 indicating whether it is a holiday at each hour over the next 24 hours, $\vh = [h_1 \: h_2 \: ... \: h_{24}]^\top$, with $h_1$ being the soonest point in the future.
	\item Holiday type as an integer from 0 to $n$ at each hour over the future 24 hours, where there are $n$ different holidays in the year, $\vg = [g_1 \: g_2 \: ... \: g_{24}]^\top$, with $g_1$ being the soonest point in the future.
	\item Month of the year from 0 to 11 (local time) over the future 24 hours, $\vn = [n_1 \: n_2 \: ... \: n_{24}]^\top$, with $n_1$ being the soonest point in the future.
\end{itemize}

Additionally, similar profile data was constructed for $k$ similar days, as described in section \ref{simperiod}.
The weights used are given in table \ref{table:offline-parameters}.
The extremely large weights are used to ensure that similar profiles are always selected from the same month and day of month/day of week unless there are none available.
The similar day data is aligned with the same points in time as the $\vt$ vector.
The order in which the similar days were added to the $\mX$ input matrix was randomized during both training and testing/inference.
The similar profile data is:

\begin{itemize}
	\item 24 hours of similar profile load $\vl_{sk} = [ l_{sk1} \: l_{sk2} \: ... \: l_{sk24}]^\top$.
	\item 24 hours of similar profile temperature (corresponding to the same points in time as $\vl_{sk}$), $\vt_{sk} = [ t_{sk1} \: t_{sk2} \: ...  \: t_{sk24}]^\top$.
\end{itemize}

This data was concatenated to form the input matrix given in equation \ref{eq:input-matrix}.

\begin{equation} \label{eq:input-matrix}
\mX = [ \vl \: \vt \: \vd \: \vm \: \vh \: \vg \: \vn \: \vl_{s1} \: \vt_{s1} \: \ldots \: \vl_{sk} \: \vt_{sk}] \in \mathbb{R}^{24 \times (7 + 2k)}
\end{equation}

The half-hourly configuration is the same as the hourly configuration, except all vectors contain 48 elements (representing 24 hours of data) and so $\mX \in \mathbb{R}^{48 \times (7 + 2k)}$.

The hourly long configuration modifies only the $\vl$ vector, such that $\vl = [ l_1 \: l_2 \: ...  \: l_{168}]^\top$ representing one week of hour-resolution data, with $l_{168}$ being the most recent observation.
The other vectors were zero padded to match this length.
With $\vz \in \mathbb{R}^{144}$ being a zero vector, and $\hat{\vv} = [\vz^\top \vv^\top]^\top$ for any vector $\vv$, the input $\mX$ is given by equation \ref{hourly-long-x}.
\begin{equation}
\label{hourly-long-x}
\mX = [ \vl \: \hat{\vt} \: \hat{\vd} \: \hat{\vm} \: \hat{\vh} \: \hat{\vg} \: \hat{\vn} \: \hat{\vl_{s1}} \: \hat{\vt_{s1}} \: \ldots \: \hat{\vl_{sk}} \: \hat{\vt_{sk}}] \in \mathbb{R}^{168 \times (7 + 2k)}
\end{equation}

The forecasting models were configured with the parameters given in table \ref{table:offline-parameters} and trained for $10^5$ iterations.
All the models presented in section \ref{ana-meth} are evaluated.
The ``Transformer" model is the transformer with teacher forcing enabled, and the ``Transformer no TF" model is the transformer with teacher forcing disabled.

\begin{table}[htbp]
	\caption{Offline Evaluation model parameters.}
	\begin{center}
	\begin{tabular}{lclc}
	\multicolumn{1}{c}{\textbf{Model/Method}} & \textbf{Parameter} & \textbf{Description}                 & \textbf{Value} \\ \cline{1-4} 
	
	\multirow{6}{*}{\shortstack[l]{SARIMAX}}       
	& $p$                & AR model order     					& 2              \\
	& $d$                & Number of differences   		        & 1             \\
	& $q$                & MA model order           			& 1              \\
	& $P$                & Seasonal AR model order              & 1            \\
	& $D$                & Number of seasonal differences       & 1              \\
	& $Q$                & Seasonal MA model order              & 1             \\
	& $s$                & Period of seasonality		        & 24             \\ \cline{1-4} 
	
	\multirow{6}{*}{\shortstack[l]{Sequence to\\Sequence}}       
	& $L$                & Number of encoder and decoder layers & 2              \\
	& $d$                & Hidden dimension                     & 16             \\
	& $D$                & Dropout fraction                     & 0            \\
	& $c$                & Loss function modifier               & 0              \\
	& $l$                & Learning rate 			            & 0.001             \\
	& -                  & Training batch size                  & 16             \\ \cline{1-4} 
	
	\multirow{6}{*}{\shortstack[l]{Transformer}}       
	& $L$                & Number of encoder and decoder layers & 2              \\
	& $d$                & Hidden dimension                     & 16             \\
	& $h$                & Number of attention heads            & 2              \\
	& $D$                & Dropout fraction                     & 0.2            \\
	& $c$                & Loss function modifier               & 0              \\
	& $l$                & Learning rate 			            & 0.001              \\
	& -                  & Training batch size                  & 16             \\ \cline{1-4} 
	
	\multirow{6}{*}{\shortstack[l]{Universal Transformer}}      
	& $L$                & Number of encoder and decoder layers & 2              \\
	& $d$                & Hidden dimension                     & 16             \\
	& $h$                & Number of attention heads            & 2              \\
	& $D$                & Dropout fraction                     & 0.2            \\
	& $c$                & Loss function modifier               & 0              \\
	& $l$                & Learning rate 			            & 0.001              \\
	& -                  & Training batch size                  & 16             \\ \cline{1-4} 
	
	\multirow{6}{*}{\shortstack[l]{Similar\\Profile\\Selection}}
	& -                  & Maximum future temperature weight    & 10             \\
	& -                  & Minimum future temperature weight    & 20             \\
	& -                  & Maximum past load weight             & 30             \\
	& -                  & Holiday type weight                  & 1e9            \\
	& -                  & Day of week weight                   & 1e6            \\
	& -                  & Day of month weight                  & 1e6            \\
	& -                  & Month of year weight                 & 1e6           
\end{tabular}
		\label{table:offline-parameters}
	\end{center}
\end{table}


\subsection{Offline evaluation results and discussion}
\label{bruny_offline_results}
The forecaster was trained on the training dataset (October 1 2009 through February 15 2014) and tested on the testing dataset (February 16 2014 through June 25 2018).
The train and test datasets were then switched in order to cross-validate the results.
All presented results are the average of the cross validated results.
Results are shown in table \ref{table:offline-results-bruny-lscape}.

Both the mean absolute percentage error (MAPE) and mean absolute error (MAE) are shown.
These metrics are calculated for all load, for load only over 1MVA, and for load that is at a first large peak.
A first large peak is a morning or afternoon peak that is over 1MVA and is greater than 36 hours after any previous peak that was over 1MVA.
The first large peak metric is used because it is generally considered difficult to forecast.
These two metrics are intended to assess the performance of the system on anomalous holiday periods.

The SARIMAX model proved to produce results that are relatively poor compared to the other models, and so was not considered beyond the hourly without similar days task.
The universal transformer is consistently a poor performer, never producing the best results for any task or metric.
The transformer model is generally outperformed by, or at least very similar to, the transformer without teacher forcing model.
Excluding these models, it is clear that there is no model which is overall superior between the S2S and transformer without teacher forcing models.

The S2S model appears to excel at the hourly task, achieving the best MAPE scores above 1 MVA and at first large peak, while also being close to the transformer without teacher forcing on the overall MAPE score.
The S2S model generally produces good results across the board.

The transformer without teacher forcing is generally good at minimizing the overall MAPE score.
It also performs exceedingly well in the hourly long task, achieving the lowest overall MAPE while the other two metrics are close to the hourly with similar and $c=3$ task.
This model is also arguably the best at the half-hourly tasks, achieving the best results on the majority of metrics across both tasks.
The dominance of the transformer-based models on the half-hourly and hourly long tasks is likely a result of the transformer's ability to handle long sequence lengths,
as discussed in section \ref{sec:transformer}.

The hourly with $c=3$ tasks show no consensus on which model is generally superior.
Interestingly, the hourly long task achieves similar results to the hourly with $c=3$ with similar profiles task.
It may be the case that increasing the $c$ hyper-parameter simply adds bias to the model -- further analysis should be undertaken to assess this.

In some cases the MAPE decreases after introducing similar profiles.
It could be that in these cases the similar profile data is providing poor quality data as input, or perhaps the model is not able to effectively handle the high dimensionality of the input.
It is difficult to say for sure why this happens.

The MAPE versus horizon plot is shown in figure \ref{fig:bruny_mape} for the hourly task.
Only the sequence to sequence (S2S) and transformer without teacher forcing models are shown.
The cross-validation results, denoted with ``CV" in the legend, are the results when the model is trained on data from 2014-2018 and evaluated on data from 2009-2014.
The cross-validated results  agree within 0.8\%.
The corresponding distribution of forecast error  (though not for the cross validated results) can be seen in figure \ref{fig:bruny_hist}.
The two distributions show an encouraging resemblance to a normal distribution.

This case study suggests that the sequence to sequence model offers good overall performance, while the transformer without teacher forcing is capable of out-performing the sequence to sequence model, but at the risk of more variability across the anomalous holiday metrics.
The transformer with teacher forcing and universal transformer models do not achieve competitive results, although notably the universal transformer has not been evaluated without teacher forcing which may improve performance as it does for the transformer model.
The similar profile method generally improves the performance of the forecasts.
Increasing the $c$ parameter causes the model to increase accuracy when forecasting peak values at the expense of overall accuracy, as intended.

\afterpage{%
	\clearpage% Flush earlier floats (otherwise order might not be correct)
	\thispagestyle{empty}% empty page style (?)
	\begin{landscape}% Landscape page
		\centering % Center table
		\bgroup
		\def\arraystretch{0.95} % Just squeeze it onto the page
		\begin{table}[htbp]
			\caption{Offline evaluation results.}
			\begin{center}
				\begin{tabular}{llcccccc}
					\multicolumn{1}{c}{\textbf{Task}} &
					\multicolumn{1}{c}{\textbf{Model}} & 
					\multicolumn{1}{c}{\textbf{\shortstack[c]{MAPE\\(\%)}}} &
					\multicolumn{1}{c}{\textbf{\shortstack[c]{MAE\\(kVA)}}} & 
					\multicolumn{1}{c}{\textbf{\shortstack[c]{MAPE over\\1 MVA (\%)}}} &
					\multicolumn{1}{c}{\textbf{\shortstack[c]{MAE over\\1 MVA (kVA)}}} &
					\multicolumn{1}{c}{\textbf{\shortstack[c]{MAPE at first\\large peak (\%)}}} &
					\multicolumn{1}{c}{\textbf{\shortstack[c]{MAE at first\\large peak (kVA)}}} \\ \cline{1-8}
					
					\multirow{4}{*}{\shortstack[l]{Hourly}}       
					& SARIMAX           &         14.34  &         82.1  &         12.31  &         137.7  &         18.86  &         200.9 	\\ % (2,1,1)(1,1,1,24)/_cv
					& S2S      			&          7.85  &         42.8  &  \textbf{9.87} & \textbf{110.6} & \textbf{11.42} & \textbf{112.6}	\\ % 
					& Transformer       &          7.82  &         43.1  &         10.33  &         115.1  &         13.00  &         140.1 	\\ % 
					& Transformer no TF &  \textbf{7.77} & \textbf{42.6} &         11.45  &         128.9  &         13.24  &         142.0 	\\ % 130 131
					& U. Transformer    &          8.20	 &         46.0  &         11.72  &         131.6  &         14.52  &         155.2 	\\ % 
					\cline{1-8}					
					
					\multirow{3}{*}{\shortstack[l]{Hourly\\with similar\\profiles}} 
					& S2S      			&          7.68  &         41.6  &  \textbf{9.77} & \textbf{109.6} & \textbf{11.05} & \textbf{118.7}	\\ % 
					& Transformer       &          7.40  &         40.7  &         10.25  &         115.2  &         12.45  &         133.6 	\\ % 
					& Transformer no TF &  \textbf{7.24} & \textbf{39.6} &         10.32  &         115.8  &         12.47  &         133.7    	\\ % 122 123
					& U. Transformer    &          8.15  &         44.8  &         11.10  &         123.5  &         12.67  &         135.5 	\\ % 
					\cline{1-8}					
					
					\multirow{3}{*}{\shortstack[l]{Half-\\hourly}}       
					& S2S      			&  \textbf{7.80} & \textbf{42.7} &          9.88  &         111.3  &         12.30  &         131.5 	\\ % 
					& Transformer       &          9.14  &         50.9  &         13.03  &         148.0  &         15.80  &         170.0 	\\ % 
					& Transformer no TF &          8.24  &         44.0  &  \textbf{9.03} & \textbf{101.5} & \textbf{10.22} & \textbf{110.4}  	\\ % 126 127
					& U. Transformer    &          8.60  &         48.0  &         13.60  &         154.6  &         14.33  &         154.5 	\\ % 
					\cline{1-8}					
					
					\multirow{3}{*}{\shortstack[l]{Half-hourly\\with similar\\profiles}}       
					& S2S      			&          8.44  &         45.0  &         10.23  &         116.2  &         11.10  &         119.1 	\\ % 
					& Transformer       &          7.82  &         42.5  &         10.78  &         121.6  & \textbf{10.98} & \textbf{118.5}	\\ % 
					& Transformer no TF &  \textbf{7.41} & \textbf{41.0} & \textbf{10.16} & \textbf{114.1} &         12.54  &         134.8  	\\ % 128 129
					& U. Transformer    &          9.05  &         48.1  &         10.22  &         114.6  &         11.90  &         136.0 	\\ % 
					\cline{1-8}					
					
					\multirow{3}{*}{\shortstack[l]{Hourly long\\with similar\\profiles}}       
					& S2S      			&          7.49  &         40.9  &  \textbf{9.43} & \textbf{106.2} &         10.10  &         109.5 	\\ % 
					& Transformer       &          6.97  &         38.0  &         10.82  &         122.5  &         11.30  &         121.8 	\\ % 
					& Transformer no TF &  \textbf{6.68} & \textbf{36.8} &          9.85  &         111.8  &  \textbf{9.54} & \textbf{103.1} 	\\ % 132 133
					& U. Transformer    &          6.74  &         38.5  &         10.35  &         116.1  &         12.82  &         138.2 	\\ % 
					\cline{1-8}					
					
					\multirow{3}{*}{\shortstack[l]{Hourly; $c=3$}}       
					& S2S      			&          8.96  &         46.8  &  \textbf{8.61} &  \textbf{95.9} &  \textbf{9.61} & \textbf{103.0}	\\ % 
					& Transformer       &  \textbf{8.14} & \textbf{44.5} &         10.46  &         117.5  &         12.25  &         130.9 	\\ % 
					& Transformer no TF &          8.65  &         45.4  &          9.36 &          105.0  &         10.76  &         115.1  	\\ % 134 135
					& U. Transformer    &          9.29  &         48.5  &         11.05  &         124.5  &         12.97  &         138.9 	\\ % 
					\cline{1-8}					
					
					\multirow{3}{*}{\shortstack[l]{Hourly\\with similar\\profiles; $c=3$}}       
					& S2S      			&  \textbf{8.32} & \textbf{43.8} &          9.20  &         103.1  &         10.68  &         114.8 	\\ % 
					& Transformer       &          8.58  &         45.5  &  \textbf{8.90} &  \textbf{99.5} &  \textbf{9.29} &  \textbf{99.9}	\\ % 
					& Transformer no TF &          8.86  &         45.4  &          8.99  &         101.1  &         10.21  &         109.7  	\\ % 136 137
					& U. Transformer    &          9.00  &         48.3  &         10.07  &         112.5  &          9.34  &         100.5 	\\ % 
					\cline{1-8} 
				\end{tabular}
				\label{table:offline-results-bruny-lscape}
			\end{center}
		\end{table}
		\egroup
	\end{landscape}
	\clearpage% Flush page
}


\begin{figure}[htbp]
	\centerline{\includegraphics[width=.9\textwidth]{images/bruny_mape.pdf}}
	\caption{Mean absolute percentage error of each point in the forecasts for Bruny Island, task ``Hourly".}.
	\label{fig:bruny_mape}
\end{figure}

\begin{figure}[htbp]
	\centerline{\includegraphics[width=.9\textwidth]{images/bruny_hist.pdf}}
	\caption{Distribution of forecast error corresponding to Figure \ref{fig:bruny_mape}.}
	\label{fig:bruny_hist}
\end{figure}


\subsection{Training data requirements}
\label{train-req}
So far the models have been trained with 4.5 years of data.
Table \ref{train-reg} shows how the MAPE changes on the hourly task when the amount of training data is decreased.
The training data was always the data closest to the train set.
These results indicate that either there are diminishing improvements to MAPE as the amount of training data increases, or the data from further in the past does not help with forecasting data in the future.

\begin{table}[htbp]
	\caption{Training data requirements.}
	\begin{center}
		\begin{tabular}{clc}
			\multicolumn{1}{c}{\textbf{Years of Training Data}} &
			\multicolumn{1}{c}{\textbf{Model}} & 
			\multicolumn{1}{c}{\textbf{\shortstack[c]{MAPE (\%)}}} \\ \cline{1-3}
			
			\multirow{2}{*}{\shortstack[l]{4.5}}       
			& S2S      			&  7.85 \\ %  54  55
			& Transformer no TF &  7.77 \\ % 130 131
			\cline{1-3}
			
			\multirow{2}{*}{\shortstack[l]{3.5}} 
			& S2S      			&  7.95 \\ % 153 168
			& Transformer no TF &  7.66 \\ % 159 171
			\cline{1-3}
			
			\multirow{2}{*}{\shortstack[l]{2.5}} 
			& S2S      			&  7.96 \\ % 154 169
			& Transformer no TF &  7.96 \\ % 160 172
			\cline{1-3}
			
			\multirow{2}{*}{\shortstack[l]{1.5}} 
			& S2S      			&  8.31 \\ % 155 170
			& Transformer no TF &  8.25 \\ % 161 173
			\cline{1-3} 
			
		\end{tabular}
		\label{table:training-req}
	\end{center}
\end{table}


\subsection{Autoregression requirements}
All models presented have been supplied with at least the past 24 hour of load.
If this data is not supplied, the models are still able to forecast load.
Evaluated on the hourly task with the $\vl$ past load vector removed, the sequence to sequence model achives a MAPE of 10.68\% and the transformer without teacher forcing model achieves a MAPE of 10.53\%.
With a careful training regime it may be possible to train these models to robustly handle missing input data.


\subsection{Online CONSORT implementation}
\label{consort-eval}
In July 2018 the load forecasting system was implemented as part of the CONSORT project and was used to assist with dispatch of residential batteries.
The model implemented was a transformer (with teacher forcing enabled), and is somewhat different to the models discussed in the offline evaluation.
Although the implemented model achieved sufficient accuracy while in use, it has since been determined to be a sub-optimal configuration.
As the data and model configuration is somewhat different than described for the offline evaluation in section \ref{data-model-config}, the data and model configuration section is largely repeated here.

\subsubsection{Data and  Model Configuration}
The following data was available from 2009-2018:
\begin{itemize}
	\item Apparent power at reclosers R1 through R4 (Figure \ref{fig:bruny_network}).
	\item Temperature at Lenah Valley, Tasmania (50km from Bruny Island). 
	\item Apparent power consumption at St Helens, Tasmania.
\end{itemize}

This data was averaged to 30 minute resolution and split into a training set containing data from October 2009 through September 2014, and a testing set containing data from October 2014 through April 2018.

The network was supplied with data from the previous and future 24 hours, for a total input sequence length of 96 (representing 48 hours at 30 minute resolution).
The output sequence length was 48 (24 hours).
The data was configured differently than described in section \ref{data-model-config}.
The $\mX$ input matrix was constructed with all vectors containing 96 elements, representing the previous 24 and the future 24 hours.
The past load vector $\vl = [ l_1 \: l_2 \: ...  \: l_{48} \: 0 \: ... \: 0]^\top \in \mathbb{R}^{96}$ was populated with the past 48 load observations (representing 24 hours) and zero padded to match the shape of the rest of the data.

The following time series were supplied to the model input:
\begin{itemize}
	\item Apparent power from recloser R1 (Figure \ref{fig:bruny_network}), with future values set to zero.
	\item Temperature.
	\item Day of the week as an integer from 0 to 6 (local time).
	\item Minutes since midnight (local time).
	\item Boolean 1 or 0 indicating whether it is a holiday.
	\item Holiday type.
\end{itemize}

When used for inference, temperature forecasts were obtained from the Bureau of Meteorology.
Additionally, five similar periods were identified using data from R1 by the process described in section \ref{simperiod}.
The data over the similar periods for each of the following time series was provided as input:
\begin{itemize}
	\item Reclosers R1, R2, R3, and R4 (Figure \ref{fig:bruny_network}) (as separate time series).
	\item St Helens recloser.
	\item Lenah Valley temperature.
\end{itemize}

In total, 36 time series were provided as input to the model.
St Helens was included because it was observed to display similar patterns to Bruny Island around holiday periods.

The forecasting system was configured with the parameters in table \ref{table:parameters}, with the upper section giving transformer model parameters and the lower section giving weights used for similar period selection.
The model was trained for $10^5$ iterations.

\begin{table}[htbp]
	\caption{Case study model parameters.}
	\begin{center}
		\begin{tabular}{clc}
			%			3.\hline
			\textbf{Parameter}&\textbf{Description}&\textbf{Value} \\
			\hline
			$L$ & Number of encoder and decoder layers & 4 \\
			$d$ & Hidden dimension & 32 \\
			$h$ & Number of attention heads & 4 \\
			$D$ & Dropout fraction & 0.2 \\
			$c$ & Loss function modifier & 3 \\
			-   & Training batch size & 16 \\
			\hline
			-   & Maximum future temperature weight & 10 \\
			-   & Minimum future temperature weight & 20 \\
			-   & Maximum past load weight & 30 \\
			-   & Holiday type weight & 1e9 \\
			-   & Day of week weight & 1e6 \\
			-   & Day of month weight & 1e6 \\
			-   & Month of year weight & 1e6 \\
			
		\end{tabular}
		\label{table:parameters}
	\end{center}
\end{table}

\subsubsection{Online evaluation results}
When implemented live on the Bruny Island distribution network during the July 2018 school holiday period, as part of the CONSORT project, the forecaster was observed to reliably forecast large demand peaks.
This enabled the fleet of distributed batteries to be used effectively in providing network support via net demand peak reduction. An accurate forecast, issued early enough in advance of the occurence of the demand peak, was observed to give the batteries adequate time to store energy in the lead up to, and discharge during the demand peak period. In at least one instance over the test period this was sufficient to avoid the island's diesel generator from being used at all, when it otherwise almost certainly would have been required.
Data collected during this peak demand period can be seen in Figure \ref{fig:bruny_nac}.
The upper section shows 24 hours of historical load in black, plus the most recent 24-hour horizon forecast in dashed black (recalculated every five minutes) and all old forecasts in grey.
The lower section shows the battery charge rate, where a negative value of battery charge rate indicates the batteries are supporting the grid.

Typically the generator is switched on when load exceeds 1050 kVA.
During the first peak the graph shows the batteries supplying between 50 and 100 kW to the island.
Without this support from the batteries, the generator would have likely been required to operate.

\begin{figure}[htbp]
	\centerline{\includegraphics[width=.75\textwidth]{images/bruny_nac.pdf}}
	\caption{Results from the forecasting system's implementation in the CONSORT project.}
	\label{fig:bruny_nac}
\end{figure}


\section{ISO New England}
ISO New England is a regional transmission organization operating in the New England area of the United States.
Data from ISO New England has been used by several papers for evaluating the performance of their load forecasting systems \cite{Ceperic2013}\cite{Chen2010}.
The aim of this section is to compare the proposed forecasting models to the models and methods used by \citet{Ceperic2013}, whose method is discussed in section \ref{SVR}, and \citet{Chen2010}, whose method is discussed in section \ref{litrev-ann} and to validate that the models used for Bruny Island can also be applied to other feeders with no modification.
A week of load from the ISO New England dataset is shown in figure \ref{fig:week-isone}.

\begin{figure}[htbp]
	\centerline{\includegraphics[width=.9\textwidth]{images/week-isone.pdf}}
	\caption{A week of load for ISO New England.}
	\label{fig:week-isone}
\end{figure}

\citet{Ceperic2013} and \citet{Chen2010} trained their model on data from March 2003 until January 2006 and tested on January 1 2006 to December 31 2006.
Publicly available Historical electrical load data was obtained from the ISO New England website \cite{isone}.
Publicly available historical temperature was obtained from the National Centers for Environment Information \cite{NOAA} at Concord, New Hampshire, United States (latitude 43.204900 degrees, longitude \mbox{-71.502740 degrees}).
This is different temperature data to that used by \citet{Ceperic2013} as that data could not be accessed.

The following holidays were used, and the same heuristics described in section \ref{simperiod} were applied.
\begin{itemize}
	\item Birthday of Martin Luther King, Jr.;
	\item Washington's Birthday;
	\item Memorial Day;
	\item Independence Day;
	\item Labor Day;
	\item Colombus Day;
	\item Veteran's Day;
	\item Thanksgiving;
	\item Easter; and
	\item Christmas/New Year (Dec 21 through Jan 6).
\end{itemize}

The forecasting was performed identically to the ``hourly long" task described in section \ref{forecasting-tasks}, except with no similar days ($k=0$).
The input matrix $\mX$ was constructed by equation \ref{hourly-long-x}.
\citet{Ceperic2013}  and \citet{Chen2010} turned heuristics about heating and cooling load based on temperature into features for the neural network.
This is generally considered good practice \cite{Zinkevich2018}.
However, as one of the primary aims of this thesis is to develop a practical load forecasting system that does not require significant modifications when forecasting for different feeders, this data was not provided as an input.

The model configurations are the same as that described in table \ref{table:offline-parameters}, and the models were trained for $10^5$ iterations.
The results are shown in table \ref{table:iso-results}.
The results show a MAPE worse than other papers, but considering the differences in inputs (heuristic weather inputs), that \cite{Chen2010} only performs forecasts once per day, and that the models here are identical to the models used for Bruny Island, these results are encouraging.
The MAPE versus forecast horizon and distribution in forecast error are shown in figures \ref{fig:iso_mape} and \ref{fig:iso_hist}.

These results demonstrate that the same models that forecast load at a low level in the distribution network can also be applied to forecasting at a high level in the transmission network with absolutely no modification.

\begin{table}[htbp]
	\caption{ISO New England results.}
	\begin{center}
		\begin{tabular}{l|cc}
			%			3.\hline
			\textbf{Model}&\textbf{MAPE (\%)} \\
			\hline
			Transformer no TF            & 2.14 \\ % 151
			S2S                          & 2.15 \\ % 165
			SSA-SVR \cite{Ceperic2013}   & 1.31 \\
			SIWNN \cite{Chen2010}        & 1.71 \\
			ANN \cite{Chen2010}          & 2.03 \\
			Transformer                  & 3.09 \\ % 166
			U. Transformer               & 3.65 \\ % 167
			
		\end{tabular}
		\label{table:iso-results}
	\end{center}
\end{table}

\begin{figure}[htbp]
	\centerline{\includegraphics[width=.9\textwidth]{images/iso_mape.pdf}}
	\caption{Mean absolute percentage error of each point in the forecasts for ISO New England, task ``Hourly long".}.
	\label{fig:iso_mape}
\end{figure}

\begin{figure}[htbp]
	\centerline{\includegraphics[width=.9\textwidth]{images/iso_hist.pdf}}
	\caption{Distribution of forecast error corresponding to Figure \ref{fig:iso_mape}.}
	\label{fig:iso_hist}
\end{figure}

\section{Implementation requirements}
The training time of the models without similar days is presented in table \ref{table:train_infer_times}.
The models were restricted to two cores/four threads on an AMD Ryzen 1700 at 3.5 GHz.
The performance is generally not improved by using a GPU or more CPU cores because of the small batch size, except for the transformer on the hourly long task where the speed can be doubled with a GPU or more cores.
It is likely that by tuning the batch size and learning rate it could be more feasible to train on a GPU in a far smaller number of iterations, reducing the overall training time.
Each model requires less than 2 GB of memory.

\begin{table}[htbp]
	\caption{Training and evaluation times.}
	\begin{center}
		\begin{tabular}{llrc}
			\multicolumn{1}{c}{\textbf{Task}} &
			\multicolumn{1}{c}{\textbf{Model}} & 
			\multicolumn{1}{c}{\textbf{\shortstack[c]{Training time}}} &
			\multicolumn{1}{c}{\textbf{\shortstack[c]{Evaluation time (ms)}}} \\ \cline{1-4}
			
			\multirow{2}{*}{\shortstack[l]{Hourly}}
			& S2S      			& 28m  & 17  \\ % 
			& Transformer no TF & 52m  & 31 \\ % 130 131
			\cline{1-4}					
			
			\multirow{2}{*}{\shortstack[l]{Half-hourly}}       
			& S2S      			& 43m  & 26 \\ % 
			& Transformer no TF & 1h 32m  & 55  \\ % 126 127
			\cline{1-4}									
			
			\multirow{2}{*}{\shortstack[l]{Hourly long}}       
			& S2S      			& 52m  & 31  \\ % 
			& Transformer no TF & 3h 5m  & 110 \\ % 132 133
			\cline{1-4}					

		\end{tabular}
		\label{table:train_infer_times}
	\end{center}
\end{table}