From ec8df65fba1817a205a815cac6dab4366d0da55a Mon Sep 17 00:00:00 2001 From: jaseg Date: Wed, 14 Apr 2021 09:47:03 +0200 Subject: Paper WIP --- paper/safety-reset-paper.tex | 739 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 718 insertions(+), 21 deletions(-) (limited to 'paper') diff --git a/paper/safety-reset-paper.tex b/paper/safety-reset-paper.tex index d7be50c..640f84a 100644 --- a/paper/safety-reset-paper.tex +++ b/paper/safety-reset-paper.tex @@ -1,4 +1,4 @@ -\documentclass[nohyperref]{iacrtrans} +\documentclass[runningheads]{llncs} \usepackage[T1]{fontenc} \usepackage[ backend=biber, @@ -12,7 +12,6 @@ \usepackage{amssymb,amsmath} \usepackage{eurosym} \usepackage{wasysym} -\usepackage{amsthm} \usepackage[binary-units]{siunitx} \DeclareSIUnit{\baud}{Bd} @@ -30,14 +29,35 @@ \begin{document} -\title[Ripples in a Pond]{Transmitting Information through Grid Frequency Modulation} +\title{Ripples in a Pond: Transmitting Information through Grid Frequency Modulation} +\titlerunning{Ripples in a Pond: Transmitting Information through Grid Frequency Modulation} \author{Jan Sebastian Götte \and Björn Scheuermann} +\authorrunning{Jan Sebastian Götte \and Björn Scheuermann} \institute{HIIG\\ \email{safetyreset@jaseg.de} \and HU Berlin \\ \email{scheuermann@informatik.hu-berlin.de}} % FIXME keywords -\keywords{hardware security \and energy systems \and signal theory} \maketitle +\keywords{Security, privacy and resilience in critical infrastructures \and Security and privacy in ``internet of +things'' \and Cyber-physical systems \and Hardware security \and Network Security \and Energy systems \and Signal theory} \begin{abstract} + The smart grid is a large, complex and interconnected technological system. With remotely controllable load switches + having been rolled out at scale in some countries, a tiny flaw inside the firmware of one of these embedded devices + may allow attacks to remotely trigger large-scale excursions of grid parameters with potentially catastrophic + results. Attaining perfect security from such cyberphysical attacks is a monumental embedded engineering task---and + observations do not indicate that current efforts meet the requirements of this task.%FIXME cite recent RECESSIM work + + In this paper, we approach the smart grid safety issue by implementing an emergency override that can be used to + e.g.\ reset all connected devices to a known-good state and preempting subsequent compromise by cutting + communication links. To yield a fully fail-safe design, our system does not rely on the internet or any other + communication network to work. Instead, our system transmits error-corrected and cryptographically secured commands + by modulating grid frequency using a single large consumer such as a large aluminium smelter. This approach differs + from traditional Powerline Communication (PLC) systems in that reaches every device within the same synchronous + area. + + Using extensive simulations we have determined that control of a $\SI{25}{\mega\watt}$ load would allow for the + transmission of a crytographically secured \emph{reset} signal within $15$ minutes. We have produced a + proof-of-concept prototype receiver that demonstrates the feasibility of decoding such signals even on + resource-constrained microcontroller hardware. \end{abstract} \section{Introduction} @@ -79,15 +99,11 @@ This complex standardization landscape and market situation has led to a prolife microcontroller firmware. The complexity and scale of this--often network-connected--firmware makes for a ripe substrate for bugs to surface. -A remotely exploitable flaw inside a smart meter's firmware\footnote{ - There are several smart metering architectures that ascribe different roles to the component called \emph{smart - meter}. Not all systems are susceptible to attacks to the same degree, with the German implementation being almost - immune as far as energy availability is concerned. For clarity, we use \emph{smart meter} to describe the entire - system at the customer premises including both the meter and if present a gateway. -} could have consequences ranging from impaired billing functionality to an existential threat to grid -stability\cite{anderson01,anderson02}. In a country where meters commonly include disconnect switches for purposes such -as prepaid tariffs a coördinated attack could at worst cause widespread activation of grid safety systems by repeatedly -connecting and disconnecting megawatts of load capacity in just the wrong moments\cite{wu01}. +A remotely exploitable flaw inside the firmware of a component of a smart metering ystem could have consequences ranging +from impaired billing functionality to an existential threat to grid stability\cite{anderson01,anderson02}. In a country +where meters commonly include disconnect switches for purposes such as prepaid tariffs a coördinated attack could at +worst cause widespread activation of grid safety systems by repeatedly connecting and disconnecting megawatts of load +capacity in just the wrong moments\cite{wu01}. Mitigation of these attacks through firmware security measures is unlikely to yield satisfactory results. The enormous complexity of smart meter firmware makes firmware security extremely labor-intensive. The diverse standardization @@ -104,10 +120,10 @@ re-flashes the meter's main microcontroller over the standard JTAG interface. N \emph{changing grid frequency itself}. This is fundamentally different in both generation and detection from systems such as traditional PLC that superimpose a signal on grid voltage, but leave grid frequency itself unaffected. -In this thesis, starting from a high level architecture we have carried out extensive simulations of our proposal's -performance under real-world conditions. Based on these simulations we implemented an end-to-end prototype of our -proposed safety reset controller as part of a realistic smart meter demonstrator. Finally we experimentally validated -our results and we will conclude with an outline of further steps towards a practical implementation. +Starting from a high level architecture, we have carried out extensive simulations of our proposal's performance under +real-world conditions. Based on these simulations we implemented an end-to-end prototype of our proposed safety reset +controller as part of a realistic smart meter demonstrator. Finally, we experimentally validated our results and we will +conclude with an outline of further steps towards a practical implementation. This work contains the following contributions: \begin{enumerate} @@ -116,15 +132,696 @@ This work contains the following contributions: implementation. \item We design a communication system based on GFM. \item We carry out extensive simulations of our systems to determine its performance characteristics. - \item We show the simple grid voltage recorder design we used to capture data for our simulations. - \item We introduce a new, simplified method to determine grid frequency from a capture of the grid voltage waveform - that is simple to implement on constrained embedded devices. + \item We show the simple grid frequency recorder design we used to capture data for our simulations. \end{enumerate} \section{Related work} \label{sec_related_work} +% FIXME: Cut down this section from ~6 pages to 2...3 pages. + +\subsection{Security and Privacy in the Smart Grid} + +The smart grid in practice is nothing more or less than an aggregation of embedded control and measurement devices that +are part of a large control system. This implies that all the same security concerns that apply to embedded systems in +general also apply to most components of a smart grid. Where programmers have been struggling for decades now with input +validation\cite{leveson01}, the same potential issue raises security concerns in smart grid scenarios as well\cite{mo01, +lee01}. Only, in smart grid we have two complicating factors present: Many components are embedded systems, and as such +inherently hard to update. Also, the smart grid and its control algorithms act as a large (partially-)distributed +system making problems such as input validation or authentication harder\cite{blaze01} and adding a host of distributed +systems problems on top\cite{lamport01}. + +Given that the electrical grid is essential infrastructure in our modern civilization, these problems amount to +significant issues in practice. Attacks on the electrical grid may have grave consequences\cite{anderson01,lee01} while +the long maintenance cycles of various components make the system slow to adapt. Thus, components for the smart grid +need to be built to a much higher standard of security than most consumer devices to ensure they live up to well-funded +attackers even decades down the road. This requirement intensifies the challenges of embedded security and distributed +systems security among others that are inherent in any modern complex technological system. The safety-critical nature +of the modern smart metering ecosystem in particular was quickly recognized\cite{anderson01}. + +A point we will not consider in much depth in this work is theft of electricity. While in publications aimed towards the +general public the introduction of smart metering is always motivated with potential cost savings and ecological +benefits, in industry-internal publications the reduction of electricity theft is often cited as an +incentive\cite{czechowski01}. Likewise, academic publications tend to either focus on other benefits such as generation +efficiency gains through better forecasting or rationalize the consumer-unfriendly aspects of smart metering with social +benefits\cite{mcdaniel01}. They do not usually point out \emph{revenue protection} mechanisms as +incentives\cite{anderson01,anderson02}. + +A serious issue in smart metering setups is customer privacy. Even though the meter ``only'' collects aggregate energy +consumption of a whole household, this data is highly sensitive\cite{markham01}. This counterintuitive fact was +initially overlooked in smart meter deployments leading to outrage, delays and reduced features\cite{cuijpers01}. The +root cause of this problem is that given sufficient timing resolution these aggregate measurements contain ample +entropy. Through disaggregation algorithms individual loads can be identified and through pattern matching even complex +usage patterns can be discerned with alarming accuracy\cite{greveler01}. Similar privacy issues arise in many other +areas of modern life through pervasive tracking and surveillance\cite{zuboff01}. + +Another fundamental challenge in smart grid implementations is the central role of smart electricity meters in the smart +grid ecosystem. Smart meters are used both for highly-granular load measurement and (in some countries) load +switching\cite{zheng01}. Smart electricity meters are effectively consumer devices. They are built down to a certain +price point that is measured by the burden it puts on consumers and that is divided by the relatively small market +served by a single smart meter implementation. Such cost requirements can preclude security features such as the use of +a standard hardened software environment on a high powered embedded system. Landis+Gyr, a large manufacturer that makes +most of its revenue from utility meters in their 2019 annual report write that they \SI{36}{\percent} of their total +R\&D budget on embedded software (firmware) while spending only \SI{24}{\percent} on hardware +R\&D\cite{landisgyr01,landisgyr02}, indicating a significant tension between firmware security and a smart meter +vendor's bottom line. + +\subsection{The state of the art in embedded security} + +Embedded software security generally is much harder than security of higher-level systems. The primary two factors +affecting this are that on one hand, embedded devices usually run highly customized firmware that (often by necessity) +is rarely updated. On the other hand, embedded devices often lack the advanced security mechanisms such as memory +management units that are found in most higher-power devices. Even well-funded companies continue to have trouble +securing their embedded systems. A spectacular example of this difficulty is the recently-exposed flaw in Apple's iPhone +SoC first-stage ROM bootloader that allows for the full compromise of any iPhone before the iPhone X given physical +access to the device. iPhone 8, one of the affected models, was still being manufactured and sold by Apple until April +2020. In another instance in 2016 researchers found multiple flaws in the secure-world firmware used by Samsung in +their mobile phone SoCs. The flaws they found were both severe architectural flaws such as secret user input being +passed through untrusted userspace processes without any protection and shocking cryptographic flaws such as +CVE-2016-1919\footnote{\url{http://cve.circl.lu/cve/CVE-2016-1919}}\cite{kanonov01}. And Samsung is not the only large +multinational corporation having trouble securing their secure world firmware implementation. In 2014 researchers found +an embarrassing integer overflow flaw in the low-level code handling untrusted input in Qualcomm's QSEE +firmware\cite{rosenberg01}. For an overview of ARM TrustZone including a survey of academic work and past security +vulnerabilities of TrustZone-based firmware see \cite{pinto01}. + +If even companies targeting R\&D budgets that rival some countries' national budgets at mass-market consumer devices +have trouble securing their secure embedded software stacks, what is a much smaller smart meter manufacturer to do? +Especially if national standards mandate complex protocols such as TLS that are tricky to implement +correctly\cite{georgiev01}, the manufacturer is short on options to secure their product. + +\subsection{Attack surface in the smart grid} + +From the previous paragraphs we can conclude that in smart metering technology, market incentives do not currently +provide the conditions for a level of device security that will reliably last the coming decades. Considering this +tension, in this paragraph we will outline the cyberphysical risks that arise from attacks on the smart grid in the +first place. + +The first such attack that might come to mind is one where the attacker compromises components of the grids centralized +control systems. This type of attack is often cited in popular discourse and to our knowledge is the only type of attack +against a grid that has ever been carried out in practice at scale. Despite their severity, these attacks do not pose a +strictly \emph{scientific} challenge, though since these attacks are generic to any industrial control system. Their +causes and countermeasures are generally well-understood and the hardest challenge in their prevention is likely to lie +in budgetary constraints. + +Beyond the centralized control systems, the next target for an attacker may be the communication links between those +control systems and other smart grid components. While in older systems as well as the last mile to households' smart +meters special-purpose systems such as PLC are still common, in the overall system IP-based technologies have +proliferated much like they did in other industries. Along with this adoption of IP-based communication links comes the +ability to apply generic network security measures from the IP world to the smart grid domain. In this way, a +standardized, IP-based protocol stack unlocks decades of network security improvements at little cost. + +Finally, an attacker might target the endpoint device itself. Smart meters are deployed at a large scale +%%% FIXME << HERE WIP >> + +\subsection{Cyberphysical threats in the smart grid} + +If we model the smart grid as a control system responding to changes in inputs by regulating outputs, on a +very high level we can see two general categories of attacks: Attacks that directly change the state of the outputs, and +attacks that try to influence the outputs indirectly by changing the system's view of its inputs. The former would be an +attack such as shutting down a power plant to decrease generation capacity\cite{lee01}. The latter would be an attack +such as forging grid frequency measurements where they enter a power plant's control systems to provoke the control +systems to +oscillate\cite{kosut01,wu01,kim01}. + + +\paragraph{Control function exploits.} +Control function exploits are attacks on the mathematical control loops used by the centralized control system. One +example of this type of attack are resonance attacks as described in \cite{wu01}. In this kind of attack, inputs from +peripheral sensors indicating grid load to the centralized control system are carefully modified to cause a +disproportionately large oscillation in control system action. This type of attack relies on complex resonance effects +that arise when mechanical generators are electrically coupled. These resonances, colloquially called ``modes'', are +well-studied in power system engineering\cite{rogers01,grebe01,entsoe01,crastan03}. Even disregarding modern attack +scenarios, for stability electrical grids are designed with measures in place to dampen any resonances inherent to grid +structure. These resonances are hard to analyze since they require an accurate grid model and they are unlikely to be +noticed under normal operating conditions. + +Mitigation of these attacks can be achieved by ensuring unmodified sensor inputs to the control systems in the first +place. Carefully designing control systems not to exhibit exploitable behavior such as oscillations is also possible but +harder. + +\paragraph{Endpoint exploits.} +The one to us rather interesting attack on smart grid systems is someone exploiting the grid's endpoint devices such as +smart electricity meters. These meters are deployed on a massive scale, with at least one meter per household on +average\footnote{Households rarely share a meter but some households may have a separate meter for detached properties +such as a detached garage or basement.}. Once compromised, restoration to an uncompromised state can be difficult if it +requires physical access to thousands of devices in hard-to-access locations. + +By compromising smart electricity meters, an attacker can forge the distributed energy measurements these devices +perform. In a best-case scenario, this might only affect billing and lead to customers being under- or over-charged if +the attack is not noticed in time. In a less ideal scenario falsified energy measurements reported by these devices +could impede the correct operation of centralized control systems. + +In some countries such as the UK smart meters have one additional function that is highly useful to an attacker: They +contain high-current disconnect switches to disconnect the entire household or business in case electricity bills are +left unpaid for a certain period. In countries that use these kinds of systems on a widespread level, the load +disconnect switch is controlled by the smart meter's central microcontroller. This allows anyone compromising this +microcontroller's firmware to actuate the disconnect switch at will. Given control over a large number of +network-connected smart meters, an attacker might thus be able to cause large-scale disruptions of power +consumption\cite{anderson01,temple01}. Combined with an attack method such as the resonance attack from \cite{wu01} +that was mentioned above, this scenario poses a serious threat to grid stability. + +In places where Demand-Side Management (DSM) is common this functionality may be abused in a similar way. In DSM the +smart metering system directly controls power to certain devices such as heaters. The utility can remotely control the +turn-on and turn-off of these devices to smoothen out the load curve. In exchange the customer is billed a lower price +for the energy consumed by these loads. DSM was traditionally done in a federated fashion usually through low-frequency +PLC over the distribution grid\cite{dzung01}. Smart metering systems no longer require large, resource-intensive +transmitters in substations and bear the potential for a rollout of such technology on a much wider scale than before. +This leads to a potentially significant role of DSM systems in the impact calculation of an attack on a smart metering +system. DSM does not control as much load capacity as remote disconnect switches do but the attacks cited in the above +paragraph still fundamentally apply. + +\subsection{Communication Channels on the Grid} + +There is a number of well-established technologies for communication on or along power lines. We can distinguish three +basic system categories: Systems using separate wires (such as DSL over landline telephone wiring), wireless radio +systems (such as LTE) and \emph{power line communication} (PLC) systems that reüse the existing mains wiring and +superimpose data transmissions onto the 50 Hz mains sine\cite{gungor01,kabalci01}. + +For our scenario, we will ignore short-range communication systems. There exists a large number of \emph{wideband} +power line communication systems that are popular with consumers for bridging Ethernet segments between parts of an +apartment or house. These systems transmit up to several hundred megabits per second over distances up to several tens +of meters\cite{kabalci01}. Technologically, these wideband PLC systems are very different from \emph{narrowband} +systems used by utilities for load management among other applications and they are not relevant to our analysis. + +\paragraph{Power line communication (PLC).} +In long-distance communications for applications such as load management, PLC systems are attractive since they allow +re-using the existing wiring infrastructure and have been used as early as in the 1930s\cite{hovi01}. Narrowband PLC +systems are a potentially low-cost solution to the problem of transmitting data at small bandwidth over distances of +several hundred meters up to tens of kilometers. + +Narrowband PLC systems transmit on the order of Kilobits per second or slower. A common use of this sort of system are +\emph{ripple control} systems. These systems superimpose a low-frequency signal at some few hundred Hertz carrier +frequency on top of the 50Hz mains sine. This low-frequency signal is used to encode switching commands for +non-essential residential or industrial loads. Ripple control systems provide utilities with the ability to actively +control demand while promising savings in electricity cost to consumers\cite{dzung01}. + +In any PLC system there is a strict trade-off between bandwidth, power and distance. Higher bandwidth requires higher +power and reduces maximum transmission distance. Where ripple control systems usually use few transmitters to cover +the entire grid of a regional distribution utility, higher bandwidth bidirectional systems used for automatic meter +reading (AMR) in places such as Italy or France require repeaters within a few hundred meters of a transmitter. + +\subsubsection{Landline and wireless IP-based systems.} +Especially in automated meter reading (AMR) infrastructure the cost-benefit trade-off of power line systems does not +always work out for utilities. A common alternative in these systems is to use the public internet for communication. +Using the public internet has the advantage of low initial investment on the part of the utility company as well as +quick commissioning. Disadvantages compared to a PLC system are potentially higher operational costs due to recurring +fees to network providers as well as lower reliability. Being integrated into power grid infrastructure, a PLC system's +failure modes are highly correlated with the overall grid. Put briefly, if the PLC interface is down, there is a good +chance that power is out, too. In contrast general internet services exhibit a multitude of failures that are entirely +uncorrelated to power grid stability. For purposes such as meter reading for billing purposes, this stability is +sufficient. However for systems that need to hold up in crisis situations such as the recovery system we are +contemplating in this thesis, the public internet may not provide sufficient reliability. + +\subsubsection{Short-range wireless systems.} +Smart meters contain copious amounts of firmware but still pale in comparison to the complexity of full-scale computers +such as smartphones. For short-range communication between a meter and a cellular radio gateway mounted nearby or +between a meter and a meter reading operator in a vehicle on the street a protocol such as Wifi (IEEE 802.11) is too +complex. Absent widely-used standards in this space proprietary radio protocols grew attractive. These are often based +on some standardized lower-level protocol such as ZigBee (IEEE 802.15) but entirely home-grown ones also exist. To the +meter manufacturer a proprietary radio protocol has several advantages. It is easy to implement and requires no external +certification. It can be customized to its specific application. In addition it provides vendor lock-in to customers +sharing infrastructure such as a cellular radio gateway between multiple devices. In other fields a lack of +standardization has led to a proliferation of proprietary protocols and a fragmented protocol landscape. This is a large +problem since the consumer cannot easily integrate products made by different manufacturers into one system. In advanced +metering infrastructure this is unlikely to be a disadvantage since usually there is only one distribution grid +operator for an area. Shared resources such as a cellular radio gateway would most likely only be shared within a +single building and usually they are all operated by the same provider. + +Systems in Europe commonly support Wireless M-Bus, an European standardized protocol\cite{silabs01} that operates on +several ISM bands\footnote{ + Frequency bands that can be used for \emph{Industrial, Scientific and Medical} applications by anyone and that do + not require obtaining a license for transmitter operation. Manufacturers can use whatever protocol they like on + these bands as long as they obtain certification that their transmitters obey certain spectral and power + limitations. +}. ZigBee is another popular standard and some vendors additionally support their own proprietary protcols\footnote{ + For an example see \cite{honeywell01}. +}. + +\section{Grid Frequency as a Communication Channel} + +Despite the awesome complexity of large power grids the physics underlying their response to changes in load and +generation is surprisingly simple. Individual machines (loads and generators) can be approximated by a small number of +differential equations and the entire grid can be modelled by aggregating these approximations into a large system of +nonlinear differential equations. Evaluating these systems it has been found that in large power grids small signal +steady state changes in generation/consumption power balance cause an approximately linear change in +frequency\cite{kundur01,crastan03,entsoe02,entsoe04}. \emph{Small signal} here describes changes in power balance that +are small compared to overall grid power. \emph{Steady state} describes changes over a time frame of multiple waveform +cycles as opposed to transient events that only last a few milliseconds. + +This approximately linear relationship allows the specification of a coefficient with unit \si{\watt\per\hertz} linking +power differential $\Delta P$ and frequency differential $\Delta f$. In this thesis we are using the European power +grid as our model system. We are using data provided by ENTSO-E (formerly UCTE), the governing association of European +transmission system operators. In our calculations we use data for the continental European synchronous area, the +largest synchronous area. $\frac{\Delta P}{\Delta f}$, called \emph{Overall Network Power Frequency Characteristic} by +ENTSO-E is around \SI{25}{\giga\watt\per\hertz}. + +We can derive general design parameter for any system utilizing grid frequency as a communication channel from the +policies of ENTSO-E\cite{entsoe02,entsoe03}. Any such system should stay below a modulation amplitude of +\SI{100}{\milli\hertz} which is the threshold defined in the ENTSO-E incidents classification scale for a Scale 0-1 +(from ``Anomaly'' to ``Noteworthy Incident'' scale) frequency degradation incident\cite{entsoe02} in the continental +Europe synchronous area. +% FIXME resolve cut --- + +Grid frequency in Europe's synchronous areas is nominally 50 Hertz, but there are small load-dependent variations from +this nominal value. Any device connected to the power grid (or even just within physical proximity of power wiring) can +reliably and accurately measure grid frequency at low hardware overhead. By intentionally modifying grid frequency, we +can create a very low-bandwidth broadcast communication channel. Grid frequency modulation has only ever been proposed +as a communication channel at very small scales in microgrids before\cite{urtasun01} and to our knowledge has not yet +been considered for large-scale application. + +Advantages of using grid frequency for communication are low receiver hardware complexity as well as the fact that a +single transmitter can cover an entire synchronous area. Though the transmitter has to be very large and powerful the +setup of a single large transmitter faces lower bureaucratic hurdles than integration of hundreds of smaller ones into +hundreds of local systems that each have autonomous governance. + +% FIXME resolve cut --- +\subsection{Interference from Frequency-Coupled Control Systems} + +The ENTSO-E Operations Handbook Policy 1 chapter\cite{entsoe02} defines the activation threshold of primary control to +be \SI{20}{\milli\hertz}. Ideally, a modulation system would stay well below this threshold to avoid fighting the +primary control reserve. Modulation line rate should likely be on the order of a few hundred Millibaud. Modulation at +these rates would outpace primary control action which is specified by ENTSO-E as acting within between ``a few +seconds'' and \SI{15}{\second}. + +Keeping modulation amplitude below this threshold would help to avoid spuriously triggering these control functions. +The effective \emph{Network Power Frequency Characteristic} of primary control in the European grid is reported by +ENTSO-E at around \SI{20}{\giga\watt\per\hertz}. This works out to an upper bound on modulation power of +\SI{20}{\mega\watt\per\milli\hertz}. + + +\subsection{Transmission Grid Fundamentals for Computer Scientists} +\subsection{Determining Grid Frequency} + +% FIXME resolve cut --- +In commercial power systems Phasor Measurement Units (PMUs, also called \emph{synchrophasors}) are used to precisely +measure parameters of the mains voltage waveform, one of which is grid frequency. PMUs are used as part of SCADA systems +controlling transmission networks to characterize the operational state of the network. + +From a superficial viewpoint measuring grid frequency might seem like a simple problem. Take the mains voltage waveform, +measure time between two rising-edge (or falling-edge) zero-crossings and take the inverse $f = t^{-1}$. In practice, +phasor measurement units are significantly more complex than this. This discrepancy is due to the combination of both +high precision and quick response that is demanded from these units. High precision is necessary since variations of +mains frequency under normal operating conditions are quite small--in the range of \SIrange{5}{10}{\milli\hertz} over +short intervals of time. Relative to the nominal \SI{50}{\hertz} this is a derivation of less than \SI{100}{ppm}. +Relative to the corresponding period of \SI{20}{\milli\second} this means a time derivation of about $2 \mu\text{s}$ +from cycle to cycle. From this it is already obvious why a simplistic measurement cannot yield the required precision +for manageable averaging times: We would need either an ADC sampling rate in the order of megabits per second or for a +reconstruction through interpolated readings an impractically high ADC resolution. + +Detail on the inner workings of commercial phasor measurement units is scarce but given their essential role to SCADA +systems there is a large amount of academic research on such algorithms\cite{narduzzi01,derviskadic01,belega01}. A +popular approach to these systems is to perform a Short-Time Fourier Transform (STFT) on ADC data sampled at high +sampling rate (e.g. \SI{10}{\kilo\hertz}) and then perform analysis on the frequency-domain data to precisely locate the +peak at \SI{50}{\hertz}. A key observation here is that FFT bin size is going to be much larger than required frequency +resolution. This fundamental limitation follows from the Nyquist criterion\cite{shannon01} +and if we had to process an \emph{arbitrary} signal this would severely limit our practical measurement accuracy +\footnote{ + Some software packages providing FFT or STFT primitives such as scipy\cite{virtanen01} allow the user to + super-sample FFT output by specifying an FFT width larger than input data length, padding the input data with zeros + on both sides. Note that in line with the Nyquist theorem this \emph{does not} actually provide finer output + resolution but instead just amounts to an interpolation between output bins. Depending on the downstream analysis + algorithm it may still be sensible to use this property of the DFT for interpolation, but in general it will be + computationally expensive compared to other interpolation methods and in any case it will not yield any better + frequency resolution aside from a potential numerical advantage\cite{gasior02}. +}. +For this reason all approaches to grid frequency estimation are based on a model of the voltage waveform. Nominally +this waveform is a perfect sine at $f=\SI{50}{\hertz}$. In practice it is a sine at $f\approx\SI{50}{\hertz}$ +superimposed with some aperiodic noise (e.g. irregular spikes from inductive loads being energized) as well as harmonic +distortion that is caused by topologically nearby devices with power factor $\cos \theta \neq 1.0$. Under a continuous +fourier transform over a long period the frequency spectrum of a signal distorted like this will be a low noise floor +depending mainly on aperiodic noise on which a comb of harmonics as well as some sub-harmonics of $f \approx +f_\text{nom} = \SI{50}{\hertz}$ is riding. The main peak at $f \approx f_\text{nom}$ will be very strong with the +harmonics being approximately an order of magnitude weaker in energy and the noise floor being at least another order of +magnitude weaker. See Figure \ref{mains_voltage_spectrum} for a measured spectrum. This domain knowledge about the +expected frequency spectrum of the signal can be employed in a number of interpolation techniques to reconstruct the +precise frequency of the spectrum's main component despite distortions and the comparatively coarse STFT resolution. + +Published grid frequency estimation algorithms such as \cite{narduzzi01,derviskadic01} are rather sophisticated and use +a combination of techniques to reduce numerical errors in FFT calculation and peak fitting. Given that we do not need +reference standard-grade accuracy for our application we chose to start with a very basic algorithm instead. We chose to +use a general approach to estimate the precise fundamental frequency of an arbitrary signal that was published by +experimental physicists Gasior and Gonzalez at CERN\cite{gasior01}. This approach assumes a general sinusoidal signal +superimposed with harmonics and broadband noise. Applicable to a wide spectrum of practical signal analysis tasks it is +a reasonable first-degree approximation of the much more sophisticated estimation algorithms developed specifically for +power systems. Some algorithms use components such as kalman filters\cite{narduzzi01} that require a physical model. +As a general algorithm \cite{gasior01} does not require this kind of application-specific tuning, eliminating one source +of error. + +The Gasior and Gonzalez algorithm\cite{gasior01} passes the windowed input signal through a DFT, then interpolates the +signal's fundamental frequency by fitting a wavelet such as a Gaussian to the largest peak in the DFT results. The bias +parameter of this curve fit is an accurate estimation of the signal's fundamental frequency. This algorithm is similar +to the simpler interpolated DFT algorithm used as a reference in much of the synchrophasor estimation +literature\cite{borkowski01}. The three-term variant of the maximum side lobe decay window often used there is a +Blackman window with parameter $\alpha = \frac{1}{4}$. Analysis has shown\cite{belega01} that the interpolated DFT +algorithm is worse than algorithms involving more complex models under some conditions but that there is \emph{no free +lunch} meaning that more complex perform worse when the input signal deviates from their models. +% FIXME resolve cut --- + +\subsubsection{Our Algorithm} +\subsubsection{Our Hardware} + +\section{Characteristics of Grid Frequency} + +\section{Grid Frequency Modulation} +\subsection{Fundamental Physics} +\subsection{Transmitter Implementation} + +% FIXME resolve cut --- +In its most basic form a transmitter for grid frequency modulation would be a very large controllable load connected to +the power grid at a suitable vantage point. A spool of wire submerged in a body of cooling liquid such as a small lake +along with a thyristor rectifier bank would likely suffice to perform this function during occasional cybersecurity +incidents. We can however decrease hardware and maintenance investment even further compared to this rather +uncultivated solution by repurposing regular large industrial loads as transmitters in an emergency situation. For some +preliminary exploration we went through a list of energy-intensive industries in Europe\cite{ec01}. The most +electricity-intensive industries in this list are primary aluminum and steel production. In primary production raw ore +is converted into raw metal for further refinement such as casting, rolling or extrusion. In steelmaking iron is +smolten in an electric arc furnace. In aluminum smelting aluminum is electrolytically extracted from alumina. Both +processes involve large amounts of electricity with electricity making up \SI{40}{\percent} of production costs. Given +these circumstances a steel mill or aluminum smelter would be good candidates as transmitters in a grid frequency +modulation system. + +In aluminum smelting high-voltage mains is transformed, rectified and fed into about 100 series-connected electrolytic +cells forming a \emph{potline}. Inside these pots alumina is dissolved in molten cryolite electrolyte at about +\SI{1000}{\degreeCelsius} and electrolysis is performed using a current of tens or hundreds of Kiloampère. The resulting +pure aluminum settles at the bottom of the cell and is tapped off for further processing. + +Like steelworks, aluminum smelters are operated night and day without interruption. Aside from metallurgical issues the +large thermal mass and enormous heating power requirements do not permit power cycling. Due to the high costs of +production inefficiencies or interruptions the behavior of aluminum smelters under power outages is a +well-characterized phenomenon in the industry. The recent move away from nuclear power and towards renewable energy has +lead to an increase in fluctuations of electricity price throughout the day. These electricity price fluctuations have +provided enough economic incentive to aluminum smelters to develop techniques to modulate smelter power consumption +without affecting cell lifetime or product quality\cite{duessel01,eisma01}. Power outages of tens of minutes up to two +hours reportedly do not cause problems in aluminum potlines and are in fact part of routine operation for purposes such +as electrode changes\cite{eisma01,oye01}. + +The power supply system of an aluminum plant is managed through a highly-integrated control system as keeping all cells +of a potline under optimal operating conditions is challenging. Modern power supply systems employ large banks of diodes +or SCRs\footnote{SCRs, also called thyristors, are electronic devices that are often used in high-power switching +applications. They are normally-off devices that act like diodes when a current is fed into their control terminal.} to +rectify low-voltage AC to DC to be fed into the potline\cite{ayoub01}. The potline voltage can be controlled almost +continuously through a combination of a tap changer and a transductor. The individual cell voltages can be controlled by +changing the anode to cathode distance (ACD) by physically lowering or raising the anode. The potline power supply is +connected to the high voltage input and to the potline through isolators and breakers. + +In an aluminum smelter most of the power is sunk into resistive losses and the electrolysis process. As such an +aluminum smelter does not have any significant electromechanical inertia compared to the large rotating machines used +in other industries. Depending on the capabilities of the rectifier controls high slew rates are possible, permitting +modulation at high\footnote{Aluminum smelter rectifiers are \emph{pulse rectifiers}. This means instead of simply +rectifying the incoming three-phase voltage they use a special configuration of transformer secondaries and in some +cases additional coils to produce a large number of equally spaced phases (e.g.\ six) from a standard three-phase input. +Where a direct-connected three-phase rectifier would draw current in six pulses per mains voltage cycle a pulse +rectifier draws current in more, smaller pulses to increase power factor. For example a 12-pulse rectifier will draw +current in 12 pulses per cycle. In the best case an SCR pulse rectifier switched at zero crossing should allow +\SIrange{0}{100}{\percent} load changes from one rectifier pulse to the next, i.e. within a fraction of a single cycle.} +data rates. +% FIXME resolve cut --- + +\subsection{Parametrizing DSSS Modulation for GFM} + +% FIXME resolve cut/write intro --- +\begin{description} + \item[Modulation amplitude.] Amplitude is proportionally related to modulation power. In a practical setup we might + realize a modulation power up to a few hundred \si{\mega\watt} which would yield a few tens of \si{\milli\hertz} + of frequency amplitude. + \item[Modulation preemphasis and slew-rate control.] Preemphasis might be necessary to ensure an adequate + Signal-to-Noise ratio (SNR) at the receiver. Slew-rate control and other shaping measures might be necessary to + reduce the impact of these sudden load changes on the transmitter's primary function (say, aluminum smelting) + and to prevent disturbances to other grid components. + \item[Modulation frequency.] For a practical implementation a careful study would be necessary to determine the + optimal frequency band for operation. On one hand we need to prevent disturbances to the grid such as the + excitation of local or inter-area modes. On the other hand we need to optimize Signal-to-Noise ratio (SNR) + and data rate to achieve optimal latency between transmission start and reset completion and to reduce the + overall burden on both transmitter and grid. + \item[Further modulation parameters.] The modulation itself has numerous parameters that are discussed in Section + \ref{mod_params} below. +\end{description} + +% FIXME resolve cut/write intro --- +% FIXME too many enumerations? +In this section we will explore how we can construct a reliable communication channel from the analog primitive we +have outlined in the previous section. Our load control approach to grid frequency modulation leads to a channel with the +following properties. + +\begin{description} + \item[Slow-changing.] Accurate grid frequency measurements take several periods of the mains sine wave. Faster + sampling rates can be achieved with more complex specialized synchrophasor estimation algorithms but this will + result in a trade-off between sampling rate and accuracy\cite{belega01}. + \item[Analog.] Grid frequency is an analog signal. + \item[Noisy.] While stable over long periods of time thanks to power stations' Load-Frequency Control + systems\cite{entsoe04} there are considerable random short-term variations. Our modulation amplitude is limited + by technical and economic constraints so we have to find a system that will work at poor SNRs. + \item[Polarized.] Grid frequency measurements have an inherent sense of polarity that we can use in our modulation + scheme. +\end{description} + +% FIXME resolve cut --- +Modern power systems are complex electromechanical systems. Each component is controlled by several carefully tuned +feedback loops to ensure voltage, load and frequency regulation. Multiple components are coupled through transmission +lines that themselves exhibit complex dynamic behavior. The overall system is generally stable, but may exhbit +instabilities to particular small-signal stimuli\cite{kundur01,crastan03}. These instabilities, called \emph{modes}, +occur when due to mis-tuning of parameters or physical constraints the overall system exhibits oscillation at a +particular frequency. \cite{kundur01} separates these modes into four categories: + +\begin{description} + \item[Local modes] where a single power station oscillates in some parameter, + \item[Interarea modes] where subsections of the overall grid oscillate with respect to each other due to weak + coupling between them, + \item[Control modes] caused by imperfectly tuned control systems and + \item[Torsional modes] that originate from electromechanical oscillations in the generator itself. +\end{description} + +The oscillation frequencies associated with each of these modes are usually between a few tens of Millihertz and a few +Hertz\cite{grebe01,entsoe01,crastan03}. It is hard to predict the particular modes of a power system at the scale of the +central European interconnected system. Theoretical analysis and simulation may give rough indications but cannot yield +conclusive results. Due to the obvious danger as well as high economical impact due to inefficiencies experimental +measurements are infeasible. Modes are highly dependent on the power grid's structure and will change with changes in +the power grid over time. For all of these reasons, a grid frequency modulation system must be designed very +conservatively without relying on the absence (or presence) of modes at particular frequencies. A concrete design +guideline that we can derive from this situation is that the frequency spectrum of any grid frequency modulation system +should not exhibit large peaks and should avoid a concentration of spectral energy in small frequency bands. +% FIXME resolve cut --- + +\subsection{Parametrizing a "Safety Reset" System Based on GFM} +% FIXME resolve cut & write intro --- +% FIXME cut down next 2 sections +\subsubsection{Error-correcting codes} + +To reduce reception error rate we have to layer channel coding on top of the DSSS modulation. The messages we expect to +transmit are at least a few tens of bits long. We are highly constrained in SNR due to limited transmission power and +with lower SNR comes higher BER (Bit Error Rate). At a fixed BER, packet error rate grows exponentially with +transmission length so for our relatively long transmissions we would realistically get unacceptable error rates. + +Error correcting codes are a very broad field with many options for specialization. Since we are implementing only an +advanced prototype in this thesis we chose to spend only limited resources on optimization and settled on a basic +Reed-Solomon code. We have no doubt that applying a more state-of-the-art code we could gain further improvements in +code overhead and decoding speed among others\cite{mackay01}. Since message length in our system limits system response +time but we do not have a fixed target we can tolerate some degree of overhead. Decoding speed is of very low concern +to us because our data rate is extremely low. We derived our implementation by adapting and optimizing an existing open +source decoder that we validated on an open source encoder implementation. We generate test signals using a Python tool +on the host. + +\subsubsection{Cryptographic security} +\label{sec-crypto} +Above the communication base layer elaborated in the previous section we have to layer a cryptographic protocol to +ensure system security. We want to avoid a case where a third party could interfere with our system or even subvert this +safety system itself for an attack. From a protocol security perspective the system we are looking for can informally +be modelled as consisting of three parties: the trusted \emph{transmitter}, one of a large number of untrusted +\emph{receivers}, and an \emph{attacker}. These three play according to the following rules: + +\begin{description} + \item[Access.] Both transmitter and attacker can transmit any bit sequence. + \item[Indistinguishability.] The receiver receives any transmission by either but cannot distinguish between them. + \item[Kerckhoff's principle.] Since the protocol design is public and anyone can get access to an electricity meter + the attacker knows anything any receiver might know\cite{kerckhoff01,kerckhoff02}. + \item[Priority.] The transmitter is stronger than an attacker and will ``win'' during simultaneous transmission. + \item[Seeding.] Both transmitter and receiver can be seeded out-of-band with some information on each other such as + public key fingerprints. +\end{description} + +We are not considering situations where an attacker attempts to jam an ongoing transmission. In practice there are +several avenues to prevent such attempts. Compromised large loads that are being abused by the attacker can be manually +disconnected by the utility. Error-correcting codes can be used to provide resiliency against small-scale disturbances. +Finally, the transmitter can be designed to have high enough power to be able to override any likely attacker. + +With the above properties in mind our goal is to find a cryptographic primitive that has the following properties: +\begin{description} + \item[Authentication.] The transmitter can produce a message bit sequence that a certain subset of receivers can + identify as being generated by the transmitter. On reception of this sequence, all addressed receivers perform a + safety reset. + \item[Unforgeability.] The attacker cannot forge a message, i.e.\ find a bit sequence other than one of the + transmitter's previous messages that a receiver would accept. This implies that the attacker also cannot create + a new distinct message from a previously transmitted message. + \item[Brevity.] The message should be short. Our communication channel is outrageously slow compared to anything + else used in modern telecommunications and every bit counts. +\end{description} + +On a protocol level we also have to ensure \emph{idempotence}. Our system should have an at-most-once semantic. This +means for a given message each receiver either performs exactly one safety reset or none at all, even if the message is +re-transmitted by either the transmitter or an attacker. We cannot achieve the ideal exactly-once semantic wit pure +protocol gymnastics since we are using an unidirectional lossy communication primitive. A receiver might be offline +(e.g.\ due to a local power outage) and then would not hear the transmission even if our broadcast primitive was +reliable. Since there is no back channel, the transmitter has no way of telling when that happens. The practical impact +of this can be mitigated by the transmitter repeating the message a number of times. + +It follows from the unforgeability requirement that we can trivially reach idempotence at the protocol level by keeping +a database of all previous messages and only accepting new messages. By considering this in our cryptographic design we +can reduce the storage overhead of this ``database''. + +Along with the indistinguishability property the access requirement implies that we need a cryptographic +signature\cite{lamport01}. However, we have relaxed constraints on this signature compared to standard cryptographic +practice\cite{anderson04}. While cryptographic signatures need to work over arbitrary inputs, all we want to ``sign'' +here is the instruction to perform a safety reset. This is the only message we might ever want to transmit so our +message space has only one element. The information content of our message thus is 0 bit! All the information we want to +transmit is already encoded \emph{in the fact that we are transmitting} and we do not require a further payload to be +transmitted: We can omit the entirety of the message and just transmit whatever ``signature'' we +produce\cite{haller01,rfc1760}. This is useful to conserve transmission bits so our transmission does not take an +exceedingly long time over our extremely slow communication channel. + +We can modify this construction to allow for a small number of bits of information content in our message (say two or +three instead of zero) at no transmission overhead by transmitting the cryptographic signature as usual but simply +omitting the message. The message contains only a few bits of information and we are dealing with minutes of +transmission time so the receiver can reconstruct the message through brute-force. Though this trade-off between +computation and data transmission might seem inelegant it does work for our extremely slow link for up to a few bits of +information. + +There is an important limitation in the rules of our setup above: The attacker can always record the reset bit sequence +the transmitter transmits and replay that same sequence later. Even without cryptography we can trivially prevent an +attacker from violating the at-most-once criterion. If every receiver memorizes all bit sequences that have been +transmitted so far it can detect replays. With this mitigation by replaying an older authentic transmission an attacker +can cause receivers that were offline during the original transmission to reset at a later point. Considering our goal +is to reset them in the first place this should not pose a threat to the system's safety or security. + +A possible scenario would be that an attacker first causes enough havoc for authorities to trigger a safety reset. The +attacker would record the trigger transmission. We can assume most meters were reset during the attack. Due to this the +attacker cannot cause a significant number of additional resets immediately afterwards. However, the attacker could +wait several years for a number of new meters to be installed that might not yet have updated firmware that includes the +last transmission. This means the attacker could cause them to reset by replaying the original sequence. + +A possible mitigation for this risk would be to introduce one bit of information into the trigger message that is +ignored by the replay protection mechanism. This \emph{enable} bit would be $1$ for the actual reset trigger message. +After the attack the transmitter would then perform scheduled transmissions of a ``disarm'' message that has this bit +set to $0$. This message informs all new meters and meters that were offline during the original transmission of the +original transmission for replay protection without actually performing any further resets. + +We could use any of several traditional asymmetric cryptographic primitives to produce these signatures. The +comparatively high computational effort required for signature verification would not be an issue. Transmissions take +several minutes anyway and we can afford to spend some tens of seconds even in signature verification. Transmission +length and by proxy system latency would be determined by the length of the signature. For RSA signature length is the +modulus length (i.e. larger than \SI{1000}{bit} for very basic contemporary security). For elliptic curve-based systems +curve length is approximately twice the security level and signature size is twice the curve length because two curve +points need to be encoded\cite{anderson02}. For contemporary security this results in more than 300 bit transmission +length. We can exploit our unique setting's low message entropy to improve on this by basing our scheme on a +cryptographic hash function used as a one-way pseudo-random function (PRF). Hash-based signature schemes date back to +the very beginnings of cryptographic signatures\cite{anderson04,diffie01,lamport02}. Today, in general applications +schemes based on asymmetric cryptography are preferred but hash-based signature systems have their applications in +certain use cases. One example of such a scheme is the TESLA scheme\cite{perrig01} that is the basis for navigation +message authentication in the European Galileo global navigation satellite system. Here, a system based purely on +asymmetric primitives would result in too much computation and communication overhead\cite{ec05}. In the following +sections we will introduce the foundations of hash-based signatures before deriving our authentication scheme. + +\subsubsection{Lamport signatures} + +1979, Lamport in \cite{lamport02} introduced a signature scheme that is based only on a one-way function such as a +cryptographic hash function. The basic observation is that by choosing a random secret input to a one-way function and +publishing the output, one can later prove knowledge of the input simply by publishing it. In the following paragraphs +we will describe a construction of a one-time signature scheme based on this observation. The scheme we describe is the +one usually called a ``Lamport Signature'' in modern literature but is slightly different from the variant described in +the 1979 paper. For our purposes we can consider both to be equivalent. + +\paragraph{Setup.} In a Lamport signature, for an n-bit hash function $H$ the signer generates a private key $s = +\left(s_{b, i} | b\in\left\{0, 1\right\}, 0\le i