From 713564b8298467f864d5dfef827aad7c51e49f28 Mon Sep 17 00:00:00 2001 From: jaseg Date: Thu, 6 Oct 2022 14:32:44 +0200 Subject: final proof by myself --- paper/safety-reset-paper.tex | 175 +++++++++++++++++++++---------------------- 1 file changed, 87 insertions(+), 88 deletions(-) (limited to 'paper') diff --git a/paper/safety-reset-paper.tex b/paper/safety-reset-paper.tex index b1904e8..5255e95 100644 --- a/paper/safety-reset-paper.tex +++ b/paper/safety-reset-paper.tex @@ -399,9 +399,16 @@ communication for smart meter reading~\cite{ec03,rs48,gungor01,agf16}. The security of IoT devices as well as the smart grid has received extensive attention in the literature~\cite{nbck+19,acsc20,smp18,ykll17,anderson01,anderson02,zlmz+21,kgma21,hcb19,mpdm+10,lzlw+20,chl20,lam21,olkd20,yomu+20}. The challenges of IoT device security and the security of smart meters and other smart grid devices are similar because -smart grid devices are essentially IoT devices in a particularly sensitive location~\cite{acsc20}. In both device types, -the challenge is that securing embedded firmware is difficult, and adding network interfaces and cost constraints only -makes the task harder. +smart grid devices are essentially IoT devices in a particularly sensitive location~\cite{zheng01,ifixit01,acsc20}. In +both device types, the challenge is that securing embedded firmware is difficult, and adding network interfaces and cost +constraints only makes the task harder. + +In some countries, smart meters can have a built-in off-switch that is used to disconnect customers who do not pay their +electricity bill. An attack scenario in which the attacker compromises a large number of such meters has been discussed +by Anderson and Fuloria in~\cite{anderson01}. In meters that do not have such a switch, an attacker can still use their +access to manipulate the meter's energy accounting, leading to financial impact on the utility operating the meter. This +scenario has received research attention~\cite{anderson02,mcdaniel01} and comes with the most direct industry +incentives. In~\cite{smp18}, Soltan, Mittal and Poor investigated an attack scenario where an attacker first gains control over a large number of high wattage devices through an IoT security vulnerability, then uses this control to cause rapid load @@ -424,6 +431,13 @@ relatively recent nature of the IoT software ecosystem and the large number of i challenge to Smart Grid security is that due to the fragmentation of markets along national borders, certain devices such as smart meters or DSR implementations exist in large monocultures. +Smart meters are consumer devices built down to a price and manufacturers' firmware security R\&D budgets are limited by +the high degree of market fragmentation that is caused by mutually incompatible national smart metering standards. +Landis+Gyr, a large utility meter manufacturer, state in their 2019 annual report that they invested \SI{36}{\percent} +of their total R\&D budget on embedded software while spending only \SI{24}{\percent} on hardware +R\&D~\cite{landisgyr01,landisgyr02}, which indicates tension between firmware security and the manufacturers's bottom +line. + Compared to IoT and Smart Grid devices, the embedded firmware foundations of modern smartphones have received more attention both from the industry and from academia. Pinto and Santos in~\cite{pinto01} conducted a survey of implementations based on ARM's TrustZone embedded virtualization architecture and found a significant number of reported @@ -459,49 +473,37 @@ mathematical analysis, small-scale simulations and limited practical experiments developed a countermeasure that can be implemented as part of generator control systems and that when activated can suppress forced oscillations of wide-area electromechanical modes. -On the device side of the smart grid, research has concentrated on smart meter security. Smart meters are -architecturally similar to IoT devices~\cite{zheng01,ifixit01}, but come with different challenges. Similar to a -high-power IoT device, an attacker could use an off-switch built as part of an attack, a scenario that was investigated -by Anderson and Fuloria in~\cite{anderson01}. Unique to smart meters, an attacker could, however, also use their control -to manipulate the meter's energy accounting, quickly leading to potentially severe financial impact on the meter's -operating utility company. This scenario has received research attention~\cite{anderson02,mcdaniel01} and this is where -industry incentives are the strongest. - -Smart electricity meters are consumer devices built down to a price and manufacturers' firmware security R\&D budgets -are limited by the high degree of market fragmentation that is caused by mutually incompatible national smart metering -standards. Landis+Gyr, a large utility meter manufacturer, state in their 2019 annual report that they invested -\SI{36}{\percent} of their total R\&D budget on embedded software while spending only \SI{24}{\percent} on hardware -R\&D~\cite{landisgyr01,landisgyr02}, which indicates tension between firmware security and the manufacturers's bottom -line. - \subsection{Proposed Countermeasures} -In~\cite{kgma21}, the authors propose an extension to grid control algorithms aimed at increasing the grid's robustness -towards forced oscillations. In~\cite{smp18}, the authors propose that utility operators use a detailed attacker model -to engineer additional safety margins into the grid while minimizing the economic inefficiency of these measures. On the -IoT side, they note that due to the wide implementation diversity, the problem cannot be solved by individual measures -and propose additional fundamental research on IoT device security. +In parallel with research on theoretical attacks, countermeasures to these have also been proposed in academic +literature. In~\cite{kgma21}, the authors propose an extension to grid control algorithms aimed at increasing the grid's +robustness towards forced oscillations. In~\cite{smp18}, the authors propose that utility operators use a detailed +attacker model to engineer additional safety margins into the grid while minimizing the economic inefficiency of these +measures. On the IoT side, they note that due to the wide implementation diversity, the problem cannot be solved by +individual measures and propose additional fundamental research on IoT device security. In~\cite{hcb19}, the authors conclude that simple demand attacks where compromised loads suddenly increase demand are -adequately mitigated by existing safety measures, in particular \emph{Under-Frequency Load Shedding} (UFLS). As part of -UFLS, during a contingency the utility will progressively disconnected loads according to set priorities until the -production / generation balance has been restored and a blackout has been averted. UFLS is already deployed in any large -electrical grid. +adequately mitigated by existing safety measures, in particular \emph{Under-Frequency Load Shedding} (UFLS), which forms +the basis of any grid's automatic emergency response. As part of UFLS, during a contingency the utility will +progressively disconnected loads according to set priorities until the production / generation balance has been restored +and a blackout has been averted. % FIXME more sources! \section{Grid Frequency as a Communication Channel} -During a large-scale cyber attack, availability of internet and cellular connectivity cannot be relied upon. An attacker -may already have disabled such systems in a separate attack, or they may go down along with parts of the electrical -grid. Powerline communication systems will likely be unaffected by an attack, but at a range of no more than several -tens of kilometers, covering the entire grid would require a large upfront infrastructure investment for transmitters. +The countermeasures discussed above are fully automatic. Such systems can provide a good first line of defense, but they +must be complemented by means of manual intervention since not every eventuality can be anticipated. During a +large-scale cyber attack, availability of internet and cellular connectivity cannot be relied upon. An attacker may +already have disabled such systems in a separate attack, or they may go down along with parts of the electrical grid. +Powerline communication systems will likely be unaffected by an attack, but at a range of no more than several tens of +kilometers, covering the entire grid would require a large upfront infrastructure investment for transmitters. -We propose to approach the problem of broadcasting an emergency signal to all grid-connected devices such as smart -meters or IoT appliances within a synchronous area by using grid frequency as a communication channel. Despite the -technological complexity of the grid, the physics underlying its response to changes in load and generation is +We propose to approach the problem of broadcasting an emergency control signal to all grid-connected devices such as +smart meters or IoT appliances within a synchronous area by using grid frequency as a communication channel. Despite +the technological complexity of the grid, the physics underlying its response to changes in load and generation is surprisingly simple. Individual machines (loads and generators) can be approximated by a small number of differential -equations describing their control systems' interaction with the machine's physics, and the entire grid can be modelled +equations describing their control systems' interaction with the machines' physics, and the entire grid can be modelled by aggregating these approximations into a large system of differential equations. As a consequence, small signal changes in generation/consumption power balance cause an approximately proportional change in frequency~\cite{kundur01,crastan03,entsoe02,entsoe04}. The slope of this first-order approximation is known as @@ -509,7 +511,7 @@ frequency~\cite{kundur01,crastan03,entsoe02,entsoe04}. The slope of this first-o \SI{25}{\giga\watt\per\hertz} according to the European electricity grid authority, ENTSO-E. If we modulate the power consumption of a large load, this modulation will result in a small change in frequency -according to this characteristic. As long as we stay within the operational limits set by +according to that characteristic. As long as we stay within the operational limits set by ENTSO-E~\cite{entsoe02,entsoe03}, this change will not degrade the operation of other parts of the grid. The advantages of grid frequency modulation are the fact that a single transmitter can cover an entire synchronous area as well as low receiver hardware complexity. @@ -521,29 +523,26 @@ at very small scales in microgrids before~\cite{urtasun01} and has not yet been Compared to traditional channels such as Fiber To The Home (FTTH), 5G or LoraWAN, grid frequency as a communication channel has a resiliency advantage. It can start transmission as soon as a power island with a connected transmitter is -powered up, while communication networks such as FTTH or 5G are still rebooting, or might be waiting for parts of their -centralized infrastructure that are connected to different power islands to come back online. Mesh networks such as -LoraWAN can cover short distances up to $\SI{20}{\kilo\meter}$ without requiring infrastructure to be available, but for -longer distances LoraWAN relies on the public internet for its network backbone. Additionally, systems such as FTTH, 5G -and LoraWAN are built around a point-to-point communication model and usually do not support a global broadcast -primitive. During times when a large number of devices must be reached simultaneously this can lead to congestion of -cellular towers and servers. Therefore, during an ongoing cyber attack, grid frequency is promising as a communication -channel because only a single transmitter facility must be operational for it to function, and this single transmitter -can reach all connected devices simultaneously. After a power outage, it can resume operation as soon as electrical -power is restored, even while the public internet and mobile networks are still offline. It is unaffected by -cyber attacks that target telecommunication networks. +powered up, while communication networks such as FTTH or 5G are still rebooting or waiting for their centralized +infrastructure to come back online. Mesh networks such as LoraWAN can cover short distances up to $\SI{20}{\kilo\meter}$ +without requiring infrastructure to be available, but for longer distances LoraWAN relies on the public internet for its +network backbone. Additionally, systems such as FTTH, 5G and LoraWAN are built around a point-to-point communication +model and usually do not support a global broadcast primitive. During times when a large number of devices must be +reached simultaneously this can lead to congestion of cellular towers and servers. Therefore, during an ongoing cyber +attack, grid frequency is promising as a communication channel because only a single transmitter facility must be +operational for it to function, and this single transmitter can reach all connected devices simultaneously. \subsection{Characterizing Grid Frequency} \label{grid-freq-characterization} -Before analyzing grid frequency as a communication channel, we developed a device that allows us to collect ground truth -for our analysis by safely recording the grid voltage waveform. Our system consists of an \texttt{STM32F030F4P6} ARM -Cortex M0 microcontroller that records mains voltage using its internal 12-bit ADC and transmits measured values through -a galvanically isolated USB/serial bridge to a host computer. We derive our system's sampling clock from a crystal oven -to avoid frequency measurement noise due to thermal drift of a regular crystal: \SI{1}{ppm} of crystal drift would cause -a grid frequency error of $\SI{50}{\micro\hertz}$. We compared our oven-stabilized clock against a GPS 1 pps reference -and found that over a time span of 20 minutes both stayed stable within 5 ppb of each other, which corresponds to the -drift specification of a typical crystal oven. +To prepare our analysis of grid frequency modulation, we developed a device that allows us to collect measurements of +actual grid frequency behavior through safely recording the grid voltage waveform. Our system consists of an +\texttt{STM32F030F4P6} ARM Cortex M0 microcontroller that records mains voltage using its internal 12-bit ADC and +transmits measured values through a galvanically isolated USB/serial bridge to a host computer. We derive our system's +sampling clock from a crystal oven to avoid frequency measurement noise due to thermal drift of a regular crystal: +\SI{1}{ppm} of crystal drift would cause a grid frequency error of $\SI{50}{\micro\hertz}$. We compared our +oven-stabilized clock against a GPS 1 pps reference and found that over a time span of 20 minutes both stayed stable +within 5 ppb of each other, which corresponds to the drift specification of a typical crystal oven. In utility SCADA systems, Phasor Measurement Units (PMUs) are used to precisely measure grid frequency among other parameters. Details on the inner workings of commercial phasor measurement units are scarce but there is a large amount @@ -579,14 +578,14 @@ Using our grid frequency recorder, we performed a two-day measurement series of Figure~\ref{fig_freq_spec} shows the frequency spectrum of grid frequency over this two-day span. In this spectrum, we observe a number of features. Across the frequency range, we observe a broad $1/f$ noise. Above a period of $\SI{10}{\second}$, this $1/f$ noise dips to a flat noise floor. We estimate that this low-noise region is caused by the -self-regulating effect of loads. %FIXME citation Above a $\SI{10}{\second}$ period, primary control is activated and -thus the $1/f$ noise we observe is the result of the interaction between primary control and consumer demand. On top of -this $1/f$ behavior, the spectrum shows several sharp peaks at time intervals with a ``round'' number such as -$\SI{10}{\second}$, $\SI{60}{\second}$ or multiples of $\SI{300}{\second}$. These peaks are due to loads turning on- or -off depending on wall-clock time, and demand forecasting not being able to precisely match the amplitude of these large -changes in load. Besides the narrow peaks caused by this effect we can also observe two wider bumps at -$\SI{7.0}{\second}$ and $\SI{4.7}{\second}$. These bumps closely correlate with continental European synchonous area's -oscillation modes at $\SI{0.15}{\hertz}$ (east-west) and $\SI{0.25}{\hertz}$ (north-south)~\cite{grebe01}. +self-regulating effect of loads. Above a $\SI{10}{\second}$ period, primary control is activated and thus the $1/f$ +noise we observe is the result of the interaction between primary control and consumer demand. On top of this $1/f$ +behavior, the spectrum shows several sharp peaks at time intervals with a ``round'' number such as $\SI{10}{\second}$, +$\SI{60}{\second}$ or multiples of $\SI{300}{\second}$. These peaks are due to loads turning on- or off depending on +wall-clock time, and demand forecasting not being able to precisely match the amplitude of these large changes in load. +Besides the narrow peaks caused by this effect we can also observe two wider bumps at $\SI{7.0}{\second}$ and +$\SI{4.7}{\second}$. These bumps closely correlate with continental European synchonous area's oscillation modes at +$\SI{0.15}{\hertz}$ (east-west) and $\SI{0.25}{\hertz}$ (north-south)~\cite{grebe01}. \section{Grid Frequency Modulation} @@ -598,8 +597,8 @@ energy-intensive industries in Europe~\cite{ec01}, we found that an aluminium sm aluminium smelting, aluminium is electrolytically extracted from alumina solution. High-voltage mains power is transformed, rectified and fed into approximately 100 series-connected electrolytic cells forming a \emph{potline}. Inside these pots, alumina is dissolved in molten cryolite electrolyte at approximately \SI{1000}{\degreeCelsius} and -electrolysis is performed using a current of tens or hundreds of Kiloampère. The resulting pure aluminium settles at the -bottom of the cell and is tapped off for further processing. +electrolysis is performed using a current of tens or hundreds of Kiloampère at a few Volt per cell. The resulting pure +aluminium settles at the bottom of the cell and is tapped off for further processing. Aluminium smelters are operated around the clock, and due to the high financial stakes their behavior under power outages has been carefully characterized. Power outages of tens of minutes up to two hours reportedly do not cause @@ -609,8 +608,8 @@ prices~\cite{duessel01,eisma01,depree01}. An aluminium plant's power supply is smelter cells under optimal operating conditions. Modern power supply systems employ large banks of diodes or thyristors to rectify low-voltage AC to DC to be fed into the potline~\cite{ayoub01}. Potline voltage is controlled through a combination of a tap changer and a transductor. Individual cell voltages are controlled by changing the physical -distance between anode and cathode distance. In this setup, power can be electronically modulated using the thyristor -rectifier. Since the system does not have any mechanical inertia, high modulation rates are possible. +distance between anode and cathode. In this setup, power can be electronically modulated using the thyristor rectifier. +Since the system does not have any mechanical inertia, high modulation rates are possible. In~\cite{depree01}, the authors describe a setup where a large Aluminium smelter in continental Europe is used as primary control reserve for frequency regulation. In this setup, a rise time of $\SI{15}{\second}$ was achieved to meet @@ -631,12 +630,12 @@ continental European synchronous area, we have to consider operation during a bl divides into a number of disconnected power islands. A single transmitter would only be able to reach receivers on the same power island. -Instead, the system can use a number of transmitters that are distributed throughout the network. Piggy-backing -transmitters on existing industrial loads keeps the implementation cost of additional transmitters low. By running -transmitters from gps-synchronized ovenized crystal oscillators or rubidium frequency standards, transmissions can be -precisely synchronized across power islands even after a holdover period of several days. This allows a transmission to -continue un-interrupted while the utility re-joins power island into the larger grid, since the transmissions on both -islands are precisely synchronized. +To alleviate this constraint, the system can use a number of transmitters that are distributed throughout the network. +Piggy-backing transmitters on existing industrial loads keeps the implementation cost of additional transmitters low. By +running transmitters from stable, synchronized frequency standards such as gps-disciplined rubidium standards, +transmissions can be precisely synchronized across power islands even after a holdover period of several days. This +allows a transmission to continue uninterrupted while the utility rejoins power island into the larger grid, since the +transmissions on both islands are precisely synchronized. As illustrated in Figure~\ref{fig_intro_flowchart}, the transmitters are connected to a command center. For this connection, a redundant set of long-range radio or satellite links can be used, as well as wired connections through the @@ -672,20 +671,20 @@ Direct Sequence Spread Spectrum modulation is a common spread-spectrum technique radio systems, most prominently all global navigation satellite systems (GNSS). As a spread-spectrum technique, DSSS spreads out the signal's energy across a broad spectral range. This decreases the susceptibility of a DSSS signal to narrowband interference. In GNSS, this allows the rejection of other nearby RF sources. In our use case, this makes the -signal immune to the many narrow peaks in the grid frequency's noise spectrum that are caused by UTC-synchronized -control systems (cf.~Fig.~\ref{fig_freq_spec}). In addition to better interference immunity, DSSS has two other -important characteristics: It provides \emph{modulation gain}, i.e.~it allows a trade-off between data rate and receiver -sensitivity, and it allows for Code Division Multiple Access (CDMA). In CDMA, multiple DSSS-modulated signals can be -sent simultaneously through a shared channel with less impact to the resulting signal-to-noise ratio (SNR) than would be -the case for other modulation techniques. +signal immune to the many narrow peaks in the grid frequency's noise spectrum that are caused by control systems +sychronized to wall-clock time(cf.~Fig.~\ref{fig_freq_spec}). In addition to better interference immunity, DSSS has two +other important characteristics: It provides \emph{modulation gain}, i.e.~it allows a trade-off between data rate and +receiver sensitivity, and it allows for Code Division Multiple Access (CDMA). In CDMA, multiple DSSS-modulated signals +can be sent simultaneously through a shared channel with less impact to the resulting signal-to-noise ratio (SNR) than +would be the case for other modulation techniques. A DSSS signal is made up from pseudo-random \emph{symbols}, which in turn are made up from individual physical layer bits called \emph{chips}. Chips are encoded in the signal using a lower-layer modulation such as phase-shift keying (e.g.~in GPS) or frequency-shift keying (in this work). In DSSS, a \emph{code} is a library of symbols that are -constructed to have minimal cross-correlation, meaning they are near-orthogonal. A transmitter sends a symbol by +constructed to have minimal cross-correlation, i.e.\ they are near-orthogonal. A transmitter sends a symbol by transmitting its particular pseudo-random chip sequence at a chosen polarity, conveying one bit of information. A receiver demodulates the signal by directly correlating the incoming physical-layer signal with the symbol's chip -pattern, which results in a positive or negative peak depending on symbol polarity when a symbol is received. +pattern, which results in a positive or negative peak when a symbol is received depending on its polarity. By increasing the DSSS sequence length by a factor of $2$, SNR is improved by $\sqrt{2}$ assuming an additive white gaussian noise (AWGN) channel. At the same time, when doubling the sequence length, common DSSS code construction @@ -807,7 +806,7 @@ grid is restored piece by piece with safety reset controllers coming back online transmit the same reset command. In our protocol, we handle this situation by memorizing the last valid received command on the device side, and only acting \emph{once} when a new command is received. The transmission of one command thus becomes idempotent, and the utility can repeat the command until sufficiently many devices have received the command and -e.g.\ performed a safety reset. +performed a safety reset. In our protocol, we define two commands, \emph{reset} and \emph{disarm}. We assign \emph{reset} and \emph{disarm} to the $k_i$ in an alternating way. For odd $i$, $k_i$ is a reset command and for even $i$, $k_i$ is a \emph{disarm} command. @@ -900,7 +899,7 @@ sign bit into account, the length of the encoded signature is 20 DSSS symbols. O correction at a 2:1 ratio inflating total message length to 30 DSSS symbols. At the \SI{1}{\second} chip rate we used in other simulations as well this equates to an overall transmission duration of approximately \SI{15}{\minute}. To give the demodulator some time to settle and to produce more realistic conditions of signal reception we padded the modulated -signal unmodulated noise on both ends. +signal with unmodulated noise on both ends. \section{Lessons learned} @@ -915,14 +914,14 @@ with common JTAG programmers. Our initial assumption that a development kit would be easier to program than a commercial meter did not prove to be true. Contrary to our expectations the commercial meter had JTAG enabled allowing us to easily read out its stock -firmware without either reverse-engineering vendor firmware update files nor circumventing code protection measures. +firmware requiring neither reverse-engineering vendor firmware update files nor circumventing code protection measures. The fact that its firmware was only available in its compiled binary form was not much of a hindrance as it proved not to be too complex and all we wanted to know we found with just a few hours of digging in Ghidra\footnote{\url{https://ghidra-sre.org/}}. -In the firmware development phase our approach of testing every module individually (e.g. DSSS demodulator, Reed-Solomon -decoder, grid frequency estimation) proved useful particularly for debugging. The modular architecture allowed us to -directly compare our demodulator implementation to our Jupyter/Python prototype, where we found that our C +In the firmware development phase we tested every module such as DSSS demodulator, Reed-Solomon decoder, or grid +frequency estimation individually. This approach proved particularly useful for debugging. The modular architecture +allowed us to directly compare our demodulator implementation to our Jupyter/Python prototype, where we found that our C implementation outperformed the Python prototype. Despite the algorithms's complexity, the microcontroller C implementation has no issues processing data in real-time due to the low sampling rate necessary. @@ -965,7 +964,7 @@ Safety reset controllers can be adapted to most IoT device and smart meter desig other public utilities such as the internet or cellular networks, we believe in their potential as a last line of defense providing resilience under large-scale cyber attacks. The next steps towards a practical implementation will be a practical demonstration of broadcast data transmission through grid frequency modulation using a megawatt-scale -controllable load as well as further optimization of the modulation and data encoding as well as the demodulator +controllable load as well as further optimization of the modulation and data encoding and the demodulator implementation. \subsection{Artifacts} -- cgit