diff options
Diffstat (limited to 'paper/safety-reset-paper.tex')
-rw-r--r-- | paper/safety-reset-paper.tex | 1057 |
1 files changed, 387 insertions, 670 deletions
diff --git a/paper/safety-reset-paper.tex b/paper/safety-reset-paper.tex index 640f84a..2493aa4 100644 --- a/paper/safety-reset-paper.tex +++ b/paper/safety-reset-paper.tex @@ -29,11 +29,10 @@ \begin{document} -\title{Ripples in a Pond: Transmitting Information through Grid Frequency Modulation} -\titlerunning{Ripples in a Pond: Transmitting Information through Grid Frequency Modulation} -\author{Jan Sebastian Götte \and Björn Scheuermann} -\authorrunning{Jan Sebastian Götte \and Björn Scheuermann} -\institute{HIIG\\ \email{safetyreset@jaseg.de} \and HU Berlin \\ \email{scheuermann@informatik.hu-berlin.de}} +\title{Ripples in the Pond: Transmitting Information through Grid Frequency Modulation} +\titlerunning{Ripples in the Pond: Transmitting Information through Grid Frequency} +\author{Jan Sebastian Götte \and Liran Katzir \and Björn Scheuermann} +\institute{HIIG\\ \email{safetyreset@jaseg.de} \and Tel Aviv University\\Faculty of Engineering \and HU Berlin \\ \email{scheuermann@informatik.hu-berlin.de}} % FIXME keywords \maketitle \keywords{Security, privacy and resilience in critical infrastructures \and Security and privacy in ``internet of @@ -42,28 +41,29 @@ things'' \and Cyber-physical systems \and Hardware security \and Network Securit \begin{abstract} The smart grid is a large, complex and interconnected technological system. With remotely controllable load switches having been rolled out at scale in some countries, a tiny flaw inside the firmware of one of these embedded devices - may allow attacks to remotely trigger large-scale excursions of grid parameters with potentially catastrophic - results. Attaining perfect security from such cyberphysical attacks is a monumental embedded engineering task---and - observations do not indicate that current efforts meet the requirements of this task.%FIXME cite recent RECESSIM work + may enable attacks to remotely trigger large-scale disruption with potentially catastrophic results. Attaining + perfect security against such cyberphysical attacks is a monumental embedded engineering task---and observations do + not indicate that current efforts meet the requirements of this task.%FIXME cite recent RECESSIM work In this paper, we approach the smart grid safety issue by implementing an emergency override that can be used to - e.g.\ reset all connected devices to a known-good state and preempting subsequent compromise by cutting - communication links. To yield a fully fail-safe design, our system does not rely on the internet or any other - communication network to work. Instead, our system transmits error-corrected and cryptographically secured commands - by modulating grid frequency using a single large consumer such as a large aluminium smelter. This approach differs - from traditional Powerline Communication (PLC) systems in that reaches every device within the same synchronous - area. - - Using extensive simulations we have determined that control of a $\SI{25}{\mega\watt}$ load would allow for the - transmission of a crytographically secured \emph{reset} signal within $15$ minutes. We have produced a - proof-of-concept prototype receiver that demonstrates the feasibility of decoding such signals even on - resource-constrained microcontroller hardware. + reset all connected devices to a known-good state and preempt subsequent compromise by cutting communication links. + To yield a fully fail-safe design, our system does not rely on the internet or other conventional communication + network to work. Instead, our system transmits error-corrected and cryptographically secured commands by modulating + grid frequency using a single large consumer such as a large aluminium smelter. This approach differs from + traditional Powerline Communication (PLC) systems in that reaches every device within the same synchronous area as + the signal is embedded into the fundamental grid frequency instead of a superimposed voltage that is quickly + attenuated across long distances. + + Using simulations we have determined that control of a $\SI{25}{\mega\watt}$ load would allow for the transmission + of a crytographically secured \emph{reset} signal within $15$ minutes. We have produced a proof-of-concept prototype + receiver that demonstrates the feasibility of decoding such signals even on resource-constrained microcontroller + hardware. \end{abstract} \section{Introduction} In the power grid, as in many other engineered systems, we can observe an ongoing diffusion of information systems into -industrial control systems. Automation of these control systems has already been practiced for the better part of a +the domain of industrial control. Automation of these control systems has already been practiced for the better part of a century. Throughout the 20th century this automation was mostly limited to core components of the grid. Generators in power stations are computer-controlled according to electromechanical and economic models. Switching in substations is automated to allow for fast failure recovery. Human operators are still vital to these systems, but their tasks have @@ -71,9 +71,9 @@ shifted from pure operation to engineering, maintenance and surveillance\cite{cr With the turn of the century came a large-scale trend in power systems to move from a model of centralized generation, built around massive large-scale fossil and nuclear power plants, towards a more heterogenous model of smaller-scale -generators working together. In this new model large-scale fossil power plants still serve a major role, but two new -factors come into play. One is the advance of renewable energies. The large-scale use of wind and solar power in -particular from a current standpoint seems unavoidable for our continued existence on this planet. For the electrical +generators working together. In this new model large-scale fossil power plants still serve a major role, but new +factors come into play. One such factor is the advance of renewable energies. The large-scale use of wind and solar power in +particular seems unavoidable for continued human life on this planet. For the electrical grid these systems constitute a significant challenge. Fossil-fueled power plants can be controlled in a precise and quick way to match energy consumption. This tracking of consumption with production is vital to the stability of the grid. Renewable energies such as wind and solar power do not provide the same degree of controllability, and they @@ -84,13 +84,15 @@ In distributed generation end-customers that previously only consumed energy hav from small solar installations on their property. Distributed generation is a chance for customers to gain autonomy and shift from a purely passive role to being active participants of the electricity market\cite{crastan03}. -To match this new landscape of decentralized generation and unpredictable renewable resources the utility industry has -had to adapt itself in major ways. One aspect of this adaptation that is particularly visible to ordinary people is the +% FIXME the following paragraph is weird. + +To match this new landscape unpredictable renewable resources and of decentralized generation, the utility industry has +had to adapt itself in major ways. One aspect of this adaptation that is particularly visible to energy consumers is the computerization of end-user energy metering. Despite the widespread use of industrial control systems inside the -electrical grid and the far-reaching diffusion of computers into people's everyday lives the energy meter has long been +electrical grid and the far-reaching diffusion of computers into people's everyday lives, the energy meter has long been one of the last remnants of an offline, analog time. Until the 2010s many households were still served through electromechanical Ferraris-style meters that have their origin in the late 19th -century\cite{borlase01,ukgov04,bnetza02}. Today under the umbrella term \emph{Smart Metering} the shift towards fully +century\cite{borlase01,ukgov04,bnetza02}. Today, under the umbrella term \emph{Smart Metering}, the shift towards fully computerized, often networked meters is well underway. The roll out of these \emph{Smart Meters} has not been very smooth overall with some countries severely lagging behind. As a safety-critical technology, smart metering technology is usually standardized on a per-country basis. This leads to an inhomogenous landscape with--in some instances--wildly @@ -99,31 +101,32 @@ This complex standardization landscape and market situation has led to a prolife microcontroller firmware. The complexity and scale of this--often network-connected--firmware makes for a ripe substrate for bugs to surface. -A remotely exploitable flaw inside the firmware of a component of a smart metering ystem could have consequences ranging -from impaired billing functionality to an existential threat to grid stability\cite{anderson01,anderson02}. In a country -where meters commonly include disconnect switches for purposes such as prepaid tariffs a coördinated attack could at -worst cause widespread activation of grid safety systems by repeatedly connecting and disconnecting megawatts of load -capacity in just the wrong moments\cite{wu01}. +A remotely exploitable flaw inside the firmware of a component of a smart metering system could have consequences +ranging from impaired billing functionality to an existential threat to grid stability\cite{anderson01,anderson02}. In a +country where meters commonly include disconnect switches for purposes such as prepaid tariffs, a coordinated attack +could at worst cause widespread activation of grid safety systems through oscillations caused by repeated cycling of +megawatts of load capacity at just the wrong frequency\cite{wu01}. Mitigation of these attacks through firmware security measures is unlikely to yield satisfactory results. The enormous complexity of smart meter firmware makes firmware security extremely labor-intensive. The diverse standardization -landscape makes a coördinated, comprehensive response unlikely. +landscape makes a coordinated, comprehensive response unlikely. In this paper, instead of focusing on the very hard task of improving firmware security we introduce a pragmatic -solution to the--in our opinion likely--scenario of a large-scale compromise of smart meter firmware. In our proposal +solution to the--in our opinion likely--scenario of a large-scale compromise of smart meter firmware. In our concept the components of the smart meter that are threatened by remote compromise are equipped with a physically separate -\emph{safety reset controller} that listens for a reset command transmitted through the electrical grid's frequency and -on reception forcibly resets the smart meter's entire firmware to a known-good state. Our safety reset controller +\emph{safety reset controller} that listens for a ``reset'' command transmitted through the electrical grid's frequency +and on reception forcibly resets the smart meter's entire firmware to a known-good state. Our safety reset controller receives commands through Direct Sequence Spread Spectrum (DSSS) modulation carried out on grid frequency through a -large controllable load such as an aluminum smelter. After forward error correction and cryptographic verification it +large controllable load such as an aluminium smelter. After forward error correction and cryptographic verification it re-flashes the meter's main microcontroller over the standard JTAG interface. Note that our modulation technique is one \emph{changing grid frequency itself}. This is fundamentally different in both generation and detection from systems -such as traditional PLC that superimpose a signal on grid voltage, but leave grid frequency itself unaffected. +such as traditional PLC that superimpose a signal on grid voltage, but leave the underlying grid frequency itself +unaffected. -Starting from a high level architecture, we have carried out extensive simulations of our proposal's performance under -real-world conditions. Based on these simulations we implemented an end-to-end prototype of our proposed safety reset -controller as part of a realistic smart meter demonstrator. Finally, we experimentally validated our results and we will -conclude with an outline of further steps towards a practical implementation. +Starting from a high level architecture, we have carried out simulations of our concept's performance under real-world +conditions. Based on these simulations we implemented an end-to-end prototype of our proposed safety reset controller as +part of a realistic smart meter demonstrator. Finally, we experimentally validated our results and we will conclude with +an outline of further steps towards a practical implementation. This work contains the following contributions: \begin{enumerate} @@ -132,698 +135,412 @@ This work contains the following contributions: implementation. \item We design a communication system based on GFM. \item We carry out extensive simulations of our systems to determine its performance characteristics. - \item We show the simple grid frequency recorder design we used to capture data for our simulations. \end{enumerate} \section{Related work} \label{sec_related_work} -% FIXME: Cut down this section from ~6 pages to 2...3 pages. + +% FIXME: intro here \subsection{Security and Privacy in the Smart Grid} The smart grid in practice is nothing more or less than an aggregation of embedded control and measurement devices that are part of a large control system. This implies that all the same security concerns that apply to embedded systems in -general also apply to most components of a smart grid. Where programmers have been struggling for decades now with input -validation\cite{leveson01}, the same potential issue raises security concerns in smart grid scenarios as well\cite{mo01, -lee01}. Only, in smart grid we have two complicating factors present: Many components are embedded systems, and as such -inherently hard to update. Also, the smart grid and its control algorithms act as a large (partially-)distributed -system making problems such as input validation or authentication harder\cite{blaze01} and adding a host of distributed -systems problems on top\cite{lamport01}. +general also apply to the components of a smart grid. Where programmers have been struggling for decades now with issues +such as input validation\cite{leveson01}, the same potential issue raises security concerns in smart grid scenarios as +well\cite{mo01, lee01}. Only, in smart grid we have two complicating factors present: Many components are embedded +systems, and as such inherently hard to update. Also, the smart grid and its control algorithms act as a large partially +distributed system making problems such as input validation or authentication harder\cite{blaze01} and adding a host of +distributed systems problems on top\cite{lamport01}. Given that the electrical grid is essential infrastructure in our modern civilization, these problems amount to -significant issues in practice. Attacks on the electrical grid may have grave consequences\cite{anderson01,lee01} while -the long maintenance cycles of various components make the system slow to adapt. Thus, components for the smart grid -need to be built to a much higher standard of security than most consumer devices to ensure they live up to well-funded -attackers even decades down the road. This requirement intensifies the challenges of embedded security and distributed -systems security among others that are inherent in any modern complex technological system. The safety-critical nature -of the modern smart metering ecosystem in particular was quickly recognized\cite{anderson01}. +significant issues. Attacks on the electrical grid may have grave consequences\cite{anderson01,lee01} while the long +replacement cycles of various components make the system slow to adapt. Thus, components for the smart grid need to be +built to a much higher standard of security than most consumer devices to ensure they live up to well-funded attackers +even decades down the road. This requirement intensifies the challenges of embedded security and distributed systems +security among others that are inherent in any modern complex technological system. The safety-critical nature of the +modern smart metering ecosystem in particular was quickly recognized\cite{anderson01}. A point we will not consider in much depth in this work is theft of electricity. While in publications aimed towards the general public the introduction of smart metering is always motivated with potential cost savings and ecological benefits, in industry-internal publications the reduction of electricity theft is often cited as an incentive\cite{czechowski01}. Likewise, academic publications tend to either focus on other benefits such as generation efficiency gains through better forecasting or rationalize the consumer-unfriendly aspects of smart metering with social -benefits\cite{mcdaniel01}. They do not usually point out \emph{revenue protection} mechanisms as +benefits\cite{mcdaniel01}. They do not usually point out revenue protection mechanisms as incentives\cite{anderson01,anderson02}. A serious issue in smart metering setups is customer privacy. Even though the meter ``only'' collects aggregate energy consumption of a whole household, this data is highly sensitive\cite{markham01}. This counterintuitive fact was initially overlooked in smart meter deployments leading to outrage, delays and reduced features\cite{cuijpers01}. The root cause of this problem is that given sufficient timing resolution these aggregate measurements contain ample -entropy. Through disaggregation algorithms individual loads can be identified and through pattern matching even complex -usage patterns can be discerned with alarming accuracy\cite{greveler01}. Similar privacy issues arise in many other -areas of modern life through pervasive tracking and surveillance\cite{zuboff01}. +entropy. Through disaggregation algorithms, individual loads can be identified and through pattern matching even complex +usage patterns can be discerned with alarming accuracy\cite{greveler01} in the same way that similar privacy issues +arise in many other areas of modern life through other kinds of pervasive tracking and surveillance\cite{zuboff01}. Another fundamental challenge in smart grid implementations is the central role of smart electricity meters in the smart -grid ecosystem. Smart meters are used both for highly-granular load measurement and (in some countries) load +grid ecosystem. Smart meters are used both for highly-granular load measurement and in some countries also for load switching\cite{zheng01}. Smart electricity meters are effectively consumer devices. They are built down to a certain price point that is measured by the burden it puts on consumers and that is divided by the relatively small market served by a single smart meter implementation. Such cost requirements can preclude security features such as the use of a standard hardened software environment on a high powered embedded system. Landis+Gyr, a large manufacturer that makes most of its revenue from utility meters in their 2019 annual report write that they \SI{36}{\percent} of their total -R\&D budget on embedded software (firmware) while spending only \SI{24}{\percent} on hardware -R\&D\cite{landisgyr01,landisgyr02}, indicating a significant tension between firmware security and a smart meter -vendor's bottom line. +R\&D budget on embedded software while spending only \SI{24}{\percent} on hardware R\&D\cite{landisgyr01,landisgyr02}, +indicating a significant tension between firmware security and a smart meter vendor's bottom line. \subsection{The state of the art in embedded security} Embedded software security generally is much harder than security of higher-level systems. The primary two factors affecting this are that on one hand, embedded devices usually run highly customized firmware that (often by necessity) -is rarely updated. On the other hand, embedded devices often lack the advanced security mechanisms such as memory -management units that are found in most higher-power devices. Even well-funded companies continue to have trouble -securing their embedded systems. A spectacular example of this difficulty is the recently-exposed flaw in Apple's iPhone -SoC first-stage ROM bootloader that allows for the full compromise of any iPhone before the iPhone X given physical -access to the device. iPhone 8, one of the affected models, was still being manufactured and sold by Apple until April -2020. In another instance in 2016 researchers found multiple flaws in the secure-world firmware used by Samsung in +is rarely updated. On the other hand, embedded devices often lack advanced security mechanisms such as memory management +units that are found in most higher-power devices. Even well-funded companies continue to have trouble securing their +embedded systems. A spectacular example of this difficulty is the 2019 flaw in Apple's iPhone SoC first-stage ROM +bootloader that allows for the full compromise of any iPhone before the iPhone X given physical access to the +device\cite{heise01}. iPhone 8, one of the affected models, was still being manufactured and sold by Apple until April +2020. In another instance in 2016, researchers found multiple flaws in the secure world firmware used by Samsung in their mobile phone SoCs. The flaws they found were both severe architectural flaws such as secret user input being -passed through untrusted userspace processes without any protection and shocking cryptographic flaws such as +passed through untrusted userspace processes without any protection as well as shocking cryptographic flaws such as CVE-2016-1919\footnote{\url{http://cve.circl.lu/cve/CVE-2016-1919}}\cite{kanonov01}. And Samsung is not the only large multinational corporation having trouble securing their secure world firmware implementation. In 2014 researchers found an embarrassing integer overflow flaw in the low-level code handling untrusted input in Qualcomm's QSEE firmware\cite{rosenberg01}. For an overview of ARM TrustZone including a survey of academic work and past security vulnerabilities of TrustZone-based firmware see \cite{pinto01}. -If even companies targeting R\&D budgets that rival some countries' national budgets at mass-market consumer devices -have trouble securing their secure embedded software stacks, what is a much smaller smart meter manufacturer to do? -Especially if national standards mandate complex protocols such as TLS that are tricky to implement -correctly\cite{georgiev01}, the manufacturer is short on options to secure their product. +If even companies with R\&D budgets that rival some countries' national budgets at mass-market consumer devices +have trouble securing their mass market secure embedded software stacks, what is a much smaller smart meter manufacturer +to do? Especially if national standards mandate complex protocols such as TLS that are tricky to implement +correctly\cite{georgiev01}, this manufacturer will be short on options to secure their product. \subsection{Attack surface in the smart grid} -From the previous paragraphs we can conclude that in smart metering technology, market incentives do not currently -provide the conditions for a level of device security that will reliably last the coming decades. Considering this -tension, in this paragraph we will outline the cyberphysical risks that arise from attacks on the smart grid in the -first place. +From the incidents we outlined in the previous paragraphs we conclude that in smart metering technology, market +incentives do not currently provide the conditions for a level of device security that will reliably last the coming +decades. Considering this tension, in this paragraph we examine the cyberphysical risks that arise from attacks on the +smart grid in the first place. These risks arise at three different infrastructure levels. -The first such attack that might come to mind is one where the attacker compromises components of the grids centralized -control systems. This type of attack is often cited in popular discourse and to our knowledge is the only type of attack -against a grid that has ever been carried out in practice at scale. Despite their severity, these attacks do not pose a -strictly \emph{scientific} challenge, though since these attacks are generic to any industrial control system. Their -causes and countermeasures are generally well-understood and the hardest challenge in their prevention is likely to lie -in budgetary constraints. +The first level is that of attacks on centralized control systems. This type of attack is often cited in popular +discourse and to our knowledge is the only type of attack against an electric grid that has ever been carried out in +practice at scale\cite{lee01}. Despite their severity, these attacks do not pose a strictly \emph{scientific} challenge +since they are generic to any industrial control system. Their causes and countermeasures are generally well-understood +and the hardest challenge in their prevention is likely to be budgetary constraints. Beyond the centralized control systems, the next target for an attacker may be the communication links between those -control systems and other smart grid components. While in older systems as well as the last mile to households' smart -meters special-purpose systems such as PLC are still common, in the overall system IP-based technologies have -proliferated much like they did in other industries. Along with this adoption of IP-based communication links comes the -ability to apply generic network security measures from the IP world to the smart grid domain. In this way, a -standardized, IP-based protocol stack unlocks decades of network security improvements at little cost. +control systems and other smart grid components. While in some countries such as Italy special-purpose systems such as +PLC are common\cite{ec03}, overall, IP-based technologies have proliferated according to the larger trend in commputing +towards IP-based communications. This proliferation of IP-based communication links brings along the possibility for +the application of generic network security measures from the IP world to the smart grid domain. In this way, a +standardized, IP-based protocol stack unlocks decades of network security improvements at little cost. -Finally, an attacker might target the endpoint device itself. Smart meters are deployed at a large scale -%%% FIXME << HERE WIP >> +Beyond these layers towards the core of the smart grid's control infrastructure, an attacker might also corrupt the +network from the edges and target the endpoint devices itself. The large scale deployment of networked smart meters +creates an environment that is favorable to such attacks. +% FIXME cite RECESSIM landis+gyr protocol hacking wiki/youtube \subsection{Cyberphysical threats in the smart grid} -If we model the smart grid as a control system responding to changes in inputs by regulating outputs, on a -very high level we can see two general categories of attacks: Attacks that directly change the state of the outputs, and -attacks that try to influence the outputs indirectly by changing the system's view of its inputs. The former would be an -attack such as shutting down a power plant to decrease generation capacity\cite{lee01}. The latter would be an attack -such as forging grid frequency measurements where they enter a power plant's control systems to provoke the control -systems to -oscillate\cite{kosut01,wu01,kim01}. - - -\paragraph{Control function exploits.} -Control function exploits are attacks on the mathematical control loops used by the centralized control system. One -example of this type of attack are resonance attacks as described in \cite{wu01}. In this kind of attack, inputs from -peripheral sensors indicating grid load to the centralized control system are carefully modified to cause a -disproportionately large oscillation in control system action. This type of attack relies on complex resonance effects -that arise when mechanical generators are electrically coupled. These resonances, colloquially called ``modes'', are -well-studied in power system engineering\cite{rogers01,grebe01,entsoe01,crastan03}. Even disregarding modern attack -scenarios, for stability electrical grids are designed with measures in place to dampen any resonances inherent to grid -structure. These resonances are hard to analyze since they require an accurate grid model and they are unlikely to be -noticed under normal operating conditions. - -Mitigation of these attacks can be achieved by ensuring unmodified sensor inputs to the control systems in the first -place. Carefully designing control systems not to exhibit exploitable behavior such as oscillations is also possible but -harder. - -\paragraph{Endpoint exploits.} -The one to us rather interesting attack on smart grid systems is someone exploiting the grid's endpoint devices such as -smart electricity meters. These meters are deployed on a massive scale, with at least one meter per household on -average\footnote{Households rarely share a meter but some households may have a separate meter for detached properties -such as a detached garage or basement.}. Once compromised, restoration to an uncompromised state can be difficult if it -requires physical access to thousands of devices in hard-to-access locations. - -By compromising smart electricity meters, an attacker can forge the distributed energy measurements these devices -perform. In a best-case scenario, this might only affect billing and lead to customers being under- or over-charged if -the attack is not noticed in time. In a less ideal scenario falsified energy measurements reported by these devices -could impede the correct operation of centralized control systems. - -In some countries such as the UK smart meters have one additional function that is highly useful to an attacker: They -contain high-current disconnect switches to disconnect the entire household or business in case electricity bills are -left unpaid for a certain period. In countries that use these kinds of systems on a widespread level, the load -disconnect switch is controlled by the smart meter's central microcontroller. This allows anyone compromising this -microcontroller's firmware to actuate the disconnect switch at will. Given control over a large number of -network-connected smart meters, an attacker might thus be able to cause large-scale disruptions of power -consumption\cite{anderson01,temple01}. Combined with an attack method such as the resonance attack from \cite{wu01} -that was mentioned above, this scenario poses a serious threat to grid stability. - -In places where Demand-Side Management (DSM) is common this functionality may be abused in a similar way. In DSM the -smart metering system directly controls power to certain devices such as heaters. The utility can remotely control the -turn-on and turn-off of these devices to smoothen out the load curve. In exchange the customer is billed a lower price -for the energy consumed by these loads. DSM was traditionally done in a federated fashion usually through low-frequency -PLC over the distribution grid\cite{dzung01}. Smart metering systems no longer require large, resource-intensive -transmitters in substations and bear the potential for a rollout of such technology on a much wider scale than before. -This leads to a potentially significant role of DSM systems in the impact calculation of an attack on a smart metering -system. DSM does not control as much load capacity as remote disconnect switches do but the attacks cited in the above -paragraph still fundamentally apply. +Assuming that an attacker has compromised devices on any of these levels of smart grid infrastructure, what could they +do with their newly gained power? The obvious action would be to switch off everything. Of all scenarios, +this is both the most likely in practice---it is exactly what happened in the russian cyberattacks on the Ukranian +grid\cite{lee01}---but it is also the easiest to mitigate since the vulnerable components are few and centralized. +Mitigations include the installation of fail-safes as well as a defense in depth approach to hardening the grid's +cyber-infrastructure. + +Another possible action for an attacker would be to forge energy measurements in an attempt to cause financial mayhem. +Both individual consumers as well as the utility could be targeted by such an attack. While such an attack might have +localized success, larger-scale discrepancies will likely quickly be caught by monitoring systems. For example, if a +large number of meters in an area systematically under- or over-reported their energy readings, meter readings across +the affected area would no longer add up with those of monitoring devices in other locations in the transmission and +distribution grid. + +In some countries, smart meter functionality goes beyond mere monitoring devices and also includes remotely controlled +switches. There are two types of these switches: Switches to support \emph{Demand-Side Management} (DMS) and cut +off-switches that are used to punish defaulting customers. Demand Side Management is when a grid operator can remotely +control the timing of large, non-time-critical loads on the customer's premises\cite{dzung01}. A typical example of this +is a customer using an electric water heater: The heater is outfitted with a large hot water storage tank and is +connected hooked up to the utility's DSM system. The customer does not care when exactly their water is heated as long +as there is enough of it, and the utitliy offers them cheaper rates for the electricity used for heating in exchange for +control over its precise timing. The utility uses this control to even out peaks in the consumption/production +imbalance, remotely enabling DSM systems during off-peak times and disabling them during peak hours. In contrast to +DSM, cut-off switches are switches placed in-between the grid and the entire customer's household such that the utility +can disconnect non-paying customers without incurring the expense of sending a technician to the customer's premises. +Unlike DSM systems, cut-off switches are not opt-in\cite{anderson01,temple01}. An attack that uses cut-off switches +would obviously immediately cause severe mayhem. Attacks on DSM may have more limited immediate impact as affected +consumers may not notice an interruption for several hours. + +Instead of switching off loads outright, an attack employing DSM switches (and potentially also cut-off switches) could +choose to target the grid's stability. By synchronizing many compromised smart meters to switch on and off a large +amount of load capacity, an attacker might cause the entire electrical grid to oscillate\cite{kosut01,wu01,kim01}. As a +large system of coupled mechanical systems, the electrical grid exhibits a complex frequency-domain behavior. These +resonance effects, colloquially called ``modes'', are well-studied in power system +engineering\cite{rogers01,grebe01,entsoe01,crastan03}. As they can cause issues even under normal operating conditions, +a large effort is invested in dampening these resonances. Howewer, fully eliminating them under changing load conditions +may not be achievable. \subsection{Communication Channels on the Grid} -There is a number of well-established technologies for communication on or along power lines. We can distinguish three -basic system categories: Systems using separate wires (such as DSL over landline telephone wiring), wireless radio -systems (such as LTE) and \emph{power line communication} (PLC) systems that reüse the existing mains wiring and -superimpose data transmissions onto the 50 Hz mains sine\cite{gungor01,kabalci01}. - -For our scenario, we will ignore short-range communication systems. There exists a large number of \emph{wideband} -power line communication systems that are popular with consumers for bridging Ethernet segments between parts of an -apartment or house. These systems transmit up to several hundred megabits per second over distances up to several tens -of meters\cite{kabalci01}. Technologically, these wideband PLC systems are very different from \emph{narrowband} -systems used by utilities for load management among other applications and they are not relevant to our analysis. - -\paragraph{Power line communication (PLC).} -In long-distance communications for applications such as load management, PLC systems are attractive since they allow -re-using the existing wiring infrastructure and have been used as early as in the 1930s\cite{hovi01}. Narrowband PLC -systems are a potentially low-cost solution to the problem of transmitting data at small bandwidth over distances of -several hundred meters up to tens of kilometers. - -Narrowband PLC systems transmit on the order of Kilobits per second or slower. A common use of this sort of system are -\emph{ripple control} systems. These systems superimpose a low-frequency signal at some few hundred Hertz carrier -frequency on top of the 50Hz mains sine. This low-frequency signal is used to encode switching commands for -non-essential residential or industrial loads. Ripple control systems provide utilities with the ability to actively -control demand while promising savings in electricity cost to consumers\cite{dzung01}. - -In any PLC system there is a strict trade-off between bandwidth, power and distance. Higher bandwidth requires higher -power and reduces maximum transmission distance. Where ripple control systems usually use few transmitters to cover -the entire grid of a regional distribution utility, higher bandwidth bidirectional systems used for automatic meter -reading (AMR) in places such as Italy or France require repeaters within a few hundred meters of a transmitter. - -\subsubsection{Landline and wireless IP-based systems.} -Especially in automated meter reading (AMR) infrastructure the cost-benefit trade-off of power line systems does not -always work out for utilities. A common alternative in these systems is to use the public internet for communication. -Using the public internet has the advantage of low initial investment on the part of the utility company as well as -quick commissioning. Disadvantages compared to a PLC system are potentially higher operational costs due to recurring -fees to network providers as well as lower reliability. Being integrated into power grid infrastructure, a PLC system's -failure modes are highly correlated with the overall grid. Put briefly, if the PLC interface is down, there is a good -chance that power is out, too. In contrast general internet services exhibit a multitude of failures that are entirely -uncorrelated to power grid stability. For purposes such as meter reading for billing purposes, this stability is -sufficient. However for systems that need to hold up in crisis situations such as the recovery system we are -contemplating in this thesis, the public internet may not provide sufficient reliability. - -\subsubsection{Short-range wireless systems.} -Smart meters contain copious amounts of firmware but still pale in comparison to the complexity of full-scale computers -such as smartphones. For short-range communication between a meter and a cellular radio gateway mounted nearby or -between a meter and a meter reading operator in a vehicle on the street a protocol such as Wifi (IEEE 802.11) is too -complex. Absent widely-used standards in this space proprietary radio protocols grew attractive. These are often based -on some standardized lower-level protocol such as ZigBee (IEEE 802.15) but entirely home-grown ones also exist. To the -meter manufacturer a proprietary radio protocol has several advantages. It is easy to implement and requires no external -certification. It can be customized to its specific application. In addition it provides vendor lock-in to customers -sharing infrastructure such as a cellular radio gateway between multiple devices. In other fields a lack of -standardization has led to a proliferation of proprietary protocols and a fragmented protocol landscape. This is a large -problem since the consumer cannot easily integrate products made by different manufacturers into one system. In advanced -metering infrastructure this is unlikely to be a disadvantage since usually there is only one distribution grid -operator for an area. Shared resources such as a cellular radio gateway would most likely only be shared within a -single building and usually they are all operated by the same provider. - -Systems in Europe commonly support Wireless M-Bus, an European standardized protocol\cite{silabs01} that operates on -several ISM bands\footnote{ - Frequency bands that can be used for \emph{Industrial, Scientific and Medical} applications by anyone and that do - not require obtaining a license for transmitter operation. Manufacturers can use whatever protocol they like on - these bands as long as they obtain certification that their transmitters obey certain spectral and power - limitations. -}. ZigBee is another popular standard and some vendors additionally support their own proprietary protcols\footnote{ - For an example see \cite{honeywell01}. -}. +A core part of intervening with any such cyberattack is the ability to communicate remediary actions to the devices +under attack. There is a number of well-established technologies for communication on or along power lines. We can +distinguish three basic system categories: Systems using separate wires (such as DSL over landline telephone wiring), +wireless radio systems (such as LTE) and \emph{Power Line Communication} (PLC) systems that reuse the existing mains +wiring and superimpose data transmissions onto the 50 Hz mains sine\cite{gungor01,kabalci01}. + +During a large-scale cyberattack, availability of internet and cellular connectivity cannot be relied upon. An attacker +may already have disabled such systems in a separate attack, or they may go down along with parts of the electrical +grid. Traditional powerline communication systems or an utitly's proprietary wireless systems would work, but at a range +of no more than several tens of kilometers reaching all meters in a country would require a large upfront infrastructure +investment. \section{Grid Frequency as a Communication Channel} -Despite the awesome complexity of large power grids the physics underlying their response to changes in load and -generation is surprisingly simple. Individual machines (loads and generators) can be approximated by a small number of -differential equations and the entire grid can be modelled by aggregating these approximations into a large system of -nonlinear differential equations. Evaluating these systems it has been found that in large power grids small signal -steady state changes in generation/consumption power balance cause an approximately linear change in -frequency\cite{kundur01,crastan03,entsoe02,entsoe04}. \emph{Small signal} here describes changes in power balance that -are small compared to overall grid power. \emph{Steady state} describes changes over a time frame of multiple waveform -cycles as opposed to transient events that only last a few milliseconds. - -This approximately linear relationship allows the specification of a coefficient with unit \si{\watt\per\hertz} linking -power differential $\Delta P$ and frequency differential $\Delta f$. In this thesis we are using the European power -grid as our model system. We are using data provided by ENTSO-E (formerly UCTE), the governing association of European -transmission system operators. In our calculations we use data for the continental European synchronous area, the -largest synchronous area. $\frac{\Delta P}{\Delta f}$, called \emph{Overall Network Power Frequency Characteristic} by -ENTSO-E is around \SI{25}{\giga\watt\per\hertz}. - -We can derive general design parameter for any system utilizing grid frequency as a communication channel from the -policies of ENTSO-E\cite{entsoe02,entsoe03}. Any such system should stay below a modulation amplitude of -\SI{100}{\milli\hertz} which is the threshold defined in the ENTSO-E incidents classification scale for a Scale 0-1 -(from ``Anomaly'' to ``Noteworthy Incident'' scale) frequency degradation incident\cite{entsoe02} in the continental -Europe synchronous area. -% FIXME resolve cut --- - -Grid frequency in Europe's synchronous areas is nominally 50 Hertz, but there are small load-dependent variations from -this nominal value. Any device connected to the power grid (or even just within physical proximity of power wiring) can -reliably and accurately measure grid frequency at low hardware overhead. By intentionally modifying grid frequency, we -can create a very low-bandwidth broadcast communication channel. Grid frequency modulation has only ever been proposed -as a communication channel at very small scales in microgrids before\cite{urtasun01} and to our knowledge has not yet -been considered for large-scale application. - -Advantages of using grid frequency for communication are low receiver hardware complexity as well as the fact that a -single transmitter can cover an entire synchronous area. Though the transmitter has to be very large and powerful the -setup of a single large transmitter faces lower bureaucratic hurdles than integration of hundreds of smaller ones into -hundreds of local systems that each have autonomous governance. - -% FIXME resolve cut --- -\subsection{Interference from Frequency-Coupled Control Systems} - -The ENTSO-E Operations Handbook Policy 1 chapter\cite{entsoe02} defines the activation threshold of primary control to -be \SI{20}{\milli\hertz}. Ideally, a modulation system would stay well below this threshold to avoid fighting the -primary control reserve. Modulation line rate should likely be on the order of a few hundred Millibaud. Modulation at -these rates would outpace primary control action which is specified by ENTSO-E as acting within between ``a few -seconds'' and \SI{15}{\second}. - -Keeping modulation amplitude below this threshold would help to avoid spuriously triggering these control functions. -The effective \emph{Network Power Frequency Characteristic} of primary control in the European grid is reported by -ENTSO-E at around \SI{20}{\giga\watt\per\hertz}. This works out to an upper bound on modulation power of -\SI{20}{\mega\watt\per\milli\hertz}. - - -\subsection{Transmission Grid Fundamentals for Computer Scientists} -\subsection{Determining Grid Frequency} - -% FIXME resolve cut --- -In commercial power systems Phasor Measurement Units (PMUs, also called \emph{synchrophasors}) are used to precisely -measure parameters of the mains voltage waveform, one of which is grid frequency. PMUs are used as part of SCADA systems -controlling transmission networks to characterize the operational state of the network. - -From a superficial viewpoint measuring grid frequency might seem like a simple problem. Take the mains voltage waveform, -measure time between two rising-edge (or falling-edge) zero-crossings and take the inverse $f = t^{-1}$. In practice, -phasor measurement units are significantly more complex than this. This discrepancy is due to the combination of both -high precision and quick response that is demanded from these units. High precision is necessary since variations of -mains frequency under normal operating conditions are quite small--in the range of \SIrange{5}{10}{\milli\hertz} over -short intervals of time. Relative to the nominal \SI{50}{\hertz} this is a derivation of less than \SI{100}{ppm}. -Relative to the corresponding period of \SI{20}{\milli\second} this means a time derivation of about $2 \mu\text{s}$ -from cycle to cycle. From this it is already obvious why a simplistic measurement cannot yield the required precision -for manageable averaging times: We would need either an ADC sampling rate in the order of megabits per second or for a -reconstruction through interpolated readings an impractically high ADC resolution. - -Detail on the inner workings of commercial phasor measurement units is scarce but given their essential role to SCADA -systems there is a large amount of academic research on such algorithms\cite{narduzzi01,derviskadic01,belega01}. A -popular approach to these systems is to perform a Short-Time Fourier Transform (STFT) on ADC data sampled at high -sampling rate (e.g. \SI{10}{\kilo\hertz}) and then perform analysis on the frequency-domain data to precisely locate the -peak at \SI{50}{\hertz}. A key observation here is that FFT bin size is going to be much larger than required frequency -resolution. This fundamental limitation follows from the Nyquist criterion\cite{shannon01} -and if we had to process an \emph{arbitrary} signal this would severely limit our practical measurement accuracy -\footnote{ - Some software packages providing FFT or STFT primitives such as scipy\cite{virtanen01} allow the user to - super-sample FFT output by specifying an FFT width larger than input data length, padding the input data with zeros - on both sides. Note that in line with the Nyquist theorem this \emph{does not} actually provide finer output - resolution but instead just amounts to an interpolation between output bins. Depending on the downstream analysis - algorithm it may still be sensible to use this property of the DFT for interpolation, but in general it will be - computationally expensive compared to other interpolation methods and in any case it will not yield any better - frequency resolution aside from a potential numerical advantage\cite{gasior02}. -}. -For this reason all approaches to grid frequency estimation are based on a model of the voltage waveform. Nominally -this waveform is a perfect sine at $f=\SI{50}{\hertz}$. In practice it is a sine at $f\approx\SI{50}{\hertz}$ -superimposed with some aperiodic noise (e.g. irregular spikes from inductive loads being energized) as well as harmonic -distortion that is caused by topologically nearby devices with power factor $\cos \theta \neq 1.0$. Under a continuous -fourier transform over a long period the frequency spectrum of a signal distorted like this will be a low noise floor -depending mainly on aperiodic noise on which a comb of harmonics as well as some sub-harmonics of $f \approx -f_\text{nom} = \SI{50}{\hertz}$ is riding. The main peak at $f \approx f_\text{nom}$ will be very strong with the -harmonics being approximately an order of magnitude weaker in energy and the noise floor being at least another order of -magnitude weaker. See Figure \ref{mains_voltage_spectrum} for a measured spectrum. This domain knowledge about the -expected frequency spectrum of the signal can be employed in a number of interpolation techniques to reconstruct the -precise frequency of the spectrum's main component despite distortions and the comparatively coarse STFT resolution. - -Published grid frequency estimation algorithms such as \cite{narduzzi01,derviskadic01} are rather sophisticated and use -a combination of techniques to reduce numerical errors in FFT calculation and peak fitting. Given that we do not need -reference standard-grade accuracy for our application we chose to start with a very basic algorithm instead. We chose to -use a general approach to estimate the precise fundamental frequency of an arbitrary signal that was published by -experimental physicists Gasior and Gonzalez at CERN\cite{gasior01}. This approach assumes a general sinusoidal signal -superimposed with harmonics and broadband noise. Applicable to a wide spectrum of practical signal analysis tasks it is -a reasonable first-degree approximation of the much more sophisticated estimation algorithms developed specifically for -power systems. Some algorithms use components such as kalman filters\cite{narduzzi01} that require a physical model. -As a general algorithm \cite{gasior01} does not require this kind of application-specific tuning, eliminating one source -of error. - -The Gasior and Gonzalez algorithm\cite{gasior01} passes the windowed input signal through a DFT, then interpolates the -signal's fundamental frequency by fitting a wavelet such as a Gaussian to the largest peak in the DFT results. The bias -parameter of this curve fit is an accurate estimation of the signal's fundamental frequency. This algorithm is similar -to the simpler interpolated DFT algorithm used as a reference in much of the synchrophasor estimation -literature\cite{borkowski01}. The three-term variant of the maximum side lobe decay window often used there is a -Blackman window with parameter $\alpha = \frac{1}{4}$. Analysis has shown\cite{belega01} that the interpolated DFT -algorithm is worse than algorithms involving more complex models under some conditions but that there is \emph{no free -lunch} meaning that more complex perform worse when the input signal deviates from their models. -% FIXME resolve cut --- - -\subsubsection{Our Algorithm} -\subsubsection{Our Hardware} - -\section{Characteristics of Grid Frequency} +We propose to approach the problem of broadcasting an emergency signal to all smart meters within a synchronous area by +using grid frequency as a communication channel. Despite the awesome complexity of large power grids, the physics +underlying their response to changes in load and generation is surprisingly simple. Individual machines (loads and +generators) can be approximated by a small number of differential equations and the entire grid can be modelled by +aggregating these approximations into a large system of nonu differential equations. As a consequence, small signal +changes in generation/consumption power balance cause an approximately proportional change in +frequency\cite{kundur01,crastan03,entsoe02,entsoe04}. This \emph{Power Frequency Charactersistic} is about +\SI{25}{\giga\watt\per\hertz} for the continental European synchronous area according to European electricity grid +authority ENTSO-E. + +If we modulate the power consumption of a large load such as a multi-megawatt aluminium smelter, this modulation will +result in a small change in frequency according to this characteristic. So long as we stay within the operational limits +set by ENTSO-E\cite{entsoe02,entsoe03}, this change will not degrade the operation of other parts of the grid. The +advantages of grid frequency modulation are the fact that a single transmitter can cover an entire synchronous area as +well as low receiver hardware complexity. + +To the best of the authors' knowledge, grid frequency modulation has only ever been proposed as a communication channel +at very small scales in microgrids before\cite{urtasun01} and has not yet been considered for large-scale application. + +\subsection{Characterizing Grid Frequency} + +In utility SCADA systems, Phasor Measurement Units (PMUs, also called \emph{synchrophasors}) are used to precisely +measure grid frequency among other parameters. This task is much more complicated in practice than it might appear at +first glance since a PMU has to make extremely precise measurements, track fast changes in frequency and handle even +distorted input signals. Detail on the inner workings of commercial phasor measurement units is scarce but there is a +large amount of academic research on sophisticated phasor measurement +algorithms\cite{narduzzi01,derviskadic01,belega01}. + +Since we do not need reference standard-grade accuracy for our application we chose to start with a very basic algorithm +based on short-time fourier transform (STFT). Our system uses the universal frequency estimation approach of +experimental physicists Gasior and Gonzalez at CERN\cite{gasior01}. The Gasior and Gonzalez algorithm\cite{gasior01} +passes the windowed input signal through a DFT, then interpolates the signal's fundamental frequency by fitting a +wavelet such as a Gaussian to the largest peak in the DFT results. The bias parameter of this curve fit is an accurate +estimation of the signal's fundamental frequency. This algorithm is similar to the simpler interpolated DFT algorithm +used as a reference in much of the phasor measurement literature\cite{borkowski01}. + +To collect ground truth measurements for our analysis of grid frequency as a communication channel, we developed a device +to safely record real mains voltage waveforms. Our system consists of an \texttt{STM32F030F4P6} ARM Cortex M0 +microcontroller that records mains voltage using its internal 12-bit ADC and transmits measured values through a +galvanically isolated USB/serial bridge to a host computer. We derive our system's sampling clock from a crystal oven to +avoid frequency measurement noise due to thermal drift of a regular crystal: \SI{1}{ppm} of crystal drift would cause a +grid frequency error of $\SI{50}{\micro\hertz}$. We validated the performance of our crystal oven solution by +benchmarking it against a GPS 1pps reference. + +% FIXME measurement results, spectra \section{Grid Frequency Modulation} -\subsection{Fundamental Physics} -\subsection{Transmitter Implementation} - -% FIXME resolve cut --- -In its most basic form a transmitter for grid frequency modulation would be a very large controllable load connected to -the power grid at a suitable vantage point. A spool of wire submerged in a body of cooling liquid such as a small lake -along with a thyristor rectifier bank would likely suffice to perform this function during occasional cybersecurity -incidents. We can however decrease hardware and maintenance investment even further compared to this rather -uncultivated solution by repurposing regular large industrial loads as transmitters in an emergency situation. For some -preliminary exploration we went through a list of energy-intensive industries in Europe\cite{ec01}. The most -electricity-intensive industries in this list are primary aluminum and steel production. In primary production raw ore -is converted into raw metal for further refinement such as casting, rolling or extrusion. In steelmaking iron is -smolten in an electric arc furnace. In aluminum smelting aluminum is electrolytically extracted from alumina. Both -processes involve large amounts of electricity with electricity making up \SI{40}{\percent} of production costs. Given -these circumstances a steel mill or aluminum smelter would be good candidates as transmitters in a grid frequency -modulation system. - -In aluminum smelting high-voltage mains is transformed, rectified and fed into about 100 series-connected electrolytic -cells forming a \emph{potline}. Inside these pots alumina is dissolved in molten cryolite electrolyte at about -\SI{1000}{\degreeCelsius} and electrolysis is performed using a current of tens or hundreds of Kiloampère. The resulting -pure aluminum settles at the bottom of the cell and is tapped off for further processing. - -Like steelworks, aluminum smelters are operated night and day without interruption. Aside from metallurgical issues the -large thermal mass and enormous heating power requirements do not permit power cycling. Due to the high costs of -production inefficiencies or interruptions the behavior of aluminum smelters under power outages is a -well-characterized phenomenon in the industry. The recent move away from nuclear power and towards renewable energy has -lead to an increase in fluctuations of electricity price throughout the day. These electricity price fluctuations have -provided enough economic incentive to aluminum smelters to develop techniques to modulate smelter power consumption -without affecting cell lifetime or product quality\cite{duessel01,eisma01}. Power outages of tens of minutes up to two -hours reportedly do not cause problems in aluminum potlines and are in fact part of routine operation for purposes such -as electrode changes\cite{eisma01,oye01}. - -The power supply system of an aluminum plant is managed through a highly-integrated control system as keeping all cells -of a potline under optimal operating conditions is challenging. Modern power supply systems employ large banks of diodes -or SCRs\footnote{SCRs, also called thyristors, are electronic devices that are often used in high-power switching -applications. They are normally-off devices that act like diodes when a current is fed into their control terminal.} to -rectify low-voltage AC to DC to be fed into the potline\cite{ayoub01}. The potline voltage can be controlled almost -continuously through a combination of a tap changer and a transductor. The individual cell voltages can be controlled by -changing the anode to cathode distance (ACD) by physically lowering or raising the anode. The potline power supply is -connected to the high voltage input and to the potline through isolators and breakers. - -In an aluminum smelter most of the power is sunk into resistive losses and the electrolysis process. As such an -aluminum smelter does not have any significant electromechanical inertia compared to the large rotating machines used -in other industries. Depending on the capabilities of the rectifier controls high slew rates are possible, permitting -modulation at high\footnote{Aluminum smelter rectifiers are \emph{pulse rectifiers}. This means instead of simply -rectifying the incoming three-phase voltage they use a special configuration of transformer secondaries and in some -cases additional coils to produce a large number of equally spaced phases (e.g.\ six) from a standard three-phase input. -Where a direct-connected three-phase rectifier would draw current in six pulses per mains voltage cycle a pulse -rectifier draws current in more, smaller pulses to increase power factor. For example a 12-pulse rectifier will draw -current in 12 pulses per cycle. In the best case an SCR pulse rectifier switched at zero crossing should allow -\SIrange{0}{100}{\percent} load changes from one rectifier pulse to the next, i.e. within a fraction of a single cycle.} -data rates. -% FIXME resolve cut --- - -\subsection{Parametrizing DSSS Modulation for GFM} - -% FIXME resolve cut/write intro --- -\begin{description} - \item[Modulation amplitude.] Amplitude is proportionally related to modulation power. In a practical setup we might - realize a modulation power up to a few hundred \si{\mega\watt} which would yield a few tens of \si{\milli\hertz} - of frequency amplitude. - \item[Modulation preemphasis and slew-rate control.] Preemphasis might be necessary to ensure an adequate - Signal-to-Noise ratio (SNR) at the receiver. Slew-rate control and other shaping measures might be necessary to - reduce the impact of these sudden load changes on the transmitter's primary function (say, aluminum smelting) - and to prevent disturbances to other grid components. - \item[Modulation frequency.] For a practical implementation a careful study would be necessary to determine the - optimal frequency band for operation. On one hand we need to prevent disturbances to the grid such as the - excitation of local or inter-area modes. On the other hand we need to optimize Signal-to-Noise ratio (SNR) - and data rate to achieve optimal latency between transmission start and reset completion and to reduce the - overall burden on both transmitter and grid. - \item[Further modulation parameters.] The modulation itself has numerous parameters that are discussed in Section - \ref{mod_params} below. -\end{description} - -% FIXME resolve cut/write intro --- -% FIXME too many enumerations? -In this section we will explore how we can construct a reliable communication channel from the analog primitive we -have outlined in the previous section. Our load control approach to grid frequency modulation leads to a channel with the -following properties. - -\begin{description} - \item[Slow-changing.] Accurate grid frequency measurements take several periods of the mains sine wave. Faster - sampling rates can be achieved with more complex specialized synchrophasor estimation algorithms but this will - result in a trade-off between sampling rate and accuracy\cite{belega01}. - \item[Analog.] Grid frequency is an analog signal. - \item[Noisy.] While stable over long periods of time thanks to power stations' Load-Frequency Control - systems\cite{entsoe04} there are considerable random short-term variations. Our modulation amplitude is limited - by technical and economic constraints so we have to find a system that will work at poor SNRs. - \item[Polarized.] Grid frequency measurements have an inherent sense of polarity that we can use in our modulation - scheme. -\end{description} - -% FIXME resolve cut --- -Modern power systems are complex electromechanical systems. Each component is controlled by several carefully tuned -feedback loops to ensure voltage, load and frequency regulation. Multiple components are coupled through transmission -lines that themselves exhibit complex dynamic behavior. The overall system is generally stable, but may exhbit -instabilities to particular small-signal stimuli\cite{kundur01,crastan03}. These instabilities, called \emph{modes}, -occur when due to mis-tuning of parameters or physical constraints the overall system exhibits oscillation at a -particular frequency. \cite{kundur01} separates these modes into four categories: - -\begin{description} - \item[Local modes] where a single power station oscillates in some parameter, - \item[Interarea modes] where subsections of the overall grid oscillate with respect to each other due to weak - coupling between them, - \item[Control modes] caused by imperfectly tuned control systems and - \item[Torsional modes] that originate from electromechanical oscillations in the generator itself. -\end{description} - -The oscillation frequencies associated with each of these modes are usually between a few tens of Millihertz and a few -Hertz\cite{grebe01,entsoe01,crastan03}. It is hard to predict the particular modes of a power system at the scale of the -central European interconnected system. Theoretical analysis and simulation may give rough indications but cannot yield -conclusive results. Due to the obvious danger as well as high economical impact due to inefficiencies experimental -measurements are infeasible. Modes are highly dependent on the power grid's structure and will change with changes in -the power grid over time. For all of these reasons, a grid frequency modulation system must be designed very -conservatively without relying on the absence (or presence) of modes at particular frequencies. A concrete design -guideline that we can derive from this situation is that the frequency spectrum of any grid frequency modulation system -should not exhibit large peaks and should avoid a concentration of spectral energy in small frequency bands. -% FIXME resolve cut --- - -\subsection{Parametrizing a "Safety Reset" System Based on GFM} -% FIXME resolve cut & write intro --- -% FIXME cut down next 2 sections -\subsubsection{Error-correcting codes} - -To reduce reception error rate we have to layer channel coding on top of the DSSS modulation. The messages we expect to -transmit are at least a few tens of bits long. We are highly constrained in SNR due to limited transmission power and -with lower SNR comes higher BER (Bit Error Rate). At a fixed BER, packet error rate grows exponentially with -transmission length so for our relatively long transmissions we would realistically get unacceptable error rates. - -Error correcting codes are a very broad field with many options for specialization. Since we are implementing only an -advanced prototype in this thesis we chose to spend only limited resources on optimization and settled on a basic -Reed-Solomon code. We have no doubt that applying a more state-of-the-art code we could gain further improvements in -code overhead and decoding speed among others\cite{mackay01}. Since message length in our system limits system response -time but we do not have a fixed target we can tolerate some degree of overhead. Decoding speed is of very low concern -to us because our data rate is extremely low. We derived our implementation by adapting and optimizing an existing open -source decoder that we validated on an open source encoder implementation. We generate test signals using a Python tool -on the host. - -\subsubsection{Cryptographic security} -\label{sec-crypto} -Above the communication base layer elaborated in the previous section we have to layer a cryptographic protocol to -ensure system security. We want to avoid a case where a third party could interfere with our system or even subvert this -safety system itself for an attack. From a protocol security perspective the system we are looking for can informally -be modelled as consisting of three parties: the trusted \emph{transmitter}, one of a large number of untrusted -\emph{receivers}, and an \emph{attacker}. These three play according to the following rules: - -\begin{description} - \item[Access.] Both transmitter and attacker can transmit any bit sequence. - \item[Indistinguishability.] The receiver receives any transmission by either but cannot distinguish between them. - \item[Kerckhoff's principle.] Since the protocol design is public and anyone can get access to an electricity meter - the attacker knows anything any receiver might know\cite{kerckhoff01,kerckhoff02}. - \item[Priority.] The transmitter is stronger than an attacker and will ``win'' during simultaneous transmission. - \item[Seeding.] Both transmitter and receiver can be seeded out-of-band with some information on each other such as - public key fingerprints. -\end{description} - -We are not considering situations where an attacker attempts to jam an ongoing transmission. In practice there are -several avenues to prevent such attempts. Compromised large loads that are being abused by the attacker can be manually -disconnected by the utility. Error-correcting codes can be used to provide resiliency against small-scale disturbances. -Finally, the transmitter can be designed to have high enough power to be able to override any likely attacker. - -With the above properties in mind our goal is to find a cryptographic primitive that has the following properties: -\begin{description} - \item[Authentication.] The transmitter can produce a message bit sequence that a certain subset of receivers can - identify as being generated by the transmitter. On reception of this sequence, all addressed receivers perform a - safety reset. - \item[Unforgeability.] The attacker cannot forge a message, i.e.\ find a bit sequence other than one of the - transmitter's previous messages that a receiver would accept. This implies that the attacker also cannot create - a new distinct message from a previously transmitted message. - \item[Brevity.] The message should be short. Our communication channel is outrageously slow compared to anything - else used in modern telecommunications and every bit counts. -\end{description} - -On a protocol level we also have to ensure \emph{idempotence}. Our system should have an at-most-once semantic. This -means for a given message each receiver either performs exactly one safety reset or none at all, even if the message is -re-transmitted by either the transmitter or an attacker. We cannot achieve the ideal exactly-once semantic wit pure -protocol gymnastics since we are using an unidirectional lossy communication primitive. A receiver might be offline -(e.g.\ due to a local power outage) and then would not hear the transmission even if our broadcast primitive was -reliable. Since there is no back channel, the transmitter has no way of telling when that happens. The practical impact -of this can be mitigated by the transmitter repeating the message a number of times. - -It follows from the unforgeability requirement that we can trivially reach idempotence at the protocol level by keeping -a database of all previous messages and only accepting new messages. By considering this in our cryptographic design we -can reduce the storage overhead of this ``database''. - -Along with the indistinguishability property the access requirement implies that we need a cryptographic -signature\cite{lamport01}. However, we have relaxed constraints on this signature compared to standard cryptographic -practice\cite{anderson04}. While cryptographic signatures need to work over arbitrary inputs, all we want to ``sign'' -here is the instruction to perform a safety reset. This is the only message we might ever want to transmit so our -message space has only one element. The information content of our message thus is 0 bit! All the information we want to -transmit is already encoded \emph{in the fact that we are transmitting} and we do not require a further payload to be -transmitted: We can omit the entirety of the message and just transmit whatever ``signature'' we -produce\cite{haller01,rfc1760}. This is useful to conserve transmission bits so our transmission does not take an -exceedingly long time over our extremely slow communication channel. - -We can modify this construction to allow for a small number of bits of information content in our message (say two or -three instead of zero) at no transmission overhead by transmitting the cryptographic signature as usual but simply -omitting the message. The message contains only a few bits of information and we are dealing with minutes of -transmission time so the receiver can reconstruct the message through brute-force. Though this trade-off between -computation and data transmission might seem inelegant it does work for our extremely slow link for up to a few bits of -information. - -There is an important limitation in the rules of our setup above: The attacker can always record the reset bit sequence -the transmitter transmits and replay that same sequence later. Even without cryptography we can trivially prevent an -attacker from violating the at-most-once criterion. If every receiver memorizes all bit sequences that have been -transmitted so far it can detect replays. With this mitigation by replaying an older authentic transmission an attacker -can cause receivers that were offline during the original transmission to reset at a later point. Considering our goal -is to reset them in the first place this should not pose a threat to the system's safety or security. - -A possible scenario would be that an attacker first causes enough havoc for authorities to trigger a safety reset. The -attacker would record the trigger transmission. We can assume most meters were reset during the attack. Due to this the -attacker cannot cause a significant number of additional resets immediately afterwards. However, the attacker could -wait several years for a number of new meters to be installed that might not yet have updated firmware that includes the -last transmission. This means the attacker could cause them to reset by replaying the original sequence. - -A possible mitigation for this risk would be to introduce one bit of information into the trigger message that is -ignored by the replay protection mechanism. This \emph{enable} bit would be $1$ for the actual reset trigger message. -After the attack the transmitter would then perform scheduled transmissions of a ``disarm'' message that has this bit -set to $0$. This message informs all new meters and meters that were offline during the original transmission of the -original transmission for replay protection without actually performing any further resets. - -We could use any of several traditional asymmetric cryptographic primitives to produce these signatures. The -comparatively high computational effort required for signature verification would not be an issue. Transmissions take -several minutes anyway and we can afford to spend some tens of seconds even in signature verification. Transmission -length and by proxy system latency would be determined by the length of the signature. For RSA signature length is the -modulus length (i.e. larger than \SI{1000}{bit} for very basic contemporary security). For elliptic curve-based systems -curve length is approximately twice the security level and signature size is twice the curve length because two curve -points need to be encoded\cite{anderson02}. For contemporary security this results in more than 300 bit transmission -length. We can exploit our unique setting's low message entropy to improve on this by basing our scheme on a -cryptographic hash function used as a one-way pseudo-random function (PRF). Hash-based signature schemes date back to -the very beginnings of cryptographic signatures\cite{anderson04,diffie01,lamport02}. Today, in general applications -schemes based on asymmetric cryptography are preferred but hash-based signature systems have their applications in -certain use cases. One example of such a scheme is the TESLA scheme\cite{perrig01} that is the basis for navigation -message authentication in the European Galileo global navigation satellite system. Here, a system based purely on -asymmetric primitives would result in too much computation and communication overhead\cite{ec05}. In the following -sections we will introduce the foundations of hash-based signatures before deriving our authentication scheme. - -\subsubsection{Lamport signatures} - -1979, Lamport in \cite{lamport02} introduced a signature scheme that is based only on a one-way function such as a -cryptographic hash function. The basic observation is that by choosing a random secret input to a one-way function and -publishing the output, one can later prove knowledge of the input simply by publishing it. In the following paragraphs -we will describe a construction of a one-time signature scheme based on this observation. The scheme we describe is the -one usually called a ``Lamport Signature'' in modern literature but is slightly different from the variant described in -the 1979 paper. For our purposes we can consider both to be equivalent. - -\paragraph{Setup.} In a Lamport signature, for an n-bit hash function $H$ the signer generates a private key $s = -\left(s_{b, i} | b\in\left\{0, 1\right\}, 0\le i<n\right)$ of $2n$ random strings of length $n$. The signer publishes a -public key $p = \left(p_{b, i} = H\left(s_{b, i}\right), b\in\left\{0, 1\right\}, 0\le i<n\right)$ that is simply the -list of hashes of each of the random strings that make up the private key. - -\paragraph{Signing.} To sign a message $m$, the signer publishes the signature $\sigma = \left(\sigma_i = k_{H(m)_i, -i}\right)$ where $H(m)_i$ is the $i$-th bit of $H$ applied to $m$. That is, for the $i$-th bit of the message's hash -$H(m)$ the signer publishes either of $p_{0, i}$ or $p_{1, i}$ depending on the hash bit's value, keeping the other -entry of $P$ secret. - -\paragraph{Verification.} The verifier can compute $H(m)$ themselves and check the corresponding entries $\sigma_i = -k_{H(m)_i}$ of $S$ correctly evaluate to $p_{b, i} = H\left(s_{b, i}\right)$ from $P$ under $H$. - -The above scheme is a one-time signature scheme only. After one signature has been published for a given key, the -corresponding key must not be reüsed for other signatures. This is intuitively clear as we are effectively publishing -part of the private key as the signature, and if we were to publish a signature for another message an attacker could -derive additional signatures by ``mixing'' the two published signatures. - -\subsubsection{Winternitz signatures} - -An improvement to basic Lamport signatures as described above are Winternitz signatures as detailed in -\cite{merkle01,dods01}. Winternitz signatures reduce public key length as well as signature length for hash length $n$ -from $2n$ to $\mathcal O \left(n/t\right)$ for some choice of parameter $t$ (usually a small number such as 4). - -\paragraph{Setup.} The signer generates a private key $s = \left(s_i\right)$ consisting of $\lceil\frac{n}{t}\rceil$ random -bit strings. The signer publishes a public key $p = \left(H^{2^t}\left(s_i\right)\right)$ where each element -$H^{2^t}\left(s_i\right)$ is the $2^t$-fold recursive application of $H$ to $s_i$. - -\paragraph{Signing.} The signer splits $m$ padded to a multiple of $t$ bits into $\lceil\frac{n}{t}\rceil$ chunks $m_i$ of -$t$ bit each. The signer publishes the signature $\sigma = \left( \sigma_i = H^{m_i}\left(s_i\right) \right)$. - -\paragraph{Verification.} The verifier can calculate for each $\sigma_i = H^{m_i}\left(s_i\right)$ that $H^{2^t - -m_i}\left(\sigma_i\right) = H^{2^t - m_i}\left(H^{m_i}\left(s_i\right)\right) = H^{2^t - m_i + m_i} \left(s_i\right) = -p_i$. - -To prevent an attacker from forging additional signatures from one signature by calculating $\sigma_i' = -H\left(\sigma_i\right)$ matching $m_i' = m_i + 1$, this scheme is usually paired with a simple checksum as described in -\cite{merkle01}. - -\subsubsection{Using hash-based signatures for trigger authentication} - -Applying these concepts the most basic trigger authentication scheme possible would be to simply generate a random -secret key bit string $s$ and publish $p = H(s)$ for some hash function $H$. To activate the trigger, $\sigma = s$ is -published and receivers verify that $H(\sigma) = p = H(s)$. This simplistic scheme has one main disadvantage: It is a -fundamentally one-time construction. To prevent an attacker from re-triggering a receiver a second time by replaying a -valid trigger $\sigma$ all receivers have to blacklist any ``used'' $\sigma$. Alas, this means we can only ever trigger -a receiver \emph{once}. The good part is that any receiver that missed this trigger can still be triggered later, but -the bad part is that once $s$ is burned we are out of options. The trivial solution to this would be to simply provision -each receiver with a whole list of public keys in advance. This however takes $n$ times the amount of space for $n$-fold -retriggerability and for each one we have to memorize separately whether it has been used up. Luckily we can easily -derive a scheme that yields $n$-fold retriggerability and naturally memorizes replay state while using no more space -than the original scheme by taking some inspiration from Winternitz signatures. - -In this improved scheme the secret key $s$ is still a random bit string. The public key is $p = H^n(s)$ for $n$-times -retriggerability. The $i$-th time the trigger is activated, $\sigma_i = H^{n-i}(s)$ is published, and every receiver -can verify that $\sigma_{i-1} = H\left(\sigma_i\right)$ with $\sigma_0 = p$. In case a receiver missed one or more -previous triggers it continues computing $H\left(H\left(\sigma_i\right)\right)$ and -$H\left(H\left(H\left(\sigma_i\right)\right)\right)$ and so on until either reaching the $n$-th recursion -level--indicating an invalid signature--or finding $H^n\left(\sigma_i\right) = \sigma_j$ with $\sigma_j$ being the last -signature this receiver recorded or $p$ in case there is none. - -This scheme provides replay protection since the receiver memorizes the last signature they acted on. Public key length -is equal to the length of the hash function $H$ used. Even for our embedded systems use case $n$ can realistically be up -to $\mathcal O\left(10^3\right)$, which is enough for our purposes. This use of a hash chain for event authentication is -identical to the one in the S/KEY one-time password system\cite{anderson04,haller01,rfc1760}. -% 1990ies crypto yeah! - -The ``disarm'' message we discussed above for replay protection can be integrated into this scheme by encoding the -``enable'' bit into the least significant bit of $n$ in our $H^n$ construction. In the chain of valid signatures every -second one would be a disarm signature: Reset and disarm signatures would alternate in this scheme. By skipping a disarm -signature two resets can still be triggered directly after one another. - -In practice it may be useful to have some control over which meters reset. An attack exploiting a particular network -protocol implementation flaw might only affect one series of meters made by one manufacturer. Resetting \emph{all} -meters may be too much in this case. A simple solution for this is to define addressable subsets of meters. ``All -meters'' along with ``meters made by manufacturer $x$'' and ``meters of model $y$'' are good choices for such scopes. On -the cryptographic level the protocol state is simply duplicated for each scope. This incurs memory and computation -overhead linear in the number of scopes but device memory requirements are small at a few bytes only and computation is -of no concern due to the very slow channel so this simple solution is adequate. The transmitter has to either store -copies of all scope's keys or derive these keys from a root key using the scope's identifier. Keys are small and the -transmitter would be using a regular server or hardware security module for key management so either easily feasible. - -A diagram of the key structure in this key management scheme is shown in Figure \ref{fig:sig_key_chain}. The -transmitter key management is shown in Figure \ref{fig:tx_scope_key_illu}. This scheme is simplistic but suffices for -our prototype in Section \ref{sec-prototype} and may even be useful in a practical implementation. During -standardization of a safety reset system the key management system would most likely have to be customized to the -particular application's requirements. Developing an universal solution is outside the scope of this work. - - -% FIXME resolve cut --- - -\subsection{Simulation Results} + +Given the grid characteristics we measured using our custom waveform recorder and a model of our transmitter, we can +derive parameters for the modulation of our broadcast system. In its most basic form a transmitter for grid frequency +modulation would be a very large controllable load connected to the power grid at a suitable vantage point. A spool of +wire submerged in a body of cooling liquid such as a small lake along with a thyristor rectifier bank would likely +suffice to perform this function during occasional cybersecurity incidents. We can however decrease hardware and +maintenance investment even compared to this rather uncultivated solution by repurposing large industrial loads +as transmitters. Going through a list of energy-intensive industries in Europe\cite{ec01}, we found that an aluminium +smelter would be a good candidate. In aluminium smelting, aluminium is electrolytically extracted from alumina solution. +High-voltage mains power is transformed, rectified and fed into about 100 series-connected electrolytic cells forming a +\emph{potline}. Inside these pots alumina is dissolved in molten cryolite electrolyte at about \SI{1000}{\degreeCelsius} +and electrolysis is performed using a current of tens or hundreds of Kiloampère. The resulting pure aluminium settles at +the bottom of the cell and is tapped off for further processing. + +Aluminium smelters are operated around the clock, and due to the high financial stakes their behavior under power +outages has been carefully characterized by the industry. Power outages of tens of minutes up to two hours reportedly do +not cause problems in aluminium potlines\cite{eisma01,oye01}. Recently, even techniques for intentional power modulation +without affecting cell lifetime or product quality have been devloped to take advantage of variable energy +prices.\cite{duessel01,eisma01}. An aluminium plant's power supply is controlled to constantly keep all smelter cells +under optimal operating conditions. Modern power supply systems employ large banks of diodes or SCRs to rectify +low-voltage AC to DC to be fed into the potline\cite{ayoub01}. Potline voltage is controlled through a combination of a +tap changer and a transductor. Individual cell voltages are controlled by changing the physical distance between anode +and cathode distance. In this setup, power can be modulated fully electronically. Since this system does not have any +mechanical inertia, high modulation rates can reasonably be achieved. + +\subsection{Parametrizing Modulation for GFM} + +Modulating $\SI{25}{\mega\watt}$ of smelter power would yield a frequency shift of $\SI{1}{\milli\hertz}$. At an RMS +frequency noise of around $\SI{10}{\milli\hertz}$ in the band around $\SI{1}{\hertz}$, this results in challenging SNR. +% FIXME properly calculate frequency noise density, SNR +Under such conditions, the obvious choice for modulation are spread-spectrum techniques. Thus, we approached the setting +using Direct Sequence Spread Spectrum for its simple implementation and good overall performance. DSSS chip timing +should be as fast as the transmitter's physics allow to exploit the low-noise region between +$\SI{0.2}{\hertz}$ to $\SI{2.0}{\hertz}$ in the frequency noise spectrum while avoiding any of the grid's oscillation modes. Going +past $\approx\SI{2}{\hertz}$ would put strain on the receiver's frequency measurement subsystem\cite{belega01}. Using a +spread-spectrum technique allows us to reduce the effect of interference by spurious tones. In addition, spreading our +signal's energy over frequency also reduces the likelihood that we cause the grid to oscillate along any of its modes. + +To test our proposed approach, we wrote a proof-of-concept modulator and demodulator in Python and tested this +proof-of-concept prototype with data captured from our grid frequency sensor. Our simulations covered a range of +parameters in modulation amplitude, DSSS sequence bit depth, chip duration and detection threshold. +Figure~\ref{fig_ser_nbits} shows symbol error rate (SER) as a function of modulation amplitude with Gold sequences of +several bit depths. As can be seen, realistic modulation amplitudes are in the range around $\SI{1}{\milli\hertz}$. In +the continental European synchronous area, this corresponds to a modulation power of approximately +$\SI{25}{\mega\watt}$. Figure~\ref{fig_ser_thf} shows SER against detection threshold relative to background noise. +Figure~\ref{fig_ser_chip} shows SER against chip duration for a given fixed symbol length. As expected from looking at +our measured grid frequency noise spectrum, performance is best for short chip durations and worsens for longer chip +durations since shorter chip durations move our signals' bandwidth into the lower-noise region from $\SI{0.2}{\hertz}$ +to $\SI{2}{\hertz}$. +%FIXME introduce term "chip" somewhere + +\begin{figure} + \centering + \includegraphics[width=0.6\textwidth]{../notebooks/fig_out/dsss_gold_nbits_overview} + \caption{Symbol Error Rate as a function of modulation amplitude for Gold sequences of several lengths.} + \label{fig_ser_nbits} +\end{figure} + +\begin{figure} + \centering + \hspace*{-1cm}\includegraphics[width=1.2\textwidth]{../notebooks/fig_out/dsss_thf_amplitude_5678} + \caption{SER vs.\ Amplitude and detection threshold. Detection threshold is set as a factor of background noise + level.} + \label{fig_ser_thf} +\end{figure} + +\begin{figure} + \centering + \hspace*{-1cm}\includegraphics[width=1.2\textwidth]{../notebooks/fig_out/chip_duration_sensitivity_6} + \vspace*{-1cm} + \caption{SER vs.\ DSSS chip duration.} + \label{fig_ser_chip} +\end{figure} + +\subsection{Parametrizing a proof-of-concept "Safety Reset" System Based on GFM} + +Taking these modulation parameters as a starting point, we proceeded to create a proof-of-concept smart meter emergency +reset system. On top of the modulation described in the previous paragraphs we layered simple Reed-Solomon error +correction\cite{mackay01} and some cryptography. The goal of our PoC cryptographic implementation was to allow the +sender of an emergency reset broadcast to authorize a reset command to all listening smart meters. An additional +constraint of our setting is that due to the extremely slow communication channel all messages should be kept as short +as possible. The solution we chose for our PoC is a simplistic hash chain using the approach from the Lamport and +Winternitz One-time Signature (OTS) schemes. Informally, the private key is a random bitstring. The public key is +generated by recursively applying a hash function to this key a number of times. Each smart meter reset command is then +authorized by disclosing subsequent elements of this series. Unwinding the hash chain from the public key at the end of +the chain towards the private key at its beginning, at each step a receiver can validate the current command by checking +that it corresponds to the previously unknown input of the current step of the hash chain. Replay attacks are prevented +by recording the most recent valid command. This simple scheme does not afford much functionality but it results in very +short messages and removes the need for computationally public key cryptography inside the smart meter. +% FIXME add more precise/formal description of crypto +% FIXME add description of targeting/scope function? +% FIXME somewhere above descirbe entire reset system architecture????!!! +% FIXME add description of disarm message (replay protection) + +\subsection{Experimental results} + +\begin{figure} + \centering + \includegraphics[width=0.6\textwidth]{prototype.jpg} + \caption{The completed prototype setup. The board on the left is the safety reset microcontroller. It is connected + to the smart meter in the middle through an adapter board. The top left contains a USB hub with debug interfaces to + the reset microcontroller. The cables on the bottom left are the debug USB cable and the \SI{3.5}{\milli\meter} + audio cable for the simulated mains voltage input.} + \label{fig_proto_pic} +\end{figure} + +For a realistic proof of concept, we decided to implement our signal processing chain from DSSS demodulator through +error correction up to our simple cryptography layer in microcontroller firmware and demonstrate this firmware on actual +smart meter hardware, shown in Figure~\ref{fig_proto_pic}. In our proof of concept a safety reset controller is +connected to the main application microcontroller of a smart meter. The reset controller is tasked with listening for +authenticated reset commands on the voltage waveform, and on reception of such a command resetting the smart meter +application controller by flashing a known-good firmware image to its memory. + +The signal processing chain of our PoC is shown in Figure~\ref{fig_demo_sig_schema}. To interoperate with existing +implementations of SHA-512 and reed-solomon decoding, this implementation was written in the C programming language. To +demonstrate an application close to a field implementation, we chose an Easymeter \texttt{Q3DA1002} smart meter as our +reset target. This model is popular in the German market and readily available second-hand. The meter consists of three +isolated metering ASICs connected to a data logging and display PCB through infrared optical links. To demonstrate the +safety reset's firmware reset functionality, we connected our safety reset microcontroller to the Texas Instruments +\texttt{MSP430} microcontroller on the meter's display and data logging board through the JTAG debug interface that the +board's vendor had conveniently left accessible. We ported part of +\texttt{mspdebug}\footnote{\url{https://dlbeer.co.nz/mspdebug/}} to drive the meter microcontroller's JTAG interface and +wrote a piece of demonstrator code that overwrites the meter's firmware with one that displays an identifying string on +the meter's display after boot-up. + +\begin{figure} + \centering + \includegraphics[width=\textwidth]{prototype_schema} + \caption{The signal processing chain of our demonstrator.} + \label{fig_demo_sig_schema} +\end{figure} + +Since we did not have an aluminium smelter ready, we decided to feed our proof-of-concept reset controller with an +emulated grid voltage sine wave from a computer's headphone jack. Where in a real application this microcontroller might +take ADC readings of input mains voltage divided down by a long resistive divider chain, we instead feed the ADC from a +$\SI{3.5}{\milli\meter}$ audio input. For operational safety, we disconnected the meter microcontroller from its +grid-referenced capacitive dropper power supply and connected it to our reset controlller's debug USB power supply. + +We performed several successful experiments using a signature truncated at 120 bit and a 5 bit DSSS sequence. Taking the +sign bit into account, the length of the encoded signature is 20 DSSS symbols. On top of this we used Reed-Solomon error +correction at a 2:1 ratio inflating total message length to 30 DSSS symbols. At the \SI{1}{\second} chip rate we used in +other simulations as well this equates to an overall transmission duration of approximately \SI{15}{\minute}. To give +the demodulator some time to settle and to produce more realistic conditions of signal reception we padded the modulated +signal unmodulated noise on both ends. \section{Discussion} + +For our proof of concept, before settling on the commercial smart meter we first tried to use an \texttt{EVM430-F6779} +smart meter evaluation kit made by Texas Instruments. This evaluation kit did not turn out well for two main reasons. +One, it shipped with half the case missing and no cover for the terminal blocks. Because of this some work was required +to get it electrically safe. Even after mounting it in an electrically safe manner the safety reset controller +prototype would also have to be galvanically isolated to not pose an electrical safety risk since the main MCU is not +isolated from the grid and the JTAG port is also galvanically coupled. The second issue we ran into was that the +development board is based around a specific microcontroller from TI's \texttt{MSP430} series that is incompatible with +common JTAG programmers. + +Our initial assumption that a development kit would be easier to program than a commercial meter did not prove to be +true. Contrary to our expectations the commercial meter had JTAG enabled allowing us to easily read out its stock +firmware without either reverse-engineering vendor firmware update files nor circumventing code protection measures. +The fact that its firmware was only available in its compiled binary form was not much of a hindrance as it proved not +to be too complex and all we wanted to know we found out with just a few hours of digging in +Ghidra\footnote{\url{https://ghidra-sre.org/}}. + +In the firmware development phase our approach of testing every module individually (e.g. DSSS demodulator, Reed-Solomon +decoder, grid frequency estimation) proved to be very useful. In particular debugging benefited greatly from being able +to run several thousand tests within seconds. In case of our DSSS demodulator, this modular testing and simulation +architecture allowed us to simulate thousands of runs of our implementation on test data and directly compare it to our +Jupyter/Python prototype. Since we spent more time polishing our embedded C implementation it turned out to perform +better than our Python prototype while still exhibiting the same fundamental response to changes to its parameters. One +significant bug we fixed in the embedded C version was the Python version's tendency towards incorrect decodings at even +very large amplitudes. + +In accordance with our initial estimations we did not run into any code space nor computation bottlenecks for chosing +floating point emulation instead of porting over our algorithms to fixed point calculations. The extremely slow sampling +rate of our systems makes even heavyweight processing such as FFT or our brute force dynamic programming approach to +DSSS demodulation possible well within our performance constraints. + +Since we are only building a prototype we did not optimize firmware code size. At around \SI{64}{\kilo\byte}, the +compiled code size of our firmware implementation is slightly larger than we would like. The overall most heavy-weight +operations are the SHA512 implementation from libsodium and the FFT from ARM's CMSIS signal processing library. +Especially the SHA512 implementation has large potential for size optimization because it is highly optimized for speed +using extensive manual loop unrolling. Despite being larger than what we initially targeted, this firmware is still +small compared to the firmware space available in commercially deployed smart meters. We estimate that even without +additional optimizations, our PoC firmware is already within the realm of firmware size that could be implemented in a +commercially viable safety reset controller. + +\section{Conclusion} \label{sec_conclusion} +In this paper we have developed an end-to-end design of a reset system to restore smart meters to a safe operating state +during an ongoing large-scale cyberattack. To allow our system to be triggered even in the middle of a cyberattack we +have developed a broadcast data transmission system based on intentional modulation of global grid frequency. We have +shown the viability of our end-to-end design through simulations. To put these simulations on a solid foundation we have +developed a grid frequency measurement methodology comprising of a custom-designed hardware device for electrically safe +data capture and a set of software tools to archive and process captured data. Our simulations show good behavior of our +broadcast communication system and give an indication that cooperating with a large consumer such as an aluminium smelter +would be a feasible way to set up a transmitter with low hardware overhead. We have outlined a simple cryptographic +protocol ready for embedded implementation in resource-constrained systems that allows triggering a safety reset with a +response time of less than 30 minutes. We have experimentally validated our system using simulated grid frequency data +in a demonstrator setup based on a commercial microcontroller as our safety reset controller and an off-the-shelf smart +meter. Source code and electronics CAD designs are available at the public repository listed at the end of this +document. + \printbibliography[heading=bibintoc] %%% FIXME remove appendix and work into text. |