paper/safety-reset-paper.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672

\documentclass[letterpaper,twocolumn,10pt]{article}
\usepackage{usenix}

\usepackage{amssymb,amsmath}
\usepackage{eurosym}
\usepackage{wasysym}

\usepackage[binary-units]{siunitx}
\DeclareSIUnit{\baud}{Bd}
\DeclareSIUnit{\year}{a}
\usepackage{commath}
\usepackage{graphicx,color}
\usepackage{subcaption}
\usepackage{array}
\usepackage{hyperref}
\usepackage{enumitem}

\renewcommand{\floatpagefraction}{.8}
\newcommand{\degree}{\ensuremath{^\circ}}
\newcolumntype{P}[1]{>{\centering\arraybackslash}p{#1}}
\newcommand{\partnum}[1]{\texttt{#1}}

\begin{document}

% https://eepublicdownloads.entsoe.eu/clean-documents/pre2015/publications/entsoe/Operation_Handbook/Policy_1_Appendix%20_final.pdf

\date{}
\title{\large\bf Ripples in the Pond:\\Transmitting Information through Grid Frequency Modulation}
\author{{\rm Jan Sebastian Götte}\\TU Darmstadt \and {\rm Liran Katzir}\\Tel Aviv University\and {\rm Björn Scheuermann}\\TU Darmstadt}
%\institute{TU Darmstadt\\ Communication Networks Lab\\ \email{safetyreset@jaseg.de}
%\and Tel Aviv University\\ Faculty of Engineering\\ \email{lirankat@tau.ac.il}
%\and TU Darmstadt\\ Communication Networks Lab\\ \email{scheuermann@informatik.hu-berlin.de}}
\maketitle
%\keywords{Security, privacy and resilience in critical infrastructures \and Security and privacy in ``internet of
%things'' \and Cyber-physical systems \and Hardware security \and Network Security \and Energy systems \and Signal theory}

\begin{abstract}
    The dependence of the electrical grid on networked control systems is steadily rising. While utilities are defending
    their side of the grid effectively through rigorous IT security measures such as physically separated control
    networks, the increasing number of networked devices on the consumer side such as smart meters or large
    IoT-connected appliances such as air conditioners are much harder to secure due to their heterogeneity. We consider
    a crisis scenario in which an attacker compromises a large number of consumer-side devices and modulates their
    electrical to destabilize the grid and cause an electrical outage~\cite{ctap+11,wu01,zlmz+21,kgma21,smp18,hcb19}.
    
    In this paper propose a broadcast channel based on the modulation of grid frequency through which utility operators
    can issue commands to devices at the consumer premises both during an attack for mitigation and in its wake to aid
    recovery. Our proposed grid frequency modulation (GFM) channel is independent of other telecommunication networks.
    It is resilient towards localized blackouts and it is operational immediately as soon as power is restored.

    Based on our GFM broadcast channel we propose a ``safety reset'' system to mitigate an ongoing attack by disabling a
    device's network interfaces and restting its control functions. It can also be used in the wake of an attack to aid
    recovery by shutting down non-essential loads to reduce strain on the grid.

    To validate our proposed design, we conducted simulations based on measured grid frequency behavior. Based on these
    simulations, we performed an experimental validation on simulated grid voltage waveforms using a smart meter
    equipped with a prototype safety reset system based on an inexpensive commodity microcontroller.
\end{abstract}

\section{Introduction}

With the rollout of the smart grid, the IT security of electrical infrastructure has attracted increased attention in
the last years. Smart Grid security has two major components: The security of central SCADA systems, and the security
of equipment at the consumer premises such as smart meters and IoT devices. While there is previous work on both sides,
their interactions have not yet received much attention.

In this paper, we consider the previously proposed scenario where a large number of compromised consumer devices is used
alone or in conjunction with an attack on the grid's central SCADA systems to destabilize the grid by rapidly modulating
the total connected load~\cite{ctap+11,wu01,zlmz+21,kgma21,smp18,hcb19}. Several devices have been identified as likely
targets for such an attack including smart meters with integrated remote disconnect switches~\cite{ctap+11,anderson01},
large IoT-connected appliances~\cite{smp18,hcb19,chl20,olkd20} and electric vehicle
chargers~\cite{kgma21,zlmz+21,olkd20}.  Such attacks are hard to mitigate, and existing literature focuses on hardening
grid control systems~\cite{kgma21,lzlw+20,lam21,zlmz+21} and device firmware\cite{mpdm+10,smp18,zb20,yomu+20} to prevent
compromise. Despite the infeasibility of perfect firmware security, there is little research on \emph{post-compromise}
mitigation approaches. A core issue with post-attack mitigation is that network connections such as internet and
cellular networks between the utility and devices on consumer premises may not work due to the attack. Thus, mitigation
strategies that involve devices on the consumer premises will need an out-of-band communication channel.

We propose a \emph{safety reset} controller that is controlled through a novel, resilient, grid-wide powerline
communication technique. Our safety reset controller can be fitted into any Smart Meter or IoT device. Its purpose is to
await an out-of-band command to put the device into a safe state (e.g. \emph{relay on} or \emph{light on}) that
interrupts attacker control over the device. The safety reset controller is separated from the system's main application
controller and does not have any conventional network connections to reduce attack surface and cost.

To facilitate resilient communication between the grid operator and the safety reset controller, we propose a grid-wide
broadcast channel based on grid frequency modulation (GFM). This channel can be operated by transmission system
operators (TSOs) even during black-start recovery procedures and it bridges the gap between the TSO's private control
network and consumer devices that can not economically be equipped with other resilient communication techniques such as
satellite transceivers. To demonstrate our proposed channel, we have implemented a system that transmits error-corrected
and cryptographically secured commands through an emulated grid frequency-modulated voltage waveform to an off-the-shelf
smart meter equipped with a prototype safety reset controller based on a small off-the-shelf microcontroller.

The frequency behavior of the electrical grid can be analyzed by examining the grid as a large collection of mechanical
oscillators coupled through the grid via the electromotive force~\cite{rogers01,wcje+12}. The generators and motors that
are electromagnetically coupled through the grid's transmission lines and transformers run synchronously with each
other, with only minor localized variations in their rotation angle. The dynamic behavior of grid frequency is a direct
product of this electromechanical coupling: With increasing load, frequency drops because shafts move slower under
higher torque, and consequentially with decreasing load frequency rises. Industrial control systems keep frequency close
to its nominal value over time spans of minutes or hours, but at shorter time frames the combined inertia of all
grid-connected generators and motors is what regulates frequency.

Grid frequency modulation works by quickly modulating the power of a large, grid-connected load or generator. When this
modulation is at low amplitude and high frequency, it is below the thresholds set for the grid's automated control
systems and monitoring systems and it will directly affect frequency according to the grid's inertia. GFM differs from
traditional Powerline Communication (PLC) systems in that it reaches every device within one synchronous area as the
signal is embedded into the fundamental grid frequency. Traditional PLC uses a superimposed voltage, which is quickly
attenuated across long distances. Practically speaking, using GFM a single large transmitter can cover an entire
synchronous area, while in traditional PLC hundreds or thousands of smaller transmitters would be necessary. Unlike
traditional PLC, any large industrial load that allows for fast computer control can act as a GFM transmitter.

\begin{figure}
    \centering
    \includegraphics[width=0.4\textwidth]{flowchart}
    \caption{Structural overview of our concept. 1 - Government authority or utility operations center. 2 - Emergency
    radio link. 3 - Aluminium smelter. 4 - Electrical grid. 5 - Target smart meter.}
    \label{fig_intro_flowchart}
\end{figure}

Figure~\ref{fig_intro_flowchart} shows an overview of our concept, where a large aluminium smelter has been temporarily
re-purposed as a GFM transmitter.  Two scenarios for its application are before or during a cyberattack, to stop an
attack on the electrical grid in its tracks, and after an attack while power is being restored to prevent a repeated
attack. In both scenarios, our concept is independent of telecommunication networks (such as the internet or cellular
networks) as well as broadcast systems (such as cable television or terrestrial broadcast radio) while requiring only
inexpensive signal processing hardware and no external antennas (such as are needed for satellite communication). A grid
frequency-based system can function as long as power is still available, or as soon as power is restored after the
attack. One powerful function this allows is ``flushing out`` an attacker from compromised smart meters after an attack,
before restoring smart meter internet connectivity.

Using simulations we have determined that control of a $\SI{25}{\mega\watt}$ load such as a large aluminium smelter,
load bank or photovoltaic farm would allow for the transmission of a crytographically secured safety reset signal within
$15$ minutes. We have designed and constructed a proof-of-concept prototype receiver that demonstrates the feasibility
of decoding such signals on a resource-constrained microcontroller.

\subsection{Motivation}

Consumer devices are increasingly becoming \emph{smart}. Large numbers of IoT devices are connected through the public
internet, and in several countries internet-connected Smart Meters can disconnect entire households from the grid in
case of unpaid bills. The increasing proliferation of smart devices on the consumer side presents an opportunity to grid
operators, who rely on forecasts for the cost-optimized control of generation and power flow. The core of the
\emph{Smart Grid} vision is that utilities can now gather detailed data for more accurate consumption forecasts, and in
some cases can even adjust parameters of large devices like water heaters to smooth out load spikes.

However, this increased degree of visibility and control comes with an increased IT security risk. In this paper we
focus on scenarios where an attacker compromises a large number of grid-connected remote-controllable devices. This may
be simple smart home devices such as IoT light bulbs, but it may also include Smart Meters that are outfitted with a
remote disconnect switch as is common in some countries. By rapidly switching large numbers of such devices in a
coordinated manner, the attacker has the opportunity to de-stabilize the electrical grid. % FIXME citation

Previous work on IoT and Smart Grid security has focused on the prevention of attacks though firmware security measures.
While research on prevention is undoubtably important, we estimate that its practical impact will be limited by the vast
diversity of implementations found in the field combined with the slow update cycles inherent to non-functional firmware
enhancements for consumer devices. We predict that it would be a Sisyphean task to secure sufficiently many devices
to deny an attacker the critical mass needed to cause trouble. For this reason, in this paper we focus on recovery after
an attack.

\subsection{Black-start recovery}

The recovery from a large-scale power outage is a complex operational challenge. Large outages are caused by cascading
failures. Since all consumers and producers that are connected to the electrical grid are physically coupled through the
electromotive force, a fault in one part of the grid affects all devices connected across the grid. To function, the
grid relies on a delicate balance between electricity generation, transmission and consumption.  When this balance is
disturbed, cascading failures can occur. A transmission line shutting off can lead other, nearby lines to overload and
shut off. Due to the electromechanical coupling of all machines connected to the grid, a generator or consumer suddenly
shutting off causes a transient in the grid's frequency. If the frequency goes too far out of bounds, protection devices
take power plants and large industrial loads offline.

The recovery from a large-scale outage requires the grid's operators to bring generators and loads back online one by
one while continuously maintaining balance between generation and consumption to avoid their protection devices shutting
them down again. To coordinate this process, transmission system operators cannot rely on the public internet or
cellular networks, as they may not work during a large-scale power outage. Instead, they maintain private communication
infrastructure using dedicated lines rented from telecommunciations providers, fibers run along transmission lines, and
dedicated radio links.

To start from a complete outage, first a number of \emph{black start}-capable power stations that can start by
themselves without any external power are brought online. With their help, other power stations and consumers are
gradually brought online until a part of the grid has been restored to nominal operation. This process can be performed
simultaneously in different parts of the grid. After these \emph{islands} have been restored, they can then be joined to
restore the grid to its normal state.

\subsection{Contents}

Starting from a high level architecture, we have carried out simulations of our concept's performance under real-world
conditions using measured grid frequency data. Based on these simulations we implemented an end-to-end prototype of our
proposed safety reset controller as part of a realistic smart meter demonstrator. Finally, we experimentally validated
our results based on a simulated mains voltage signal and we will conclude with an outline of further steps towards a
practical implementation.

This work contains the following contributions:
\begin{enumerate}[topsep=4pt]
    \item We introduce Grid Frequency Modulation (GFM) as a communication primitive. % FIXME done before in that one paper
    \item We elaborate the fundamental physics underlying GFM and theorize on the constrains of a practical
        implementation.
    \item We design a communication system based on GFM.
    \item We carry out extensive simulations of our systems to determine its performance characteristics.
\end{enumerate}

\subsection{Notation}

To a computer scientist there is one confusing aspect to the theory of grid frequency modulation. GFM can be seen as a
frequency modulation (FM) with a baseband signal in the band below approximately $f_m = \SI{5}{\hertz}$ that is
modulated on top of a carrier signal at $f_c = \SI{50}{\hertz}$ in case of the European electrical grid. The frequency
deviation $f_\Delta$ that the modulated carrier deviates from its nominal value of $f_m$ is very small at only a few
milli-Hertz.

When grid frequency is measured by first digitizing the mains voltage waveform, then de-modulating digitally, the FM's
SNR is very high and is dominated by the ADC's quantization noise and nearby mains voltage noise sources such as
resistive droop due to large inrush current of nearby machines.

Note that both the carrier signal at $f_c$ and the modulation signal at $f_m$ both have unit Hertz. To disambiguate
them, in this paper we will use \textbf{bold} letters to refer to the carrier waveform $\mathbf{U}$ or frequency
$\mathbf{f_c}$ as well as its deviation $\mathbf{f_\Delta}$, and we will use normal weight for the actual modulation
signal and its properties such as $f_m$.

\section{Related work} 
\label{sec_related_work}

Previous work has analyzed Smart Grid security from numerous angles and made several suggestions towards its
improvement.  Apart from the critical location that Smart Grid devices occupy, they are computer systems like many
others.  Thus, for IT security purposes the Smart Grid is simply an aggregation of embedded control and measurement
devices that are part of a large control system. These devices share the same security concerns that apply to embedded
systems in general. 

\subsection{Smart Meter Security}

Where programmers have been struggling for decades now with issues such as input validation~\cite{leveson01}, the same
potential issue raises security concerns in smart grid scenarios as well~\cite{mo01, lee01}.  Only, in smart grid we
have two complicating factors present: many components are embedded systems, and as such inherently hard to update.
Also, the smart grid and its control algorithms act as a large (partially) distributed system making problems such as
input validation or authentication harder~\cite{blaze01} and adding a host of distributed systems problems on
top~\cite{lamport01}.

Given that the electrical grid is essential infrastructure, these issues are significant. Attacks on the electrical grid
may have grave consequences~\cite{anderson01,lee01} while the long replacement cycles of various components make the
system slow to adapt. Thus, components for the smart grid need to be built to a higher standard of security than e.g.\
IoT devices to live up to well-funded attackers decades down the road.  Another implication of their long service life
is that their agility w.r.t.\ post-hoc mitigations through firmware updates is limited.

%Another fundamental challenge in smart grid implementations is the central role of smart electricity meters in the
%smart grid ecosystem. Smart meters are used both for highly-granular load measurement and in some countries also for
%load switching~\cite{zheng01}.
Smart electricity meters are consumer devices built down to a price. Firmware security research and development budgets
are limited by the high degree of market fragmentation that is caused by mutually incompatible national smart metering
standards. Landis+Gyr, a large utility meter manufacturer, state in their 2019 annual report that they invested
\SI{36}{\percent} of their total R\&D budget on embedded software while spending only \SI{24}{\percent} on hardware
R\&D~\cite{landisgyr01,landisgyr02}, which indicates tension between firmware security and the manufacturers's bottom
line. 

% FIXME more sources!

\subsection{The state of the art in embedded security}

Embedded software security has proven challenging compared to the security of larger computer systems. On one hand,
embedded devices usually run highly customized firmware that is rarely updated. On the other hand, embedded devices
often lack security mechanisms such as memory management units that are found in higher-power devices. As a result of
these factors, even well-funded companies continue to have trouble securing their embedded systems. An example of this
difficulty is the 2019 flaw in Apple's iPhone SoC first-stage ROM bootloader that allows for the full compromise of any
iPhone older than iPhone X given physical access to the device~\cite{heise01}. iPhone 8, one of the affected models, was
still being manufactured and sold by Apple until April 2020.  In another instance in 2016, researchers found multiple
flaws in Samsung's implementation of ARM TrustZone ``secure world'' firmware that Samsung used for their own mobile
phone SoCs.  The flaws they found were both architectural flaws such as secret user input being passed through untrusted
userspace processes as well as cryptographic flaws such as
CVE-2016-1919\footnote{\url{http://cve.circl.lu/cve/CVE-2016-1919}}~\cite{kanonov01}.  In a similar way, in 2014,
researchers found an integer overflow flaw in the low-level code handling untrusted input in Qualcomm's QSEE
firmware\footnote{For an overview of ARM TrustZone including a survey of academic work and past
security vulnerabilities of TrustZone-based firmware see~\cite{pinto01}.}~\cite{rosenberg01}.

If even companies with R\&D budgets that rival some countries' national budgets at mass-market consumer devices
have trouble securing their mass market secure embedded software stacks, what is a much smaller smart meter manufacturer
to do?  Especially if national standards mandate complex protocols such as TLS that are difficult to implement
correctly~\cite{georgiev01}, this manufacturer will be short on options to secure their product.

\subsection{Attack surface in the smart grid}

From the incidents we outlined in the previous paragraphs we conclude that in smart metering technology, market
incentives do not currently provide the conditions for a level of device security that will reliably last for decades
after deployment. Considering this tension, in this paragraph we examine the cyberphysical risks that arise from attacks
on the smart grid in the first place. These risks arise at three different infrastructure levels.

The first level is that of attacks on centralized control systems. This type of attack is often cited in popular
discourse and to our knowledge is the only type of attack against an electric grid that has ever been carried out in
practice at scale~\cite{lee01}. Despite their severity, these attacks do not pose a strictly \emph{scientific} challenge
since they are generic to any industrial control system. Their causes and countermeasures are generally well-understood
and the hardest challenge in their prevention is likely to be budgetary constraints.

Beyond the centralized control systems, the next target for an attacker may be the communication links between those
control systems and other smart grid components. While in some countries such as Italy special-purpose systems such as
PLC are common~\cite{ec03}, overall, IP-based technologies have proliferated according to the larger trend towards
IP-based communications.  This proliferation of IP links brings along the possibility for the application of generic
network security measures from the IP world to the smart grid domain. In this way, a standardized, IP-based protocol
stack unlocks decades of network security improvements at little cost.

Beyond these layers towards the core of the smart grid's control infrastructure, an attacker might also corrupt the
network from the edges and target the endpoint devices itself. The large scale deployment of networked smart meters
creates an environment that is favorable to such attacks.
% FIXME cite RECESSIM landis+gyr protocol hacking wiki/youtube

\subsection{Cyberphysical threats in the smart grid}

Assuming that an attacker has compromised devices on any of these levels of smart grid infrastructure, what could they
do with their newly gained power? The obvious action would be to switch off everything. Of all scenarios,
this is both the most likely in practice---it is exactly what happened in the Russian cyberattacks on the Ukranian
grid~\cite{lee01}---but it is also the easiest to mitigate since the vulnerable components are few and centralized.
Mitigations include the installation of fail safes as well as a defense in depth approach to hardening the grid's
cyber infrastructure.

Another possible action for an attacker would be to forge energy measurements in an attempt to cause financial mayhem.
Both individual consumers as well as the utility could be targeted by such an attack. While such an attack might have
localized success, larger-scale discrepancies will likely quickly be caught by monitoring systems. For example, if a
large number of meters in an area systematically under- or over-reported their energy readings, meter readings across
the affected area would no longer add up with those of monitoring devices in other locations in the transmission and
distribution grid.

In some countries, smart meter functionality goes beyond mere monitoring devices and also includes remotely controlled
switches. There are two types of these switches: Switches to support \emph{Demand-Side Management} (DMS) and cut
off-switches that are used to punish defaulting customers. Demand Side Management is when a grid operator can remotely
control the timing of large, non-time-critical loads on the customer's premises~\cite{dzung01}. A typical example of this
is a customer using an electric water heater: The heater is outfitted with a large hot water storage tank and is
connected hooked up to the utility's DSM system. The customer does not care when exactly their water is heated as long
as there is enough of it, and the utility offers them cheaper rates for the electricity used for heating in exchange for
control over its precise timing.  The utility uses this control to even out peaks in the consumption/production
imbalance, remotely enabling DSM systems during off-peak times and disabling them during peak hours.  In contrast to
DSM, cut-off switches are switches placed in between the grid and the entire customer's household such that the utility
can disconnect non-paying customers without incurring the expense of sending a technician to the customer's premises.
Unlike DSM systems, cut-off switches are not opt-in~\cite{anderson01,temple01}.  An attack that uses cut-off switches
would obviously immediately cause severe mayhem. Attacks on DSM may have more limited immediate impact as affected
consumers may not notice an interruption for several hours.

Instead of switching off loads outright, an attack employing DSM switches (and potentially also cut-off switches) could
choose to target the grid's stability.  By synchronizing many compromised smart meters to switch on and off a large
load capacity, an attacker might cause the entire electrical grid to oscillate~\cite{kosut01,wu01,kim01}. As a large
system of coupled mechanical systems, the electrical grid exhibits a complex frequency-domain behavior.  Resonance
effects, colloquially called ``modes'', are well-studied in power system
engineering~\cite{rogers01,grebe01,entsoe01,crastan03}. As they can cause issues even under normal operating conditions,
a large effort is invested in dampening these resonances. Howewer, fully eliminating them under changing load conditions
may not be achievable.

\subsection{Communication Channels on the Grid}

A core part of intervening with any such cyberattack is the ability to communicate remediary actions to the devices
under attack.  There is a number of well-established technologies for communication on or along power lines. We can
distinguish three basic system categories: systems using separate wires (such as DSL over landline telephone wiring),
wireless radio systems (such as LTE) and \emph{Power Line Communication} (PLC) systems that reuse the existing mains
wiring and superimpose data transmissions onto the 50 Hz mains sine~\cite{gungor01,kabalci01}.

During a large-scale cyberattack, availability of internet and cellular connectivity cannot be relied upon. An attacker
may already have disabled such systems in a separate attack, or they may go down along with parts of the electrical
grid. Traditional powerline communication systems or an utitly's proprietary wireless systems would work, but at a range
of no more than several tens of kilometers reaching all meters in a country would require a large upfront infrastructure
investment.

\section{Grid Frequency as a Communication Channel}

We propose to approach the problem of broadcasting an emergency signal to all smart meters within a synchronous area by
using grid frequency as a communication channel.  Despite the technological complexity of the grid, the physics
underlying its response to changes in load and generation is surprisingly simple. Individual machines (loads and
generators) can be approximated by a small number of differential equations and the entire grid can be modelled by
aggregating these approximations into a large system of nonlinear differential equations. As a consequence, small signal
changes in generation/consumption power balance cause an approximately proportional change in
frequency~\cite{kundur01,crastan03,entsoe02,entsoe04}.  This \emph{Power Frequency Charactersistic} is about
\SI{25}{\giga\watt\per\hertz} for the continental European synchronous area according to European electricity grid
authority ENTSO-E.

If we modulate the power consumption of a large load such as a multi-megawatt aluminium smelter, this modulation will
result in a small change in frequency according to this characteristic. As long as we stay within the operational limits
set by ENTSO-E~\cite{entsoe02,entsoe03}, this change will not degrade the operation of other parts of the grid. The
advantages of grid frequency modulation are the fact that a single transmitter can cover an entire synchronous area as
well as low receiver hardware complexity.

To the best of the authors' knowledge, grid frequency modulation has only ever been proposed as a communication channel
at very small scales in microgrids before~\cite{urtasun01} and has not yet been considered for large-scale application.

Compared to traditional channels such as DSL, LTE or LoraWAN, grid frequency as a communication channel has a large
resiliency advantage: If there is power, a grid frequency modulation system is operational. Both DSL and LTE systems not
only require power but also require large amounts of centralized infrastructure to operate. Mesh networks such as
LoraWAN can cover short distances up to $\SI{20}{\kilo\meter}$ without requiring infrastructure to be available, but for
longer distances LoraWAN relies on the public internet for its network backbone. Additionally, systems such as DSL, LTE
and LoraWAN are built around a point-to-point communication model and usually do not support a generic broadcast
primitive. During times when a large number of devices must be reached simultaneously this can lead to congestion of
local cellular towers or gateways.
Therefore, during an ongoing cyberattack, grid frequency is promising as a communication channel as only a single
transmitter facility must be operational for it to function, and this single transmitter can reach all connected devices
simultaneously. After a power outage, it can function as soon as electrical power is restored, even while the public
internet and mobile networks are still offline and it is unaffected by cyberattacks that target telecommunication
networks.

\subsection{Characterizing Grid Frequency}
\label{grid-freq-characterization}

To collect ground truth measurements for our analysis of grid frequency as a communication channel, we developed a
device to safely record mains voltage waveforms.  Our system consists of an \texttt{STM32F030F4P6} ARM Cortex M0
microcontroller that records mains voltage using its internal 12-bit ADC and transmits measured values through a
galvanically isolated USB/serial bridge to a host computer. We derive our system's sampling clock from a crystal oven to
avoid frequency measurement noise due to thermal drift of a regular crystal: \SI{1}{ppm} of crystal drift would cause a
grid frequency error of $\SI{50}{\micro\hertz}$. We compared our oven-stabilized clock against a GPS 1 pps reference and
found that over a time span of 20 minutes both stayed stable within 5 ppb of each other, which corresponds to the drift
specification of a typical crystal oven.

In utility SCADA systems, Phasor Measurement Units (PMUs, also called \emph{synchrophasors}) are used to precisely
measure grid frequency among other parameters.  Details on the inner workings of commercial phasor measurement units are
scarce but there is a large amount of academic research on measurement.  PMUs employ complex signal analysis algorithms
to provide fast and precise measurements even when given a heavily distorted input
signal~\cite{narduzzi01,derviskadic01,belega01}.

In our application, we do not need the same level of precision. For the sake of simplicity, we use the universal
frequency estimation approach of Gasior and Gonzalez~\cite{gasior01}. In this algorithm, the windowed input signal is
processed using a Discrete Fourier Transform (DFT), then the signal's fundamental frequency is interpolated by fitting a
wavelet to the largest peak in the DFT result. The bias parameter of this curve fit is an accurate estimation of the
signal's fundamental frequency. This algorithm is similar to the interpolated DFT algorithm referenced by phasor
measurement literature~\cite{borkowski01}.

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{../notebooks/fig_out/freq_meas_spectrum_new}
    \caption{The spectrum of grid frequency variations measured over 24 hours. The raw spectrum is shown in gray, and a
    smoothed spectrum is shown in red. The blue line is inversely proportional to frequency and illustrates the $1/f$
    nature of the spectrum. Distinctive peaks in the spectrum are marked with red crosses, and their locations
    are given on the bottom of the diagram.}
    \label{fig_freq_spec}
\end{figure}

Using our grid frequency recorder, we performed a two-day measurement series of grid frequency.
Figure~\ref{fig_freq_spec} shows the frequency spectrum of grid frequency over this two-day span. In this spectrum, we
observe a number of features. Across the frequency range, we observe a broad $1/f$ noise. Above a period of
$\SI{10}{\second}$, this $1/f$ noise dips to a flat noise floor. We estimate that this low-noise region is caused by the
self-regulating effect of loads. %FIXME citation Above a $\SI{10}{\second}$ period, primary control is activated and
thus the $1/f$ noise we observe is the result of the interaction between primary control and consumer demand. On top of
this $1/f$ behavior, the spectrum shows several sharp peaks at time intervals with a ``round'' number such as
$\SI{10}{\second}$, $\SI{60}{\second}$ or multiples of $\SI{300}{\second}$. These peaks are due to loads turning on- or
off depending on wall-clock time. Besides the narrow peaks caused by this effect we can also observe two wider bumps at
$\SI{7.0}{\second}$ and $\SI{4.7}{\second}$. These bumps closely correlate with continental european synchonous area's
oscillation modes at $\SI{0.15}{\hertz}$ (east-west) and $\SI{0.25}{\hertz}$ (north-south)~\cite{grebe01}.

\section{Grid Frequency Modulation}

A transmitter for grid frequency modulation would be a controllable load of several Megawatt that
is located centrally within the grid. A baseline implementation would be a spool of wire submerged in a body of cooling
liquid (such as a small lake) which is powered from a 
thyristor rectifier bank. Compared to this baseline solution, hardware and maintenance investment can be decreased
by repurposing a large industrial load as a transmitter. Going through a
list of energy-intensive industries in Europe~\cite{ec01}, we found that an aluminium smelter would be a good candidate.
In aluminium smelting, aluminium is electrolytically extracted from alumina solution. High-voltage mains power is
transformed, rectified and fed into about 100 series-connected electrolytic cells forming a \emph{potline}. Inside these
pots alumina is dissolved in molten cryolite electrolyte at about \SI{1000}{\degreeCelsius} and electrolysis is
performed using a current of tens or hundreds of Kiloampère. The resulting pure aluminium settles at the bottom of the
cell and is tapped off for further processing.

Aluminium smelters are operated around the clock, and due to the high financial stakes their behavior under power
outages has been carefully characterized. Power outages of tens of minutes up to two hours reportedly do
not cause problems in aluminium potlines~\cite{eisma01,oye01}. Recently, even techniques for intentional power modulation
without affecting cell lifetime or product quality have been developed to take advantage of variable energy
prices~\cite{duessel01,eisma01,depree01}.  An aluminium plant's power supply is controlled to constantly keep all
smelter cells under optimal operating conditions. Modern power supply systems employ large banks of diodes or thyristors to
rectify low-voltage AC to DC to be fed into the potline~\cite{ayoub01}. Potline voltage is controlled through a
combination of a tap changer and a transductor.  Individual cell voltages are controlled by changing the physical
distance between anode and cathode distance.  In this setup, power can be electronically modulated using the thyristor
rectifier. Since the system does not have any mechanical inertia, high modulation rates are possible.

In~\cite{depree01}, the authors describe a setup where a large Aluminium smelter in continental Europe is used as
primary control reserve for frequency \emph{regulation}. In this setup, a rise time of $\SI{15}{\second}$ was achieved
to meet the $\SI{30}{\second}$ requirement posed by local standards for primary control. In their conclusion, the
authors note that for their system, an energy storage capacity of $\SI{7.7}{\giga\watt\hour}$ is possible if all plants
of a single operator are used. Given the maximum modulation depth of $\SI{100}{\percent}$ for up to one hour that is
mentioned by the authors, this results in an effective modulation power of $\SI{7.7}{\giga\watt}$. Over a longer
timespan of $\SI{48}{\hour}$, they have demonstrated a $\SI{33}{\percent}$ modulation depth which would correspond to a
modulation power of $\SI{2.5}{\giga\watt}$. We conclude that a modulation of part of an aluminium smelter's power
consumption is possible at no significant production impact and at low infrastructure cost. Aluminium smelters are
already connected to the grid in a way that they do not pose a danger to other nearby consumers when they turn off or on
parts of the plant, as this is commonplace during routine maintenance activities.

\subsection{Parametrizing Modulation for GFM}

Given the grid characteristics we measured using our custom waveform recorder and using a model of our transmitter, we
can derive parameters for the modulation of our broadcast system.  The overall network power-frequency characteristic of
the continental European synchronous area is about $\SI{25}{\giga\watt\per\hertz}$~\cite{entsoe02}. Thus, the main
challenge for a GFM system will be poor SNR due to low transmission power. A second layer of modulation yielding some
modulation gain beyond the basic amplitude modulation of the transmitter will be necessary to achieve sufficient overall
SNR.

The grid's frequency noise has significant localized peaks that might interfere with this modulation. Further
complicating things are the oscillation modes. A GFM system must be designed to avoid exciting these modes. However,
since these modes are not static, a modulation method that is designed around a specific assumption of their location
would not be future proof. Given these concerns, the optimal second-level modulation technique for GFM is a
spread-spectrum technique. By spreading signal energy throughout a wide band, both the impact of local noise spikes is
minimized and the risk of mode excitation is reduced since spread-spectrum techniques minimize energy in any particular
sub-band.

In this paper, we chose to perform simulations using Direct Sequence Spread Spectrum for its simple implementation and
good overall performance. DSSS chip timing should be as fast as the transmitter's physics allow to exploit the low-noise
region between $\SI{0.2}{\hertz}$ to $\SI{2.0}{\hertz}$ in Figure~\ref{fig_freq_spec}. Going past
$\approx\SI{2}{\hertz}$ would complicate frequency measurement at the receiver side.

We simulated a proof-of-concept modulator and demodulator using data captured from our grid frequency sensor. Our
simulations covered a range of parameters in modulation amplitude, DSSS sequence bit depth, chip duration and detection
threshold. Figure~\ref{fig_ser_nbits} shows our simulation results for symbol error rate (SER) as a function of
modulation amplitude with Gold sequences of several bit depths. From these graphs we conclude that the range of
practical modulation amplitudes starts at approximately $\SI{1}{\milli\hertz}$, which corresponds to a modulation power
of approximately $\SI{25}{\mega\watt}$~\cite{entsoe02}.  Figure~\ref{fig_ser_thf} shows SER against detection threshold
relative to background noise. Figure~\ref{fig_ser_chip} shows SER against chip duration for a given fixed symbol length.
As expected from looking at our measured grid frequency noise spectrum, performance is best for short chip durations and
worsens for longer chip durations since shorter chip durations move our signals' bandwidth into the lower-noise region
from $\SI{0.2}{\hertz}$ to $\SI{2}{\hertz}$.
%FIXME introduce term "chip" somewhere

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{../notebooks/fig_out/dsss_gold_nbits_overview}
    \caption{Symbol Error Rate as a function of modulation amplitude for Gold sequences of several lengths.}
    \label{fig_ser_nbits}
\end{figure}

\begin{figure}
    \centering
    \hspace*{-1cm}\includegraphics[width=0.5\textwidth]{../notebooks/fig_out/dsss_thf_amplitude_5678}
    \caption{SER vs.\ Amplitude and detection threshold. Detection threshold is set as a factor of background noise
    level.}
    \label{fig_ser_thf}
\end{figure}

\begin{figure}
    \centering
    \hspace*{-1cm}\includegraphics[width=0.5\textwidth]{../notebooks/fig_out/chip_duration_sensitivity_6}
    \vspace*{-1cm}
    \caption{SER vs.\ DSSS chip duration.}
    \label{fig_ser_chip}
\end{figure}

\subsection{Parametrizing a proof-of-concept ``Safety Reset'' System Based on GFM}

%FIXME introduce scenario
Taking these modulation parameters as a starting point, we proceeded to create a proof-of-concept smart meter emergency
reset system. On top of the modulation described in the previous paragraphs we layered simple Reed-Solomon error
correction~\cite{mackay01} and some cryptography. The goal of our PoC cryptographic implementation was to allow the
sender of an emergency reset broadcast to authorize a reset command to all listening smart meters. An additional
constraint of our setting is that due to the extremely slow communication channel all messages should be kept as short
as possible. The solution we chose for our PoC is a simplistic hash chain using the approach from the Lamport and
Winternitz One-time Signature (OTS) schemes. Informally, the private key is a random bitstring. The public key is
generated by recursively applying a hash function to this key a number of times. Each smart meter reset command is then
authorized by disclosing subsequent elements of this series. Unwinding the hash chain from the public key at the end of
the chain towards the private key at its beginning, at each step a receiver can validate the current command by checking
that it corresponds to the previously unknown input of the current step of the hash chain. Replay attacks are prevented
by recording the most recent valid command. Keys revocation is supported by designating the last key in the chain as a
\emph{revocation key} upon whose reception the client devices advance their local hash ratchet without taking further
action.  This simple scheme does not afford much functionality but it results in very short messages and removes the
need for computationally expensive public key cryptography inside the smart meter.
% FIXME add more precise/formal description of crypto
% FIXME add description of targeting/scope function?
% FIXME somewhere above descirbe entire reset system architecture????!!!
% FIXME add description of disarm message (replay protection)

\subsection{Experimental results}

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{prototype.jpg}
    \caption{The completed prototype setup. The board on the left is the safety reset microcontroller. It is connected
    to the smart meter in the middle through an adapter board. The top left contains a USB hub with debug interfaces to
    the reset microcontroller. The cables on the bottom left are the debug USB cable and the \SI{3.5}{\milli\meter}
    audio cable for the simulated mains voltage input.}
    \label{fig_proto_pic}
\end{figure}

For a realistic proof of concept, we decided to implement our signal processing chain from DSSS demodulator through
error correction up to our simple cryptography layer in microcontroller firmware and demonstrate this firmware on actual
smart meter hardware, shown in Figure~\ref{fig_proto_pic}. In our proof of concept a safety reset controller is
connected to the main application microcontroller of a smart meter. The reset controller is tasked with listening for
authenticated reset commands on the voltage waveform, and on reception of such a command resetting the smart meter
application controller by flashing a known-good firmware image to its memory.

The signal processing chain of our PoC is shown in Figure~\ref{fig_demo_sig_schema}. To interoperate with existing
implementations of SHA-512 and reed-solomon decoding, this implementation was written in the C programming language. To
demonstrate an application close to a field implementation, we chose an Easymeter \texttt{Q3DA1002} smart meter as our
reset target. This model is popular in the German market and readily available second-hand. The meter consists of three
isolated metering ASICs connected to a data logging and display PCB through infrared optical links. To demonstrate the
safety reset's firmware reset functionality, we connected our safety reset microcontroller to the Texas Instruments
\texttt{MSP430} microcontroller on the meter's display and data logging board through the JTAG debug interface that the
board's vendor had conveniently left accessible. We ported part of
\texttt{mspdebug}\footnote{\url{https://dlbeer.co.nz/mspdebug/}} to drive the meter microcontroller's JTAG interface and
wrote a piece of demonstrator code that overwrites the meter's firmware with one that displays an identifying string on
the meter's display after boot-up.

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{prototype_schema}
    \caption{The signal processing chain of our demonstrator.}
    \label{fig_demo_sig_schema}
\end{figure}

To measure grid frequency in our demonstrator, we ported the same code we used in
Section~\label{grid-freq-characterization} to our demonstrator, again using the voltage measured using the
microcontroller's internal ADC but using a regular crystal instead of a crystal oven for the microcontroller's system
clock.  Since we did not have an aluminium smelter ready, we decided to feed our proof-of-concept reset controller with
an emulated grid voltage sine wave from a computer's headphone jack. Where in a real application this microcontroller
would  take ADC readings of input mains voltage divided down by a long resistive divider chain, we instead feed the ADC
from a $\SI{3.5}{\milli\meter}$ audio input. For operational safety, we disconnected the meter microcontroller from its
grid-referenced capacitive dropper power supply and connected it to our reset controlller's debug USB power supply.

We performed several successful experiments using a signature truncated at 120 bit and a 5 bit DSSS sequence. Taking the
sign bit into account, the length of the encoded signature is 20 DSSS symbols. On top of this we used Reed-Solomon error
correction at a 2:1 ratio inflating total message length to 30 DSSS symbols. At the \SI{1}{\second} chip rate we used in
other simulations as well this equates to an overall transmission duration of approximately \SI{15}{\minute}. To give
the demodulator some time to settle and to produce more realistic conditions of signal reception we padded the modulated
signal unmodulated noise on both ends.

\section{Lessons learned}

For our proof of concept, before settling on the commercial smart meter we first tried to use an \texttt{EVM430-F6779}
smart meter evaluation kit made by Texas Instruments. This evaluation kit did not turn out well for two main reasons.
One, it shipped with half the case missing and no cover for the terminal blocks. Because of this some work was required
to get it electrically safe.  Even after mounting it in an electrically safe manner the safety reset controller
prototype would also have to be galvanically isolated to not pose an electrical safety risk since the main MCU is not
isolated from the grid and the JTAG port is also galvanically coupled. The second issue we ran into was that the
development board is based around a specific microcontroller from TI's \texttt{MSP430} series that is incompatible with
common JTAG programmers.

Our initial assumption that a development kit would be easier to program than a commercial meter did not prove to be
true. Contrary to our expectations the commercial meter had JTAG enabled allowing us to easily read out its stock
firmware without either reverse-engineering vendor firmware update files nor circumventing code protection measures.
The fact that its firmware was only available in its compiled binary form was not much of a hindrance as it proved not
to be too complex and all we wanted to know we found with just a few hours of digging in
Ghidra\footnote{\url{https://ghidra-sre.org/}}.

In the firmware development phase our approach of testing every module individually (e.g. DSSS demodulator, Reed-Solomon
decoder, grid frequency estimation) proved useful particularly for debugging. The modular architecture allowed us to
directly compare our demodulator implementation to our Jupyter/Python prototype, where we found that our C
implementation outperformed the Python prototype. Despite the algorithms's complexity, the microcontroller C
implementation has no issues processing data in real-time due to the low sampling rate necessary.

\section{Conclusion}
\label{sec_conclusion} 
\subsection{Applicability to IoT devices}

\subsection{Discussion}
During an emergency in the electrical grid, the ability to communicate to large numbers of end-point devices is a
valuable tool for restoring normal operation. When a resilient communcation channel is available, loads such as smart
meters and IoT devices can be equipped with a supervisor circuit that allows for a remote ``safety reset'' that puts the
device into a safe operating state. Using this safety reset, an attacker that uses compromised smart meters or IoT
devices to attack grid stability can be interrupted before the can conclude their attack. During recovery from an
outage, a safety reset can be used to reduce stress on the system during a black start by temporarily disabling
non-essential loads such as air conditioners.

In this paper we have developed an end-to-end design for a safety reset system that provides these capabilities.
Our novel broadcast data transmission system is based on intentional modulation of global grid frequency. Our system is
independent of normal communication networks and can operate during a cyberattack. We have shown the practical viability
of our end-to-end design through simulations. Using our purpose-designed grid frequency recorder, we can capture and
process real-time grid frequency data in an electrically safe way. We used data captured this way as the basis for
simulations of our proposed grid frequency modulation communication channel. In these simulations, our system has proven
feasible. From our simulations we conclude that a large consumer such as an aluminium smelter at a small cost can be
modified to act as an on-demand grid frequency modulation transmitter.

We have demonstrated our modulation system in a small-scale practical demonstration.  For this demonstration, we have
developed a simple cryptographic protocol ready for embedded implementation in resource-constrained systems that allows
triggering a safety reset with a response time of less than 30 minutes.  In this demonstration we use simulated grid
frequency data to trigger a commercial microcontroller to perform a firmware reset of an off-the-shelf smart meter. The
next step in our evaluation will be to conduct an experimental evaluation of our modulation scheme in collaboration with
an utility and an operator of a multi-megawatt load.  

The safety reset controller does not require any peripherals except for an ADC. Thus we expect code size to be the main
factor affecting per-unit cost in an in-field deployment of our concept. At around \SI{64}{\kilo\byte}, our demonstrator
firmware implementation is viable on low-end microcontrollers. Thus, we expect safety reset controllers to be
commercially viable.

Source code and EDA designs are available at the public repository listed at the end of this document.

\bibliographystyle{plain}
\bibliography{\jobname}

\center{
    \center{This is version \texttt{\input{version.tex}\unskip} of this paper, generated on \today. The git repository
    can be found at:}

    \center{\url{https://git.jaseg.de/safety-reset.git}}
}
\end{document}