diff options
Diffstat (limited to 'ma')
-rw-r--r-- | ma/safety_reset.bib | 23 | ||||
-rw-r--r-- | ma/safety_reset.tex | 1470 |
2 files changed, 762 insertions, 731 deletions
diff --git a/ma/safety_reset.bib b/ma/safety_reset.bib index d1f5d2f..0302159 100644 --- a/ma/safety_reset.bib +++ b/ma/safety_reset.bib @@ -1619,4 +1619,27 @@ url = {https://www.heise.de/newsticker/meldung/Cambridge-Analytica-Mehrere-Untersuchungen-angekuendigt-moegliche-Billionenstrafe-fuer-Facebook-3998151.html},
}
+@Online{bigclive01,
+ author = {bigclivedotcom},
+ date = {2018-10-26},
+ title = {Inside a smart meter, and the REAL problem with them.},
+ url = {https://www.youtube.com/watch?v=G32NYQpvy8Q},
+ urldate = {2020-06-03},
+}
+
+@Online{eevblog01,
+ author = {Dave Jones},
+ date = {2013-01-08},
+ title = {EEVblog #409 - EDMI - Smart Meter Teardown},
+ url = {https://www.youtube.com/watch?v=dm-yZ1N3xmc},
+ urldate = {2020-06-03},
+}
+
+@Online{ti01,
+ author = {{National Semiconductor}},
+ date = {2006},
+ title = {Clock Conditioner Owner’s Manual},
+ url = {http://www.ti.com/lit/ug/snaa103/snaa103.pdf?ts=1591194443306},
+}
+
@Comment{jabref-meta: databaseType:biblatex;}
diff --git a/ma/safety_reset.tex b/ma/safety_reset.tex index 7d99b2d..ea11ff9 100644 --- a/ma/safety_reset.tex +++ b/ma/safety_reset.tex @@ -181,7 +181,7 @@ receives commands through Direct Sequence Spread Spectrum (DSSS) modulation carr large controllable load such as an aluminium smelter. After forward error correction and cryptographic verification it re-flashes the meter's main microcontroller over the standard JTAG interface. -In this thesis, starting from a high-level architecture we have carried out extensive simulations of our proposal's +In this thesis, starting from a high level architecture we have carried out extensive simulations of our proposal's performance under real-world conditions. Based on these simulations we implemented an end-to-end prototype of our proposed safety reset controller as part of a realistic smart meter demonstrator. Finally we experimentally validated our results and we will conclude with an outline of further steps towards a practical implementation. @@ -397,12 +397,12 @@ often-cited one is utilizing the new high-resolution load data to improve load f generation efficiency. Computerizing the meter also allows for new fee models where electricity cost is no longer fixed over time but adapts to market conditions. Models such as prepayment electricity plans where the customer is automatically disconnected until they pay their bill are significantly aided by a fully electronic system that can be -controlled and monitored remotely\cite{anderson02}. A remotely controllable load switch can also be used to coerce +controlled and monitored remotely\cite{anderson02}. A remotely controllable disconnect switch can also be used to coerce customers in situations where that was not previously economically possible\footnote{ - The swiss association of electrical utility companies in sec.\ 7.2 par.\ (2)a of their 2010 whitepaper on the - introduction of smart metering\cite{vseaes01} cynically writes that remotely controllable load switches ``lead a new - tenant to swiftly register'' with the utility company. This whitepaper completely vanished from their website some - time after publication, but the internet archive has a copy. + The swiss association of electrical utility companies in Section 7.2 Paragraph (2)a of their 2010 whitepaper on the + introduction of smart metering\cite{vseaes01} cynically writes that remotely controllable disconnect switches ``lead + a new tenant to swiftly register'' with the utility company. This whitepaper completely vanished from their website + some time after publication, but the internet archive has a copy. }. Figure \ref{fig_smgw_schema} shows a schema of a smart metering installation in a typical household\cite{stuber01}. \begin{figure} @@ -473,7 +473,7 @@ seems largely without input on their practicality or socio-technological implica Smart meters usually are built around an off-the-shelf microcontroller (microcontroller unit, MCU). Some meters use specialized smart metering system-on-chips (SoCs)\cite{ifixit01} while others use standard microcontrollers with core -metering functions implemented in external circuitry (cf.\ sec.\ \ref{sec-easymeter} where we detail the meter in our +metering functions implemented in external circuitry (cf.\ Section \ref{sec-easymeter} where we detail the meter in our demonstration setup). Specialized SoCs usually contain a segment LCD driver along with some high-resolution analog-to-digital converters for the actual measurement functions. In many smart meter designs the metering SoC is connected to another full-featured SoC acting as the modem. At a casual glance this might seem to be a security measure, @@ -510,9 +510,9 @@ nontechnical losses}\cite{brown01} while cynically claiming \emph{Consumer Empow allow an utility company to remotely disconnect a customer at any time\cite{anderson01}. Whereas before smart metering this required either additional hardware or an expensive site visit by a qualified technician smart meters have ushered in an era of frictionless control\footnote{ Note that in some countries such as the UK non-networked mechanical -prepayment meters did exist. In such systems the user inserts coins into a coin slot that activates a load switch at the -household's main electricity connection. These systems were non-networked and did not allow for remote control. A -disadvantage of such systems compared to modern \emph{smart} systems are the high cost of the coin acceptor and the +prepayment meters did exist. In such systems the user inserts coins into a coin slot that activates a disconnect switch +at the household's main electricity connection. These systems were non-networked and did not allow for remote control. +A disadvantage of such systems compared to modern \emph{smart} systems are the high cost of the coin acceptor and the overhead of site visits required to empty the coin box\cite{anderson02}. }. \subsection{Cryptographic coprocessors} @@ -530,8 +530,8 @@ use of a smartcard-like security module to provide transport encryption and othe services\cite{bsi-tr-03109-2,bsi-tr-03109-2-a}. During our literature review we did not find many references to similar requirements in other national standards, though this does not mean that individual manufacturers do not use smartcards for engineering reasons or due to pressure from utilities. The limited documentation on meter internals that we did find -such as \cite{ifixit01} suggests where no such regulation exists manufacturers and utilities likely choose to forego -such advanced measures and instead settle on simple software implementations. +such as \cite{ifixit01,bigclive01,eevblog01} suggests where no such regulation exists manufacturers and utilities likely +choose to forego such advanced measures and instead settle on simple software implementations. \subsection{Physical structure and installation} @@ -545,7 +545,7 @@ wired into the house or apartment's electrical connection. Modern smart meters are usually made with plastic cases. Ferraris meters often used cases stamped from sheet metal with glass windows on them. Smart meters now look much more like other modern electronic devices. A common construction style is to separate the case into front and back halves with both clipped or ultrasonically welded together. Ultrasonic -welding gives a robust, airtight interface that cannot easily be separated and re-connected without leaving visible +welding gives a robust, airtight interface that cannot easily be separated and reconnected without leaving visible traces, which helps with tamper evidence properties. As an industry-standard process common in various consumer goods ultrasonic welding is a cheap and accessible technology\cite{easymeter01,ifixit01}. @@ -613,12 +613,13 @@ smartcard\cite{mahlknecht01} that is entrusted with signing of measurements and encrypted communication channel with its authorities. Security of the system is certified according to a Common Criteria process. -The German specification does not include any support for load switches as they are common in some other countries +The German specification does not include any support for disconnect switches as they are common in some other countries outside of demand-side management. It only does not prohibit the installation of one behind the smart meter -installation. This makes it theoretically possible for a utility company to still install a load switch to disconnect a -customer, but this would be a spearate installation from the smart meter. In Germany there are significant barriers that -have to be met before a utility company may cut power to a household\cite{delaw01}. The elision of a load switch means -attacks on German meters will be limited in influence to billing irregularities and attacks using DSM equipment. +installation. This makes it theoretically possible for a utility company to still install a disconnect switch to +disconnect a customer, but this would be a spearate installation from the smart meter. In Germany there are significant +barriers that have to be met before a utility company may cut power to a household\cite{delaw01}. The elision of a +disconnect switch means attacks on German meters will be limited in influence to billing irregularities and attacks +using DSM equipment. % TODO elaborate DSM attacks vs. whole-household attacks in attacks section @@ -630,22 +631,22 @@ a consortium of distribution system operators. They integrate gateway and metrol utility-facing interface is a IEC DLMS/COSEM-based interface over cellular radio such as GPRS or LTE\cite{aubel01}. Like e.g.\ the German standard, the Dutch standard precisely specifies all communication interfaces of the meter\cite{dsmrp3}. Another parallel is that the Dutch standard also does not cover any functionality for remotely -disconnecting a household. This absence of a load switch limits attacks on Dutch smart meters, too to causing billing -irregularities. +disconnecting a household. This absence of a disconnect switch limits attacks on Dutch smart meters, too to causing +billing irregularities. \subsubsection{The UK} The UK is currently undergoing a smart metering rollout. Meters in the UK are nationally standardized to provide both Zigbee ZSE-based and IEC DLMS/COSEM connectivity. UK smart metering specifications are shared between electrical and gas meters. Different to other countries' specifications the UK national specifications require electrical meters to have an -integrated load switch and gas meters to have an integrated valve. In Northern Ireland most consumers use prepaid +integrated disconnect switch and gas meters to have an integrated valve. In Northern Ireland most consumers use prepaid electricity contracts\cite{anderson02}. Prepayment and credit functionality are also specified in the UK's national smart metering standard, as is remote firmware update functionality\cite{ukgov02}. Outside communications in these standards is performed through a gateway (there called \emph{communications hub}) that can be shared between several meters \cite{ukgov01,ukgov02,ukgov03,brown01,sato01}. The combination of both gas and electricity metering into one family of standards and the exceptionally large set of \emph{required} features make the UK regulations the maximalist -option among the regulations in this section. The mandatory inclusion of both load switches and remote connectivity up -to remote firmware update make it an interesting attack target\cite{anderson01}. +option among the regulations in this section. The mandatory inclusion of both disconnect switches and remote +connectivity up to remote firmware update make it an interesting attack target\cite{anderson01}. \subsubsection{Italy} @@ -784,15 +785,15 @@ with ``enormous social benefits''\cite{mcdaniel01}. They do not usually point ou A serious issue in smart metering setups is customer privacy. Even though the meter ``only'' collects aggregate energy consumption of a whole household this data is highly sensitive\cite{markham01}. This counterintuitive fact was initially overlooked in smart meter deployments leading to outrage, delays and reduced features\cite{cuijpers01}. The root cause -for this is that given sufficient timing resolution these aggregate measurements contain ample entropy. Through -disaggregation individual loads can be identified and through pattern matching even complex usage patterns can be -discerned with alarming accuracy\cite{greveler01}. Similar privacy issues arise in many other areas of modern life -through pervasive tracking and surveillance\cite{zuboff01}. What makes the case of smart metering worse is that even the -fig leaf of consent such practices hide behind does not apply here. If I as a citizen do not consent to Google's privacy -policy Google says I can choose not to use their service. In today's world this may not be a free choice making this -argument totally invalid, but it is at least technically possible. Smart metering on the other hand is mandated by law. -In some countries such as Germany a customer unwilling to accept the accompanying privacy violation cannot legally -evade it\cite{bmwi04}. +of this problem is that given sufficient timing resolution these aggregate measurements contain ample entropy. Through +disaggregation algorithms individual loads can be identified and through pattern matching even complex usage patterns +can be discerned with alarming accuracy\cite{greveler01}. Similar privacy issues arise in many other areas of modern +life through pervasive tracking and surveillance\cite{zuboff01}. What makes the case of smart metering worse is that +even the fig leaf of consent such practices often hide behind does not apply here. If a citizen does not consent to +Google's privacy policy Google says they can choose not to use their service. In today's world this may not be a free +choice thereby invalidating this argument but it is at least technically possible. Smart metering on the other hand is +mandated by law and depending on the law a customer unwilling to accept the accompanying privacy violation may not be +able to evade it\cite{bmwi04}. \subsection{Smart grid components as embedded devices} @@ -805,66 +806,68 @@ standard hardened software environment on a high-powerded embedded system (such setup) that would both increase resilience against attacks and simplify updates. Combined with the small market sizes in smart grid deployments\footnote{ Most vendors of smart electricity meters only serve a handful of markets. For the most part, smart meter development - cost lies in the meter's software % TODO cite? + cost lies in the meter's software. % TODO cite? There exist multiple competing standards applicable to various parts of a smart electricity meter. In addition, most countries have their own certification regimen\cite{cenelec01}. This complexity creates a large development burden for new market entrants\cite{perez01}. } -this produces a high cost pressure on the software development process for smart electricity meters. +this results in a high cost pressure on the software development process for smart electricity meters. \subsection{The state of the art in embedded security} Embedded software security generally is much harder than security of higher-level systems. This is due to a combination -of the unique constraints of embedded devices (hard to update, usually small quantity) and their lack of capabilities -(processing power, memory protection functions, user interface devices). Even very well-funded companies continue to -have serious problems securing their embedded systems. A spectacular example of this difficulty is the recently-exposed -flaw in Apple's iPhone SoC first-stage ROM bootloader\footnote{ +of the unique constraints of embedded devices: Among others they are hard to update and usually produced in small +quantities. They also lack capabilities compared to full computers. Processing power is limited and memory protection +functions are spartan. Even well-funded companies continue to have trouble securing their embedded +systems. A spectacular example of this difficulty is the recently-exposed flaw in Apple's iPhone SoC first-stage ROM +bootloader\footnote{ Modern system-on-chips integrate one or several CPUs with a multitude of peripherals, from memory and DMA controllers over 3D graphics accelerators down to general-purpose IO modules for controlling things like indicator LEDs. Most SoCs boot from one of several boot devices such as flash memory, ethernet or USB according to a - configuration set e.g. by connecting some SoC pins a certain way or set by device-internal write-only fuse bits. + configuration set by pin-strapping configuration IOs or through write-only fuse bits. Physically, one of the processing cores of the SoC (usually one of the main CPU cores) is connected such that it is - taken out of reset before all other devices, and is tasked with switching on and configuring all other devices of + taken out of reset before all other devices, and is tasked with enabling and configuring all other peripherals of the SoC. In order to run later intialization code or more advanced bootloaders, this core on startup runs a very small piece of code hard-burned into the SoC in the factory. This ROM loader initializes the most basic peripherals such as internal SRAM memory and selects a boot device for the next bootloader stage. - Apple's ROM loader performs some authorization checks, to ensure no unauthorized software is loaded. The present - flaw allows an attacker to circumvent these checks, booting code not authorized by Apple on a USB-connected iPhone, - compromising Apple's chain of trust from ROM loader to userland right at its root. -}, that allows a full compromise of any iPhone before the iPhone X. iPhone 8, one of the affected models, is still being -manufactured and sold by Apple until April 2020. In another instance in 2016 researchers found multiple flaws in the -secure-world firmware used by Samsung in their mobile phone SoCs. The flaws they found were both severe architectural -flaws such as secret user input being passed through untrusted userspace processes without any protection and shocking -cryptographic flaws such as CVE-2016-1919\footnote{\url{http://cve.circl.lu/cve/CVE-2016-1919}}\cite{kanonov01}. And -Samsung is not the only large multinational corporation having trouble securing their secure world firmware -implementation. In 2014 researchers found an embarrassing integer overflow flaw in the low-level code handling untrusted -input in Qualcomm's QSEE firmware\cite{rosenberg01}. For an overview of ARM TrustZone including a survey of academic -work and past security vulnerabilities of TrustZone-based firmware see \cite{pinto01}. - -If all of these very large companies have trouble securing parts of their secure embedded software stacks measuring a -mere few hundred bytes in Apple's case or a few kilobytes in Qualcomm's, what is a smart electricity meter manufacturer -to do? For their mass-market phones, these two companies have R\&D budgets that dwarf some countries' national budgets. + Apple's ROM loader measures only a few hundred bytes. It performs authorization checks to ensure only software + authorized by Apple is booted. The present flaw allows an attacker to circumvent these checks and boot their own + code on a USB-connected iPhone. This compromsies Apple's chain of trust from ROM loader to userland right at its + root. Since this is a flaw in the factory-programmed first stage read-only boot code of the SoC it cannot be patched + in the field. +}, that allows a full compromise of any iPhone before the iPhone X. iPhone 8, one of the affected models, was still +being manufactured and sold by Apple until April 2020. In another instance in 2016 researchers found multiple flaws in +the secure-world firmware used by Samsung in their mobile phone SoCs. The flaws they found were both severe +architectural flaws such as secret user input being passed through untrusted userspace processes without any protection +and shocking cryptographic flaws such as +CVE-2016-1919\footnote{\url{http://cve.circl.lu/cve/CVE-2016-1919}}\cite{kanonov01}. And Samsung is not the only large +multinational corporation having trouble securing their secure world firmware implementation. In 2014 researchers found +an embarrassing integer overflow flaw in the low-level code handling untrusted input in Qualcomm's QSEE +firmware\cite{rosenberg01}. For an overview of ARM TrustZone including a survey of academic work and past security +vulnerabilities of TrustZone-based firmware see \cite{pinto01}. + +For their mass-market phones these companies have R\&D budgets that dwarf some countries' national budgets. If even +they have trouble securing their secure embedded software stacks, what is a smart meter manufacturer to do? Since thorough formal verification of code is not yet within reach for either large-scale software development or code heavy in side-effects such as embedded firmware or industrial control software\cite{pariente01} the two most effective -measures for embedded security is reducing the amount of code on one hand, and labour-intensively checking and -double-checking this code on the other hand. A smart electricity manufacturer does not have a say in the former since it -is bound by the official regulations it has to comply with, and will likely not have sufficient resources for the -latter. We are left with an impasse: Manufacturers in this field likely do not have the saftey resources to keep up with -complex standards requirements. At the same time they have no option to reduce the scope of their implementation to -alleviate the burden on firmware security. +measures for embedded security are reducing the amount of code on one hand, and labour-intensively reviewing and testing +this code on the other hand. A smart meter manufacturer does not have a say in the former since it is bound by the +official regulations it has to comply with, and will likely not have sufficient resources for the latter. We are left +with an impasse: Manufacturers in this field likely do not have the security resources to keep up with complex standards +requirements. At the same time they have no option to reduce the scope of their implementation to alleviate the burden +on firmware security. \subsection{Attack avenues in the smart grid} If we model the smart grid as a control system responding to changes in inputs by regulating outputs, on a very high level we can see two general categories of attacks: Attacks that directly change the state of the outputs, and attacks that try to influence the outputs indirectly by changing the system's view of its inputs. The former would be an attack -such as one that shuts down a power plant to decrease generation capacity\cite{lee01}. The latter would be an attack -such as one that forges grid frequency measurements where they enter a power plant's control systems to provoke -increasing oscillation in the amount of power generated by the plant according to the control systems' -directions\cite{kosut01,wu01,kim01}. +such as shutting down a power plant to decrease generation capacity\cite{lee01}. The latter would be an attack such as +forging grid frequency measurements where they enter a power plant's control systems to provoke the control systems to +oscillate\cite{kosut01,wu01,kim01}. \subsubsection{Communication channel attacks} @@ -874,37 +877,36 @@ substations. Generally, these attacks can be mitigated by securing the aforement cryptography. IP links can be protected using TLS, and more low-level busses can be protected using more lightweight Noise\cite{perrin01}-based protocols. -Cryptographic security transforms an attackers ability to manipulate communication contents into a mere denial of -service attack. Thus, in addition to cryptographic security safety under DoS conditions must be ensured to ensure -continued system performance under attacks. This safety property is identical with the safety required to withstand -random outages of components, such as communications link outages due to physical damage from storms, flooding -etc\cite{sato01}. In general attacks at the meter level are hard to weaponize. Meters primarily serve billing purposes. -The use of smart meter data for load forecasting is not yet common practice. Additionally smart meter data will only be -used to refine existing forecasting models based on aggregate data collected at higher vantage points in the -distribution grid. This combination of smart metering data with more trusted aggregate data from sensors within the grid -infrastructure limits the potential impact of a data falsification attack on smart meters. It also allows the utility to -identify potentially corrupt meter readings and thus detect manipulation above a certain threshold. In order for an -attack to have more far-reaching consequences the attacker would need to compromise additional grid -infrastructure\cite{kim01,kosut01}. +Cryptographic security transforms an attackers ability to read and manipulate communication contents into a mere denial +of service attack. Thus, in addition to cryptographic security safety under DoS conditions must be ensured for continued +system performance under attacks. This safety property is identical with the safety required to withstand random outages +of components, such as communication link outages due to physical damage from storms, flooding etc\cite{sato01}. In +general attacks at the meter level are hard to weaponize. Meters primarily serve billing purposes. The use of smart +meter data for load forecasting is not yet common practice. Once it is this data will only be used to refine existing +forecasting models that are based on aggregate data collected at higher vantage points in the distribution grid. This +combination of smart metering data with more trusted aggregate data from sensors within the grid infrastructure limits +the potential impact of a data falsification attack on smart meters. It also allows the utility to identify potentially +corrupt meter readings and thus detect manipulation above a certain threshold. In order for an attack to have more +far-reaching consequences the attacker would need to compromise additional grid infrastructure\cite{kim01,kosut01}. \subsubsection{Exploiting centralized control systems} The type of smart grid attack most often cited in popular discourse, and to the author's knowledge the only type that -has so far been conducted in practice, is a direct attack on centralized control systems. In this attack, computer +has so far been carried out in practice, is a direct attack on centralized control systems. In this attack, computer components of control systems are compromised by the same techniques used to compromise any other kind of computer system such as spearfishing, exploiting insecure services running on internet-exposed ports and using one compromised system to compromise other systems on the same ostensably secure internal network. These attacks are very powerful as -they yield the attacker direct control over whatever outputs the control systems are controlling. If an attacker manages -to compromise the right set of control computers, they may even be able to cause a blackout\cite{lee01}. +they yield the attacker direct control over whatever outputs the compromised control systems are controlling. If an +attacker manages to compromise the right set of control computers, they may even be able to cause physical +damage\cite{lee01}. Despite their potentially large impact, these attacks are only moderately interesting from a scientific perspective. For -one, their mitigation mostly consists of a straightforward application of security practices well-known for decades. -Though there is room for the implementation of genuinely new, application-specific security systems in this field, the -general state of the art is lacking behind other fields of embedded security. From this background low-hanging fruit -should take priority\cite{heise02}. - -Given political will these systems can readily be fortified. There is only a comparatively small number of them and -having a technician drive to every one of them in turn to install a firmware security update is feasible. +one, their mitigation mostly consists of a straightforward application of decades-old security best practices. Though +there is room for the implementation of genuinely new, power sytems-specific security systems in this field, the general +state of the art is lacking behind other fields of embedded security. From this background low-hanging fruit should take +priority\cite{heise02}. Given political will these systems can readily be fortified. There is only a comparatively +small number of them and having a technician drive to every one of them in turn to install a firmware security update is +feasible. \subsubsection{Control function exploits} @@ -912,11 +914,11 @@ Control function exploits are attacks on the mathematical control loops used by example of this type of attack are resonance attacks as described in \cite{wu01}. In this kind of attack, inputs from peripheral sensors indicating grid load to the centralized control system are carefully modified to cause a disproportionally large oscillation in control system action. This type of attack relies on complex resonance effects -that arise when mechanical generators are electrically coupled. These resonances, coloquially called ``modes'' are +that arise when mechanical generators are electrically coupled. These resonances, coloquially called ``modes'', are well-studied in power system engineering\cite{rogers01,grebe01,entsoe01,crastan03}. Even disregarding modern attack scenarios, for stability electrical grids are designed with measures in place to dampen any resonances inherent to grid -structure. Still, requiring an accurate grid model these resonances are hard to analyze and unlikely to be noiticed -under normal operating conditions. +structure. These resonances are hard to analyze since they require an accurate grid model and they are unlikely to be +noticed under normal operating conditions. Mitigation of these attacks can be achieved by ensuring unmodified sensor inputs to the control systems in the first place. Carefully designing control systems not to exhibit exploitable behavior such as oscillations is also possible but @@ -924,34 +926,34 @@ harder. \subsubsection{Endpoint exploits} -One rather interesting attack on smart grid systems is one exploiting the grid's endpoint devices such as smart -electricity meters. These meters are deployed on a massive scale, with at least one meter per household on +The one to us rather interesting attack on smart grid systems is someone exploiting the grid's endpoint devices such as +smart electricity meters. These meters are deployed on a massive scale, with at least one meter per household on average\footnote{Households rarely share a meter but some households may have a separate meter for detached properties -such as a detached garage or basement.}. Once compromised, restoration to an uncompromised state can potentially be -very difficult if it requires physical access to thousands of devices hidden inaccessible in private homes. +such as a detached garage or basement.}. Once compromised, restoration to an uncompromised state can be difficult if it +requires physical access to thousands of devices in hard-to-access locations. -By compromising smart electricity meters, an attacker can trivially forge the distributed energy measurements these -devices perform. In a best-case scenario, this might only affect billing and lead to customers being under- or -over-charged if the attack is not noticed in time. In a less ideal scenario falsified energy measurements reported by -these devices could impede the correct operation of centralized control systems. +By compromising smart electricity meters, an attacker can forge the distributed energy measurements these devices +perform. In a best-case scenario, this might only affect billing and lead to customers being under- or over-charged if +the attack is not noticed in time. In a less ideal scenario falsified energy measurements reported by these devices +could impede the correct operation of centralized control systems. In some countries such as the UK smart meters have one additional function that is highly useful to an attacker: They -contain high-current load switches to disconnect the entire household or business in case electricity bills are left -unpaid for a certain period. In countries that use these kinds of systems on a widespread level, the load disconnect -switch is controlled by the smart meter's central microcontroller. This allows anyone compromising this -microcontroller's firmware to actuate the load switch at will. Given control over a large number of network-connected -smart meters, an attacker might thus be able to cause large-scale disruptions of power +contain high-current disconnect switches to disconnect the entire household or business in case electricity bills are +left unpaid for a certain period. In countries that use these kinds of systems on a widespread level, the load +disconnect switch is controlled by the smart meter's central microcontroller. This allows anyone compromising this +microcontroller's firmware to actuate the disconnect switch at will. Given control over a large number of +network-connected smart meters, an attacker might thus be able to cause large-scale disruptions of power consumption\cite{anderson01,temple01}. Combined with an attack method such as the resonance attack from \cite{wu01} -that was mentioned above, this scenario poses a serious danger to grid stability. +that was mentioned above, this scenario poses a serious threat to grid stability. In places where Demand-Side Management (DSM) is common this functionality may be abused in a similar way. In DSM the smart metering system directly controls power to certain devices such as heaters. The utility can remotely control the turn-on and turn-off of these devices to smoothen out the load curve. In exchange the customer is billed a lower price -for the energy consumed by these loads. DSM was traditionally done with de-centralized systems mostly through -low-frequency PLC over the distribution grid. Smart metering systems no longer require large, resource-intensive -transmitters in substations and thus potentially allow the rollout of such technology on a much wider scale than before. +for the energy consumed by these loads. DSM was traditionally done in a federated fashion usually through low-frequency +PLC over the distribution grid\cite{dzung01}. Smart metering systems no longer require large, resource-intensive +transmitters in substations and bear the potential for a rollout of such technology on a much wider scale than before. This leads to a potentially significant role of DSM systems in the impact calculation of an attack on a smart metering -system. DSM does not control as much load capacity as remote disconnect switches do. The attacks cited in the above +system. DSM does not control as much load capacity as remote disconnect switches do but the attacks cited in the above paragraph still fundamentally apply. \subsection{Practical threats} @@ -963,20 +965,20 @@ attacks is by their motivation. Along this axis we found the following motives: \item[Service disruption.] An attack aimed at disrupting service could e.g.\ aim at causing a blackout. It could also take aim in a more subtle way targeting a degradation of parameters such as power quality (voltage, frequency and waveform). It could target a particular customer, geographic area or all parts of the grid. - Possible motivations range from a bored tennage hacker to actual cyberwar\cite{cleveland01,lee01}. + Possible motivations range from a tennage hacker's boredom to actual cyberwar\cite{cleveland01,lee01}. \item[Commercial disruption.] Simple commercial motives already motivate a wide variety of attacks on grid infrastructure\cite{czechowski01}. Though generally mostly harmless from a cypersecurity point of view there are instances where these attacks put the lives of both the attacker and bystanders at grave risk\cite{anderson01}. Such attacks generally aim at the meter itself but a more sophisticated attacker might also target the - utility's backend computer-bureaucracy. + utility's backend computer bureaucracy. \item[Data extraction.] The smart grid collects large amounts of data on both individual consumers and on an aggregate level. The privacy risk in individual consumer's data is obvious. On the web - data collection practices from questionable to flat-out illegal have widely proliferated for various purposes up - to manipulation of elections\cite{heise03}. Assuming criminals in this field would eschew fertile ground such as - this due to legal or ethical concerns is optimistic. Taking the risk to individual customer's data out of the - equation even aggregate data is still highly attractive to some. Aggregate real-time electricity usage data is a - potential source on timely information on things such as national social events (through TV set energy - consumption\cite{greveler01}) or just plainly the state of the economy. + data collection practices ranging from questionable to flat-out illegal have widely proliferated for various + purposes including election manipulation\cite{heise03}. Assuming criminals in this field would eschew + fertile ground such as this due to legal or ethical concerns is optimistic. Taking the risk to individual + customer's data out of the equation even aggregate data is still highly attractive to some. Aggregate real-time + electricity usage data is a potential source on timely information on matters such as national social events + (through TV set energy consumption\cite{greveler01}) or the state of the economy. \end{description} A factor to consider in all these cases is that one actor's attacks have the potential to weaken system security @@ -984,16 +986,16 @@ overall. An attacker might add new backdoors to gain persistence or they might d further steps of their attack. In this paper we will largely concentrate on attacks of the first type because they both have the most serious -consequences and the most motivated attackers. Attackers that may want to disrupt service include cyberwar operations of -enemy nation states. This type of attacker is both highly skilled and highly funded. +consequences and the most motivated attackers. Attackers that may want to disrupt service include nation state's +cyberwar operations. This type of attacker is both highly skilled and highly funded. \subsection{Conclusion or, why we are doomed} We can conclude that a compromise of a large number of smart electricity meters cannot be ruled out. The complexity of network-connected smart meter firmware makes it exceedingly unlikely that it is in fact flawless. Large-scale -deployments of these devices under some circumstances such as where they are used with load disconnect relays make them -an attractive target for attackers interested in causing grid instability. The attacker model for these devices includes -nation states, who have considerable resources at their disposal. +deployments of these devices sometimes with disconnect relays make them an attractive target for attackers interested in +causing grid instability. The attacker model for these devices includes nation states, who have considerable resources +at their disposal. For a reasonable guarantee that no large-scale compromises of hard- and software built today will happen over a span of some decades, we would have to radically simplify its design and limit attack surface. Unfortunately, the complexity of @@ -1001,35 +1003,37 @@ smart electricity meter implementations mostly stems from the large list of requ with. Alas, the standards have already been written, political will has been cast into law and changes that reduce scope or functionality have become exceedingly unlikely at this point. -A general observation with smart grid systems of any kind is that they comprise a departure from the decentralized -control structure of yesterday's dumb grid and the advent of centralization at an enormous scale. This modern, +A general observation with smart grid systems of any kind is that they comprise a departure from the federated +control structure of yesterday's ``dumb'' grid and the advent of centralization to an enormous scale. This modern, centralized infrastructure has been carefully designed to defend against malicious actors and all involved parties have -an interest in keeping it secure. In decentralized systems scaling attacks is inherently harder than in centralized -systems\cite{anderson02}. Centralization makes for an attractive attack target. An attacker can employ this centralized -control to their advantage. From this perspective the centralization of smart metering control sytems--sometimes at a -national level\cite{anderson01,anderson02}--poses a security risk. +an interest in keeping it secure but in centralized systems scaling attacks is inherently easier than in decentralized +systems\cite{anderson02}. An attacker can employ centralized control to their advantage. From this perspective the +centralization of smart metering control sytems--sometimes up to a national level\cite{anderson01,anderson02}--poses a +security risk. \chapter{Restoring endpoint safety in an age of smart devices} -As laid out in the previous paragraph we cannot fully rule out a large-scale compromise of smart energy meters at some -point in the long-term future. We have to rephrase our claim to security. We cannot rule out exploitation: We have to -limit its impact. Assuming that we cannot strip any functionality from smart meters (it may be required by standards or -for enormous social benefits\cite{mcdaniel01}). All we can do is to flush out an attacker once they are in, i.e.\ -mitigation instead of prevention. +As laid out in the previous section we cannot fully rule out a large-scale compromise of smart energy meters at some +point in the long-term future. Instead we have to rephrase our claim to security. We cannot rule out exploitation: We +have to limit its impact. Assuming that we cannot strip any functionality from smart meters all we can do is to flush +out an attacker once they are in. Mitigation replaces prevention. -In a worst-case scenario an attacker would gain unconstrained code execution (e.g.\ by exploiting a flaw in a network -protocol implentation). Smart meters use standard microcontrollers that do not have advanced memory protection functions +In a worst-case scenario an attacker would gain unconstrained code execution e.g.\ by exploiting a flaw in a network +protocol implentation. Smart meters use standard microcontrollers that do not have advanced memory protection functions (cf.\ Section \ref{sm-cpu}). We can assume the attacker has full control over the main microcontroller given any such -flaw. With this control they can actuate the load switch if present. They can transmit data through the device's +flaw. With this control they can actuate the disconnect switch if present. They can transmit data through the device's communication interfaces or use the user interface components such as LEDs and the LCD. Using the self-programming -capabilities of flash microcontrollers an attacker may even gain persistency. Note that in systems separating +capabilities of flash microcontrollers an attacker could even gain persistency. Note that in systems separating cryptographic functions into some form of cryptographic module\footnote{such as systems used in Germany\cite{bsi-tr-03109}.} we can be optimistic and assume the attacker has not yet compromised this cryptographic co-processor. With the meter's core microcontroller under attacker control we cannot use this microcontroller to restore control over the system. We have no way of ensuring the attacker does not simply delete a security mechanism we include in the core -microcontroller's firmware. +microcontroller's firmware. Theoretically a secure boot implementation could be used to ensure meters boot into a safe +state after temporary power loss but we cannot rely on secure boot being present on every smart meter application +controller. Nowadays secure boot is a standard feature in many SoC aimed at smartphones or smart TVs but it is still +very uncommon in microcontrollers. Our solution to this problem is to add another smaller microcontroller to the smart meter design. This microcontroller will contain a small piece of software that receives cryptographically authenticated commands from utility companies. On @@ -1043,9 +1047,9 @@ Our solution requires the core mircocontroller's JTAG interface to be activated to work the core microcontroller firmware must not be able to permanently disable the JTAG interface by itself. In microcontrollers that do not yet provide this functionality this is a minor change that could be added to a custom microcontroller variant at low cost. On most microcontrollers keeping JTAG open should not interfere with code readout -protection\footnote{Readout protection usually forces a device erase before allowing JTAG access.}. Code secrecy should -be of no concern\cite{schneier01} here but some manufacturers have strong preferences due to a fear of copyright -infringement. +protection\footnote{Readout protection usually forces a device to erase its program and data memories before allowing +JTAG access.}. Code secrecy should be of no concern\cite{schneier01} here but some manufacturers have strong preferences +due to a fear of copyright infringement. \section{The theory of endpoint safety} \label{sec_criteria} @@ -1061,11 +1065,11 @@ Note that our \emph{security} property includes only remote exploitation, and ex Even though most smart meters provide some level of physical security, we do not wish to make any assumptions on this. In the following section we will elaborate our attacker model and it will become apparent that sufficient physical security to defend against all attackers in our model would be infeasible, and thus we will design our overall system -to remain secure even assuming some number of physically compromised devices. +to remain secure even if we assume some number of physically compromised devices. % FIXME expand \subsection{Attack characteristics} -The attacker model these two conditions must hold under is as follows. We assume three angles of attack: Attacks by the +The attacker model the two above conditions must hold under is as follows. We assume three angles of attack: Attacks by the customer themselves, attacks by an insider within the metering systems controlling utility company and lastly attacks from third parties. Examples for these third parties are hobbyist hackers or outside cyber-criminals on the one hand, but also other companies participating in the smart grid infrastructure besides the utility company such as intermediary @@ -1074,7 +1078,7 @@ providers of meter-reading services. Due to the critical nature of the electrical grid, we have to include hostile state actors in our attacker model. When acting directly, these would be classified as third-party attackers by the above schema, but they can reasonably be expected to be able to assume either of the other two roles as well e.g. through infiltration or bribery. In the -generalized attacker model in \cite{fraunholz01} the authors give a classification of attackers and provide a nice +generalized attacker model in \cite{fraunholz01} the authors give a classification of attacker types and provide a nice taxonomy of attacker properties. In their threat/capability rating, criminals are still considered to have higher threat rating than state-sponsored attackers. The New York Times reported in 2016 that some states recruit their hacking personnel in part from cyber-criminals. If this report is true, in a worst-case scenario we have to assume a @@ -1086,46 +1090,46 @@ Based on the above classification of attack angles and our observations on state \cite{fraunholz01} to our problem, yielding the following new attacker types: \begin{enumerate} - \item \textbf{Utility company insiders controlled by a state actor} - We can ignore the other internal threats described in \cite{fraunholz01} since an insider cooperating with a + \item \textbf{Utility company insiders controlled by a state actor.} + We can ignore the other internal threats described in \cite{fraunholz01} since an insider coöperating with a state actor is strictly worse in every respect. - \item \textbf{State-sponsored external attackers} - A state actor can directly attack the system through the internet. - \item \textbf{Customers controlled by a state actor} + \item \textbf{State-sponsored external attackers.} + A state actor can directly attack the system through the internet and with proper operations security they do + not risk exposure or capture. + \item \textbf{Customers controlled by a state actor.} A state actor can very well compromise some customers for their purposes. They might either physically infiltrate the system posing as legitimate customers, or they might simply deceive or bribe existing customers - into cooperation. - \item \textbf{Regular customers} - Though a hostile state actor might gain control of some number of customers through means such as voluntary - cooperation, bribery, infiltration, they are limited in attack scale since they do not want to arouse premature - attention. Though regular customers may not have the motivation, skill or resources of a state-sponsored - attacker, potentially large numbers of them may try to attack a system out of financial incentives. To allow for - this possibility, we consider regular customers separate from state actors posing as customers in some way. + into coöperation. + \item \textbf{Regular customers.} + A hostile state actor might gain control of some number of customers through means such as voluntary + coöperation, bribery or infiltration but this limits the scale of an attack since an attacker has to avoid + arousing premature attention. Though regular customers may not have the motivation, skill or resources of a + state-sponsored attacker, potentially large numbers of them may try to attack a system out of financial + incentives\cite{anderson01,czechowski01}. To allow for this possibility, we consider regular customers separate + from state actors posing as customers. \end{enumerate} \subsection{Overall structural system security} -Considering overall security, we first introduce the \emph{reset authority}, a trusted party acting as the single -authority for issuing reset commands in our system. In practice this trusted party may be part of the utility company, -part of an external regulatory body or a hybrid setup requiring both to cooperate. We assume this party will be designed -to be secure against all of the above attacker types. The precise design of this trusted party is out of scope for this -work but we will list some practical suggestions on how to achieve security below. % FIXME do the list +Considering overall security, we first introduce the reset authority, a trusted party acting as the single authority for +issuing reset commands in our system. In practice this trusted party may be part of the utility company, part of an +external regulatory body or a hybrid setup requiring both to coöperate. We assume this party will be designed to be +secure against all of the above attacker types. The precise design of this trusted party is out of scope for this work +but we will list some practical suggestions on how to achieve security below. % FIXME do the list % FIXME put up a large box on this limitation -Using an asymmetric cryptographic design centered around the \emph{reset authority}, we rule out all attacks except for +Using an asymmetric cryptographic design centered around the reset authority, we rule out all attacks except for denial-of-service attacks on our system by any of the four attacker types. All reset commands in our system originate -from the \emph{reset authority} and are cryptographically secured to provide authentication and tamper detection. -Under this model, attacks on the electrical grid components between the \emph{reset authority} and the customer device -degrade into man-in-the-middle attacks. To ensure the \emph{safety} criterion from Section \ref{sec_criteria} holds we -must make sure our cryptography is secure against man-in-the-middle attacks and we must try to harden the system against -denial-of-service attacks by the attacker types listed above. Given our attacker model we cannot fully guard against -this sort of attack but we can at least choose a commmunication channel that is resilient against denial of service -attacks under the above model. - -Finally, we have to consider the issue of hardware security. We will solve the problem of physical attacks on some small -number of devices by simply not programming any secret information into these devices. This also simplifies hardware -production. From consideration in this work we explicitly rule out any form of supply-chain attack as -out-of-scope. +from the reset authority and are cryptographically secured to provide authentication and tamper detection. Under this +model attacks on the electrical grid components between the reset authority and the customer device degrade into denial +of service attacks. To ensure the \emph{safety} criterion from Section \ref{sec_criteria} holds we must make sure our +cryptography is secure against man-in-the-middle attacks and we must try to harden the system against denial-of-service +attacks by the attacker types listed above. Given our attacker model we cannot fully guard against this sort of attack +but we can at least choose a commmunication channel that is resilient under the above model. + +Finally, we have to consider the issue of hardware security. We will solve the problem of physical attacks by simply not +programming any secret information into devices. This also simplifies hardware production. We consider supply-chain +attacks out-of-scope for this work. % FIXME include considerations on production testing somewhere (is the device working? is the right key programmed?) \subsection{Complex microcontroller firmware} @@ -1135,37 +1139,37 @@ controller firmware. The best method to increase firmware security is to reduce interfaces as much as possible and by reducing code complexity as much as possible. % FIXME formalize this as something like "Design Goal DG-023-42-1" ? If we avoid the complexity of most modern microcontroller firmware we gain another benefit beyond implicitly reduced -attack surface: If the resulting design is small enough we may attempt formal verification of our security property. -Though formal verification tools are not yet suitable for highly complex tasks they are already adequate for small -amounts of code and simple interfaces. +attack surface: If the resulting design is small enough we may even succeed in formal verification of our security +properties. Though formal verification tools are not yet suitable for highly complex tasks they are already adequate +for small amounts of code and simple interfaces. \subsection{Modern microcontroller hardware} -Microcontrollers have gained enormously in both performance/efficiency as well as in peripheral support. Alas, these -gains have largely been driven by insatiable customer demand for faster, more powerful chips and for a long time -security has not been considered important outside of some specific niches such as smartcards. Traditionally a -microcontroller would spend its entire lifetime without ever being exposed to any networks. Though this trend has been -reversing with the increasing adoption of internet-of-things things -and more advanced security features have started appearing in general-purpose microcontrollers, most still lack even -basic functionality found in processors for computers or smartphones. +Microcontrollers have gained enormously in both performance and efficiency as well as in peripheral support. Alas, these +gains have largely been driven by insatiable customer demand for faster, more powerful chips and for the longest time +security has not been considered important outside of some specific niches such as smartcards. A few years ago a +microcontroller would spend its entire lifetime without ever being exposed to any networks\cite{anderson02}. Though this +trend has been reversing with the increasing adoption of internet-of-things things and more advanced security features +have started appearing in general-purpose microcontrollers, most still lack even basic functionality found in processors +for computers or smartphones. -One of the components lacking from most microcontrollers is strong memory protection or even a memory mapping unit as -it is found in all modern computer processors and SoCs for applications such as smartphones. Without an MPU/MPU some -mitigations for memory safety violations cannot be implemented. This and the absence of virtualization tools such as -ARM's TrustZone make hardening microcontroller firmware a big task. It is very important to ensure memory safety in -microcontroller firmware through tools such as defensive coding, extensive testing and formal verification. +One of the components lacking from most microcontrollers is strong memory protection or even a memory mapping unit as it +is found in all modern computer processors and SoCs for applications such as smartphones. Without an MPU or MMU many +memory safety mitigations cannot be implemented. This and the absence of virtualization tools such as ARM's TrustZone +make hardening microcontroller firmware a big task. It is very important to ensure memory safety in microcontroller +firmware through tools such as defensive coding, extensive testing and formal verification. In our design we achieve simplicity on two levels: One, we isolate the very complex metering firmware from our reset controller by having both run on separate microcontrollers. Two, we keep the reset controller firmware itself extremely -simple to reduce attack surface there. +simple to reduce attack surface there. Our protocol only has one message type and no state machine. -\subsection{Regulatory and economical constraints} -%FIXME +% \subsection{Regulatory and economical constraints} +% TODO decide whether to keep this section \subsection{Safety vs. security: Opting for restoration instead of prevention} By implementing our reset system as a physically separate microcontroller we sidestep most security issues around the -main application microcontroller. There are some simple measures that can be taken to harden this firmware. +main application microcontroller. There are some simple measures that can be taken to harden its firmware. Implementing industry best practices such as memory protection or stack canaries will harden the system and increase the cost of an attack but it will not yield a system that we can be confident enough in to say it is fully secure. The complexity of the main application controller firmware makes fully securing the system a formidable effort--and one that @@ -1173,14 +1177,14 @@ would have to be repeated by every meter vendor for every one of their code base In contrast to this our reset system does not provide any additional security. Any attack that could occur without it can still occur with it in place. What it provides is a fail-safe mechanism that can quickly immobilize a malicious -actor even mid-attack. It does this in a way that can be adapted to any meter architecture and any microcontroller -platform with low effort since it relies on established standard interfaces such as JTAG and SWD. Concentrating -research and development resources on a single platform like this allows for a system that is more economical to -implement across device series and across vendors. +actor mid-attack. It does this in a way that can be adapted to any meter architecture and any microcontroller platform +with low effort since it relies on established standard interfaces such as JTAG and SWD. Concentrating research and +development resources on a single platform like this allows for a system that is more economical to implement across +device series and across vendors. -Attack resilience in the power grid can benefit from a safety-focused approach. The greater danger such an attack poses +Attack resilience in the power grid can benefit from a safety-focused approach. The greater threat such an attack poses is not the temporary denial of service of utility metering functions. Even in a highly integrated smart grid as -envisioned by utility companies their measurement functions are used by utility companies to increase efficiency and +envisioned by utility companies these measurement functions are used by utility companies to increase efficiency and reduce cost but are not necessary for the grid to function at all. % TODO citation Thus if we can provide mere \emph{safety} with a fail-safe semantic instead of unattainable perfect \emph{security} we have gained resilience against a large class of realistic attack scenarios. @@ -1190,45 +1194,41 @@ have gained resilience against a large class of realistic attack scenarios. There are several ways our system could be practically implemented. The most basic way is to add a separate microcontroller connected to the meter's main application MCU and optionally other embedded microcontrollers such as modems. This discrete chip could either be placed on the metering board itself or it could be placed on a separate PCB -connected to the programming interface(s) of the metering board. In certain cases the latter might allow use in +connected to the programming interface(s) of the metering board. In certain cases the latter might allow its use in otherwise unmodified legacy designs. -The saftey reset controller would be a much simpler MCU than the meter's main application controller. Its software can -be held simple leading to low program flash and RAM requirements. Since it does not need to address rich periphery such +The safety reset controller would be a much simpler MCU than the meter's main application controller. Its software can +be kept simple leading to low program flash and RAM requirements. Since it does not need to address rich periphery such as external parallel memory, LCDs etc.\ it can be a physically small, low-pin count device. If the main application controller is supposed to be reset to a full factory image with little or no reduced functionality its firmware image size is certainly too large for the reset controller's embedded flash. Thus a realistic setup would likely use an external SPI flash chip to store this image. The most likely interfaces to reset the main application controller and possibly other microcontrollers such as modem -chips would be the controller's integrated programming port such as JTAG. There exist a variety of programming -interfaces for microcontrollers but for moderately complex ones JTAG has grown to be by far the most broadly supported -one. Parallel high-voltage flash programming has come to be uncommon in modern microcontrollers and most chips nowadays -use some form of a serial interface. Some vendors have their own proprietary serial in-system programming interfaces -that they use on certain parts instead of or in addition to JTAG. The reasons for this usually are either lower -complexity in parts that do not require full debugging capabilities as provided by JTAG or the high pin count of JTAG. +chips would be the controller's integrated programming port such as JTAG. Parallel high-voltage flash programming has +come to be uncommon in modern microcontrollers and most nowadays use some form of a serial interface. There exist a +variety of serial programming and debug interfaces but JTAG has grown to be by far the most broadly supported one and +has largely displaced vendor-specific debug interfaces except for very small devices. The kind of microcontroller that would likely be used as the main application controller in a smart meter application will almost certainly support JTAG. These microcontrollers are high pin-count devices since they need to connect to a large set of peripherals such as the LCD and the large program flash makes it likely for a proper debugging interface to -be present. - -The one remaining issue in this coarse technical outline is what communication interface should be used to transmit the -trigger command to the reset controller. In the following section we will give an overview on communication interfaces -established in energy metering applications and evaluate each of them for our purpose. +be present. The one remaining issue in this coarse technical outline is what communication interface should be used to +transmit the trigger command to the reset controller. In the following section we will give an overview on communication +interfaces established in energy metering applications and evaluate each of them for our purpose. \section{Communication channels on the grid} There is a number of well-established technologies for communication on or along power lines. We can distinguish three basic system categories: Systems using separate wires (such as DSL over landline telephone wiring), wireless radio -systems (such as LTE) and \emph{powerline communication} (PLC) systems that re-use the existing mains wiring and -superimpose data transmissions on the 50 Hz mains sine\cite{gungor01,kabalci01}. +systems (such as LTE) and \emph{powerline communication} (PLC) systems that reüse the existing mains wiring and +superimpose data transmissions onto the 50 Hz mains sine\cite{gungor01,kabalci01}. For our scenario, we will ignore short-range communication systems. There exists a large number of \emph{wideband} -powerline communication systems that are popular with consumers for bridging ethernet between parts of an apartment or -house. These systems transmit at up to several hundred megabits over distances up to several tens of -meters\cite{kabalci01}. Technologically, these wideband PLC systems are very different from \emph{narrowband} systems -used by utilities for load management among other applications and they are not relevant to our analysis. +powerline communication systems that are popular with consumers for bridging ethernet segments between parts of an +apartment or house. These systems transmit up to several hundred megabits per second over distances up to several tens +of meters\cite{kabalci01}. Technologically, these wideband PLC systems are very different from \emph{narrowband} +systems used by utilities for load management among other applications and they are not relevant to our analysis. \subsection{Powerline communication (PLC) systems and their use} @@ -1241,46 +1241,43 @@ Narrowband PLC systems transmit on the order of kilobits per second or slower. \emph{ripple control} systems. These systems superimpose a low-frequency signal at some few hundred Hertz carrier frequency on top of the 50Hz mains sine. This low-frequency signal is used to encode switching commands for non-essential residential or industrial loads. Ripple control systems provide utilities with the ability to actively -control demand while promising small savings in electricity cost to consumers\cite{dzung01}. +control demand while promising savings in electricity cost to consumers\cite{dzung01}. -In any PLC system there is a strict tradeoff between bandwidth, power and distance. Higher bandwidth requires higher +In any PLC system there is a strict trade-off between bandwidth, power and distance. Higher bandwidth requires higher power and reduces maximum transmission distance. Where ripple control systems usually use few transmitters to cover -the entire grid of a regional distribution utility, higher-bandwidth bidirectional systems used for automatic meter -reading (AMR) in places such as italy or france require repeaters within a few hundred meters of a transmitter. +the entire grid of a regional distribution utility, higher bandwidth bidirectional systems used for automatic meter +reading (AMR) in places such as Italy or France require repeaters within a few hundred meters of a transmitter. \subsection{Landline and wireless IP-based systems} -Especially in automated meter reading (AMR) infrastructure the cost-benefit tradeoff of powerline systems does not +Especially in automated meter reading (AMR) infrastructure the cost-benefit trade-off of powerline systems does not always work out for utilities. A common alternative in these systems is to use the public internet for communication. Using the public internet has the advantage of low initial investment on the part of the utility company as well as quick commissioning. Disadvantages compared to a PLC system are potentially higher operational costs due to recurring fees to network providers as well as lower reliability. Being integrated into power grid infrastructure, a PLC system's failure modes are highly correlated with the overall grid. Put briefly, if the PLC interface is down, there is a good -chance that power is out, too. In contrast to this general internet services exhibit a multitude of failures that are -entirely decorrelated from power grid stability. - -For purposes such as meter reading for billing purposes, this stability is sufficient. However for systems that need to -hold up in crisis situations such as the recovery system we are contemplating in this thesis, the public internet may -not provide sufficient reliability. +chance that power is out, too. In contrast general internet services exhibit a multitude of failures that are entirely +decorrelated from power grid stability. For purposes such as meter reading for billing purposes, this stability is +sufficient. However for systems that need to hold up in crisis situations such as the recovery system we are +contemplating in this thesis, the public internet may not provide sufficient reliability. \subsection{Short-range wireless systems} -Smart meters contain copious amonuts of firmware but still pale in comparison to the complexity of full-scale computers +Smart meters contain copious amounts of firmware but still pale in comparison to the complexity of full-scale computers such as smartphones. For short-range communication between a meter and a cellular radio gateway mounted nearby or -between a meter an an meter reading operator in a vehicle on the street a protocol such as Wifi (802.11) might be too -complex in most cases. Absent widely-used standards in this space proprietary radio protocols instead grow very -attractive. These might be based on some standardized lower-level protocol such as ZigBee (802.15) or might be entirely -home-grown. To a meter manufacturer a proprietary radio protocol has several advantages. It is easy to implement and -requires zero external certification. It can be customized to its specific application. In addition it provides some -level of vendor lock-in to customers sharing infrastructure such as a cellular radio gateway between multiple devices. -In other fields where a lack of standardization has led to a proliferation of proprietary protocols such as home -automation this has led to a fragmented protocol landscape. In other fields this is a large problem since consumer -cannot easily integrated products made by different manufacturers into one system. In advanced metering infrastructure -this is unlikely to be a disadvantage since ususally there is only one distribution grid operator for an area. -Additionally shared resources such as a cellular radio gateway would most likely only be shared within a single building -and within a single building usually all meters are operated by the same provider. - -Systems in Europe commonly support Wireless M-Bus, an european standardized protocol\cite{silabs01} that operates on +between a meter and a meter reading operator in a vehicle on the street a protocol such as Wifi (IEEE 802.11) is too +complex. Absent widely-used standards in this space proprietary radio protocols grew attractive. These are often based +on some standardized lower-level protocol such as ZigBee (IEEE 802.15) but entirely home-grown ones also exist. To the +meter manufacturer a proprietary radio protocol has several advantages. It is easy to implement and requires no external +certification. It can be customized to its specific application. In addition it provides vendor lock-in to customers +sharing infrastructure such as a cellular radio gateway between multiple devices. In other fields a lack of +standardization has led to a proliferation of proprietary protocols and a fragmented protocol landscape. This is a large +problem since the consumer cannot easily integrate products made by different manufacturers into one system. In advanced +metering infrastructure this is unlikely to be a disadvantage since ususally there is only one distribution grid +operator for an area. Shared resources such as a cellular radio gateway would most likely only be shared within a +single building and usually they are all operated by the same provider. + +Systems in Europe commonly support Wireless M-Bus, an European standardized protocol\cite{silabs01} that operates on several ISM bands\footnote{ Frequency bands that can be used for \emph{Industrial, Scientific and Medical} applications by anyone and that do not require obtaining a license for transmitter operation. Manufacturers can use whatever protocol they like on @@ -1293,106 +1290,109 @@ several ISM bands\footnote{ \subsection{Frequency modulation as a communication channel} -For our system, we chose grid frequency modulation (henceforth GFM) as a low-bandwidth uni-directional broadcast -communications channel. Compared to traditional PLC GFM requires only a small amount of additional hardware, works +For our system, we chose grid frequency modulation (henceforth GFM) as a low-bandwidth unidirectional broadcast +communication channel. Compared to traditional PLC, GFM requires only a small amount of additional hardware, works reliably throughout the grid and is harder to manipulate by a malicious actor. -Grid frequency in europe's synchronous areas is nominally 50 Hertz, but there are small load-dependent variations from +Grid frequency in Europe's synchronous areas is nominally 50 Hertz, but there are small load-dependent variations from this nominal value. Any device connected to the power grid (or even just within physical proximity of power wiring) can reliably and accurately measure grid frequency at low hardware overhead. By intentionally modifying grid frequency, we can create a very low-bandwidth broadcast communication channel. Grid frequency modulation has only ever been proposed -as a communications channel at very small scales in microgrids before\cite{urtasun01} but to our knowledge has not yet +as a communication channel at very small scales in microgrids before\cite{urtasun01} and to our knowledge has not yet been considered for large-scale application. Advantages of using grid frequency for communication are low receiver hardware complexity as well as the fact that a -single transmitter can cover an entire synchronous area. Though the transmitter has to be very large and powerful, setup -of a single large transmitter faces lower bureaucratic hurdles than integration of hundreds of smaller ones into -hundreds of local systems each with autonomous goverance. +single transmitter can cover an entire synchronous area. Though the transmitter has to be very large and powerful the +setup of a single large transmitter faces lower bureaucratic hurdles than integration of hundreds of smaller ones into +hundreds of local systems that each have autonomous goverance. \subsubsection{The frequency dependency of grid frequency} Despite the awesome complexity of large power grids the physics underlying their response to changes in load and generation is surprisingly simple. Individual machines (loads and generators) can be approximated by a small number of differential equations and the entire grid can be modelled by aggregating these approximations into a large system of -nonlinear differential equations. Evaluating these systems it has been found that in large power grids small-signal -steady-state changes in generation/consumption power balance cause an approximately linear change in +nonlinear differential equations. Evaluating these systems it has been found that in large power grids small signal +steady state changes in generation/consumption power balance cause an approximately linear change in frequency\cite{kundur01,crastan03,entsoe02,entsoe04}. \emph{Small signal} here describes changes in power balance that -are small compared to overall grid power. \emph{Steady state} describes changes over a timeframe of multiple cycles as -opposed to transient events that only last a few milliseconds. +are small compared to overall grid power. \emph{Steady state} describes changes over a time frame of multiple waveform +cycles as opposed to transient events that only last a few milliseconds. -This approximately linear relationship allows the specification of a coefficient linking $\Delta P$ and $\Delta f$ with -unit \si{\watt\per\hertz}. In this thesis we are using the European power grid as our model system. We are -using data provided by ENTSO-E (formerly UCTE), the governing association of european transmission system operators. In -our calculations we use data for the continental european synchronous area, the largest synchronous area. $\frac{\Delta -P}{\Delta f}$, called \emph{Overall Network Power Frequency Characteristic} by ENTSO-E is around -\SI{25}{\giga\watt\per\hertz}. +This approximately linear relationship allows the specification of a coefficient with unit \si{\watt\per\hertz} linking +power differential $\Delta P$ and frequency differential $\Delta f$. In this thesis we are using the European power +grid as our model system. We are using data provided by ENTSO-E (formerly UCTE), the governing association of European +transmission system operators. In our calculations we use data for the continental European synchronous area, the +largest synchronous area. $\frac{\Delta P}{\Delta f}$, called \emph{Overall Network Power Frequency Characteristic} by +ENTSO-E is around \SI{25}{\giga\watt\per\hertz}. -We can derive general design parameter for any system utilizing grid frequency as a communications channel from the +We can derive general design parameter for any system utilizing grid frequency as a communication channel from the policies of ENTSO-E\cite{entsoe02,entsoe03}. Any such system should stay below a modulation amplitude of \SI{100}{\milli\hertz} which is the threshold defined in the ENTSO-E incidents classification scale for a Scale 0-1 -(from "Anomaly" to "Noteworthy Incident" scale) frequency degradation incident\cite{entsoe03} in the continental europe -synchronous area. +(from ``Anomaly'' to ``Noteworthy Incident'' scale) frequency degradation incident\cite{entsoe02} in the continental +Europe synchronous area. \subsubsection{Control systems coupled to grid frequency} -The ENTSO-E Operations Handbook Policy 1 chapter defines the activation threshold of primary control to be -\SI{20}{\milli\hertz}. Ideally a modulation system would stay well below this threshold to avoid fighting the primary -control reserve. Modulation line rate should likely be on the order of at most a few hundred millibaud. Modulation at -such high rates would outpace primary control action which is specified by ENTSO-E as acting within between ``a few +The ENTSO-E Operations Handbook Policy 1 chapter\cite{entsoe02} defines the activation threshold of primary control to +be \SI{20}{\milli\hertz}. Ideally, a modulation system would stay well below this threshold to avoid fighting the +primary control reserve. Modulation line rate should likely be on the order of a few hundred millibaud. Modulation at +these rates would outpace primary control action which is specified by ENTSO-E as acting within between ``a few seconds'' and \SI{15}{\second}. -The effective \emph{Network Power Frequency Characteristic} of primary control in the european grid is reported by -ENTSO-E at around \SI{20}{\giga\watt\per\hertz}. Keeping modulation amplitude below this threshold would help to avoid -spuriously triggering these control functions. This works out to an upper bound on modulation power of +Keeping modulation amplitude below this threshold would help to avoid spuriously triggering these control functions. +The effective \emph{Network Power Frequency Characteristic} of primary control in the European grid is reported by +ENTSO-E at around \SI{20}{\giga\watt\per\hertz}. This works out to an upper bound on modulation power of \SI{20}{\mega\watt\per\milli\hertz}. \subsubsection{An outline of practical transmitter implementation} In its most basic form a transmitter for grid frequency modulation would be a very large controllable load connected to -the power grid at a suitable vantage point. A spool of wire submerged in a body of cooling water (such as a small lake -with a fence around it) along with a thyristor rectifier bank would likely suffice to perform this function during -occassional cybersecurity incidents. We can however decrease hardware and maintenance investment even further compared -to this rather uncultivated solution by repurposing regular large industrial loads to our transmitter purposes in an -emergency situation. For some preliminary exploration we went through a list of energy-intensive industries in -Europe\cite{ec01}. The most electricity-intensive industries in this list are primary aluminium and steel production. -In primary production raw ore is converted into raw metal for further refinement such as casting, rolling or extrusion. -In steelmaking iron is smolten in an electric arc furnace. In aluminium smelting aluminium is electrolytically extracted -from alumina. Both processes involve large amounts of electricity with electricity making up \SI{40}{\percent} of -production costs. Given these circumstances a steel mill or aluminium smelter would be good candidates as transmitters -in a grid frequency modulation system. - -In aluminium smelting high-voltage mains is transformed, rectified and fed into about 100 series-connected cells forming -a \emph{potline}. Inside the pots alumina is dissolved in molten cryolite electrolyte at about -\SI{1000}{\degreeCelsius} and electrolysis is performed using a current of tens or hundreds of kiloampere. Resulting +the power grid at a suitable vantage point. A spool of wire submerged in a body of cooling liquid such as a small lake +along with a thyristor rectifier bank would likely suffice to perform this function during occassional cybersecurity +incidents. We can however decrease hardware and maintenance investment even further compared to this rather +uncultivated solution by repurposing regular large industrial loads as transmitters in an emergency situation. For some +preliminary exploration we went through a list of energy-intensive industries in Europe\cite{ec01}. The most +electricity-intensive industries in this list are primary aluminium and steel production. In primary production raw ore +is converted into raw metal for further refinement such as casting, rolling or extrusion. In steelmaking iron is +smolten in an electric arc furnace. In aluminium smelting aluminium is electrolytically extracted from alumina. Both +processes involve large amounts of electricity with electricity making up \SI{40}{\percent} of production costs. Given +these circumstances a steel mill or aluminium smelter would be good candidates as transmitters in a grid frequency +modulation system. + +In aluminium smelting high-voltage mains is transformed, rectified and fed into about 100 series-connected electrolytic +cells forming a \emph{potline}. Inside these pots alumina is dissolved in molten cryolite electrolyte at about +\SI{1000}{\degreeCelsius} and electrolysis is performed using a current of tens or hundreds of Kiloampère. The resulting pure aluminium settles at the bottom of the cell and is tapped off for further processing. Like steelworks, aluminium smelters are operated night and day without interruption. Aside from metallurgical issues the -large thermal mass and enormous heating power requirements do not permit power-cycling. Due to the high costs of -production inefficiencies or interruptions the behavior of aluminium smelters under power outages is a fairly -well-characterized phenomenon in the industry. The recent move away from nuclear power and to renewable energy has lead -to an increase in fluctuations of electricity price throughout the day. These electricity price fluctuations have +large thermal mass and enormous heating power requirements do not permit power cycling. Due to the high costs of +production inefficiencies or interruptions the behavior of aluminium smelters under power outages is a +well-characterized phenomenon in the industry. The recent move away from nuclear power and towards renewable energy has +lead to an increase in fluctuations of electricity price throughout the day. These electricity price fluctuations have provided enough economic incentive to aluminium smelters to develop techniques to modulate smelter power consumption -without affecting cell lifetime or the output product\cite{duessel01,eisma01}. Power outages of tens of minutes up to -two hours reportedly do not cause problems in aluminium potlines and are in fact part of routine operation for purposes -such as electrode changes\cite{eisma01,oye01}. +without affecting cell lifetime or product quality\cite{duessel01,eisma01}. Power outages of tens of minutes up to two +hours reportedly do not cause problems in aluminium potlines and are in fact part of routine operation for purposes such +as electrode changes\cite{eisma01,oye01}. The power supply system of an aluminium plant is managed through a highly-integrated control system as keeping all cells of a potline under optimal operating conditions is challenging. Modern power supply systems employ large banks of diodes -or SCRs to rectify low-voltage AC to DC to be fed into the potline\cite{ayoub01}. The potline voltage can be controlled -almost continuously through a combination of a tap changer and a transductor. The individual cell voltages can be -controlled by changing the anode to cathode distance (ACD) by physically lowering or raising the anode. The potline -power supply is connected to the high voltage input and to the potline through isolators and breakers. +or SCRs\footnote{SCRs, also called thyristors, are electronic devices that are often used in high-power switching +applications. They are normally-off devices that act like diodes when a current is fed into their control terminal.} to +rectify low-voltage AC to DC to be fed into the potline\cite{ayoub01}. The potline voltage can be controlled almost +continuously through a combination of a tap changer and a transductor. The individual cell voltages can be controlled by +changing the anode to cathode distance (ACD) by physically lowering or raising the anode. The potline power supply is +connected to the high voltage input and to the potline through isolators and breakers. In an aluminium smelter most of the power is sunk into resistive losses and the electrolysis process. As such an aluminium smelter does not have any significant electromechanical inertia compared to the large rotating machines used -in other industries. Depending on the capabilities of the rectifier controls high slew rates should be possible, -permitting modulation at high\footnote{Aluminium smelter rectifiers are \emph{pulse rectifiers}. This means instead of -simply rectifying the incoming three-phase voltage they use a special configuration of transformer secondaries and in -some cases additional coils to produce a large number (such as 6) of equally spaced phases. Where -a direct-connected three-phase rectifier would draw current in 6 pulses per cycle a pulse rectifier draws current in -more, smaller pulses to increase power factor. E.g. a 12-pulse rectifier will draw current in 12 pulses per cycle. In -the best case an SCR pulse rectifier switched at zero crossing should allow \SIrange{0}{100}{\percent} load changes from -one rectifier pulse to the next, i.e. within a fraction of a single cycle.} data rates. +in other industries. Depending on the capabilities of the rectifier controls high slew rates are possible, permitting +modulation at high\footnote{Aluminium smelter rectifiers are \emph{pulse rectifiers}. This means instead of simply +rectifying the incoming three-phase voltage they use a special configuration of transformer secondaries and in some +cases additional coils to produce a large number of equally spaced phases (e.g.\ six) from a standard three-phase input. +Where a direct-connected three-phase rectifier would draw current in six pulses per mains voltage cycle a pulse +rectifier draws current in more, smaller pulses to increase power factor. For example a 12-pulse rectifier will draw +current in 12 pulses per cycle. In the best case an SCR pulse rectifier switched at zero crossing should allow +\SIrange{0}{100}{\percent} load changes from one rectifier pulse to the next, i.e. within a fraction of a single cycle.} +data rates. % FIXME validate this \subsubsection with an expert @@ -1400,28 +1400,28 @@ one rectifier pulse to the next, i.e. within a fraction of a single cycle.} data Modern power systems are complex electromechanical systems. Each component is controlled by several carefully tuned feedback loops to ensure voltage, load and frequency regulation. Multiple components are coupled through transmission -lines that themselves exhibit complex dynamic behavior. The overall system is generally stable, but may exhbit some -instabilities to particular small-signal stimuli\cite{kundur01,crastan03}. These instabilities, called \emph{modes} -occur when due to mis-tuning of parameters or physical constraints the overall system exhibits oscillation at particular -frequencies. These are separated into four categories in \cite{kundur01}: +lines that themselves exhibit complex dynamic behavior. The overall system is generally stable, but may exhbit +instabilities to particular small-signal stimuli\cite{kundur01,crastan03}. These instabilities, called \emph{modes}, +occur when due to mis-tuning of parameters or physical constraints the overall system exhibits oscillation at a +particular frequency. \cite{kundur01} separates these modes into four categories: \begin{description} - \item[Local modes] where a single power station oscillates in some parameter + \item[Local modes] where a single power station oscillates in some parameter, \item[Interarea modes] where subsections of the overall grid oscillate w.r.t.\ each other due to weak coupling - between them - \item[Control modes] caused by imperfectly tuned control systems - \item[Torsional modes] that originate from electromechanical oscillations in the generator itself + between them, + \item[Control modes] caused by imperfectly tuned control systems and + \item[Torsional modes] that originate from electromechanical oscillations in the generator itself. \end{description} The oscillation frequencies associated with each of these modes are usually between a few tens of Millihertz and a few Hertz\cite{grebe01,entsoe01,crastan03}. It is hard to predict the particular modes of a power system at the scale of the -central-european interconnected system. Theoretical analysis and simulation may give rough indications but cannot yield +central European interconnected system. Theoretical analysis and simulation may give rough indications but cannot yield conclusive results. Due to the obvious danger as well as high economical impact due to inefficiencies experimental -measurements are infeasible. Finally, modes are highly dependent on the power grid's structure and will change with -changes in the power grid over time. For all of these reasons, a grid frequency modulation system must be designed very +measurements are infeasible. Modes are highly dependent on the power grid's structure and will change with changes in +the power grid over time. For all of these reasons, a grid frequency modulation system must be designed very conservatively without relying on the absence (or presence) of modes at particular frequencies. A concrete design guideline that we can derive from this situation is that the frequency spectrum of any grid frequency modulation system -should not exhibit any notable peaks and should avoid a concentration of spectral energy in certain frequency ranges. +should not exhibit large peaks and should avoid a concentration of spectral energy in small frequency bands. \subsubsection{Overall system parameters} @@ -1430,90 +1430,88 @@ controllable load: \begin{description} \item[Modulation amplitude.] Amplitude is proportionally related to modulation power. In a practical setup we might - realize a modulation power up to a few hundred \si{\mega\watt} which would yield maybe a few tens of - \si{\milli\hertz} of frequency amplitude. + realize a modulation power up to a few hundred \si{\mega\watt} which would yield a few tens of \si{\milli\hertz} + of frequency amplitude. \item[Modulation pre-emphasis and slew-rate control.] Pre-emphasis might be necessary to ensure an adequate Signal-to-Noise ratio (SNR) at the receiver. Slew-rate control and other shaping measures might be necessary to reduce the impact of these sudden load changes on the transmitter's primary function (say, aluminium smelting) - and to prevent disturbances to grid components. - \item[Modulation frequency]. For a practical implementation a careful study would be necessary to determine an - optimal frequency band for operation. On one hand we need to prevent disturbances to the grid such as through - excitation of some local or inter-area modes. On the other hand we need to optimize Signal-to-Noise ratio (SNR) - and data rate to achieve optimal latency between transmission start and successful reception and to reduce the - overall burden on transmitter and grid. - \item[Further modulation parameters.] The modulation itself has numerous parameters that are discussed in sec.\ + and to prevent disturbances to other grid components. + \item[Modulation frequency.] For a practical implementation a careful study would be necessary to determine the + optimal frequency band for operation. On one hand we need to prevent disturbances to the grid such as the + excitation of local or inter-area modes. On the other hand we need to optimize Signal-to-Noise ratio (SNR) + and data rate to achieve optimal latency between transmission start and reset completion and to reduce the + overall burden on both transmitter and grid. + \item[Further modulation parameters.] The modulation itself has numerous parameters that are discussed in Section \ref{mod_params} below. \end{description} \section{From grid frequency to a reliable communication channel} +% FIXME add intro text here \subsection{Channel properties} In this section we will explore how we can construct a reliable communication channel from the analog primitive we -outline in the previous section. Our load control approach to grid frequency modulation leads to a channel with the +have outlined in the previous section. Our load control approach to grid frequency modulation leads to a channel with the following properties. \begin{description} - \item[Slow-changing.] Accurate grid frequency measurements need several periods of the mains sine wave. Faster + \item[Slow-changing.] Accurate grid frequency measurements take several periods of the mains sine wave. Faster sampling rates can be achieved with more complex specialized synchrophasor estimation algorithms but this will - result in a tradeoff between sampling rate and accuracy\cite{belega01}. + result in a trade-off between sampling rate and accuracy\cite{belega01}. \item[Analog.] Grid frequency is an analog signal. - \item[Noisy.] While stable over long periods of time thanks to Load-Frequency Control\cite{entsoe04} it shows - considerable random short-term variations. In addition our modulation amplitude is limited by technical and - economic constraints so we have to find a system that will work at poor SNRs. - \item[Polarized.] Grid frequency measurements have an inherent sense of \emph{up} (higher frequencies). We can use - this in a polarized modulation scheme to encode information without first transmitting some reference signal to - establish this polarization. + \item[Noisy.] While stable over long periods of time thanks to power stations' Load-Frequency Control + systems\cite{entsoe04} there are considerable random short-term variations. Our modulation amplitude is limited + by technical and economic constraints so we have to find a system that will work at poor SNRs. + \item[Polarized.] Grid frequency measurements have an inherent sense of polarity that we can use in our modulation + scheme. \end{description} \subsection{Modulation and its parameters} \label{mod_params} -In this section we will consider how to select a good set of parameters for a modulation scheme fitting grid frequency +In this section we will analyze what makes for a good set of parameters for a modulation scheme fitting grid frequency modulation. -The sensitivity of the grid to oscillation at particular frequencies described above means we should avoid any -modulation technique that would concentrate a lot of energy in a small bandwidth. Taking this principle to its extreme -provides us with a useful pointer towards techniques that might work well: Spread-spectrum techniques. By employing -spread-spectrum modulation we can produce an almost ideal frequency-domain behavior that spreads the modulation energy -almost flat across the modulation bandwidth\cite{goiser01} while at the same time achieving some modulation gain, -increasing system sensitivity. This modulation gain spread-spectrum techniques yield potentially allows us to use a -weaker stimulus, allowing further reduction of the probability of disturbance to the overall system. Spread-spectrum -techniques also inherently allow us to tune the tradeoff between receiver sensitivity and data rate. This tunability is -a highly useful parameter to have for the overall system design. +As described before the grid's oscillatory modes mean that we should avoid any modulation technique that would +concentrate energy in a small bandwidth. Taking this principle to its extreme provides us with a useful pointer towards +techniques that might work well: Spread-spectrum techniques. By employing spread-spectrum modulation we can produce +close to ideal frequency-domain behavior. Modulation energy is spread almost flatly across the modulation +bandwidth\cite{goiser01}. At the same time we achieve modulation gain which increases system sensitivity. This +modulation gain potentially allows us to use a weaker stimulus allowing for a further reduction of the probability of +disturbance to the overall system. Spread-spectrum techniques also inherently allow us to trade-off receiver sensitivity +for data rate. This tunability is a useful parameter in the overall system design. -Spread spectrum covers a whole family of techniques. In \cite{goiser01} these techniques are divided into the coarse -categories of \emph{Direct Sequence Spread Spectrum}, \emph{Frequency Hopping Spread Spectrum} and \emph{Time Hopping -Spread Spectrum}. +Spread spectrum covers a whole family of techniques that are comprehensively explained in \cite{goiser01}. +\cite{goiser01} divides spread spectrum techniques into the coarse categories of \emph{Direct Sequence Spread Spectrum}, +\emph{Frequency Hopping Spread Spectrum} and \emph{Time Hopping Spread Spectrum}. In \cite{goiser01} a BPSK or similar modulation is assumed underlying the spread-spectrum technique. Our grid frequency modulation channel effectively behaves more like a DC-coupled wire than a traditional radio channel: Any change in excitation will cause a proportional change in the receiver's measurement. Using our fft-based measurement methodology we get a real-valued signed quantity. In this way grid frequency modulation is similar to a channel using coherent -modulation. We can transmit not only signal strength, but polarity too. +modulation. We can utilize both signal strength and polarity in our modulation. -For our purposes we can discount both Time and Frequency Hopping Spread Spectrum techniques. Time hopping aids to reduce -interference between multiple transmitters but does not help with SNR any more than Direct Sequence does since all it -does is allowing other transmitters to transmit. Our system is strictly limited to a single transmitter so we do not -gain anything through Time Hopping. +For our purposes we can discount both Time and Frequency Hopping Spread Spectrum techniques. Time hopping helps to +reduce interference between multiple transmitters but does not help with SNR any more than Direct Sequence does since +all it does is allowing other transmitters to transmit. Our system is strictly limited to a single transmitter so we do +not gain anything through Time Hopping. Frequency Hopping Spread Spectrum techniques require a carrier. Grid frequency modulation itself is very limited in peak frequency deviation $\Delta f$. Frequency hopping could only be implemented as a second modulation on top of GFM, but this would not yield any benefits while increasing system complexity and decreasing data bandwidth. Direct Sequence Spread Spectrum is the only remaining approach for our application. Direct Sequence Spread Spectrum -works by directly modulating a long pseudorandom bit sequence onto the channel. The receiver must know the same +works by directly modulating a long pseudo-random bit sequence onto the channel. The receiver must know the same pseudo-random bit sequence and continuously calculates the correlation between the received signal and the pseudo-random -template sequence mapped from binary $[0, 1]$ to bipolar $[1, -1]$. The pseudorandom sequence has approximately equal -number of $0$ and $1$ bits the correlation between the sequence and uncorrelated noise is small. The positive -contribution of the $+1$ terms of the correlation template approximately cancel out with the $-1$ terms when multiplied -with an uncorrelated signal such as white gaussian noise or another pseudo-random sequence. +template sequence mapped from binary $[0, 1]$ to bipolar $[1, -1]$. The pseudo-random sequence has an approximately equal +number of $0$ and $1$ bits. The positive contribution of the $+1$ terms of the correlation template approximately cancel +out with the $-1$ terms when multiplied with an uncorrelated signal such as white gaussian noise. By using a family of pseudo-random sequences with low cross-correlation channel capacity can be increased. Either the transmitter can encode data in the choice of sequence or multiple transmitters can use the same channel at once. The -longer the pseudo-random sequence the lower its cross-correlation with noise or other pseudorandom sequences of the same -length. Choosing a long sequence we increase modulation gain while decreasing bandwidth. For any given application the -sweet spot will be the shortest sequence that is long enough to yield sufficient SNR for subsequent processing layers -such as channel coding. +longer the pseudo-random sequence, the lower its cross-correlation with noise or other pseudo-random sequences of the +same length. Choosing a long sequence we increase modulation gain while decreasing bandwidth. For any given application +the sweet spot will be the shortest sequence that is long enough to yield sufficient SNR for subsequent processing +layers such as channel coding. A popular code used in many DSSS systems are Gold codes. A set of Gold codes has small cross-correlations. For some value $n$ a set of Gold codes contains $2^n + 1$ sequences of length $2^n - 1$. Gold codes are generated from two @@ -1529,53 +1527,52 @@ modulation such as BPSK as it is commonly used in DSSS systems. \subsection{Error-correcting codes} -To make our overall system reliable we have to layer some channel coding on top of our DSSS modulation. The messages we -expect to transmit are at least a few tens of bits long. We are highly constrained in SNR due to limited transmission -power. With lower SNR comes higher BER (bit error rate). Packet error rate grows exponentially with transmission length. -For our relatively long transmissions we would realistically get unacceptable error rates. - -Error correcting codes are a very broad field with many options for specialization. Since we are implementing nothing -more than a prototype in this thesis we chose to not expend resources on optimization too much and settled on a basic -reed-solomon code. The state of the art has advanced considerably since the discovery of reed-solomon -codes\cite{mackay01}. The main areas of improvement are overhead and decoding speed. Since message length in our system -limits system response time but we do not have a fixed target we can tolerate some degree of overhead. Decoding speed -is of very low concern to us because our data rate is extremely low. +To reduce reception error rate we have to layer channel coding on top of the DSSS modulation. The messages we expect to +transmit are at least a few tens of bits long. We are highly constrained in SNR due to limited transmission power and +with lower SNR comes higher BER (Bit Error Rate). At a fixed BER, packet error rate grows exponentially with +transmission length so for our relatively long transmissions we would realistically get unacceptable error rates. -An important concern for our prototype implementation was the availability of reference implementations of our error -correcting code. We need a python implementation for test signal generation on a regular computer and we need a small C -or C++ implementation that we can adapt to embedded firmware. LDPC codes are a popular textbook example of -error-correcting codes and we had no particular difficulty finding either. +Error correcting codes are a very broad field with many options for specialization. Since we are implementing only an +advanced prototype in this thesis we chose to spend only limited resources on optimization and settled on a basic +reed-solomon code. We have no doubt that applying a more state-of-the-art code we could gain further improvements in +code overhead and decoding speed among others\cite{mackay01}. Since message length in our system limits system response +time but we do not have a fixed target we can tolerate some degree of overhead. Decoding speed is of very low concern +to us because our data rate is extremely low. We derived our implementation by adapting and optimizing an existing open +source decoder that we validated on an open source encoder implementation. We generate test signals using a Python tool +on the host. \subsection{Cryptographic security} \label{sec-crypto} +% FIXME intro blurb -Informally the system we are looking for can be modelled as consisting of three parties: the trusted -\emph{transmitter}, one of a large number of untrusted \emph{receivers}, and an \emph{attacker}. These three play -according to the following rules: +From a protocol security perspective the system we are looking for can informally be modelled as consisting of three +parties: the trusted \emph{transmitter}, one of a large number of untrusted \emph{receivers}, and an \emph{attacker}. +These three play according to the following rules: \begin{description} \item[Access.] Both transmitter and attacker can transmit any bit sequence. \item[Indistinguishability.] The receiver receives any transmission by either but cannot distinguish between them. - \item[Kerckhoff's principle.] The attacker knows anything any receiver might know\cite{kerckhoff01,kerckhoff02}. + \item[Kerckhoff's principle.] Since the protocol design is public and anyone can get access to an electricity meter + the attacker knows anything any receiver might know\cite{kerckhoff01,kerckhoff02}. \item[Priority.] The transmitter is stronger than an attacker and will ``win'' during simultaneous transmission. \item[Seeding.] Both transmitter and receiver can be seeded out-of-band with some information on each other such as public key fingerprints. \end{description} We are not considering situations where an attacker attempts to jam an ongoing transmission. In practice there are -several avenues to prevent such attempts. Compromised loads that are being abused by the attacker can be manually +several avenues to prevent such attempts. Compromised large loads that are being abused by the attacker can be manually disconnected by the utility. Error-correcting codes can be used to provide resiliency against small-scale disturbances. Finally, the transmitter can be designed to have high enough power to be able to override any likely attacker. -Our goal is to find a cryptographic primitive that has the following properties: +With the above properties in mind our goal is to find a cryptographic primitive that has the following properties: \begin{description} - \item[Authenticity.] The transmitter can produce a message bit sequence that a subset of receivers can identify as - being generated by the transmitter. On reception of this sequence, all addressed receivers perform a safety - reset. + \item[Authentication.] The transmitter can produce a message bit sequence that a certain subset of receivers can + identify as being generated by the transmitter. On reception of this sequence, all addressed receivers perform a + safety reset. \item[Unforgeability.] The attacker cannot forge a message, i.e.\ find a bit sequence other than one of the - transmitter's previous messages that a receiver would accept. This implies that the attacker also cannot modify - an existing message. - \item[Brevity.] The message should be short. Our communications channel is outrageously slow compared to anything + transmitter's previous messages that a receiver would accept. This implies that the attacker also cannot create + a new distinct message from a previously transmitted message. + \item[Brevity.] The message should be short. Our communication channel is outrageously slow compared to anything else used in modern telecommunications and every bit counts. \end{description} @@ -1584,28 +1581,29 @@ means for a given message each receiver either performs exactly one safety reset re-transmitted by either the transmitter or an attacker. We cannot achieve the ideal exactly-once semantic wit pure protocol gymnastics since we are using an unidirectional lossy communication primitive. A receiver might be offline (e.g.\ due to a local power outage) and then would not hear the transmission even if our broadcast primitive was -reliable. Since there is no back-channel, the transmitter has no way of telling when that happens. The practical impact -of this can be mitigated by the transmitter by repeating the transmission a number of times. +reliable. Since there is no back channel, the transmitter has no way of telling when that happens. The practical impact +of this can be mitigated by the transmitter repeating the message a number of times. It follows from the unforgeability requirement that we can trivially reach idempotence at the protocol level by keeping -a database of all previous messages and only accepting \emph{new} messages. By considering this in our cryptographic -design we can reduce the storage requirement for this ``database''. +a database of all previous messages and only accepting new messages. By considering this in our cryptographic design we +can reduce the storage overhead of this ``database''. Along with the indistinguishability property the access requirement implies that we need a cryptographic -signature\cite{lamport01}. However, we have relaxed constraints on this signature compared to cryptographic practice. -While cryptographic signatures need to work over arbitrary inputs, all we want to ``sign'' here is the instruction to -perform a safety reset. This is the only message we might ever want to transmit so our message space has only one -entry. The information content of our message thus is 0 bit! All the information we want to transmit is already -encoded \emph{in the fact that we are transmitting}. We do not require any further payload to be transmitted. We can -omit the entirety of the message and just transmit whatever ``signature'' we produce. This is useful to conserve -transmission bits so our transmission does not take an exceeedingly long time over our extremely slow -communication channel. +signature\cite{lamport01}. However, we have relaxed constraints on this signature compared to standard cryptographic +practice. While cryptographic signatures need to work over arbitrary inputs, all we want to ``sign'' here is the +instruction to perform a safety reset. This is the only message we might ever want to transmit so our message space has +only one element. The information content of our message thus is 0 bit! All the information we want to transmit is +already encoded \emph{in the fact that we are transmitting} and we do not require a further payload to be transmitted: +We can omit the entirety of the message and just transmit whatever ``signature'' we produce. This is useful to conserve +transmission bits so our transmission does not take an exceeedingly long time over our extremely slow communication +channel. We can modify this construction to allow for a small number of bits of information content in our message (say two or -three instead of zero) at no transmission overhead. We could transmit the cryptographic signature as usual but simply -omit the message. The message is only a few bits and we are dealing with minutes of transmission time so the receiver -can reconstruct the message through brute-force. Though this tradeoff between computation and data transmission might -seem inelegant it does work for our extremely slow link for very few bits. +three instead of zero) at no transmission overhead by transmitting the cryptographic signature as usual but simply +omitting the message. The message contains only a few bits of information and we are dealing with minutes of +transmission time so the receiver can reconstruct the message through brute-force. Though this trade-off between +computation and data transmission might seem inelegant it does work for our extremely slow link for up to a few bits of +information. There is an important limitation in the rules of our setup above: The attacker can always record the reset bit sequence the transmitter transmits and replay that same sequence later. Even without cryptography we can trivially prevent an @@ -1617,9 +1615,9 @@ is to reset them in the first place this should not pose a threat to the system' A possible scenario would be that an attacker first causes enough havoc for authorities to trigger a safety reset. The attacker would record the trigger transmission. We can assume most meters were reset during the attack. Due to this the attacker cannot cause a significant number of additional resets immediately afterwards. However, the attacker could -wait several years for a number of new meters to be installed. These new meters might not yet have updated firmware -including the lastest transmission. This means the attacker could cause them to reset by replaying the original -sequence. +wait several years for a number of new meters to be installed that might not yet have updated firmware that includes the +lastest transmission. This means the attacker could cause them to reset by replaying the original sequence. +% TODO mention why firmware has to be update with last transmission A possible mitigation for this risk would be to introduce one bit of information into the trigger message that is ignored by the replay protection mechanism. This \emph{enable} bit would be $1$ for the actual reset trigger message. @@ -1634,14 +1632,15 @@ length and by proxy system latency would be determined by the length of the sign modulus length (i.e. larger than \SI{1000}{bit} for very basic contemporary security). For elliptic curve-based systems curve length is approximately twice the security level and signature size is twice the curve length because two curve points need to be encoded\cite{anderson02}. For contemporary security this results in more than 300 bit transmission -length. Thanks to our unique setting we can do better than this. We can exploit that our effective message entropy is 0 -bit to derive a more efficient scheme. +length. We can exploit our unique setting's low message entropy to improve on this. + +% FIXME add some intro/background blurb here \subsubsection{Lamport signatures} 1979, Lamport in \cite{lamport02} introduced a signature scheme that is based only on a one-way function such as a cryptographic hash function. The basic observation is that by choosing a random secret input to a one-way function and -publishing the output, one can later prove knowledge of the input by simply publishing it. In the following paragraphs +publishing the output, one can later prove knowledge of the input simply by publishing it. In the following paragraphs we will describe a construction of a one-time signature scheme based on this observation. The scheme we describe is the one usually called a ``Lamport Signature'' in modern literature but is slightly different from the variant described in the 1979 paper. For our purposes we can consider both to be equivalent. @@ -1660,7 +1659,7 @@ entry of $P$ secret. k_{H(m)_i}$ of $S$ correctly evaluate to $p_{b, i} = H\left(s_{b, i}\right)$ from $P$ under $H$. The above scheme is a one-time signature scheme only. After one signature has been published for a given key, the -corresponding key must not be re-used for other signatures. This is intutively clear as we are effectively publishing +corresponding key must not be reüsed for other signatures. This is intutively clear as we are effectively publishing part of the private key as the signature, and if we were to publish a signature for another message an attacker could derive additional signatures by ``mixing'' the two published signatures. @@ -1687,44 +1686,44 @@ H\left(\sigma_i\right)$ matching $m_i' = m_i + 1$, this scheme is usually paired \subsubsection{Using hash-based signatures for trigger authentication} -The most basic possible trigger authentication scheme would be to simply generate a random bit string secret key $s$ and -publish $p = H(s)$ for some hash function $H$. To activate the trigger, $\sigma = s$ is published and receivers verify -that $H(\sigma) = p = H(s)$. This simplistic scheme has one main disadvantage: It is a fundamentally one-time -construction. To prevent an attacker from re-triggering a receiver a second time by replaying a valid trigger $\sigma$ -all receivers have to blacklist any ``used'' $\sigma$. Alas, this means we can only ever trigger a receiver \emph{once}. -The good part is that any receiver that missed this trigger can still be triggered later, but the bad part is that once -$s$ is burned we are out of options. The trivial solution to this would be to simply inform each receiver with a whole -list of public keys in advance. This however takes $n$ times the amount of space for $n$-fold retriggerability and we -have to memorize separately for each one whether it has been used up. Luckily we can easily derive a scheme that yields -$n$-fold retriggerability and naturally memorizes replay state while using no more same space than the original scheme -by taking some inspiration from Winternitz signatures above. - -In this scheme the secret key $s$ is still a random bit string. The public key is $p = H^n(s)$ for $n$-times -retriggerability. The $i$-th time the trigger is activated, $\sigma_i = H^n-i(s)$ is published, and every receiver can -verify that $\sigma_{i-1} = H\left(\sigma_i\right)$ with $\sigma_0 = p$. In case a receiver missed one or more previous -triggers it continues computing $H\left(H\left(\sigma_i\right)\right)$ and -$H\left(H\left(H\left(\sigma_i\right)\right)\right)$ until either reaching the $n$-th recursion level (indicating an -invalid signature) or finding $H^n\left(\sigma_i\right) = \sigma_j$ with $sigma_j$ being the last signature this -receiver recorded, or $p$ in case there is none. - -This scheme provides replay protection through receiver memorizing the last signature they activated to. Public key -length is equal to the length of the hash function $H$ used. Even for our embedded systems use case $n$ can -realistically be up to $\mathcal O\left(10^3\right)$, which is easily enough for our purposes. - -The ``disarm'' message we discussed above can be integrated into this scheme by encoding the ``enable'' bit into the -least significant bit of $n$ in our $H^n$ construction. In the chain of valid signatures every second one would be a -disarm signature. Reset and disarm signatures would alternate in this scheme. By skipping a disarm signature two resets -can still be triggered directly after one another. - -In practice it may be useful to have some control over which particular meters reset. An attack exploiting a particular -network protocol implementation flaw might only affect one series of meters made by one manufacturer. Resetting -\emph{all} meters may be too much in this case. A simple solution for this is to define adressable subsets of meters. -``All meters'' along with ``meters made by manufacturer $x$'' and ``meters of model $y$'' are good choices for such -scopes. On the cryptographic level the protocol state is simply duplicated for each scope. This incurs memory and -computation overhead linear in the number of scopes. Device memory requirements are small at a few bytes only and -computation is of no concern due to the very slow channel so this simple solution is adequate. The transmitter has to -either store copies of all scope's keys or derive these keys from a root key using the scope's identifier. Keys are -small and the transmitter would be using a regular server or hardware security module so either easily feasible. +Applying these concepts the most basic trigger authentication scheme possible would be to simply generate a random +secret key bit string $s$ and publish $p = H(s)$ for some hash function $H$. To activate the trigger, $\sigma = s$ is +published and receivers verify that $H(\sigma) = p = H(s)$. This simplistic scheme has one main disadvantage: It is a +fundamentally one-time construction. To prevent an attacker from re-triggering a receiver a second time by replaying a +valid trigger $\sigma$ all receivers have to blacklist any ``used'' $\sigma$. Alas, this means we can only ever trigger +a receiver \emph{once}. The good part is that any receiver that missed this trigger can still be triggered later, but +the bad part is that once $s$ is burned we are out of options. The trivial solution to this would be to simply provision +each receiver with a whole list of public keys in advance. This however takes $n$ times the amount of space for $n$-fold +retriggerability and for each one we have to memorize separately whether it has been used up. Luckily we can easily +derive a scheme that yields $n$-fold retriggerability and naturally memorizes replay state while using no more space +than the original scheme by taking some inspiration from Winternitz signatures. + +In this improved scheme the secret key $s$ is still a random bit string. The public key is $p = H^n(s)$ for $n$-times +retriggerability. The $i$-th time the trigger is activated, $\sigma_i = H^{n-i}(s)$ is published, and every receiver +can verify that $\sigma_{i-1} = H\left(\sigma_i\right)$ with $\sigma_0 = p$. In case a receiver missed one or more +previous triggers it continues computing $H\left(H\left(\sigma_i\right)\right)$ and +$H\left(H\left(H\left(\sigma_i\right)\right)\right)$ and so on until either reaching the $n$-th recursion +level--indicating an invalid signature--or finding $H^n\left(\sigma_i\right) = \sigma_j$ with $\sigma_j$ being the last +signature this receiver recorded or $p$ in case there is none. + +This scheme provides replay protection since the receiver memorizes the last signature they acted on. Public key length +is equal to the length of the hash function $H$ used. Even for our embedded systems use case $n$ can realistically be up +to $\mathcal O\left(10^3\right)$, which is enough for our purposes. + +The ``disarm'' message we discussed above for replay protection can be integrated into this scheme by encoding the +``enable'' bit into the least significant bit of $n$ in our $H^n$ construction. In the chain of valid signatures every +second one would be a disarm signature: Reset and disarm signatures would alternate in this scheme. By skipping a disarm +signature two resets can still be triggered directly after one another. + +In practice it may be useful to have some control over which meters reset. An attack exploiting a particular network +protocol implementation flaw might only affect one series of meters made by one manufacturer. Resetting \emph{all} +meters may be too much in this case. A simple solution for this is to define adressable subsets of meters. ``All +meters'' along with ``meters made by manufacturer $x$'' and ``meters of model $y$'' are good choices for such scopes. On +the cryptographic level the protocol state is simply duplicated for each scope. This incurs memory and computation +overhead linear in the number of scopes but device memory requirements are small at a few bytes only and computation is +of no concern due to the very slow channel so this simple solution is adequate. The transmitter has to either store +copies of all scope's keys or derive these keys from a root key using the scope's identifier. Keys are small and the +transmitter would be using a regular server or hardware security module for key management so either easily feasible. A diagram of the key structure in this key management scheme is shown in Figure \ref{fig:sig_key_chain}. The transmitter key management is shown in Figure \ref{fig:tx_scope_key_illu}. This scheme is simplistic but suffices for @@ -1732,6 +1731,7 @@ our prototype in Section \ref{sec-prototype} and may even be useful in a practic standardization of a safety reset system the key management system would most likely have to be customized to the particular application's requirements. Developing an universal solution is outside the scope of this work. % FIXME revisit this section - 2020-05-26 + \begin{figure} \centering \begin{minipage}[c]{0.5\textwidth} @@ -1743,7 +1743,7 @@ particular application's requirements. Developing an universal solution is outsi hash function. To generate a new chain a random transmitter key is generated, then hashed $n$ times to generate the corresponding device key. A new trigger message can be generated by generating the key at depth $m-1$ where $m$ is the height of the last used trigger, or $n$ initially. Every second trigger message is a - disarm message and every second one a reset message. Depending on which is needed the other one may be skipped. + disarm message and every second one a reset message. Depending on which is needed either one may be skipped. } \label{fig:sig_key_chain} \end{minipage} @@ -1753,15 +1753,15 @@ particular application's requirements. Developing an universal solution is outsi \centering \includegraphics{resources/transmitter_scope_key_illustration} \caption{ - An illustration of a key management system using a shared master key. The transmitter derives one secret key for - each adressable group from the master key. Then public device keys are generated like in Figure + An illustration of a key management system using a common master key. First, the transmitter derives one secret + key for each adressable group from the master key. Then public device keys are generated like in Figure \ref{fig:sig_key_chain}. Finally for each device the manufacturer picks the group public keys matching the device. In this example one device is a series A meter made by manufacturer B so it gets provisioned with the - keys for the ``all devices'', ``manufacturer B'' and ``series A'' keys. The other device is also made by + keys for the ``all devices'', ``manufacturer B'' and ``series A'' groups. The other device is also made by manufacturer B but is a series C device so it gets provisioned with the ``all devices'', ``manufacturer B'' and - ``series C'' public device keys. In this example the transmitter stores (or is able to derive) all six shown - group keys, but each device only needs to store the three applying to it for the three scopes ``all devices'', - ``manufacturer'' and ``series''. + ``series C'' device keys. In this example the transmitter stores (or is able to derive) all six shown + group keys, but each device only needs to store the three applying to it--one for each of the three scopes ``all + devices'', ``manufacturer'' and ``series''. } \label{fig:tx_scope_key_illu} \end{figure} @@ -1771,7 +1771,7 @@ particular application's requirements. Developing an universal solution is outsi To validate the practical feasibility of the theoretical concepts we laid out in the previous chapter we decided to build a prototype of a safety reset controller. In this section we describe the reasoning behind the components of this prototype and the engineering that went into its firmware. The prototype consists of a smart meter whose application -microcontroller is reset by a prototype reset controller on an external circuit board. We lay out how we extensively +microcontroller is reset by a microcontroller on an external circuit board. We lay out how we extensively tested all parts of our firmware implementation. We conclude with results of a practical end-to-end experiment exercising every part of our prototype. @@ -1785,49 +1785,48 @@ variable, as opposed to the frequency spectrum of mains voltage $V(t)$ itself). \subsection{Grid frequency estimation} \label{frequency_estimation} -In commercial power systems Phasor Measurement Units (PMUs) are used to precisely measure parameters of a mains voltage -waveform. One of the parameters PMUs measure is mains frequency. PMUs are used as part of SCADA systems controlling -transmission networks to characterize the operational state of the network. - -From a superficial viewpoint measuring mains frequency might seem like a simple problem. Take the mains voltage -waveform, measure time between two rising-edge (or falling-edge) zero-crossings and take the inverse $f = t^{-1}$. In -practice, phasor measurement units are significantly more complex than this. This discrepancy is due to the combination -of both high precision and quick response that is demanded from these units. High precision is necessary since -variations of mains frequency under normal operating conditions are quite small--in the range of -\SIrange{5}{10}{\milli\hertz} over short intervals of time. Relative to the nominal \SI{50}{\hertz} this is a derivation -of less than \SI{100}{ppm} Relative to the corresponding \SI{20}{\milli\second} period that means a time derivation of -about $2 \mu\text{s}$ from cycle to cycle. From this it is already obvious why a simplistic measurement cannot yield the -required precision for manageable averaging times--we would need either a ADC sampling rate in the order of megabits or -for a reconstruction through interpolated readings an impractically high ADC resolution. +In commercial power systems Phasor Measurement Units (PMUs, also called \emph{synchrophasors}) are used to precisely +measure parameters of the mains voltage waveform, one of which is grid frequency. PMUs are used as part of SCADA systems +controlling transmission networks to characterize the operational state of the network. + +From a superficial viewpoint measuring grid frequency might seem like a simple problem. Take the mains voltage waveform, +measure time between two rising-edge (or falling-edge) zero-crossings and take the inverse $f = t^{-1}$. In practice, +phasor measurement units are significantly more complex than this. This discrepancy is due to the combination of both +high precision and quick response that is demanded from these units. High precision is necessary since variations of +mains frequency under normal operating conditions are quite small--in the range of \SIrange{5}{10}{\milli\hertz} over +short intervals of time. Relative to the nominal \SI{50}{\hertz} this is a derivation of less than \SI{100}{ppm}. +Relative to the corresponding period of \SI{20}{\milli\second} this means a time derivation of about $2 \mu\text{s}$ +from cycle to cycle. From this it is already obvious why a simplistic measurement cannot yield the required precision +for manageable averaging times: We would need either an ADC sampling rate in the order of megabits per second or for a +reconstruction through interpolated readings an impractically high ADC resolution. Detail on the inner workings of commercial phasor measurement units is scarce but given their essential role to SCADA systems there is a large amount of academic research on such algorithms\cite{narduzzi01,derviskadic01,belega01}. A popular approach to these systems is to perform a Short-Time Fourier Transform (STFT) on ADC data sampled at high -sampling rate (e.g. \SI{10}{\kilo\hertz}) and then perform some analysis on the frequency-domain data to precisely -locate the strong peak around \SI{50}{\hertz}. A key observation here is that FFT bin size is going to be much larger -than required frequency resolution. This fundamental limitiation follows from the nyquist criterion %FIXME maybe cite? -and if we had to process an \emph{arbitrary} signal this would highly limit our practical measurement accuracy +sampling rate (e.g. \SI{10}{\kilo\hertz}) and then perform analysis on the frequency-domain data to precisely locate the +peak at \SI{50}{\hertz}. A key observation here is that FFT bin size is going to be much larger than required frequency +resolution. This fundamental limitiation follows from the Nyquist criterion %FIXME cite DSP text +and if we had to process an \emph{arbitrary} signal this would severely limit our practical measurement accuracy \footnote{ Some software packages providing FFT or STFT primitives such as scipy\cite{virtanen01} allow the user to super-sample FFT output by specifying an FFT width larger than input data length, padding the input data with zeros - on both sides. Note that in line with Nyquist this \emph{does not} actually provide finer output resolution but - instead just amounts to an interpolation between output bins. Depending on the downstream analysis algorithm it may - still be sensible to use this property of the DFT for interpolation, but in general it will be computationally - expensive compared to other interpolation methods and in any case it will not yield any better frequency resolution - aside from a hypothetical numerical advantage\cite{gasior02}. + on both sides. Note that in line with the Nyquist theorem this \emph{does not} actually provide finer output + resolution but instead just amounts to an interpolation between output bins. Depending on the downstream analysis + algorithm it may still be sensible to use this property of the DFT for interpolation, but in general it will be + computationally expensive compared to other interpolation methods and in any case it will not yield any better + frequency resolution aside from a potential numerical advantage\cite{gasior02}. }. -For this reason all approaches to mains frequency estimation are based on a model of the mains voltage waveform. -Nominally, this waveform would be a perfect sine at $f=\SI{50}{\hertz}$. In practice it is a sine at -$f\approx\SI{50}{\hertz}$ superimposed with some aperiodic noise (e.g. irregular spikes from inductive loads being -energized) as well as harmonic distortion that is caused by grid-topologically nearby devices with power factor -$\cos \theta \neq 1.0$. Under a continous fourier transform over a long period the frequency spectrum of a signal -distorted like this will be a low noise floor depending mainly on aperiodic noise on which a comb of harmonics as well -as some sub-harmonics of $f \approx f_\text{nom} = \SI{50}{\hertz}$ rides. The main peak at $f \approx f_\text{nom}$ -will be very strong with the harmonics being approximately an order of magnitude weaker in energy and the noise floor -being at least another order of magnitude weaker. See Figure \ref{mains_voltage_spectrum} for a measured spectrum. This -domain knowledge about the expected frequency spectrum of the signal can be employed in a number of interpolation -techniques to re-construct the precise frequency of the spectrum's main component despite comparatively coarse STFT -resolution and despite numerous distortions. +For this reason all approaches to grid frequency estimation are based on a model of the voltage waveform. Nominally +this waveform is a perfect sine at $f=\SI{50}{\hertz}$. In practice it is a sine at $f\approx\SI{50}{\hertz}$ +superimposed with some aperiodic noise (e.g. irregular spikes from inductive loads being energized) as well as harmonic +distortion that is caused by topologically nearby devices with power factor $\cos \theta \neq 1.0$. Under a continous +fourier transform over a long period the frequency spectrum of a signal distorted like this will be a low noise floor +depending mainly on aperiodic noise on which a comb of harmonics as well as some sub-harmonics of $f \approx +f_\text{nom} = \SI{50}{\hertz}$ is riding. The main peak at $f \approx f_\text{nom}$ will be very strong with the +harmonics being approximately an order of magnitude weaker in energy and the noise floor being at least another order of +magnitude weaker. See Figure \ref{mains_voltage_spectrum} for a measured spectrum. This domain knowledge about the +expected frequency spectrum of the signal can be employed in a number of interpolation techniques to reconstruct the +precise frequency of the spectrum's main component despite distortions and the comparatively coarse STFT resolution. Published grid frequency estimation algorithms such as \cite{narduzzi01,derviskadic01} are rather sophisticated and use a combination of techniques to reduce numerical errors in FFT calculation and peak fitting. Given that we do not need @@ -1836,7 +1835,7 @@ use a general approach to estimate the precise fundamental frequency of an arbit experimental physicists Gasior and Gonzalez at CERN\cite{gasior01}. This approach assumes a general sinusoidal signal superimposed with harmonics and broadband noise. Applicable to a wide spectrum of practical signal analysis tasks it is a reasonable first-degree approximation of the much more sophisticated estimation algorithms developed specifically for -power systems. Some algorithms have components such as kalman filters\cite{narduzzi01} that require a phyiscal model. +power systems. Some algorithms use components such as kalman filters\cite{narduzzi01} that require a phyiscal model. As a general algorithm \cite{gasior01} does not require this kind of application-specific tuning, eliminating one source of error. @@ -1855,9 +1854,9 @@ that more complex perform worse when the input signal deviates from their models \label{sec-fsensor} Our safety reset controller will have to measure mains frequency to later demodulate a reset signal transmitted through -it. Since we have decided to do our own frequency measurement system here we can use this frequency measurement setup as -a prototype for the frequency measurement subcomponent of the demodulation system we will later develop. Since we do not -plan to do a large-scale field deployment of our measurement setup we can keep the hardware implementation simple by +it. Since we have decided to do our own frequency measurement system here we can reüse this frequency measurement setup +as a prototype for the frequency measurement component of the demodulation system we will develop later. Since we do +not plan to do a large-scale field deployment of our measurement setup we can keep the hardware implementation simple by moving most of the signal processing to a regular computer and concentrating our hardware efforts on raw signal capture. \begin{figure} @@ -1869,14 +1868,14 @@ moving most of the signal processing to a regular computer and concentrating our component/.style = {base, rectangle, text width=40mm}, coord/.style = {coordinate, on chain, on grid, node distance=6mm and 25mm} } - \node[text centered] (input) {Single-Phase Mains Input}; - \node[component] (safety) [below = of input] {Input Protection}; + \node[text centered] (input) {Single phase mains input}; + \node[component] (safety) [below = of input] {Input protection}; \node[coord] (safety-anchor) [below = of safety] {}; - \node[component] (analog) [below = of safety-anchor] {Analog Signal Processing}; + \node[component] (analog) [below = of safety-anchor] {Analog signal processing}; \node[component] (powersupply) [left = of analog] {Power supply}; \node[component] (adc) [below = of analog] {ADC}; \node[component] (micro) [below = of adc] {Microcontroller}; - \node[component] (isol) [below = of micro] {Galvanic Digital Isolation}; + \node[component] (isol) [below = of micro] {Galvanic digital isolation}; \node[coord] (isol-left) [left = 6cm of isol.west] {}; \node[coord] (isol-right) [right = 1cm of isol.east] {}; \node[component] (usb) [below = of isol] {USB interface}; @@ -1896,17 +1895,17 @@ moving most of the signal processing to a regular computer and concentrating our \draw[dashed] (isol.east) -- (isol-right.west); \end{tikzpicture} \end{center} - \caption{Frequency sensor hardware diagram.} + \caption{Frequency sensor hardware block diagram.} \label{fmeas-sens-diag} \end{figure} -An overall block diagram of our system is shown in Figure \ref{fmeas-sens-diag}. The mircrocontroller we chose is an -\texttt{STM32F030F4P6} ARM Cortex-M0 microcontroller made by ST Microelectronics. The ADC in Figure -\ref{fmeas-sens-diag} in our design is the integrated 12-bit ADC of this microcontroller, which is sufficient for our -purposes. The USB interface is a simple USB to serial converter IC (\texttt{CH340G}) and the galvanic digital isolation -is accomplished with a pair of high-speed optocouplers on its \texttt{RX} and \texttt{TX} lines. The analog signal -processing is a simple voltage divider using high-power resistors to get the required creepage along with some -high-frequency filter capacitors and an op-amp buffer. The power supply is an off-the-shelf mains-input power module. +An overall block diagram of our system is shown in Figure \ref{fmeas-sens-diag}. The microcontroller we chose is an +\texttt{STM32F030F4P6} ARM Cortex M0 microcontroller made by ST Microelectronics. The ADC in Figure +\ref{fmeas-sens-diag} in our implementation is the integrated 12-bit ADC of this microcontroller, which is sufficient +for our purposes. The USB interface is a simple USB to serial converter IC (\texttt{CH340G}) and the galvanic digital +isolation is accomplished with a pair of high speed optocouplers on its \texttt{RX} and \texttt{TX} lines. The analog +signal processing is a simple voltage divider using high power resistors to get the required creepage along with some +high frequency filter capacitors and an op-amp buffer. The power supply is an off-the-shelf mains-input power module. The system is implemented on a single two-layer PCB that is housed in an off-the-shelf industrial plastic case fitted with a printed label and a few status lights on its front. @@ -1914,28 +1913,28 @@ with a printed label and a few status lights on its front. Our measurement hardware will sample line voltage at some sampling rate $f_S$, e.g.\ \SI{1}{\kilo\hertz}. All downstream processsing is limited in accuracy by the accuracy of $f_S$\footnote{ -We are not considering the effects of clock jitter. We are highly oversampling the signal and the FFT done in our -downstream processing will eliminate small jitter effects leaving only frequency stability to worry about. }. We +We are not considering the effect of clock jitter. We are highly oversampling the signal and the FFT done in our +downstream processing will average out small jitter effects leaving only frequency stability to worry about. }. We generate our sampling clock in hardware by clocking the ADC from one of the microcontroller's timer blocks clocked from the microcontroller's system clock. This means our ADC's sampling window will be synchronized cycle-accurate to the microcontroller's system clock. -Our downstream measurement of mains frequency by nature is relative to our sampling frequency $f_S$. In the setup -described above this means we have to make sure our system clock is fairly stable. A frequency derivation of \SI{1}{ppm} -in our system clock causes a proportional grid frequency measurement error of $\Delta f = f_\text{nom} \cdot -10^{-6} = \SI{50}{\micro\hertz}$. In a worst-case where our system is clocked from a particularly bad crystal that exhibits -\SI{100}{ppm} of instabilities over our measurement period we end up with an error of \SI{5}{\milli\hertz}. This is well -within our target measurement range, so we need a more stable clock source. Ideally we want to avoid writing our own -clock conditioning code where we try to change an oscillators operating frequency to match some reference. Clock -conditioning algorithms are highly complex and in our case post-processing of measurement data and simply adding and -offset is simpler and less error-prone. +Our downstream estimation of mains frequency by nature is relative to our sampling frequency $f_S$. In the setup +described above this means we have to make sure our system clock is stable. A frequency deviation of \SI{1}{ppm} in our +system clock causes a proportional grid frequency measurement error of $\Delta f = f_\text{nom} \cdot 10^{-6} = +\SI{50}{\micro\hertz}$. In a worst-case scenario where our system is clocked from a particularly bad crystal that +exhibits \SI{100}{ppm} of instabilities over our measurement period we end up with an error of \SI{5}{\milli\hertz}. +This is well within our target measurement range, so we need a more stable clock source. Ideally we want to avoid +writing our own clock conditioning code where we try to change an oscillators operating frequency to match some +reference. Clock conditioning algorithms are complex\cite{ti01} and in our case post processing of measurement data and +simply adding an offset is simpler and less error-prone. Our solution to these problems is to use a crystal oven\footnote{ - A crystal oven is a crystal oscillator thermally coupled closely to a heater and temperature sensor and enclosed in - a thermally isolated case. The heater is controlled to hold the crystal oscillator at a near-constant temperature - some few ten degrees above ambient. Any ambient temperature variations will be absorbed by the temperature control. - This yields a crystal frequency that is almost completely unaffected by ambient temperature variations below the - oven temperature and whose main remaining instability is aging. + A crystal oven is a crystal oscillator closely thermally coupled to a heater and temperature sensor and enclosed in + a thermally isolated case. The heater is controlled to hold the crystal oscillator at a near constant temperature + some tens of degrees Celsius above ambient temperature. Ambient temperature variations will be absorbed by the + temperature control. This yields a crystal frequency that is almost completely unaffected by ambient temperature + variations below the oven temperature and whose main remaining instability is aging. }as our main system clock source. Crystal ovens are expensive compared to ordinary crystal oscillators. Since any crystal oven will be much more accurate than a standard room-temperature crystal we chose to reduce cost by using one recycled from old telecommunications equipment. @@ -1956,7 +1955,8 @@ and ADC resolution. \begin{figure} \centering \includegraphics{../lab-windows/fig_out/ocxo_freq_stability} - \caption{OCXO Frequency derivation from nominal \SI{19.440}{\mega\hertz} measured against GPS 1pps.} + \caption{OCXO Frequency derivation from its nominal \SI{19.440}{\mega\hertz} frequency measured against a GPS + receiver's 1pps reference output.} \label{ocxo_freq_stability} \end{figure} @@ -1965,48 +1965,48 @@ and ADC resolution. The firmware uses one of the microcontroller's timers clocked from an external crystal oscillator to produce an \SI{1}{\milli\second} tick that the internal ADC is triggered from for a sample rate of \SI{1}{\kilo sps}. Higher sample rates would be possible but reliable data transmission over the opto-isolated serial interface might prove challenging -and \SI{1}{\kilo sps} corresponds to $20$ samples per cycle at $f_\text{nominal}$. This figure exceeds the nyquist -criterion by a factor of ten and is be plenty for accurate measurements. +and \SI{1}{\kilo sps} already corresponds to $20$ samples per cycle at $f_\text{nominal}$. This figure exceeds the +Nyquist criterion by a factor of ten and is plenty for accurate measurements. -The ADC measurements are read using DMA and written into a circular buffer. Using some DMA controller features this +The ADC measurements are read using DMA and written into a circular buffer. Using DMA controller features this circular buffer is split in back and front halves with one being written to and the other being read at the same time. Buffer contents are moved from the ADC DMA buffer into a packet-based reliable UART interface as they come in. The UART -packet interface keeps two ringbuffers: One byte-based ringbuffer for transmission data and one ringbuffer pointer -structure that keeps track of ADC data packet boundaries in the byte-based ringbuffer. Every time a chunk of data is -available from the ADC the data is framed into the byte-based ringbuffer and the packet boundaries are logged in the -packet pointer ringbuffer. If the UART transmitter is idle at this time a DMA-backed transmission of the oldest packet -in the packet ringbuffer is triggered at this point. Data is framed using Consistent Overhead Byte Stuffing +packet interface keeps two ring buffers: One byte-based ring buffer for transmission data and one ring buffer pointer +structure that keeps track of ADC data packet boundaries in the byte-based ring buffer. Every time a chunk of data is +available from the ADC the data is framed into the byte-based ring buffer and the packet boundaries are logged in the +packet pointer ring buffer. If the UART transmitter is idle at this time a DMA-backed transmission of the oldest packet +in the packet ring buffer is triggered at this point. Data is framed using Consistent Overhead Byte Stuffing (COBS)\footnote{ -COBS is a framing technique that allows encoding $n$ bytes of arbitray data into exactly $n+1$ bytes with no embedded -$0$-bytes that can then be delimited using $0$-bytes. COBS is simple to implement and allows both one-pass decoding and +COBS is a framing technique that allows encoding $n$ bytes of arbitrary data into exactly $n+1$ bytes with no embedded +$0$ bytes that can then be delimited using $0$ bytes. COBS is simple to implement and allows both one pass decoding and encoding. The encoder either needs to be able to read up to \SI{256}{\byte} ahead or needs a buffer of \SI{256}{\byte}. COBS is very robust in that it allows self-synchronization. At any point a receiver can reliably synchronize itself -against a COBS data stream by waiting for the next $0$-byte. The constant overhead allows precise bandwidth and buffer +against a COBS data stream by waiting for the next $0$ byte. The constant overhead allows precise bandwidth and buffer planning and provides constant, good efficiency close to the theoretical maximum.}\cite{cheshire01} along with a CRC-32 checksum for error checking. When the host receives a new packet with a valid checksum it returns an acknowledgement packet to the sensor. When the sensor receives the acknowledgement, the acknowledged packet is dropped -from the transmission packet ringbuffer. When the host detects an incorrect checksum it simply stays quiet and waits for +from the transmission packet ring buffer. When the host detects an incorrect checksum it simply stays quiet and waits for the sensor to resume with retransmission when the next ADC buffer has been received. The serial interface logic presents most of the complexity of the sensor firmware. This complexity is necessary since we need reliable, error-checked transmission to the host. Though rare, bit errors on a serial interface do happen and -data corruption is unacceptable. The packet-layer queueing on the sensor is necessary since the host is not a realtime +data corruption is unacceptable. The packet layer queueing on the sensor is necessary since the host is not a realtime system and unpredictable latency spikes of several hundred milliseconds are possible. The host in our recording setup is a Raspberry Pi 3 model B running a Python script. The Python script handles serial communication and logs data and errors into an SQLite database file. SQLite has been chosen for its simple yet flexible interface and its good tolerance of system resets due to unexpected power loss. Overall our setup performed adequately -with IO contention on the raspberry PI/linux side causing only 16 skipped sample packets over a 68-hour recording span. +with IO contention on the Raspberry PI/Linux side causing only 16 skipped sample packets over a 68 hour recording span. \subsection{Frequency sensor measurement results} Captured raw waveform data has been processed in the Jupyter Lab environment\cite{kluyver01} and grid frequency -estimates are extracted as described in sec. \ref{frequency_estimation} using the Gasior and Gonzalez\cite{gasior01} -technique. Appendix \ref{grid_freq_estimation_notebook} contains the Jupyter notebook we used for frequency -measurement. In Figure \ref{freq_meas_feedback} we fed back to the frequency estimator its own output giving us an -indication of its numerical performance. The result was \SI{1.3}{\milli\hertz} of RMS noise over a \SI{3600}{\second} -simulation time. This indicates performance is good enough for our purposes. In addition to this we validated our -algorithm's performance by applying it to the test waveforms from \cite{wright01}. In this test we got errors of +estimates are extracted as described in Section \ref{frequency_estimation} using the Gasior and Gonzalez\cite{gasior01} +technique. The Jupyter notebook we used for frequency measurement is included with the supplementary materials to this +thesis. In Figure \ref{freq_meas_feedback} we fed back to the frequency estimator its own output giving us an indication +of its numerical performance. The result was \SI{1.3}{\milli\hertz} of RMS noise over a \SI{3600}{\second} simulation +time. This indicates performance is good enough for our purposes. In addition to this we validated our algorithm's +performance by applying it to the test waveforms from \cite{wright01}. In this test we got errors of \SI{4.4}{\milli\hertz} for the \emph{noise} test waveform, \SI{0.027}{\milli\hertz} for the \emph{interharmonics} test waveform and \SI{46}{\milli\hertz} for the \emph{amplitude and phase step} test waveform. Full results can be found in Figure \ref{freq_meas_rocof_reference}. @@ -2022,8 +2022,8 @@ window respectively. output. This feedback simulation gives an indication of numerical errors in our estimation algorithm. The top four graphs show a comparison of the original trace (blue) and the re-calculated trace (orange). The bottom trace shows the difference between the two. As we can tell both traces agree very well with an overall RMS - deviation of about \SI{1.3}{\milli\hertz}. The bottom trace shows deviation growing over time. This is very - likely an effect of numerical errors in our ad-hoc waveform generator. + deviation of about \SI{1.3}{\milli\hertz}. The bottom trace shows deviation growing over time. This is an effect + of numerical errors in our ad-hoc waveform generator. } \label{freq_meas_feedback} \end{figure} @@ -2032,8 +2032,8 @@ window respectively. \centering \includegraphics[width=\textwidth]{../lab-windows/fig_out/freq_meas_rocof_reference} \caption{ - Performance of our frequency estimation algorithm against the test suite specified in \cite{wright01}. Shown are - standard deviation and variance measurements as well as time-domain traces of differences. + Performance of our frequency estimation algorithm under the test suite specified in \cite{wright01}. Shown are + standard deviation and variance measurements as well as time-domain traces of absolute differences. } \label{freq_meas_rocof_reference} \end{figure} @@ -2041,8 +2041,8 @@ window respectively. \begin{figure} \centering \includegraphics{../lab-windows/fig_out/freq_meas_trace_24h} - \caption{Trace of grid frequency over a 24 hour window. One clearly visible feature are large positive and negative - transients at full hours. Times shown are UTC. Note that the european continental synchronous area that this + \caption{Trace of grid frequency over a 24 hour time span. One clearly visible feature are large positive and negative + transients at full hours. Times shown are UTC. Note that the European continental synchronous area that this sensor is placed in covers several time zones which may result in images of daily load peaks appearing in 1 hour intervals. Figure \ref{freq_meas_trace_mag} contains two magnified intervals from this plot.} \label{freq_meas_trace} @@ -2052,12 +2052,12 @@ window respectively. \begin{subfigure}{\textwidth} \centering \includegraphics{../lab-windows/fig_out/freq_meas_trace_2h_1} - \caption{A 2 hour window around 00:00 UTC.} + \caption{A 2 hour window centered on 00:00 UTC.} \end{subfigure} \begin{subfigure}{\textwidth} \centering \includegraphics{../lab-windows/fig_out/freq_meas_trace_2h_2} - \caption{A 2 hour window around 18:30 UTC.} + \caption{A 2 hour window centered on 18:30 UTC.} \end{subfigure} \caption{Two magnified 2 hour windows of the trace from Figure \ref{freq_meas_trace}.} \label{freq_meas_trace_mag} @@ -2067,10 +2067,10 @@ window respectively. \centering \includegraphics{../lab-windows/fig_out/mains_voltage_spectrum} \caption{Power spectral density of the mains voltage trace in Figure \ref{freq_meas_trace}. Data was captured using - our frequency measurement sensor (\ref{sec-fsensor}) and FFT'ed after applying a blackman window. Vertical lines - indicate \SI{50}{\hertz} and odd harmonics. We can see the expected peak at \SI{50}{\hertz} along with smaller - peaks at odd harmonics. We can also see a number of spurious tones both between harmonics and at low frequencies, as - well as some bands containing high noise energy around \SI{0.1}{\hertz}. This graph demonstrates a high + our frequency measurement sensor (\ref{sec-fsensor}) and FFT-processed after applying a blackman window. The + vertical lines indicate \SI{50}{\hertz} and odd harmonics. We can see the expected peak at \SI{50}{\hertz} along + with smaller peaks at odd harmonics. We can also see a number of spurious tones both between harmonics and at low + frequencies. We can also see bands containing high noise energy around \SI{0.1}{\hertz}. This graph shows a high signal-to-noise ratio that is not very demanding on our frequency estimation algorithm. } \label{mains_voltage_spectrum} @@ -2080,25 +2080,25 @@ window respectively. \label{sec-ch-sim} To validate all layers of our communication stack from modulation scheme to cryptography we built a prototype -implementation in python. Implementing all components in a high-level language builds up familiartiy with the concepts -while taking away much of the implementation complexity. For our demonstrator we will not be able to use python since -our target platform is a cheap low-end microcontroller. Our demonstrator firmware will have to be written in a low-level -language such as C or rust. For prototyping these languages lack flexibility compared to python. +implementation in Python. Implementing all components in a high level language builds up familiartiy with the concepts +while taking away much of the implementation complexity. For our demonstrator we will not be able to use Python since +our target platform is an inexpensive low-end microcontroller. Our demonstrator firmware will have to be written in a +low-level language such as C or Rust. For prototyping these languages lack flexibility compared to Python. -To validate our modulation scheme we first performed a series of simulations on our python demodulator prototype +To validate our modulation scheme we first performed a series of simulations on our Python demodulator prototype implementation. To simulate a modulated grid frequency signal we added noise to a synthetic modulation signal. For most simulations we used measured frequency data gathered with our frequency sensor. We only have a limited amount of capture -data. Re-using segements of this data as background noise in multiple simulation runs could hypothetically lead to our -simulation results depending on individual features of this particular capture that would be common between all runs. To -estimate the impact of this problem we re-ran some of our simulations with artificial random noise synthesized with a -power spectral density matching that of our capture. To do this, we first measured our capture's PSD, then fitted a +data. Re-using segements of this data as background noise in multiple simulation runs could lead to our simulation +results depending on individual features of this particular capture that would be common between all runs. To estimate +the impact of this problem we re-ran some of our simulations with artificial random noise synthesized with a power +spectral density matching that of our capture. To do this, we first measured our capture's PSD, then fitted a low-resolution spline to the PSD curve in log-log coördinates. We then generated white noise, multiplied the resampled spline with the DFT of the synthetic noise and performed an iDFT on the result. The resulting time-domain signal is our synthetic grid frequency data. Figure \ref{freq_meas_spectrum} shows the PSD of our measured grid frequency signal. The red line indicates the low-resolution log-log spline interpolation used for shaping our artificial noise. Figure \ref{simulated_noise_spectrum} shows the PSD of our simulated signal overlayed with the same spline as a red line and shows time-domain traces of both simulated (blue) and reference signals (orange) at various time scales. Visually both -signals look very similar, suggesting we have found a good synthetic approximation of our measurements. +signals look very similar, suggesting that we have found a good synthetic approximation of our measurements. \begin{figure} \centering @@ -2128,19 +2128,19 @@ In our simulations, we manipulated four main variables of our modulation scheme their impact on symbol error rate (SER): \begin{description} - \item[Modulation amplitude.] Higher amplitude should correspond to a lower SER. + \item[Modulation amplitude.] Higher amplitude corresponds to a lower SER. \item[Modulation bit count.] Higher bit count $n$ means longer transmissions but yields higher theoretical decoding gain, and should increase demodulator sensitivity. Ultimately, we want to find a sweet spot of manageable transmission length at good demodulator sensitivity. - \item[Decimation.] or DSSS chip duration. The chip time determines where in the grid frequency spectrum (Figure - \ref{freq_meas_spectrum} our modulated signal is located. Given our noise spectrum (Figure + \item[Decimation or DSSS chip duration.] The chip time determines where in the grid frequency spectrum (Figure + \ref{freq_meas_spectrum}) our modulated signal is located. Given our noise spectrum (Figure \ref{freq_meas_spectrum}) lower chip durations (shifting our signal upwards in the spectrum) should yield lower in-band background noise which should correspond to lower symbol error rates. \item[Demodulation correlator peak threshold factor.] The first step of our prototype demodulation algorithm is to - calculate the correlation between all $2^n+1$ Gold sequences - and to identify peaks corresponding to the input data containing a correctly aligned Gold sequence. The - threshold factor is a factor peaks of what magnitude compared to baseline noise levels are considered in the - following maximum likelihood estimation (MLE) decoding (cf.\ Figure \ref{fig_demo_sig_schema}). + calculate the correlation between all $2^n+1$ Gold sequences and our signal and to identify peaks corresponding + to the input data containing a correctly aligned Gold sequence. The threshold factor determines peaks of which + magnitude compared to baseline noise levels are considered in the following maximum likelihood estimation (MLE) + decoding (cf.\ Figure \ref{fig_demo_sig_schema}). \end{description} Our results indicate that symbol error rate is a good proxy of demodulation performance. With decreasing signal-to-noise @@ -2153,16 +2153,16 @@ monotonically to the signal-to-noise margins inside our demodulator prototype. A basic parameter of our DSSS modulation is the length of the Gold codes used. The length of a Gold code is exponential in the code's bit count. Figure \ref{dsss_gold_nbits_overview} shows a plot of the symbol error rate of our demodulator prototype depending on amplitude for each of five, six, seven and eigth-bit Gold sequences. In regions where symbol -error rate is between $0$ and $1$ we can see the expected dependency that a $n+1$ bit Gold sequence at roughly twice -the length yields roughly one half the SER. We can also observe a saturation effect: At low amplitudes, increasing the -correlation length does not seem to yield much of a benefit in SER anymore. In particular there seems to be a level of -about \SI{2.5}{\milli\hertz} signal amplitude where even with asymptotically infinite sequence length our demodulator -would still not be able to produce a good demodulation. This is likely due to numerical errors in our demodulator. Since -Gold codes of more than 7 bit would yield unacceptably long transmission times this does not pose a problem in practice. - -Figure \ref{dsss_gold_nbits_sensitivity} for each bit count shows the minimum signal amplitude where our demodulator +error rate is neither clipping at $0$ nor at $1$ we can see the expected dependency that a $n+1$ bit Gold sequence at +roughly twice the length yields roughly one half the SER. We can also observe a saturation effect: At low amplitudes, +increasing the correlation length does not yield much benefit in SER anymore. In particular at a signal amplitude of +\SI{2.5}{\milli\hertz} even with asymptotically infinite sequence length our demodulator would still not be able to +produce a good demodulation. This is likely due to numerical errors in our demodulator. Since Gold codes of more than 7 +bit would yield unacceptably long transmission times this does not pose a problem in practice. + +Figure \ref{dsss_gold_nbits_sensitivity} for each bit count shows the minimum signal amplitude at which our demodulator crossed below $\text{SER}=0.5$. If we have sufficient transmitter power to allocate selecting either a 5 bit or a 6 bit -gold code looks to yield good enough performance at manageable data rates. +Gold code yields sufficient performance at manageable data rates. \begin{figure} \centering @@ -2176,7 +2176,7 @@ gold code looks to yield good enough performance at manageable data rates. simulation uses a decimation of 10, which corresponds to an $1 \text{s}$ chip length at our $10 \text{Hz}$ grid frequency sampling rate. At 5 bit per symbol, one symbol takes $31 \text{s}$ and one bit takes $6.2 \text{s}$ amortized. At 8 bit one symbol takes $255 \text{s} = 4 \text{min} 15 \text{s}$ and one bit takes $31.9 \text{s}$ - amortized. Here, slower transmission speed buys coding gain. All else being the same this allows for a decrease + amortized. Here, slower transmission speed buys coding gain. All else being equal this allows for a decrease in transmission power. } \label{dsss_gold_nbits_overview} @@ -2189,10 +2189,10 @@ gold code looks to yield good enough performance at manageable data rates. \end{minipage} \begin{minipage}[c]{0.45\textwidth} \caption{ - Amplitude at a SER of 0.5\ in mHz depending on symbol length. Here we can observe an increase of sensitivity + Amplitude at an SER of 0.5\ in mHz depending on symbol length. Here we can observe an increase of sensitivity with increasing symbol length, but we can clearly see diminishing returns above 6 bit (63 chips). Considering that each bit roughly doubles overall transmission time for a given data length it seems lower bit counts are - preferrable if the necessary transmitter power can be realized. + preferrable if the required transmitter power can be realized. } \label{dsss_gold_nbits_sensitivity} \end{minipage} @@ -2200,27 +2200,28 @@ gold code looks to yield good enough performance at manageable data rates. \subsection{Sensitivity versus peak detection threshold factor} -One of the high-level parameters of our demodulation algorithm is the \emph{threshold factor}. This parameter is +One of the high level parameters of our demodulation algorithm is the \emph{threshold factor}. This parameter is an implementation detail specific to our algorithm and not general to all possible DSSS demodulation algorithms. After -correlating the input signal against the template Gold sequences our algorithm runs a single-channel discrete wavelet +correlating the input signal against the template Gold sequences our algorithm runs a single channel discrete wavelet transform (DWT) on the correlator output to better discriminate peaks from background noise. The output of this DWT is then normalized against a running average and then fed into a simple threshold detector. The threshold of this detector is our threshold factor. This threshold is the ratio that a correlation peak after DWT has to stand out from long-term average background noise to be considered a peak. -The threshold factor is an empirically-determined parameter Low threshold factors yield many false positives that in the -extreme ultimately overload our MLE estimator's capacity to discard them. Moderate numbers of false positive do not pose -much of a challenge to our MLE since these spurious peaks have a random time distribution and are easily discarded by -our MLE's symbol chain detection. High threshold factors lead the algorithm to completely ignore some valid peaks. To -some degree this can be compensated by our later interpolation step for missing peaks but in the extreme will also break -demodulation. In our simulations good values lie in the range from $4.0$ to $5.5$. +The threshold factor is an empirically determined unitless parameter. Low threshold factors yield many false positives +that in the extreme ultimately overload our MLE estimator's capacity to discard them. Moderate numbers of false +positives do not pose much of a challenge to our MLE since these spurious peaks have a random time distribution and are +easily discarded by our MLE's detection of sequences of equally-spaced symbols. High threshold factors lead the +algorithm to completely ignore some valid peaks. To some degree this can be compensated by our later interpolation step +for missing peaks but in the extreme will also break demodulation. In our simulations good values lie in the range from +$4.0$ to $5.5$. Figure \ref{dsss_thf_amplitude_5678} contains plots of demodulator sensitivity like the one in Figure \ref{dsss_gold_nbits_overview}. This time there is one color-coded trace for each threshold factor between $1.5$ and $10.0$ in steps of $0.5$. We can see a clear dependency of demodulation performance from trheshold factor with both very -low and very high values breaking the demodulator. The ``runaway'' traces that we can see at low threshold factors are +low and very high values breaking the demodulator. The runaway traces that we can see at low threshold factors are artifacts of an implementation issue with our prototype code. We later fixed this issue in the demonstrator firmware -implementation in Section \ref{sec-demo-fw-impl}. For comparison purposes this issue do not matter. +in Section \ref{sec-demo-fw-impl}. For comparison purposes this issue do not matter. \begin{figure} \centering @@ -2258,8 +2259,8 @@ duration is specified in grid frequency sampling periods to ease implementation Figure \ref{chip_duration_sensitivity} shows the dependence of symbol error rate at a fixed good threshold factor from chip duration. The color bars indicate both chip duration translated to seconds real-time and the resulting symbol duration at the given Gold code length. In the lower graphs we show the trace of ampltude at $\text{SER}=0.5$ over chip -duration like we did in Figure \ref{dsss_thf_sensitivity_all_bits} for threshold facotr. In both graphs we can just about -see an optimum for very short chips with a decrease of sensitivity for long chips. This effect is due to longer chips +duration like we did in Figure \ref{dsss_thf_sensitivity_all_bits} for threshold factor. In both graphs we can see a +faint optimum for very short chips with a decrease of sensitivity for long chips. This effect is due to longer chips moving the signal band into noisier spectral regions (cf.\ Figure \ref{freq_meas_spectrum}). \begin{figure} @@ -2284,20 +2285,21 @@ moving the signal band into noisier spectral regions (cf.\ Figure \ref{freq_meas \end{subfigure} \caption{ Dependence of demodulator sensitivity on DSSS chip duration. Due to computational constraints this simulation is - limited to 5 bit and 6 bit DSSS sequences. There is a clearly visible sensitivity maximum at fairly short chip + limited to 5 bit and 6 bit DSSS sequences. There is a clearly visible sensitivity maximum at short chip lengths around $0.2 \text{s}$. Short chip durations shift the entire transmission band up in frequency. In Figure \ref{freq_meas_spectrum} we can see that noise energy is mostly concentrated at lower frequencies, so shifting our signal up in frequency will reduce the amount of noise the decoder sees behind the correlator by shifting the band of interest into a lower-noise spectral region. For a practical implementation chip duration - is limited by physical factors such as the maximum modulation slew rate ($\frac{\text{d}P}{\text{d}t}$) and the - maximum Rate-Of-Change-Of-Frequency (ROCOF, $\frac{\text{d}f}{\text{d}t}$) the grid can tolerate. + is limited by physical factors such as the maximum modulation slew rate ($\frac{\text{d}P}{\text{d}t}$) that can + be technically realized and the maximum Rate-Of-Change-Of-Frequency (ROCOF, $\frac{\text{d}f}{\text{d}t}$) that + the grid can tolerate. } \label{chip_duration_sensitivity} \end{figure} In the previous graphs we have used random clips of measured grid frequency noise as noise in our simulations. Comparing between a simulation using measured noise and synthetic noise generated as we outlined in the beginning of Section -\label{sec-ch-sim} we get the plots in Figure \ref{chip_duration_sensitivity_cmp}. We can see that while not perfect our +\ref{sec-ch-sim} we get the plots in Figure \ref{chip_duration_sensitivity_cmp}. We can see that while not perfect our simulated noise is an adequate approximation of reality: Our prototype demodulator shows no significant difference in behavior between measured and simulated noise. Simulated noise causes slightly worse performance for long chips. Overall the results for both are very close in absolute value. @@ -2324,9 +2326,9 @@ the results for both are very close in absolute value. \end{subfigure} \caption{ Chip duration/sensitivity simulation results like in Figure \ref{chip_duration_sensitivity} compared between a - simulation using measured frequency data like previous graphs and one using artificially generated noise. There - is little visible difference indicating that we have found a good model of reality in our noise synthesizer, but - also that real grid frequency behaves like a frequency-shaped gaussian noise process. + simulation using measured frequency data like in the previous graphs and one using artificially generated noise. + There is little visible difference indicating that we have found a good model of reality in our noise + synthesizer, but also that real grid frequency behaves like a frequency-shaped gaussian noise process. } \label{chip_duration_sensitivity_cmp} \end{figure} @@ -2339,9 +2341,9 @@ demonstrator we use JTAG to reset part of a commodity smart meter from an extern reset controller receives its commands over the grid frequency modulation system we outlined in this thesis. To keep implementation cost low the reset controller is fed a simulation of a modulated grid frequency signal through a standard \SI{3.5}{\milli\meter} audio jack\footnote{ - By generously cutting two PCB traces the meter we chose to use can be easily modified to provide strong galvanic - separation between grid and main application microcontroller. With this modification we have to supply power to its - main application MCU externally along with the JTAG interface. + By generously cutting two PCB traces the meter we chose to use can be easily modified to provide galvanic separation + between grid and main application microcontroller. With this modification we have to supply power to its main + application MCU externally along with the JTAG interface but now the modified meter is electrically safe. }. Measurement of actual grid frequency instead would simply require a voltage divider and depending on the setup an analog optoisolator. @@ -2349,23 +2351,23 @@ analog optoisolator. \label{sec-easymeter} For our demonstrator to make sense we wanted to select a realistic reset target. In Germany where this thesis was -written a standards-compliant setup would consist of a fairly dumb smart meter and a smart meter gateway (SMGW) -containing all of the complex bidirectional protocol logic such as wireless or landline IP connectivity. The realistic -target for a setup in this architecture would be the components of an SMGW such as its communications modem or main -application processor. In the German architecture the smart meter does not even have to have a bi-directional data link -to the SMGW effectively mitigating any attack vector for remote compormise. +written a standards-compliant setup would consist of a comparatively feature-limited smart meter and a smart meter +gateway (SMGW) containing all of the complex bidirectional protocol logic such as wireless or landline IP connectivity. +The realistic target for a setup in this architecture would be the components of an SMGW such as its communication modem +or main application processor. In the German architecture the smart meter does not even have to have a bi-directional +data link to the SMGW effectively mitigating any attack vector for remote compormise. Despite these considerations we still chose to reset the application MCU inside smart meter for two reasons. One is that -SMGWs are much harder to come by on the second-hand market. The other is that SMGWs are a particular feature of the -German standardization landscape and in many other countries functions of an SMGW such as wireless protocol handling are +SMGWs are much rarer on the second-hand market. The other is that SMGWs are a particular feature of the German +standardization landscape and in many other countries functions of an SMGW such as wireless protocol handling are integrated into the meter itself (see e.g.\ \cite{honeywell01}). -In the end we settled on an Q3DA1002 three-phase 60A meter made by German manufacturer EasyMeter. This meter is typical +In the end we settled on a Q3DA1002 three phase 60A meter made by German manufacturer EasyMeter. This meter is typical of what would be found in an average German household and can be acquired very inexpensively as new old stock on online marketplaces. -The meter consists of a plastic enclosure with a transparent polycarbonate top part and a grey ABS bottom part that are -ultrasonically welded shut. In the bottom part of the case a PCB we call the \emph{measurement} board is potted in +The meter consists of a plastic enclosure with a transparent polycarbonate top part and a gray ABS bottom part that are +ultrasonically welded together. In the bottom part of the case a PCB we call the \emph{measurement} board is potted in epoxide resin (see Figure \ref{easymeter_composites}). This PCB contains three separate energy measurement ASICs for the three phases (see Figure \ref{easymeter_detail_xrays}). It also contains a capacitive dropper power supply for the meter circuitry and external modules such as a SMGW. The measurement board through three infrared links (one per phase) @@ -2374,26 +2376,27 @@ measurement logging and aggregation, controls a small segment LCD displaying tot accessible \si{\kilo\watt\hour} impulse LED and serial IR links. The measurement board does not contain any logging or outside communication interfaces. All of that is handled on the -display board by a Texas Instruments MSP430F2350 application MCU. This is a 16-bit RISC MCU with \SI{16}{\kilo\byte} -flash and \SI{2}{\kilo\byte} SRAM\footnote{ - The microcontroller might seem a bit overkill for such a simple application, but most of its \SI{16}{\kilo\byte} - program flash is in fact used. A casual glance with Ghidra shows that a large part of program flash is expended on - keeping multiple redundant copies of energy consumption aggregates including error recovery in case of data - corruption and some effort has even been made to guard against data corruption using simple non-cryptographic - checksums. Another large part of the MCU's firmware handles data transmission over the meter's externally accessible - IR link through Smart Message Language\cite{bsi-tr-03109-1-IVb}. +display board by a Texas Instruments \texttt{MSP430F2350} application MCU. This is a 16-bit RISC MCU with +\SI{16}{\kilo\byte} flash and \SI{2}{\kilo\byte} SRAM\footnote{ + At first glance the microcontroller might seem overkill for such a simple application, but most of its + \SI{16}{\kilo\byte} program flash is in fact used. A casual glance with Ghidra shows that a large part of program + flash is expended on keeping multiple redundant copies of energy consumption aggregates including error recovery in + case of data corruption and some effort has even been made to guard against data corruption using simple + non-cryptographic checksums. Another large part of the MCU's firmware handles data transmission over the meter's + externally accessible IR link through Smart Message Language\cite{bsi-tr-03109-1-IVb}. }. There is an I2C EEPROM that is used in conjunction with the microcontroller's internal \SI{256}{\byte} data flash to -keep redundant copies of energy consumption aggregates. On the side of the base board is a 14-pin header containing both -a standard TI MSP430 JTAG pinout and an UART serial link for debugging. Conveniently the JTAG port was left enabled by -fuse in our particular production unit. - -We chose to use this MSP430 series application MCU as our reset target. Though in this particular unit compromise is -impossible due to a lack of bi-directional communication links some of its sister models do contain bidirectional -communication links\cite{easymeter01} making compromise through communication interfaces at least a theoretical -possibility. In other countries meters with a similar architecture to the Q3DA1002 commonly include complex protocol -logic as part of the meter itself\cite{honeywell01,ifixit01}. As an example, the Honeywell REX2 uses a Maxim Integrated -71M6541 main application microcontroller along with a Texas Instruments CC1000 series radio transceiver and is -advertised to support both over-the-air firmware upgrades and a remotely accessible ``service control switch''. +keep redundant copies of energy consumption aggregates. On the side of the display board there is a 14-pin header +containing both a standard TI MSP430 JTAG pinout and a UART serial interface for debugging. Conveniently, the JTAG port +was left enabled by fuse in our particular production unit. + +We chose to use this \texttt{MSP430} series application MCU as our reset target. Though in this particular unit remote +compromise is impossible due to a lack of bidirectional communication links some of its sister models do contain +bidirectional communication links\cite{easymeter01} making compromise through communication interfaces an at least +theoretical possibility. In other countries, meters with a similar architecture to the Q3DA1002 include complex protocol +logic as part of the meter itself or have bidirectional links to it\cite{honeywell01,ifixit01,bigclive01,eevblog01}. As +an example, the Honeywell REX2 uses a Maxim Integrated \texttt{71M6541} main application microcontroller along with a +Texas Instruments \texttt{CC1000} series radio transceiver and is advertised to support both over-the-air firmware +upgrade and a remotely accessible disconnect switch. % TODO add pics of the intact easymeter and of the one with the safety reset0r hooked up @@ -2428,7 +2431,7 @@ advertised to support both over-the-air firmware upgrades and a remotely accessi \end{subfigure} \caption{ - Composite images of the circuit boards inside the EasyMeter Q3DA1002 ``smart'' electricity meter used in our + Composite images of the circuit boards inside the EasyMeter Q3DA1002 smart electricity meter used in our demonstration. } \label{easymeter_composites} @@ -2458,18 +2461,19 @@ advertised to support both over-the-air firmware upgrades and a remotely accessi \subsection{Firmware implementation} \label{sec-demo-fw-impl} -We based our safety reset demonstrator firmware on the grid frequency sensor firmware we developed in sec.\ -\ref{sec-fsensor}. We implemented DSSS demodulation by translating the python prototype code we developed in sec.\ +We based our safety reset demonstrator firmware on the grid frequency sensor firmware we developed in Section +\ref{sec-fsensor}. We implemented DSSS demodulation by translating the Python prototype code we developed in Section \ref{sec-ch-sim} to embedded C code. After validating the C translation in extensive simulations we integrated our code with a reed-solomon implementation and a libsodium-based implementation of the cryptographic protocol we designed in -sec.\ \ref{sec-crypto}. To reprogram the target MSP430 microcontroller we ported over the low-level bitbang JTAG driver -of \texttt{mspdebug}\footnote{\url{https://github.com/dlbeer/mspdebug}}. See Figure \ref{fig_demo_sig_schema} for a -schematic overview of signal processing in our demonstrator. +Section \ref{sec-crypto}. To reprogram the target \texttt{MSP430} microcontroller we ported the low-level bitbang JTAG +driver of \texttt{mspdebug}\footnote{\url{https://github.com/dlbeer/mspdebug}}. See Figure \ref{fig_demo_sig_schema} for +a schematic overview of signal processing in our demonstrator. -For all computation-heavy high-level modules of our firmware such as the DSSS demodulator or the grid frequency +For all computation-heavy high level modules of our firmware such as the DSSS demodulator or the grid frequency estimator we wrote test fixtures that allow the same code that runs on the microcontroller to be executed on the host for testing. These test fixtures are very simple C programs that load input data from a file or the command line, run -the algorithm and print results on standard output. +the algorithm and print results on standard output. To enable automatic testing of a large parameter set we run these +test fixtures repeatedly from a set of Python scripts sweeping parameters. \begin{figure} \centering @@ -2481,7 +2485,7 @@ the algorithm and print results on standard output. \section{Grid frequency modulation emulation} To emulate a modulated grid frequency signal we superimposed a DSSS-modulated signal at the proper amplitude with -synthetic grid frequency noise generated according to the measurements we took in sec. \ref{sec-fsensor}. In this +synthetic grid frequency noise generated according to the measurements we took in Section \ref{sec-fsensor}. In this primitive simulation we do not simulate the precise impulse response of the grid to a DSSS-modulated stimulus signal. Our results still serve to illustrate the possibility of data transmission in this manner this impulse response can be compensated for at the transmitter by selecting appropriate modulation parameters (e.g. chip rate and amplitude) and at @@ -2498,17 +2502,17 @@ extensive testing paid off: The demonstrator setup worked on its first try. \section{Lessons learned} -Before settling on the commercial smart meter we first tried to use an EVM430-F6779 smart meter evaluation kit made by -Texas Instruments. This evaluation kit did not turn out well for two main reasons. One, it shipped with half the case -missing and no cover for the terminal blocks. Because of this some work was required to maintain electrical safety. -Even after mounting it in an electrically safe manner since the main MCU is not isolated from the grid and the JTAG port -is also galvanically coupled the safety reset controller prototype would also have to be galvanically isolated to not -pose an electrical safety risk. The second issue we ran into was that the EVM430-F6779 is based around an MSP430F6779 -microcontroller. This microcontroller is a rather large part within the MSP430 series and uses a particularly new -revision of the CPU core and associated JTAG peripheral that are incompatible with all MSP430 programmers we tried to -use on it. \texttt{mspdebug} does not have support for it and porting TI's own JTAG programmer reference sources did not -yield any results either. Finally we tried an USB-based programmer made by TI themselves that turned out to either have -broken firmware or a hardware defect, leading to it frequently re-enumerating on the USB. +Before settling on the commercial smart meter we first tried to use an \texttt{EVM430-F6779} smart meter evaluation kit +made by Texas Instruments. This evaluation kit did not turn out well for two main reasons. One, it shipped with half the +case missing and no cover for the terminal blocks. Because of this some work was required to get it electrically safe. +Even after mounting it in an electrically safe manner the safety reset controller prototype would also have to be +galvanically isolated to not pose an electrical safety risk since the main MCU is not isolated from the grid and the +JTAG port is also galvanically coupled. The second issue we ran into was that the \texttt{EVM430-F6779} is based around +an \texttt{MSP430F6779} microcontroller. This microcontroller is a rather large part within the \texttt{MSP430} series +and uses a new revision of the CPU core and associated JTAG peripheral that are incompatible with all \texttt{MSP430} +programmers we tried to use on it. \texttt{mspdebug} does not have support for it and porting TI's own JTAG programmer +reference sources did not yield any results either. Finally we tried an USB-based programmer made by TI themselves that +turned out to either have broken firmware or a hardware defect, leading to it frequently reënumerating on the USB. Overall our initial assumption that a development kit would certainly be easier to program than a commercial meter did not prove to be true. Contrary to our expectations the commercial meter had JTAG enabled allowing us to easily read out @@ -2518,12 +2522,12 @@ proved not to be too complex and all we wanted to know could be found out with j In the firmware development phase our approach of testing every module individually (e.g. DSSS demodulator, Reed-Solomon decoder, grid frequency estimation) proved to be very useful. In particular debugging benefited greatly from being able -to run a couple thousand tests within seconds. In case of our DSSS demodulator this modular testing and simulation -architecture allowed us to simulate many thousand runs of our implementation on test data and directly compare it to our +to run several thousand tests within seconds. In case of our DSSS demodulator this modular testing and simulation +architecture allowed us to simulate thousands of runs of our implementation on test data and directly compare it to our Jupyter/Python prototype (see Figure \ref{fw_proto_comparison}). Since we spent more time polishing our embedded C -implementation it turned out to perform much better than our initial python prototype. At the same time it shows -fundamentally similar response to its parameters. One significant bug we fixed in the embedded C version is the python -version's tendency towards incorrect decodings at even very large amplitudes. +implementation it turned out to perform better than our Python prototype. At the same time it shows fundamentally +similar response to its parameters. One significant bug we fixed in the embedded C version was the Python version's +tendency towards incorrect decodings at even very large amplitudes. \begin{figure} \centering @@ -2539,7 +2543,7 @@ version's tendency towards incorrect decodings at even very large amplitudes. \end{subfigure} \caption{ - Symbol error rate plots versus threshold factor for both our python prototype (above) and our firmware + Symbol error rate plots versus threshold factor for both our Python prototype (above) and our firmware implementation of our demodulation algorithm. Note the slightly different threshold factor color scales. Cf.\ Figure \ref{dsss_thf_amplitude_5678}. } @@ -2547,55 +2551,58 @@ version's tendency towards incorrect decodings at even very large amplitudes. \end{figure} In accordance with our initial estimations we did not run into any code space nor computation bottlenecks for chosing -floating-point emulation instead of porting over our algorithms to fixed-point calculations. The extremely slow sampling -rate of our systems makes even heavyweight processing such as FFT or our rather brute-force dynamic programming approach -to DSSS demodulation possible well within performance constraints. - -Compiled code size of our firmware implementation is slightly larger than we would like at around \SI{64}{\kilo\byte} -for our firmware image including everything except the target microcontroller firmware image. See appendix -\ref{symbol_size_chart} for a graph illustrating the contribution of various parts of the signal processing toolchain to -this total. Overall the most heavy-weight operations by far are the SHA512 implementation from libsodium and the FFT -from ARM's CMSIS signal processing library. +floating point emulation instead of porting over our algorithms to fixed point calculations. The extremely slow sampling +rate of our systems makes even heavyweight processing such as FFT or our brute-force dynamic programming approach to +DSSS demodulation possible well within our performance constraints. + +Since we are only building a prototype we did not optimize firmware code size at all. The compiled code size of our +firmware implementation is slightly larger than we would like at around \SI{64}{\kilo\byte} for our firmware image +including everything except the target microcontroller firmware image. See appendix \ref{symbol_size_chart} for a graph +illustrating the contribution of various parts of the signal processing toolchain to this total. Overall the most +heavy-weight operations by far are the SHA512 implementation from libsodium and the FFT from ARM's CMSIS signal +processing library. Especially the SHA512 implementation has large potential for size optimization because it is highly +optimized for speed using extensive manual loop unrolling. \chapter{Future work} \section{Precise grid characterization} -We based our simulations on a linear relationship between generation/consumption power imbalance and grid frequency. -Our literature study suggests that this is an appropriate first-order approximation\cite{crastan03}. We kept modulation -bandwidth in our simulations inside a \SIrange{1000}{100}{\milli\hertz} frequency band that we reason is most likely to -exibit this linear behavior in practice. At lower frequencies primary control kicks in. With the frequency delta -thresholds specified for primary control systems\cite{entsoe04} this will likely lead to significant non-linear effects. -At higher frequencies grid frequency estimation at the receiver becomes more complex. Higher frequencies also come -close to modes of mechanical oscillation in generators (usually at \SI{5}{\hertz} and above\cite{crastan03}). +We based our simulations on a linear relationship between the generation/consumption power imbalance and grid frequency. +Our literature study suggests that this is an appropriate first order approximation\cite{crastan03}. We kept the +modulation bandwidth in our simulations inside a \SIrange{1000}{100}{\milli\hertz} frequency band that we reason is most +likely to exhibit this linear behavior in practice. At lower frequencies primary control kicks in. With the frequency +delta thresholds specified for primary control systems\cite{entsoe04} this would lead to significant non-linear +effects. At higher frequencies grid frequency estimation at the receiver becomes more complex. Higher frequencies also +come close to modes of mechanical oscillation in generators (usually at \SI{5}{\hertz} and above\cite{crastan03}). An analysis of the above concerns can be performed using dynamic grid simulation models\cite{semerow01,entsoe05}. -Presumably out of safety concerns these models are only available under non-disclosure agreements. Integrating +Presumably out of security concerns these models are only available under non-disclosure agreements. Integrating NDA-encumbered results stemming from such a model in an open-source publication such as this one poses a logistical -challenge which is why we decided to leave this topic for a separate future work. After detailed model simulation we -ultimately aim to validate our results experimentally. Assuming linear grid behavior even under very small disturbances -a small-scale experiment is an option. Such a small-scale experiment would require very long integration times. - -Given a frequency characteristic of \SI{30}{\giga\watt\per\hertz} a stimulus of \SI{10}{\kilo\watt} yields $\Delta f = -\SI{0.33}{\micro\hertz}$. At an estimated \SI{20}{\milli\hertz} of RMS noise over a bandwidth of interest this results -in an SNR slightly better than \SI{-50}{\decibel}. The correlation time necessary to offset this with DSSS processing -gain at a chip rate of \SI{1}{\baud} would be in the order of days. With such long correlation times clock stability -starts to become a problem as during correlation transmitter and receiver must maintain close phase alignment w.r.t.\ -one chip period. A $\leq \SI{10}{\degree}$ phase difference requirement over this period of time would translate into -clock stability better than \SI{10}{ppm}. Though certainly not impossible to achieve this does pose an engineering -challenge. - -A possible way to maintain clock alignment is to use grid frequency itself as a reference. Instead of keying the DSSS +challenge which is why we decided to leave this topic for a separate future work. + +After detailed model simulation we ultimately aim to validate our results experimentally. Assuming linear grid behavior +even under very small disturbances a small-scale experiment is an option. Such a small-scale experiment would require +very long integration times. Given a frequency characteristic of \SI{30}{\giga\watt\per\hertz} a stimulus of +\SI{10}{\kilo\watt} yields $\Delta f = \SI{0.33}{\micro\hertz}$. At an estimated \SI{20}{\milli\hertz} of RMS noise over +a bandwidth of interest this results in an SNR slightly better than \SI{-50}{\decibel}. The correlation time necessary +to offset this with DSSS processing gain at a chip rate of \SI{1}{\baud} would be in the order of days. With such long +correlation times clock stability starts to become a problem as during correlation transmitter and receiver must +maintain close phase alignment with respect to one chip period. A phase difference requirement of less than +\SI{10}{\degree}over this period of time would translate into clock stability better than \SI{10}{ppm}. Though certainly +not impossible to achieve this does pose an engineering challenge. + +A way to reduce clock alignment might be to use grid frequency itself as a reference. Instead of keying the DSSS modulator/demodulator on a local crystal oscillator, chip timings would be described in fractions of a mains voltage cycle. This would track grid frequency variations synchronously at both ends and would maintain phase alignment even -over long periods of time at cost of a slight increase in system complexity. +over long periods of time at cost of a slight increase in system complexity. The receiver would then measure differences +between consecutive chips instead of their absolute values. \section{Technical standardization} -The description of a safety reset system provided in this work could be translated into a formalized technical standard -with relatively low effort. Our system is very simple compared to e.g. a full smart meter communication standard and -thus can conceivably be described in a single, concise document. The much more complicated side of standardization would -be the standardization of the backend operation including key management, coördination and command authorization. +The description of a safety reset system provided in this work could be translated into a formalized technical standard. +Our system is simple compared to e.g.\ a full smart meter communication standard and thus can conceivably be +described in a single, concise document. The complicated side of standardization would be the standardization of the +backend operation including key management, coördination and command authorization. \section{Regulatory adoption} @@ -2608,31 +2615,31 @@ a safe state without the need of fine-grained control of implementation details A regulatory authority might specify that all smart meters must use a standardized reset controller that on command resets to a minimal firmware image that disables external communication, continues basic billing functions and enables -any disconnect switches. This system would enable the \emph{reset authority} to directly preempt a large-scale attack +any disconnect switches. This system would enable the regulatory authority to directly preempt a large-scale attack irrespective of implementation details of the various smart meter implementations. Cryptographic key management for the smart reset system is not much different to the management of highly privileged -signing keys as they are used in many other systems already. If the safety reset system is implemented with a -regulatory authority as the \emph{reset authority} they would likely be able to find a public entity that is already -managing root keys for other government systems to also manage safety reset keys. Availability and security requirements -of safety reset keys do not differ significantly from those for other types of root keys. +signing keys as they are used in many other systems such as TLS already. If the safety reset system is implemented by a +regulatory authority they would likely be able to find a public entity that is already managing root keys for other +government systems to also manage safety reset keys. Availability and security requirements of safety reset keys do not +differ significantly from those for other types of root keys. \section{Zones of trust} In our design, we opted for a safety reset controller in form of a separate micocontroller entirely separate from whatever application microcontroller the smart meter design is already using. This design nicely separates the meter -into an untrusted application (the core microcontroller) and the trusted reset controller. Since the interface between -the two is simple and logically one-way, it can be validated to a high standard of security. +into an untrusted application on the core microcontroller and the trusted reset controller. Since the interface between +the two is simple and one-way, it can be validated to a high standard of security. Despite these security benefits, the cost of such a separate hardware device might prove high in a mass-market rollout. In this case, one might attempt to integrate the reset controller into the core microcontroller in some way. Primarily, there would be two ways to accomplish this. One is a solution that physically integrates an additional microcontroller core into the main application microcontroller package either as a submodule on the same die or as a separate die in a -multi-chip module (MCM) with the main application microcontroller. A full-custom solution integrating both on a single -die might be a viable path for very large-scale deployments, but will most likely be too expensive in tooling costs -alone to justify its use. More likely for a medium- to large-scale deployment (millions of meters) would be a MCM -integrating an off-the-shelf smart metering microcontroller die with the reset controller running on another, much -smaller off-the-shelf microcontroller die. This solution might potentially save some cost compared to a solution using a +multi-chip module (MCM) with the main application microcontroller. A custom solution integrating both on a single die +might be a viable path for very large-scale deployments but will most likely be too expensive in tooling costs alone to +justify its use. More likely for a medium- to large-scale deployment of millions of meters would be a MCM integrating an +off-the-shelf smart metering microcontroller die with the reset controller running on another, much smaller +off-the-shelf microcontroller die. This solution might potentially save some cost compared to a solution using a discrete microcontroller for the reset controller. The more likely approach to reducing cost overhead of the reset controller would be to employ virtualization @@ -2640,14 +2647,14 @@ technologies such as ARM's TrustZone in order to incorporate the reset controlle on the same processor core without compromising the reset controller's security or disturbing the application firmware's operation. -TrustZone is a virtualization technology that provides a hardware-assisted privileged execution domain on at least one -of the microcontroller's cores. In traditional virtualization setups a privileged hypervisor is managing several -unprivileged applications sharing resources between them. Separation between applications in this setup is longitudinal -between adjacent virtual machines. Two applications would both be running in unprivileged mode sharing the same cpu and -the hypervisor would merely schedule them, configure hardware resource access and coördinate communication. This -longitudinal virtualization simplifies application development since from the application's perspective the virtual -machine looks very similar to a physical one. In addition, in general this setup reciprocally isolates two applications -with neither one being able to gain control over the other. +TrustZone is a virtualization technology that provides a hardware-assisted privileged execution domain. In traditional +virtualization setups a privileged hypervisor is managing several unprivileged applications that share resources between +them. Separation between applications in this setup is longitudinal between adjacent virtual machines. Two applications +would both be running in unprivileged mode sharing the same CPU and the hypervisor would merely schedule them, configure +hardware resource access and coördinate communication. This longitudinal virtualization simplifies application +development since from the application's perspective the virtual machine looks very similar to a physical one. In +addition, in general this setup can be used to reciprocally isolate two applications with neither one being able to gain +control over the other. In contrast to this, a TrustZone-like system in general does not provide several application virtual machines and longitudinal separation. Instead, it provides lateral separation between two domains: The unprivileged application @@ -2664,20 +2671,21 @@ correctly configure than it is to simply use separate hardware and secure the in In this thesis we have developed an end-to-end design of a reset system to restore smart meters to a safe operating state during an ongoing large-scale cyberattack. We have laid out the fundamentals of smart metering infrastructure and -elaborated the need for an out-of-band method to reset device firmware due to the large attack surface of this complex -firmware. To allow our system to be triggered even in the middle of a cyberattack we have developed a broadcast data -transmission system based on intentional modulation of global grid frequency. We have developed the theoretical +elaborated the need for an out of band method to reset a meter's firmware due to the large attack surface of this +complex firmware. To allow our system to be triggered even in the middle of a cyberattack we have developed a broadcast +data transmission system based on intentional modulation of the global grid frequency. We have developed the theoretical foundations of the process based on an established model of inertial grid frequency response to load variations and -shown the veracity of our end-to-end design through extensive simulations. To properly base these simulations we have +shown the viability of our end-to-end design through extensive simulations. To properly base these simulations we have developed a grid frequency measurement methodology comprising of a custom-designed hardware device for electrically safe data capture and a set of software tools to archive and process captured data. Our simulations show good behavior of our broadcast communication system and give an indication that coöperating with a large consumer such as an aluminium -smelter would be a feasible way to set up a transmitter at very low hardware overhead. Based on our broadcast primitive -we have developed a cryptographic protocol ready for embedded implementation in resource-constrained systems that allows -quick (response time less than 30 minutes) triggering of all or a selected subset of devices. Finally, we have -experimentally validated our system using simulated grid frequency data in a demonstrator setup based on a commercial -microcontroller as our safety reset controller and an off-the-shelf smart meter. We have laid out a path for further -research and standardization related to our system. +smelter would be a feasible way to set up a transmitter with very low hardware overhead. Based on our broadcast +primitive we have developed a cryptographic protocol ready for embedded implementation in resource-constrained systems +that allows triggering all or a selected subset of devices within a quick response time of less than 30 minutes. +Finally, we have experimentally validated our system using simulated grid frequency data in a demonstrator setup based +on a commercial microcontroller as our safety reset controller and an off-the-shelf smart meter. We have laid out a path +for further research and standardization related to our system. Our code and electronics designs are available at the +public repository listed on the second page of this document. \newpage |