This blog post showcases the performance improvements achieved in the process of booting unlock for Clevis LUKS-bound devices. By removing a single function from the boot process, boot time was shortened by 20% to 47%, depending on the scenario.
Clevis is a software framework that allows booting encrypted LUKS devices without manual intervention. This tool is part of Network-Bound Disk Encryption (NBDE). Clevis is the “client” side, although it is not strictly necessary to work against a server, and can be configured to read keys in different ways. Clevis has a set of “pins” that allow different mechanisms for automatic unlocking:
- tang: real NBDE based in client-server architecture
- tpm2: secure cryptoprocessor on the machine
- sss: for composed configurations (for example, achieving high availability using two or more servers)
Basic scenarios
My initial idea was to identify any possible bottlenecks that could be making the boot process last longer than it needs to be. After all, Clevis is a set of commands that allow storing LUKS metadata and recover it in the boot process. Inspection of the different commands executed by Clevis, together with the time spent by each of them, would allow discovering if any function spends most of the execution time. Then I could study if it can be improved or removed.
For this task, I defined different scenarios:
- Virtual machine, one LUKS-encrypted device: A virtual machine with a “/” device for inspecting boot times
- Virtual machine, multiple LUKS-encrypted devices: A virtual machine with several encrypted devices to analyze a much more complex scenario similar to what is described in the Red Hat Enterprise Linux product documentation
- Real machine, one LUKS-encrypted device: A laptop machine with a “/” device for inspecting boot times on a physical machine
Basic configuration
Clevis can work with different configurations, which represent the different mechanisms to allow automatic encrypted disk unlocking. I identified the following configurations as the most representative:
- tpm2 (no pcr_id)
- tpm2 (pcr_id=7)
- tang
I did not include the sss configuration because it is composed of the basic ones, tpm2 and tang, which are already included.
Timestamp logging
I considered several mechanisms for dumping timestamps of each of the lines executing in Clevis boot:
systemd-analyze blame
: This command dumps information about the time that has been spent from each of the commands executed in boot time fromsystemd
. The kind of output dumped is as follows:
6.460s systemd-cryptsetup@luks\x2d18f5ee6f\x2d2ae9\x2d4807\x2da916\x2d106037930328.service 3.616s fwupd.service 2.632s plymouth-quit-wait.service 1.721s systemd-udev-settle.service …
It gives information about the time spent from each service separately, but no information about particular execution time for each process. So it is a good helper, but it cannot be a definitive way to process information per line.
- Manual dump: creation of a logging function with timestamp (including microseconds) to dump the timestamps before and after the execution of particular functions. In the end, this was the solution I took because, although not being the most reusable one, it serves as a very quick fix to implement a mechanism that allows dumping the timestamps of the main functions used at boot time. The code is available in my GitHub repository.
- This commit allows the dumping of the different timestamps before and after the most significant functions used in Clevis boot are processed. It dumps logs into a file (
/var/run/systemd/clevis.log
) with the different execution times for each of the lines, in the following format:
- This commit allows the dumping of the different timestamps before and after the most significant functions used in Clevis boot are processed. It dumps logs into a file (
Bottleneck identification
After logging with the respective timestamps was complete, I could observe that the following functions/lines could be possible bottlenecks:
clevis_luks_check_valid_key_or_keyfile
- sleep in clevis-luks-askpass
I identified the possible bottlenecks by analyzing the different times spent on the execution of each of the most representative functions that intervene in the boot process via Clevis. Further information on the logs dumped are available in this .xlsx file.
In particular, the tabs TPM2-pcr_id: TPM2-pcr_id:7 and tang contain timestamps of the dumped logs according to the change introduced to obtain such information. In any of these cases, it can be clearly observed that the possible bottlenecks are the places where the most of the execution time is spent.
Time measures (“clevis_luks_check_valid_key_or_keyfile”)
Considering the proposed scenarios, together with the different configurations that apply for each of them, I measured the time used by the system to boot and then analyzed the time saved when avoiding the function clevis_luks_check_valid_key_or_keyfile. To obtain these values, I used the systemd-analyze blame
command and calculated the following time measures:.
Scenario 1: Virtual machine (VM) with one LUKS device, original code
Scenario 1: VM with one LUKS device after clevis_luks_check_valid_key_or_keyfile
bottleneck removal
Scenario 2: VM with several LUKS devices, original code
Time measures: Scenario 2: VM with several LUKS devices, after clevis_luks_check_valid_key_or_keyfile
bottleneck removal
It is apparent how the time spent in boot decreases proportionally to the number of devices unlocked in boot time.
Scenario 3: Real machine with one LUKS device, original code
This table represents the time measures of the boot times used to unlock one LUKS2 device in a real machine, in particular, a Lenovo ThinkPad T14s Gen 1, model 20T1S39D5Q, with original code.
Scenario 3: Real machine with one LUKS device, after clevis_luks_check_valid_key_or_keyfile
bottleneck removal
Time measures (different “sleep” values)
As demonstrated in the Bottleneck identification section above, another possible area of code that consumes much of the time in the boot process is sleep in clevis-luks-askpass
. The sleep initial value is 0.5 seconds and is placed in the clevis-luks-askpass
file, which is the file executed by systemd
. It is introduced to wait for the unlocking process of the devices to unlock to be successful. So, the question here is: Is it really necessary? Can it be changed so that boot process is accelerated?
The short answer is that this sleep is already customized at an optimal value. Changes of its value did not cause improvements on the boot time, so the values will not be covered here. If you are interested, they are included in this .xlsx file.
Boot-time improvements
Here, we’ll show the time improvements obtained in each of the scenarios in bar graphs:
Scenario 1: VM with one LUKS device
The bar graph below represents boot time improvements in seconds for the individual keys due to the validation removal:
Improvement in boot times in seconds and as a percentage for each configuration:
Scenario 2: VM with multiple LUKS device
This bar graph represents boot time improvements in seconds on keys due to the validation removal:
Improvement in boot times in seconds and as a percentage for each configuration:
Scenario 3: real machine with unique LUKS device
This bar graph represents improvements on keys due to the validation removal:
Improvement in boot times in seconds and as a percentage for each configuration:
Actions
After analyzing time measures (See section: Time measures (“clevis_luks_check_valid_key_or_keyfile”) above), it is apparent that the real bottleneck is clevis_luks_check_valid_key_or_keyfile
. So, the action that has been taken is simple, and the call to this function has been removed, but only for the execution path in the boot process. The function will be called in other Clevis-related commands, so its execution has been parameterized so that it is avoided only in the boot process.
Sull'autore
Sergio Arroutbi is an experienced Software Designer skilled in Linux, C++/Golang/Python programming, Shell Scripting, Software Design and Development. He has a Master's Degree focused in FLOSS (Free/Libre/Open Source Software) from Universidad Rey Juan Carlos and a Software Craftsmanship Master's Degree from Universidad Politécnica de Madrid.
Ricerca per canale
Automazione
Novità sull'automazione IT di tecnologie, team e ambienti
Intelligenza artificiale
Aggiornamenti sulle piattaforme che consentono alle aziende di eseguire carichi di lavoro IA ovunque
Hybrid cloud open source
Scopri come affrontare il futuro in modo più agile grazie al cloud ibrido
Sicurezza
Le ultime novità sulle nostre soluzioni per ridurre i rischi nelle tecnologie e negli ambienti
Edge computing
Aggiornamenti sulle piattaforme che semplificano l'operatività edge
Infrastruttura
Le ultime novità sulla piattaforma Linux aziendale leader a livello mondiale
Applicazioni
Approfondimenti sulle nostre soluzioni alle sfide applicative più difficili
Serie originali
Raccontiamo le interessanti storie di leader e creatori di tecnologie pensate per le aziende
Prodotti
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servizi cloud
- Scopri tutti i prodotti
Strumenti
- Formazione e certificazioni
- Il mio account
- Supporto clienti
- Risorse per sviluppatori
- Trova un partner
- Red Hat Ecosystem Catalog
- Calcola il valore delle soluzioni Red Hat
- Documentazione
Prova, acquista, vendi
Comunica
- Contatta l'ufficio vendite
- Contatta l'assistenza clienti
- Contatta un esperto della formazione
- Social media
Informazioni su Red Hat
Red Hat è leader mondiale nella fornitura di soluzioni open source per le aziende, tra cui Linux, Kubernetes, container e soluzioni cloud. Le nostre soluzioni open source, rese sicure per un uso aziendale, consentono di operare su più piattaforme e ambienti, dal datacenter centrale all'edge della rete.
Seleziona la tua lingua
Red Hat legal and privacy links
- Informazioni su Red Hat
- Opportunità di lavoro
- Eventi
- Sedi
- Contattaci
- Blog di Red Hat
- Diversità, equità e inclusione
- Cool Stuff Store
- Red Hat Summit