| Age | Commit message (Collapse) | Author |
|
KVM SVM changes for 6.19:
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
|
|
KVM x86 misc changes for 6.19:
- Fix an async #PF bug where KVM would clear the completion queue when the
guest transitioned in and out of paging mode, e.g. when handling an SMI and
then returning to paged mode via RSM.
- Fix a bug where TDX would effectively corrupt user-return MSR values if the
TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
- Leave the user-return notifier used to restore MSRs registered when
disabling virtualization, and instead pin kvm.ko. Restoring host MSRs via
IPI callback is either pointless (clean reboot) or dangerous (forced reboot)
since KVM has no idea what code it's interrupting.
- Use the checked version of {get,put}_user(), as Linus wants to kill them
off, and they're measurably faster on modern CPUs due to the unchecked
versions containing an LFENCE.
- Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
timers can result in a hard lockup in the host.
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NPT corrections.
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
behind CONFIG_CPU_MITIGATIONS.
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as
the only reason they were handled in the faspath was to paper of a bug in
the core #MC code that has long since been fixed.
- Add emulator support for AVX MOV instructions to play nice with emulated
devices whose PCI BARs guest drivers like to access with large multi-byte
instructions.
|
|
Move KVM's swapping of PKRU outside of the fastpath loop, as there is no
KVM code anywhere in the fastpath that accesses guest/userspace memory,
i.e. that can consume protection keys.
As documented by commit 1be0e61c1f25 ("KVM, pkeys: save/restore PKRU when
guest/host switches"), KVM just needs to ensure the host's PKRU is loaded
when KVM (or the kernel at-large) may access userspace memory. And at the
time of commit 1be0e61c1f25, KVM didn't have a fastpath, and PKU was
strictly contained to VMX, i.e. there was no reason to swap PKRU outside
of vmx_vcpu_run().
Over time, the "need" to swap PKRU close to VM-Enter was likely falsely
solidified by the association with XFEATUREs in commit 37486135d3a7
("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"),
and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a
KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c716
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel,
and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c,
with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers
for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go
from ~810 => ~770.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Link: https://patch.msgid.link/20251118222328.2265758-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Handle Machine Checks (#MC) that happen in the guest (by forwarding them
to the host) outside of KVM's fastpath so that as much host state as
possible is re-loaded before invoking the kernel's #MC handler. The only
requirement is that KVM invokes the #MC handler before enabling IRQs (and
even that could _probably_ be relaxed to handling #MCs before enabling
preemption).
Waiting to handle #MCs until "more" host state is loaded hardens KVM
against flaws in the #MC handler, which has historically been quite
brittle. E.g. prior to commit 5567d11c21a1 ("x86/mce: Send #MC singal from
task work"), the #MC code could trigger a schedule() with IRQs and
preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c716
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Note, except for #MCs on VM-Enter, VMX already handles #MCs outside of the
fastpath.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Link: https://patch.msgid.link/20251118222328.2265758-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't update the LBR MSR intercept bitmaps if they're already up-to-date,
as unconditionally updating the intercepts forces KVM to recalculate the
MSR bitmaps for vmcb02 on every nested VMRUN. The redundant updates are
functionally okay; however, they neuter an optimization in Hyper-V
nested virtualization enlightenments and this manifests as a self-test
failure.
In particular, Hyper-V lets L1 mark "nested enlightenments" as clean, i.e.
tell KVM that no changes were made to the MSR bitmap since the last VMRUN.
The hyperv_svm_test KVM selftest intentionally changes the MSR bitmap
"without telling KVM about it" to verify that KVM honors the clean hint,
correctly fails because KVM notices the changed bitmap anyway:
==== Test Assertion Failure ====
x86/hyperv_svm_test.c:120: vmcb->control.exit_code == 0x081
pid=193558 tid=193558 errno=4 - Interrupted system call
1 0x0000000000411361: assert_on_unhandled_exception at processor.c:659
2 0x0000000000406186: _vcpu_run at kvm_util.c:1699
3 (inlined by) vcpu_run at kvm_util.c:1710
4 0x0000000000401f2a: main at hyperv_svm_test.c:175
5 0x000000000041d0d3: __libc_start_call_main at libc-start.o:?
6 0x000000000041f27c: __libc_start_main_impl at ??:?
7 0x00000000004021a0: _start at ??:?
vmcb->control.exit_code == SVM_EXIT_VMMCALL
Do *not* fix this by skipping svm_hv_vmcb_dirty_nested_enlightenments()
when svm_set_intercept_for_msr() performs a no-op change. changes to
the L0 MSR interception bitmap are only triggered by full CPUID updates
and MSR filter updates, both of which should be rare. Changing
svm_set_intercept_for_msr() risks hiding unintended pessimizations
like this one, and is actually more complex than this change.
Fixes: fbe5e5f030c2 ("KVM: nSVM: Always recalculate LBR MSR intercepts in svm_update_lbrv()")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251112013017.1836863-1-yosry.ahmed@linux.dev
[Rewritten commit message based on mailing list discussion. - Paolo]
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn
instruction, discard the exception and retry the instruction if the code
stream is changed (e.g. by a different vCPU) between when the CPU
executes the instruction and when KVM decodes the instruction to get the
next RIP.
As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject
INT3/INTO instead of retrying the instruction"), failure to verify that
the correct INTn instruction was decoded can effectively clobber guest
state due to decoding the wrong instruction and thus specifying the
wrong next RIP.
The bug most often manifests as "Oops: int3" panics on static branch
checks in Linux guests. Enabling or disabling a static branch in Linux
uses the kernel's "text poke" code patching mechanism. To modify code
while other CPUs may be executing that code, Linux (temporarily)
replaces the first byte of the original instruction with an int3 (opcode
0xcc), then patches in the new code stream except for the first byte,
and finally replaces the int3 with the first byte of the new code
stream. If a CPU hits the int3, i.e. executes the code while it's being
modified, then the guest kernel must look up the RIP to determine how to
handle the #BP, e.g. by emulating the new instruction. If the RIP is
incorrect, then this lookup fails and the guest kernel panics.
The bug reproduces almost instantly by hacking the guest kernel to
repeatedly check a static branch[1] while running a drgn script[2] on
the host to constantly swap out the memory containing the guest's TSS.
[1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
[2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The current scheme for handling LBRV when nested is used is very
complicated, especially when L1 does not enable LBRV (i.e. does not set
LBR_CTL_ENABLE_MASK).
To avoid copying LBRs between VMCB01 and VMCB02 on every nested
transition, the current implementation switches between using VMCB01 or
VMCB02 as the source of truth for the LBRs while L2 is running. If L2
enables LBR, VMCB02 is used as the source of truth. When L2 disables
LBR, the LBRs are copied to VMCB01 and VMCB01 is used as the source of
truth. This introduces significant complexity, and incorrect behavior in
some cases.
For example, on a nested #VMEXIT, the LBRs are only copied from VMCB02
to VMCB01 if LBRV is enabled in VMCB01. This is because L2's writes to
MSR_IA32_DEBUGCTLMSR to enable LBR are intercepted and propagated to
VMCB01 instead of VMCB02. However, LBRV is only enabled in VMCB02 when
L2 is running.
This means that if L2 enables LBR and exits to L1, the LBRs will not be
propagated from VMCB02 to VMCB01, because LBRV is disabled in VMCB01.
There is no meaningful difference in CPUID rate in L2 when copying LBRs
on every nested transition vs. the current approach, so do the simple
and correct thing and always copy LBRs between VMCB01 and VMCB02 on
nested transitions (when LBRV is disabled by L1). Drop the conditional
LBRs copying in __svm_{enable/disable}_lbrv() as it is now unnecessary.
VMCB02 becomes the only source of truth for LBRs when L2 is running,
regardless of LBRV being enabled by L1, drop svm_get_lbr_vmcb() and use
svm->vmcb directly in its place.
Fixes: 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-4-yosry.ahmed@linux.dev
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
svm_update_lbrv() is called when MSR_IA32_DEBUGCTLMSR is updated, and on
nested transitions where LBRV is used. It checks whether LBRV enablement
needs to be changed in the current VMCB, and if it does, it also
recalculate intercepts to LBR MSRs.
However, there are cases where intercepts need to be updated even when
LBRV enablement doesn't. Example scenario:
- L1 has MSR_IA32_DEBUGCTLMSR cleared.
- L1 runs L2 without LBR_CTL_ENABLE (no LBRV).
- L2 sets DEBUGCTLMSR_LBR in MSR_IA32_DEBUGCTLMSR, svm_update_lbrv()
sets LBR_CTL_ENABLE in VMCB02 and disables intercepts to LBR MSRs.
- L2 exits to L1, svm_update_lbrv() is not called on this transition.
- L1 clears MSR_IA32_DEBUGCTLMSR, svm_update_lbrv() finds that
LBR_CTL_ENABLE is already cleared in VMCB01 and does nothing.
- Intercepts remain disabled, L1 reads to LBR MSRs read the host MSRs.
Fix it by always recalculating intercepts in svm_update_lbrv().
Fixes: 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-3-yosry.ahmed@linux.dev
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The APM lists the DbgCtlMsr field as being tracked by the VMCB_LBR clean
bit. Always clear the bit when MSR_IA32_DEBUGCTLMSR is updated.
The history is complicated, it was correctly cleared for L1 before
commit 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when
L2 is running"). At that point svm_set_msr() started to rely on
svm_update_lbrv() to clear the bit, but when nested virtualization
is enabled the latter does not always clear it even if MSR_IA32_DEBUGCTLMSR
changed. Go back to clearing it directly in svm_set_msr().
Fixes: 1d5a1b5860ed ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Reported-by: Matteo Rizzo <matteorizzo@google.com>
Reported-by: evn@google.com
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-2-yosry.ahmed@linux.dev
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When emulating L2 instructions, svm_check_intercept() checks whether a
write to CR0 should trigger a synthesized #VMEXIT with
SVM_EXIT_CR0_SEL_WRITE. However, it does not check whether L1 enabled
the intercept for SVM_EXIT_WRITE_CR0, which has higher priority
according to the APM (24593—Rev. 3.42—March 2024, Table 15-7):
When both selective and non-selective CR0-write intercepts are active at
the same time, the non-selective intercept takes priority. With respect
to exceptions, the priority of this intercept is the same as the generic
CR0-write intercept.
Make sure L1 does NOT intercept SVM_EXIT_WRITE_CR0 before checking if
SVM_EXIT_CR0_SEL_WRITE needs to be injected.
Opportunistically tweak the "not CR0" logic to explicitly bail early so
that it's more obvious that only CR0 has a selective intercept, and that
modifying icpt_info.exit_code is functionally necessary so that the call
to nested_svm_exit_handled() checks the correct exit code.
Fixes: cfec82cb7d31 ("KVM: SVM: Add intercept check for emulated cr accesses")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251024192918.3191141-4-yosry.ahmed@linux.dev
[sean: isolate non-CR0 write logic, tweak comments accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When emulating L2 instructions, svm_check_intercept() checks whether a
write to CR0 should trigger a synthesized #VMEXIT with
SVM_EXIT_CR0_SEL_WRITE. For MOV-to-CR0, SVM_EXIT_CR0_SEL_WRITE is only
triggered if any bit other than CR0.MP and CR0.TS is updated. However,
according to the APM (24593—Rev. 3.42—March 2024, Table 15-7):
The LMSW instruction treats the selective CR0-write
intercept as a non-selective intercept (i.e., it intercepts
regardless of the value being written).
Skip checking the changed bits for x86_intercept_lmsw and always inject
SVM_EXIT_CR0_SEL_WRITE.
Fixes: cfec82cb7d31 ("KVM: SVM: Add intercept check for emulated cr accesses")
Cc: stable@vger.kernel.org
Reported-by: Matteo Rizzo <matteorizzo@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251024192918.3191141-3-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add and use a helper, kvm_prepare_unexpected_reason_exit(), to dedup the
code that fills the exit reason and CPU when KVM encounters a VM-Exit that
KVM doesn't know how to handle.
Reviewed-by: yaoyuan@linux.alibaba.com
Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030185004.3372256-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Unregister the GALog notifier (used to get notified of wake events for
blocking vCPUs) on kvm-amd.ko exit so that a KVM or IOMMU driver bug that
results in a spurious GALog event "only" results in a spurious IRQ, and
doesn't trigger a use-after-free due to executing unloaded module code.
Fixes: 5881f73757cc ("svm: Introduce AMD IOMMU avic_ga_log_notifier")
Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Closes: https://lore.kernel.org/all/20250918130320.GA119526@k08j02272.eu95sqa
Link: https://patch.msgid.link/20251016190643.80529-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Setup the per-CPU SVM data structures at the very end of hardware setup so
that svm_hardware_unsetup() can be used in svm_hardware_setup() to unwind
AVIC setup (for the GALog notifier). Alternatively, the error path could
do an explicit, manual unwind, e.g. by adding a helper to free the per-CPU
structures. But the per-CPU allocations have no interactions or
dependencies, i.e. can comfortably live at the end, and so converting to
a manual unwind would introduce churn and code without providing any
immediate advantage.
Link: https://patch.msgid.link/20251016190643.80529-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
With support for 4k vCPUs in x2AVIC, the size of the AVIC Physical ID
table is expanded from a single 4k page to a maximum of 8 contiguous 4k
pages. The actual number of pages allocated depends on the maximum
possible APIC ID in the guest, which is only known by the time the first
vCPU is created. In preparation for supporting a dynamic AVIC Physical
ID table size, move its allocation to vcpu_precreate().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/7dc764e0af7f01440bbac3d9215ed174027c2384.1757009416.git.naveen@kernel.org
[sean: drop enable_apicv check from svm_vcpu_precreate()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Modern AMD CPUs do not support segment limit checks in 64-bit mode
(i.e. EFER.LMSLE must be zero). Do not allow a guest to set EFER.LMSLE
on a CPU that requires the bit to be zero.
For backwards compatibility, allow EFER.LMSLE to be set on CPUs that
support segment limit checks in 64-bit mode, even though KVM's
implementation of the feature is incomplete (e.g. KVM's emulator does
not enforce segment limits in 64-bit mode).
Fixes: eec4b140c924 ("KVM: SVM: Allow EFER.LMSLE to be set with nested svm")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20251001001529.1119031-3-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM x86 CET virtualization support for 6.18
Add support for virtualizing Control-flow Enforcement Technology (CET) on
Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks).
CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect
Branch Tracking (IBT), that can be utilized by software to help provide
Control-flow integrity (CFI). SHSTK defends against backward-edge attacks
(a.k.a. Return-oriented programming (ROP)), while IBT defends against
forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)).
Attackers commonly use ROP and COP/JOP methodologies to redirect the control-
flow to unauthorized targets in order to execute small snippets of code,
a.k.a. gadgets, of the attackers choice. By chaining together several gadgets,
an attacker can perform arbitrary operations and circumvent the system's
defenses.
SHSTK defends against backward-edge attacks, which execute gadgets by modifying
the stack to branch to the attacker's target via RET, by providing a second
stack that is used exclusively to track control transfer operations. The
shadow stack is separate from the data/normal stack, and can be enabled
independently in user and kernel mode.
When SHSTK is is enabled, CALL instructions push the return address on both the
data and shadow stack. RET then pops the return address from both stacks and
compares the addresses. If the return addresses from the two stacks do not
match, the CPU generates a Control Protection (#CP) exception.
IBT defends against backward-edge attacks, which branch to gadgets by executing
indirect CALL and JMP instructions with attacker controlled register or memory
state, by requiring the target of indirect branches to start with a special
marker instruction, ENDBRANCH. If an indirect branch is executed and the next
instruction is not an ENDBRANCH, the CPU generates a #CP. Note, ENDBRANCH
behaves as a NOP if IBT is disabled or unsupported.
From a virtualization perspective, CET presents several problems. While SHSTK
and IBT have two layers of enabling, a global control in the form of a CR4 bit,
and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET
respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS.
Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable
option due to complexity, and outright disallowing use of XSTATE to context
switch SHSTK/IBT state would render the features unusable to most guests.
To limit the overall complexity without sacrificing performance or usability,
simply ignore the potential virtualization hole, but ensure that all paths in
KVM treat SHSTK/IBT as usable by the guest if the feature is supported in
hardware, and the guest has access to at least one of SHSTK or IBT. I.e. allow
userspace to advertise one of SHSTK or IBT if both are supported in hardware,
even though doing so would allow a misbehaving guest to use the unadvertised
feature.
Fully emulating SHSTK and IBT would also require significant complexity, e.g.
to track and update branch state for IBT, and shadow stack state for SHSTK.
Given that emulating large swaths of the guest code stream isn't necessary on
modern CPUs, punt on emulating instructions that meaningful impact or consume
SHSTK or IBT. However, instead of doing nothing, explicitly reject emulation
of such instructions so that KVM's emulator can't be abused to circumvent CET.
Disable support for SHSTK and IBT if KVM is configured such that emulation of
arbitrary guest instructions may be required, specifically if Unrestricted
Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that
is smaller than host.MAXPHYADDR.
Lastly disable SHSTK support if shadow paging is enabled, as the protections
for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so
that they can't be directly modified by software), i.e. would require
non-trivial support in the Shadow MMU.
Note, AMD CPUs currently only support SHSTK. Explicitly disable IBT support
so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT
in SVM requires KVM modifications.
|
|
KVM x86 changes for 6.18
- Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
intercepts and trigger a variety of WARNs in KVM.
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
supposed to exist for v2 PMUs.
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
- Clean up KVM's vector hashing code for delivering lowest priority IRQs.
- Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
in the process of doing so add fastpath support for TSC_DEADLINE writes on
AMD CPUs.
- Clean up a pile of PMU code in anticipation of adding support for mediated
vPMUs.
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside of
forced emulation and other contrived testing scenarios).
- Clean up the MSR APIs in preparation for CET and FRED virtualization, as
well as mediated vPMU support.
- Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
as KVM can't faithfully emulate an I/O APIC for such guests.
- KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
response to PMU refreshes for guests with mediated vPMUs.
- Misc cleanups and minor fixes.
|
|
KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
instead of latent guest failures.
- Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
host from tampering with the guest's TSC frequency, while still allowing the
the VMM to configure the guest's TSC frequency prior to launch.
- Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
wrapping all accesses via READ_ONCE().
- Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
bogous XCR0 value in KVM's software model.
- Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
avoid leaving behind stale state (thankfully not consumed in KVM).
- Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
instead of subtly relying on guest_memfd to do the "heavy" lifting.
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
TSC_AUX due to hardware not matching the value cached in the user-return MSR
infrastructure.
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
and clean up the AVIC initialization code along the way.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.18
1. Add PTW feature detection on new hardware.
2. Add sign extension with kernel MMIO/IOCSR emulation.
3. Improve in-kernel IPI emulation.
4. Improve in-kernel PCH-PIC emulation.
5. Move kvm_iocsr tracepoint out of generic code.
|
|
Remove the explicit clearing of shadow stack CPU capabilities.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Synchronize XSS from the GHCB to KVM's internal tracking if the guest
marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy
of XSS in order to compute the required XSTATE size when emulating
CPUID.0xD.0x1 for the guest.
Treat the incoming XSS change as an emulated write, i.e. validatate the
guest-provided value, to avoid letting the guest load garbage into KVM's
tracking. Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage. In either case, KVM can't fix the broken guest.
Explicitly allow access to XSS at all times, as KVM needs to ensure its
copy of XSS stays up-to-date. E.g. KVM supports migration of SEV-ES guests
and so needs to allow the host to save/restore XSS, otherwise a guest
that *knows* its XSS hasn't change could get stale/bad CPUID emulation if
the guest doesn't provide XSS in the GHCB on every exit. This creates a
hypothetical problem where a guest could request emulation of RDMSR or
WRMSR on XSS, but arguably that's not even a problem, e.g. it would be
entirely reasonable for a guest to request "emulation" as a way to inform
the hypervisor that its XSS value has been modified.
Note, emulating the change as an MSR write also takes care of side effects,
e.g. marking dynamic CPUID bits as dirty.
Suggested-by: John Allen <john.allen@amd.com>
base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c
Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pass through XSAVE managed CET MSRs on SVM when KVM supports shadow
stack. These cannot be intercepted without also intercepting XSAVE which
would likely cause unacceptable performance overhead.
MSR_IA32_INT_SSP_TAB is not managed by XSAVE, so it is intercepted.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add shadow stack VMCB fields to dump_vmcb. PL0_SSP, PL1_SSP, PL2_SSP,
PL3_SSP, and U_CET are part of the SEV-ES save area and are encrypted,
but can be decrypted and dumped if the guest policy allows debugging.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Emulate shadow stack MSR access by reading and writing to the
corresponding fields in the VMCB.
Signed-off-by: John Allen <john.allen@amd.com>
[sean: mark VMCB_CET dirty/clean as appropriate]
Link: https://lore.kernel.org/r/20250919223258.1604852-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the
CET XFEATURE bits in XSS, and advertise support for IBT and SHSTK to
userspace. Explicitly clear IBT and SHSTK onn SVM, as additional work is
needed to enable CET on SVM, e.g. to context switch S_CET and other state.
Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET, as running without Unrestricted Guest
can result in KVM emulating large swaths of guest code. While it's highly
unlikely any guest will trigger emulation while also utilizing IBT or
SHSTK, there's zero reason to allow CET without Unrestricted Guest as that
combination should only be possible when explicitly disabling
unrestricted_guest for testing purposes.
Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces
the presence of an Error Code based on exception vector, as attempting to
inject a #CP with an Error Code (#CP architecturally has an Error Code)
will fail due to the #CP vector historically not having an Error Code.
Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural
of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT. Note,
KVM already clears guest CET state that is managed via XSTATE in
kvm_xstate_reset().
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: move some bits to separate patches, massage changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Initialize allow_smaller_maxphyaddr during hardware setup as soon as KVM
knows whether or not TDP will be utilized. To avoid having to teach KVM's
emulator all about CET, KVM's upcoming CET virtualization support will be
mutually exclusive with allow_smaller_maxphyaddr, i.e. will disable SHSTK
and IBT if allow_smaller_maxphyaddr is enabled.
In general, allow_smaller_maxphyaddr should be initialized as soon as
possible since it's globally visible while its only input is whether or
not EPT/NPT is enabled. I.e. there's effectively zero risk of setting
allow_smaller_maxphyaddr too early, and substantial risk of setting it
too late.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250922184743.1745778-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB
helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for
SEV-ES+ guests.
|
|
Move "avic" to avic.c so that it's colocated with the other AVIC specific
globals and module params, and so that avic_hardware_setup() is a bit more
self-contained, e.g. similar to sev_hardware_setup().
Deliberately set enable_apicv in svm.c as it's already globally visible
(defined by kvm.ko, not by kvm-amd.ko), and to clearly capture the
dependency on enable_apicv being initialized (svm_hardware_setup() clears
several AVIC-specific hooks when enable_apicv is disabled).
Alternatively, clearing of the hooks (and enable_ipiv) could be moved to
avic_hardware_setup(), but that's not obviously better, e.g. it's helpful
to isolate the setting of enable_apicv when reading code from the generic
x86 side of the world.
No functional change intended.
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set the "allow_apicv_in_x2apic_without_x2apic_virtualization" flag as part
of avic_hardware_setup() instead of handling in svm_hardware_setup(), and
make x2avic_enabled local to avic.c (setting the flag was the only use in
svm.c).
Tag avic_hardware_setup() with __init as necessary (it should have been
tagged __init long ago).
No functional change intended (aside from the side effects of tagging
avic_hardware_setup() with __init).
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move svm_set_x2apic_msr_interception() to avic.c as it's only relevant
when x2AVIC is enabled/supported and only called by AVIC code. In
addition to scoping AVIC code to avic.c, this will allow burying the
global x2avic_enabled variable in avic.
Opportunistically rename the helper to explicitly scope it to "avic".
No functional change intended.
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make svm_x86_ops globally visible in anticipation of modifying the struct
in avic.c, and clean up the KVM-on-HyperV usage, as declaring _and using_
a local variable in a header that's only defined in one specific .c-file
is all kinds of ugly.
Opportunistically make svm_hv_enable_l2_tlb_flush() local to
svm_onhyperv.c, as the only reason it was visible was due to the
aforementioned shenanigans in svm_onhyperv.h.
Alternatively, svm_x86_ops could be explicitly passed to
svm_hv_hardware_setup() as a parameter. While that approach is slightly
safer, e.g. avoids "hidden" updates, for better or worse, the Intel side
of KVM has already chosen to expose vt_x86_ops (and vt_init_ops). Given
that svm_x86_ops is only truly consumed by kvm_ops_update, the odds of a
"hidden" update causing problems are extremely low. So, absent a strong
reason to rework the VMX/TDX code, make svm_x86_ops visible, as having all
updates use exactly "svm_x86_ops." is advantageous in its own right.
No functional change intended.
Link: https://lore.kernel.org/r/20250919215934.1590410-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Prior to running an SEV-ES guest, set TSC_AUX in the host save area to the
current value in hardware, as tracked by the user return infrastructure,
instead of always loading the host's desired value for the CPU. If the
pCPU is also running a non-SEV-ES vCPU, loading the host's value on #VMEXIT
could clobber the other vCPU's value, e.g. if the SEV-ES vCPU preempted
the non-SEV-ES vCPU, in which case KVM expects the other vCPU's TSC_AUX
value to be resident in hardware.
Note, unlike TDX, which blindly _zeroes_ TSC_AUX on TD-Exit, SEV-ES CPUs
can load an arbitrary value. Stuff the current value in the host save
area instead of refreshing the user return cache so that KVM doesn't need
to track whether or not the vCPU actually enterred the guest and thus
loaded TSC_AUX from the host save area.
Opportunistically tag tsc_aux_uret_slot as read-only after init to guard
against unexpected modifications, and to make it obvious that using the
variable in sev_es_prepare_switch_to_guest() is safe.
Fixes: 916e3e5f26ab ("KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX")
Cc: stable@vger.kernel.org
Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
[sean: handle the SEV-ES case in sev_es_prepare_switch_to_guest()]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250923153738.1875174-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Defer recalculating MSR and instruction intercepts after a CPUID update
via RECALC_INTERCEPTS to converge on RECALC_INTERCEPTS as the "official"
mechanism for triggering recalcs. As a bonus, because KVM does a "recalc"
during vCPU creation, and every functional VMM sets CPUID at least once,
for all intents and purposes this saves at least one recalc.
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework the MSR_FILTER_CHANGED request into a more generic RECALC_INTERCEPTS
request, and expand the responsibilities of vendor code to recalculate all
intercepts that vary based on userspace input, e.g. instruction intercepts
that are tied to guest CPUID.
Providing a generic recalc request will allow the upcoming mediated PMU
support to trigger a recalc when PMU features, e.g. PERF_CAPABILITIES, are
set by userspace, without having to make multiple calls to/from PMU code.
As a bonus, using a request will effectively coalesce recalcs, e.g. will
reduce the number of recalcs for normal usage from 3+ to 1 (vCPU create,
set CPUID, set PERF_CAPABILITIES (Intel only), set filter).
The downside is that MSR filter changes that are done in isolation will do
a small amount of unnecessary work, but that's already a relatively slow
path, and the cost of recalculating instruction intercepts is negligible.
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Commit 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
inhibited pre-VMRUN sync of TPR from LAPIC into VMCB::V_TPR in
sync_lapic_to_cr8() when AVIC is active.
AVIC does automatically sync between these two fields, however it does
so only on explicit guest writes to one of these fields, not on a bare
VMRUN.
This meant that when AVIC is enabled host changes to TPR in the LAPIC
state might not get automatically copied into the V_TPR field of VMCB.
This is especially true when it is the userspace setting LAPIC state via
KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
VMCB.
Practice shows that it is the V_TPR that is actually used by the AVIC to
decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
so any leftover value in V_TPR will cause serious interrupt delivery issues
in the guest when AVIC is enabled.
Fix this issue by doing pre-VMRUN TPR sync from LAPIC into VMCB::V_TPR
even when AVIC is enabled.
Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
Cc: stable@vger.kernel.org
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/c231be64280b1461e854e1ce3595d70cde3a2e9d.1756139678.git.maciej.szmigiero@oracle.com
[sean: tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename kvm_x86_ops.private_max_mapping_level() to .gmem_max_mapping_level()
in anticipation of extending guest_memfd support to non-private memory.
No functional change intended.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fold the remaining line of sev_es_vcpu_reset() into sev_vcpu_create() as
there's no need for a dedicated RESET hook just to init a mutex, and the
mutex should be initialized as early as possible anyways.
No functional change intended.
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the initialization of SNP guest state from svm_vcpu_reset() into
sev_init_vmcb() to reduce the number of paths that deal with INIT/RESET
for SEV+ vCPUs from 4+ to 1. Plumb in @init_event as necessary.
Opportunistically check for an SNP guest outside of
sev_snp_init_protected_guest_state() so that sev_init_vmcb() is consistent
with respect to checking for SEV-ES+ and SNP+ guests.
No functional change intended.
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a dedicated sev_vcpu_create() helper to allocate the VMSA page for
SEV-ES+ vCPUs, and to allow for consolidating a variety of related SEV+
code in the near future.
No functional change intended.
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise support for the immediate form of MSR instructions to userspace
if the instructions are supported by the underlying CPU, and KVM is using
VMX, i.e. is running on an Intel-compatible CPU.
For SVM, explicitly clear X86_FEATURE_MSR_IMM to ensure KVM doesn't over-
report support if AMD-compatible CPUs ever implement the immediate forms,
as SVM will likely require explicit enablement in KVM.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename the WRMSR fastpath API to drop "irqoff", as that information is
redundant (the fastpath always runs with IRQs disabled), and to prepare
for adding a fastpath for the immediate variant of WRMSRNS.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: split to separate patch, write changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a fastpath handler for INVD so that the common fastpath logic can be
trivially tested on both Intel and AMD. Under KVM, INVD is always:
(a) intercepted, (b) available to the guest, and (c) emulated as a nop,
with no side effects. Combined with INVD not having any inputs or outputs,
i.e. no register constraints, INVD is the perfect instruction for
exercising KVM's fastpath as it can be inserted into practically any
guest-side code stream.
Link: https://lore.kernel.org/r/20250805190526.1453366-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Skip the WRMSR and HLT fastpaths in SVM's VM-Exit handler if the next RIP
isn't valid, e.g. because KVM is running with nrips=false. SVM must
decode and emulate to skip the instruction if the CPU doesn't provide the
next RIP, and getting the instruction bytes to decode requires reading
guest memory. Reading guest memory through the emulator can fault, i.e.
can sleep, which is disallowed since the fastpath handlers run with IRQs
disabled.
BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:106
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 32611, name: qemu
preempt_count: 1, expected: 0
INFO: lockdep is turned off.
irq event stamp: 30580
hardirqs last enabled at (30579): [<ffffffffc08b2527>] vcpu_run+0x1787/0x1db0 [kvm]
hardirqs last disabled at (30580): [<ffffffffb4f62e32>] __schedule+0x1e2/0xed0
softirqs last enabled at (30570): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210
softirqs last disabled at (30568): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210
CPU: 298 UID: 0 PID: 32611 Comm: qemu Tainted: G U 6.16.0-smp--e6c618b51cfe-sleep #782 NONE
Tainted: [U]=USER
Hardware name: Google Astoria-Turin/astoria, BIOS 0.20241223.2-0 01/17/2025
Call Trace:
<TASK>
dump_stack_lvl+0x7d/0xb0
__might_resched+0x271/0x290
__might_fault+0x28/0x80
kvm_vcpu_read_guest_page+0x8d/0xc0 [kvm]
kvm_fetch_guest_virt+0x92/0xc0 [kvm]
__do_insn_fetch_bytes+0xf3/0x1e0 [kvm]
x86_decode_insn+0xd1/0x1010 [kvm]
x86_emulate_instruction+0x105/0x810 [kvm]
__svm_skip_emulated_instruction+0xc4/0x140 [kvm_amd]
handle_fastpath_invd+0xc4/0x1a0 [kvm]
vcpu_run+0x11a1/0x1db0 [kvm]
kvm_arch_vcpu_ioctl_run+0x5cc/0x730 [kvm]
kvm_vcpu_ioctl+0x578/0x6a0 [kvm]
__se_sys_ioctl+0x6d/0xb0
do_syscall_64+0x8a/0x2c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f479d57a94b
</TASK>
Note, this is essentially a reapply of commit 5c30e8101e8d ("KVM: SVM:
Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"), but with
different justification (KVM now grabs SRCU when skipping the instruction
for other reasons).
Fixes: b439eb8ab578 ("Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250805190526.1453366-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM x86 MMU changes for 6.17
- Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact
with CR0.WP.
- Move the TDX hardware setup code to tdx.c to better co-locate TDX code
and eliminate a few global symbols.
- Dynamically allocation the shadow MMU's hashed page list, and defer
allocating the hashed list until it's actually needed (the TDP MMU doesn't
use the list).
|
|
KVM x86 misc changes for 6.17
- Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the
guest. Failure to honor FREEZE_IN_SMM can bleed host state into the guest.
- Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to
prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF.
- Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
vCPU's CPUID model.
- Rework the MSR interception code so that the SVM and VMX APIs are more or
less identical.
- Recalculate all MSR intercepts from the "source" on MSR filter changes, and
drop the dedicated "shadow" bitmaps (and their awful "max" size defines).
- WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
nested SVM MSRPM offsets tracker can't handle an MSR.
- Advertise support for LKGS (Load Kernel GS base), a new instruction that's
loosely related to FRED, but is supported and enumerated independently.
- Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED,
a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON). Use
the same approach KVM uses for dealing with "impossible" emulation when
running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU
has architecturally impossible state.
- Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
APERF/MPERF reads, so that a "properly" configured VM can "virtualize"
APERF/MPERF (with many caveats).
- Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default"
frequency is unsupported for VMs with a "secure" TSC, and there's no known
use case for changing the default frequency for other VM types.
|
|
Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.
The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.
Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Store each "disabled exit" boolean in a single bit rather than a byte.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract a common function from MSR interception disabling logic and create
disabling and enabling functions based on it. This removes most of the
duplicated code for MSR interception disabling/enabling.
No functional change intended.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com
[sean: s/enable/set, inline the wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
allocation before falling back to __vmalloc(), to avoid the overhead of
establishing the virtual mappings. For non-debug builds, The SVM and VMX
(and TDX) structures are now just below 7000 bytes in the worst case
scenario (see below), i.e. are order-1 allocations, and will likely remain
that way for quite some time.
Add compile-time assertions in vendor code to ensure the size of the
structures, sans the memslot hash tables, are order-0 allocations, i.e.
are less than 4KiB. There's nothing fundamentally wrong with a larger
kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
the memslots hash tables) is below 2KiB after 18+ years of existence,
more than doubling the size would be quite notable.
Add sanity checks on the memslot hash table sizes, partly to ensure they
aren't resized without accounting for the impact on VM structure size, and
partly to document that the majority of the size of VM structures comes
from the memslots.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|