|
| 1 | +--- |
| 2 | +title: Domain.build |
| 3 | +description: |
| 4 | + "Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest." |
| 5 | +--- |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +```mermaid |
| 10 | +flowchart LR |
| 11 | +subgraph xenopsd VM_build[ |
| 12 | + xenopsd thread pool with two VM_build micro#8209;ops: |
| 13 | + During parallel VM_start, Many threads run this in parallel! |
| 14 | +] |
| 15 | +direction LR |
| 16 | +build_domain_exn[ |
| 17 | + VM.build_domain_exn |
| 18 | + from thread pool Thread #1 |
| 19 | +] --> Domain.build |
| 20 | +Domain.build --> build_pre |
| 21 | +build_pre --> wait_xen_free_mem |
| 22 | +build_pre -->|if NUMA/Best_effort| numa_placement |
| 23 | +Domain.build --> xenguest[Invoke xenguest] |
| 24 | +click Domain.build "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank |
| 25 | +click build_domain_exn "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank |
| 26 | +click wait_xen_free_mem "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank |
| 27 | +click numa_placement "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank |
| 28 | +click build_pre "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank |
| 29 | +click xenguest "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank |
| 30 | +
|
| 31 | +build_domain_exn2[ |
| 32 | + VM.build_domain_exn |
| 33 | + from thread pool Thread #2] --> Domain.build2[Domain.build] |
| 34 | +Domain.build2 --> build_pre2[build_pre] |
| 35 | +build_pre2 --> wait_xen_free_mem2[wait_xen_free_mem] |
| 36 | +build_pre2 -->|if NUMA/Best_effort| numa_placement2[numa_placement] |
| 37 | +Domain.build2 --> xenguest2[Invoke xenguest] |
| 38 | +click Domain.build2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank |
| 39 | +click build_domain_exn2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank |
| 40 | +click wait_xen_free_mem2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank |
| 41 | +click numa_placement2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank |
| 42 | +click build_pre2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank |
| 43 | +click xenguest2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank |
| 44 | +end |
| 45 | +``` |
| 46 | + |
| 47 | +[`VM.build_domain_exn`](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248) |
| 48 | +[calls](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225) |
| 49 | +[`Domain.build`](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210) |
| 50 | +to call: |
| 51 | +- `build_pre` to prepare the build of a VM: |
| 52 | + - If the `xe` config `numa_placement` is set to `Best_effort`, invoke the NUMA placement algorithm. |
| 53 | + - Run `xenguest` |
| 54 | +- `xenguest` to invoke the [xenguest](xenguest) program to setup the domain's system memory. |
| 55 | + |
| 56 | +## Domain Build Preparation using build_pre |
| 57 | + |
| 58 | +[`Domain.build`](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210) |
| 59 | +[calls](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1137) |
| 60 | +the [function `build_pre`](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964) |
| 61 | +(which is also used for VM restore). It must: |
| 62 | + |
| 63 | +1. [Call](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L902-L911) |
| 64 | + [wait_xen_free_mem](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272) |
| 65 | + to wait, if necessary, for the Xen memory scrubber to catch up reclaiming memory (CA-39743) |
| 66 | +2. Call the hypercall to set the timer mode |
| 67 | +3. Call the hypercall to set the number of vCPUs |
| 68 | +4. As described in the [NUMA feature description](../../toolstack/features/NUMA), |
| 69 | + when the `xe` configuration option `numa_placement` is set to `Best_effort`, |
| 70 | + except when the VM has a hard affinity set, invoke the `numa_placement` function: |
| 71 | + |
| 72 | + ```ml |
| 73 | + match !Xenops_server.numa_placement with |
| 74 | + | Any -> |
| 75 | + () |
| 76 | + | Best_effort -> |
| 77 | + log_reraise (Printf.sprintf "NUMA placement") (fun () -> |
| 78 | + if has_hard_affinity then |
| 79 | + D.debug "VM has hard affinity set, skipping NUMA optimization" |
| 80 | + else |
| 81 | + numa_placement domid ~vcpus |
| 82 | + ~memory:(Int64.mul memory.xen_max_mib 1048576L) |
| 83 | + ) |
| 84 | + ``` |
| 85 | +
|
| 86 | +## NUMA placement |
| 87 | +
|
| 88 | +`build_pre` passes the `domid`, the number of `vCPUs` and `xen_max_mib` to the |
| 89 | +[numa_placement](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897) |
| 90 | +function to run the algorithm to find the best NUMA placement. |
| 91 | +
|
| 92 | +When it returns a NUMA node to use, it calls the Xen hypercalls |
| 93 | +to set the vCPU affinity to this NUMA node: |
| 94 | +
|
| 95 | +```ml |
| 96 | + let vm = NUMARequest.make ~memory ~vcpus in |
| 97 | + let nodea = |
| 98 | + match !numa_resources with |
| 99 | + | None -> |
| 100 | + Array.of_list nodes |
| 101 | + | Some a -> |
| 102 | + Array.map2 NUMAResource.min_memory (Array.of_list nodes) a |
| 103 | + in |
| 104 | + numa_resources := Some nodea ; |
| 105 | + Softaffinity.plan ~vm host nodea |
| 106 | +``` |
| 107 | + |
| 108 | +By using the default `auto_node_affinity` feature of Xen, |
| 109 | +setting the vCPU affinity causes the Xen hypervisor to activate |
| 110 | +NUMA node affinity for memory allocations to be aligned with |
| 111 | +the vCPU affinity of the domain. |
| 112 | + |
| 113 | +Note: See the Xen domain's |
| 114 | +[auto_node_affinity](https://wiki.xenproject.org/wiki/NUMA_node_affinity_in_the_Xen_hypervisor) |
| 115 | +feature flag, which controls this, which can be overridden in the |
| 116 | +Xen hypervisor if needed for specific VMs. |
| 117 | + |
| 118 | +This can be used, for example, when there might not be enough memory |
| 119 | +on the preferred NUMA node, but there are other NUMA nodes that have |
| 120 | +enough free memory among with the memory allocations shall be done. |
| 121 | + |
| 122 | +In terms of future NUMA design, it might be even more favourable to |
| 123 | +have a strategy in `xenguest` where in such cases, the superpages |
| 124 | +of the preferred node are used first and a fallback to neighbouring |
| 125 | +NUMA nodes only happens to the extent necessary. |
| 126 | + |
| 127 | +Likely, the future allocation strategy should be passed to `xenguest` |
| 128 | +using Xenstore like the other platform parameters for the VM. |
| 129 | + |
| 130 | +Summary: This passes the information to the hypervisor that memory |
| 131 | +allocation for this domain should preferably be done from this NUMA node. |
| 132 | + |
| 133 | +## Invoke the xenguest program |
| 134 | + |
| 135 | +With the preparation in `build_pre` completed, `Domain.build` |
| 136 | +[calls](https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1127-L1155) |
| 137 | +the `xenguest` function to invoke the [xenguest](xenguest) program to build the domain. |
0 commit comments