qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUM


From: Markus Armbruster
Subject: Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes
Date: Tue, 07 Mar 2017 08:37:03 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)

He Chen <address@hidden> writes:

> On Fri, Mar 03, 2017 at 02:10:50PM -0300, Eduardo Habkost wrote:
>> On Fri, Mar 03, 2017 at 04:52:18PM +0000, Daniel P. Berrange wrote:
>> > On Fri, Mar 03, 2017 at 01:47:51PM -0300, Eduardo Habkost wrote:
>> > > On Fri, Mar 03, 2017 at 04:26:12PM +0000, Daniel P. Berrange wrote:
>> > > > On Fri, Mar 03, 2017 at 10:09:22AM -0600, Eric Blake wrote:
>> > > > > On 03/03/2017 07:57 AM, Eduardo Habkost wrote:
>> > > > > 
>> > > > > >> With this patch, when a user wants to create a guest that contains
>> > > > > >> several vNUMA nodes and also wants to set distance among those 
>> > > > > >> nodes,
>> > > > > >> the QEMU command would like:
>> > > > > >>
>> > > > > >> ```
>> > > > > >> -object 
>> > > > > >> memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0
>> > > > > >>  \
>> > > > > >> -numa 
>> > > > > >> node,nodeid=0,cpus=0,memdev=node0,distance=10,distance=21,distance=31,distance=41
>> > > > > >>  \
>> > > > > 
>> > > > > > 
>> > > > > > It would be nice to have a more intuitive syntax to represent
>> > > > > > ordered lists in QemuOpts. But this is what we have today.
>> > > > > > 
>> > > > > 
>> > > > > Markus has the discussion on representing arrays via the command 
>> > > > > line;
>> > > > > particularly since this array is very tightly coupled to the order in
>> > > > > which values are presented, it may be worth having:
>> > > > > 
>> > > > > -numa
>> > > > > node,nodeid=0,cpus=0,memdev=nod0,distance.0=10,distance.1=21,distance.2=31,distance.3=41
>> > > > > 
>> > > > > with the explicit distance.0= suffixes to distance making it more
>> > > > > obvious that we are dealing with an array.
>> > > > > 
>> > > > > > I think the proposal makes sense. I would like the semantics of 
>> > > > > > the new option
>> > > > > > to be documented at qapi-schema.json and qemu-options.hx.
>> > > > > > 
>> > > > > > I would call the new NumaNodeOptions field "distances", as it is
>> > > > > > a list of distances.
>> > > > > 
>> > > > > Indeed, Markus is trying (with his work on -blockdev for 2.9) to get 
>> > > > > the
>> > > > > command line to a point where it is identical to the QMP code, by
>> > > > > reusing qapi-schema.json, so we should very much keep that in mind 
>> > > > > with
>> > > > > whatever we add to -numa in 2.10.
>> > > > > 
>> > > > > 
>> > > > > > but in the future we could support something like:
>> > > > > > 
>> > > > > >   -numa node,nodeid=0,cpus=0,memdev=node0 \
>> > > > > >   -numa node,nodeid=1,cpus=1,memdev=node1 \
>> > > > > >   -numa node,nodeid=2,cpus=2,memdev=node2 \
>> > > > > >   -numa node,nodeid=3,cpus=3,memdev=node3 \
>> > > > > >   -numa 
>> > > > > > distances,distances[0][0]=10,distances[0][1]=21,distances[0][2]=31,distances[0][3]=41,\
>> > > > > >                   
>> > > > > > distances[1][0]=21,distances[1][1]=10,distances[1][2]=21,distances[1][3]=31,\
>> > > > > >                   
>> > > > > > distances[2][0]=31,distances[2][1]=21,distances[2][2]=10,distances[2][3]=21,\
>> > > > > >                   
>> > > > > > distances[3][0]=41,distances[3][1]=31,distances[3][2]=21,distances[3][3]=10
>> > > > > 
>> > > > > Except that [] requires special shell quoting, so the proposal would 
>> > > > > be
>> > > > > more like:
>> > > > > 
>> > > > > -numa distances.0.0=10,distances.0.1=21
>> > > > > 
>> > > > > Right now, QMP doesn't support 2-D arrays (although this may be a 
>> > > > > good
>> > > > > reason to introduce support), so that's also something to think about
>> > > > > (not insurmountable, but makes the task more complex).
>> > > > 
>> > > > What I don't like about this syntax is that it is duplicating 
>> > > > information
>> > > > twice. IIUC the NUMA distance information is unidirectional, so 
>> > > > specifying
>> > > > the same data for both direetions (node 0 -> node 3, and node 3 -> 
>> > > > node 0)
>> > > > looks like overkill. Also the self-node distance isi defined to always 
>> > > > be
>> > > > 10 IIUC, so specifying that is not required. IOW, could cut down the 
>> > > > data
>> > > > we need to provider to just
>> > > > 
>> > > >    -numa distances,nodea=0,nodeb=1,value=20
>> > > >    -numa distances,nodea=0,nodeb=2,value=20
>> > > >    -numa distances,nodea=0,nodeb=3,value=20
>> > > >    -numa distances,nodea=1,nodeb=2,value=20
>> > > >    -numa distances,nodea=1,nodeb=3,value=20
>> > > >    -numa distances,nodea=2,nodeb=3,value=20
>> > > 
>> > > The ACPI spec (I'm looking at revision 5.0) explicitly mentions
>> > > that A->B distance may be different from B->A distrance:
>> > > 
>> > > "The entry value is a one-byte unsigned integer. The relative
>> > > distance from System Locality i to System Locality j is the
>> > > i*N + j entry in the matrix, where N is the number of System
>> > > Localities.  Except for the relative distance from a System
>> > > Locality to itself, each relative distance is stored twice in the
>> > > matrix. This provides the capability to describe the scenario
>> > > where the relative distances for the two directions between
>> > > System Localities is different."
>> > 
>> > Ah interesting, learn something new every day ? I've only made
>> > that unidirectional assumption for the last 10 years ;-P
>> > 
>> > > But I agree we could figure out a more compact syntax for more
>> > > common cases where self-node distance is 10 and distance is the
>> > > same both ways.
>> > 
>> > QAPI would need a specialized numeric matrix type, which we could
>> > efficiently map into some CLI syntax, in order to avoid needing to
>> > tickle the rather verbose general purpose list syntax. Probably
>> > not worth the hassle though - rather than just picking shorter
>> > variable names eg
>> > 
>> >   -numa dist,a=0,b=1,val=3
>> > 
>> > instead of
>> > 
>> >   -numa distances,nodea=0,nodeb=1,value=20
>> 
>> Whatever syntax/names we choose, we could have reasonable
>> defaults for omitted values:
>> 
>> * If A->B is set and B->A is omitted, use the same value for both
>>   A->B and B->A
>> * If A->A is omitted, use min(10, configured_distances)
>> 
>> This way, the previous example:
>> 
>>    -numa 
>> distances,distances.0.0=10,distances.0.1=21,distances.0.2=31,distances.0.3=41,\
>>                    
>> distances.1.0=21,distances.1.1=10,distances.1.2=21,distances.1.3=31,\
>>                    
>> distances.2.0=31,distances.2.1=21,distances.2.2=10,distances.2.3=21,\
>>                    
>> distances.3.0=41,distances.3.1=31,distances.3.2=21,distances.3.3=10
>> 
>> could be written as:
>> 
>>    -numa distances,distances.0.1=21,distances.0.2=31,distances.0.3=41,\
>>                                     distances.1.2=21,distances.1.3=31,\
>>                                                      distances.2.3=21
>> 
> It seems that the dotted key convention has not been supported yet.

It's being built.  We're going to use an initial version for -blockdev
in 2.9:

    [PULL v3 00/24] block: Command line option -blockdev
    Message-Id: <address@hidden>
    http://repo.or.cz/w/qemu/armbru.git tag pull-block-2017-02-28-v3

PATCH 03 lists user interface differences to QemuOpts, briefly.  PATCH
21 shows how to use it.

It's not a drop-in replacement for QemuOpts.  It's meant to be used it
with the QObject input visitor to produce a QAPI type holding the
configuration, like PATCH 21 does.

There is no support for -readconfig and -writeconfig, yet.  I guess
that's a show stopper for -numa.  We'll get there, but it'll take time.

> So which syntax do you think is proper for NUMA distance?
> Maybe I will implement something like `-numa dist,a=0,b=1,val=21` first
> then change the syntax to dotted key convention when it get merged?

I don't have an opinion there, just want to point out that we can mess
around during development, but once we released an external interface,
we better stick to it.  So keep that in mind.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]