qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public


From: Cédric Le Goater
Subject: Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
Date: Tue, 5 Jul 2022 14:35:44 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0

Hello Alex,

On 6/30/22 11:43, Alex Bennée wrote:

Cédric Le Goater <clg@kaod.org> writes:

On 6/29/22 20:24, Alex Bennée wrote:
Cédric Le Goater <clg@kaod.org> writes:

On 6/29/22 16:14, Alex Bennée wrote:
Cédric Le Goater <clg@kaod.org> writes:

On 6/24/22 18:50, Cédric Le Goater wrote:
On 6/23/22 20:43, Peter Delevoryas wrote:


On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:

On 6/23/22 12:26, Peter Delevoryas wrote:
Signed-off-by: Peter Delevoryas <pdel@fb.com>

Let's start simple without flash support. We should be able to
load FW blobs in each CPU address space using loader devices.

Actually, I was unable to do this, perhaps because the fb OpenBMC
boot sequence is a little weird. I specifically _needed_ to have
a flash device which maps the firmware in at 0x2000_0000, because
the fb OpenBMC U-Boot SPL jumps to that address to start executing
from flash? I think this is also why fb OpenBMC machines can be so slow.

$ ./build/qemu-system-arm -machine fby35 \
        -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
        -d int -drive file=fby35.mtd,format=raw,if=mtd
Ideally we should be booting from the flash device directly using
the machine option '-M ast2600-evb,execute-in-place=true' like HW
does. Instructions are fetched using SPI transfers. But the amount
of code generated is tremendous.
Yeah because there is a potential race when reading from HW so we
throw
away TB's after executing them because we have no way of knowing if it
has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
execution code) which cleaned up this handling.

See some profiling below for a
run which barely reaches DRAM training in U-Boot.

Some more profiling on both ast2500 and ast2600 machines shows :


* ast2600-evb,execute-in-place=true :

Type               Object  Call site                Wait Time (s)         Count 
 Average (us)
---------------------------------------------------------------------------------------------
BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
32909927          0.43
This is unavoidable as a HW access needs the BQL held so we will go
through this cycle every executed instruction.
Did I miss why the flash contents are not mapped into the physical
address space? Isn't that how it appear to the processor?


There are two modes :
            if (ASPEED_MACHINE(machine)->mmio_exec) {
              memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
                                       &fl->mmio, 0, size);
              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
                                          boot_rom);
          } else {
              memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
                                     size, &error_abort);
              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
                                          boot_rom);
              write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
          }

The default boot mode uses the ROM. No issue.

The "execute-in-place=true" option creates an alias on the region of
the flash contents and each instruction is then fetched from the flash
drive with SPI transactions.

I wonder if we were keeping the code before?


With old FW images, using an older U-boot, the machine boots in a couple
of seconds. See the profiling below for a witherspoon-bmc machine using
U-Boot 2016.07.

    qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive 
file=./flash-witherspoon-bmc,format=raw,if=mtd -drive 
file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults 
-snapshot -serial mon:stdio -enable-sync-profile
    ...
    U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
              Watchdog enabled
    DRAM:  496 MiB
    Flash: 32 MiB
    In:    serial
    Out:   serial
    Err:   serial
    Net:
    (qemu) info sync-profile
    Type               Object  Call site                Wait Time (s)         
Count  Average (us)
    
---------------------------------------------------------------------------------------------
    BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      
12346237          0.02
    condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506           
  2      27530.78
    BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709           
  2      23544.26
    condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340           
 83        161.47
    condvar    0x56189610b240  softmmu/cpus.c:571             0.00005           
  1         54.93
    condvar    0x56189610b280  softmmu/cpus.c:642             0.00003           
  1         32.88
    BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003           
 34          0.76
    mutex      0x561896eade00  tcg/region.c:204               0.00002           
995          0.02
    rec_mutex  [           2]  util/async.c:682               0.00002           
493          0.03
    mutex      [           2]  chardev/char.c:118             0.00001           
404          0.03
    
---------------------------------------------------------------------------------------------


However, with recent U-boots, it takes quite a while to reach DRAM training.
Close to a minute. See the profiling below for an ast2500-evb machine using
U-Boot 2019.04.

   qemu-system-arm -M ast2500-evb,execute-in-place=true  -net 
nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive 
file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults -snapshot 
-serial mon:stdio  -enable-sync-profile
    qemu-system-arm: warning: Aspeed iBT has no chardev backend
    qemu-system-arm: warning: nic ftgmac100.1 has no peer
    QEMU 7.0.50 monitor - type 'help' for more information
       U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
      +0200)
       SOC : AST2500-A1
    RST : Power On
    LPC Mode : SIO:Enable : SuperIO-2e
    Eth : MAC0: RGMII, , MAC1: RGMII,
    Model: AST2500 EVB
    DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
    MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
    Loading Environment from SPI Flash... SF: Detected mx25l25635e with page 
size 256 Bytes, erase size 64 KiB, total 32 MiB
    *** Warning - bad CRC, using default environment
       In:    serial@1e784000
    Out:   serial@1e784000
    Err:   serial@1e784000
    Net:   eth0: ethernet@1e660000
    Warning: ethernet@1e680000 (eth1) using random MAC address - 
4a:e5:9a:4a:c7:c5
    , eth1: ethernet@1e680000
    Hit any key to stop autoboot:  2
    (qemu) info sync-profile
    Type               Object  Call site                Wait Time (s)         
Count  Average (us)
    
---------------------------------------------------------------------------------------------
    condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196           
 28     357570.00
    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      
14248621          0.02
    condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231           
  2      11152.57
    BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447           
  4       3618.60
    condvar    0x561f10236240  softmmu/cpus.c:571             0.00010           
  1        102.19
    mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007          
3052          0.02
    mutex      [           2]  chardev/char.c:118             0.00003          
1486          0.02
    condvar    0x561f10236280  softmmu/cpus.c:642             0.00003           
  1         29.38
    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002           
973          0.02
    BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001           
 34          0.41
    
---------------------------------------------------------------------------------------------
    Something in the layout of the FW is making a big difference. One
that could be relevant is that the recent versions are using a device
tree.

There might be no good solution to this issue but I fail to analyze
it correctly. Is there a way to collect information on the usage of
Translation Blocks ?
You could expand the data we collect in tb_tree_stats and expose it
via
info jit.

The "fast" run, U-Boot 2016.07, gives :


   Translation buffer state:
   gen code size       254880371/1073736704
   TB count            1089
   TB avg target size  16 max=356 bytes
   TB avg host size    278 bytes (expansion ratio: 17.2)
   cross page TB count 0 (0%)
   direct jump count   501 (46%) (2 jumps=372 34%)
   TB hash buckets     1025/8192 (12.51% head buckets used)
   TB hash occupancy   3.32% avg chain occ. Histogram: [0.0,7.5)%|█  ▁  ▁  
▁|[67.5,75.0]%
   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
      Statistics:
   TB flush count      0
   TB invalidate count 0
   TLB full flushes    0
   TLB partial flushes 2
   TLB elided flushes  2338
   JIT cycles          2221788409 (0.926 s at 2.4 GHz)
   translated TBs      738520 (aborted=0 0.0%)
   avg ops/TB          15.7 max=459
   deleted ops/TB      2.72
   avg temps/TB        32.89 max=88
   avg host code/TB    113.7
   avg search data/TB  5.2
   cycles/op           192.0
   cycles/in byte      748.7
   cycles/out byte     26.4
   cycles/search byte     582.8
     gen_interm time   57.6%
     gen_code time     42.4%
   optim./code time    19.4%
   liveness/code time  26.1%
   cpu_restore count   0
     avg cycles        0.0
   and the "slow", U-Boot 2019.04 :

Translation buffer state:
gen code size       368603795/1073736704
TB count            3052
TB avg target size  16 max=360 bytes
TB avg host size    293 bytes (expansion ratio: 17.6)
cross page TB count 0 (0%)
direct jump count   1431 (46%) (2 jumps=1104 36%)
TB hash buckets     2559/8192 (31.24% head buckets used)
TB hash occupancy   9.31% avg chain occ. Histogram: [0,10)%|█ ▃  ▁ ▁ ▁|[90,100]%
TB hash avg chain   1.000 buckets. Histogram: 1|█|1

Statistics:
TB flush count      3

Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000
TBs). This can be the TB overflowing but:

fast: gen code size       254880371/1073736704
slow: gen code size       368603795/1073736704

doesn't look like we are getting close to overflowing. Must be something
else.


Here are images to reproduce. These machines use the same AST2500 SoC.

* U-Boot 2016.07

  wget 
https://github.com/openbmc/openbmc/releases/download/2.9.0/obmc-phosphor-image-romulus.static.mtd

  qemu-system-arm -M romulus-bmc -drive 
file=./obmc-phosphor-image-romulus.static.mtd,format=raw,if=mtd -nographic
  qemu-system-arm -M romulus-bmc,execute-in-place=true -drive 
file=./obmc-phosphor-image-romulus.static.mtd,format=raw,if=mtd -nographic

* U-Boot 2019.04

  
https://github.com/legoater/qemu-aspeed-boot/raw/master/images/ast2500-evb/buildroot-2022.05/flash.img

  qemu-system-arm -M ast2500-evb -drive file=./flash.img,format=raw,if=mtd 
-nographic
  qemu-system-arm -M ast2500-evb,execute-in-place=true -drive 
file=./flash.img,format=raw,if=mtd -nographic


Thanks,

C.



TB invalidate count 0
TLB full flushes    0
TLB partial flushes 3
TLB elided flushes  2367
JIT cycles          26479044772 (11.033 s at 2.4 GHz)
translated TBs      10552169 (aborted=0 0.0%)
avg ops/TB          15.0 max=464
deleted ops/TB      2.44
avg temps/TB        32.43 max=89
avg host code/TB    99.0
avg search data/TB  5.0
cycles/op           167.7
cycles/in byte      626.8
cycles/out byte     25.4
cycles/search byte     499.4
   gen_interm time   50.4%
   gen_code time     49.6%
optim./code time    19.5%
liveness/code time  27.7%
cpu_restore count   0
   avg cycles        0.0


A lot more TBs.
   C.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]