coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: performance bug of `wc -m` on macOS


From: Bruno Haible
Subject: Re: performance bug of `wc -m` on macOS
Date: Mon, 21 May 2018 20:02:24 +0200
User-agent: KMail/5.1.3 (Linux/4.4.0-124-generic; KDE/5.18.0; x86_64; ; )

With the proposed function-pointer-factory changes, I'm seeing
this speedup on macOS systems:

             num     mbc
  Before    0.153   0.229
  After     0.042   0.112
  -------
  Speedup    3.6     2.0
  factor

The profiler's output now is:
===============================================================================
--------------------------------------------------------------------------------
Profile data file 'callgrind.out.64367' (creator: callgrind-3.14.0.GIT)
--------------------------------------------------------------------------------
I1 cache: 
D1 cache: 
LL cache: 
Timerange: Basic block 0 - 158573914
Trigger: Program termination
Profiled target:  src/wc -m (PID 64367, part 1)
Events recorded:  Ir
Events shown:     Ir
Event sort order: Ir
Thresholds:       99
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
         Ir 
--------------------------------------------------------------------------------
546,413,341  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir  file:function
--------------------------------------------------------------------------------
134,811,404  ../src/wc.c:wc [src/wc]
103,504,100  ../lib/mbrtowc-factory.c:utf8_mbrtowc [src/wc]
 88,000,000  ???:__maskrune [/usr/lib/system/libsystem_c.dylib]
 66,000,000  ../lib/uniwidth/width.c:uc_width [src/wc]
 46,200,000  ???:mbsinit [/usr/lib/system/libsystem_c.dylib]
 21,000,000  ???:_UTF8_mbsinit [/usr/lib/system/libsystem_c.dylib]
 16,000,000  /usr/include/_ctype.h:wc
 12,600,000  ../lib/mbchar.h:wc
 12,200,674  ???:pthread_getspecific [/usr/lib/system/libsystem_pthread.dylib]
 10,000,000  ../lib/wcwidth-factory.c:utf8_wcwidth [src/wc]
  8,000,000  ../lib/streq.h:uc_width
  6,112,126  ???:__vsnprintf_chk [/usr/lib/system/libsystem_c.dylib]
  4,596,013  ???:ImageLoader::trieWalk(unsigned char const*, unsigned char 
const*, char const*) [/usr/lib/dyld]
  4,000,498  ???:rpl_wcwidth [src/wc]
  2,067,598  ???:ImageLoaderMachOCompressed::rebase(ImageLoader::LinkContext 
const&, unsigned long) [/usr/lib/dyld]
  1,773,141  ???:ImageLoaderMachO::libPath(unsigned int) const [/usr/lib/dyld]
    768,220  ???:ImageLoaderMachO::findExportedSymbol(char const*, bool, char 
const*, ImageLoader const**) const'2 [/usr/lib/dyld]
    758,551  ???:_mapStrHash(_NXMapTable*, void const*) 
[/usr/lib/libobjc.A.dylib]
    683,763  ???:ImageLoader::read_uleb128(unsigned char const*&, unsigned char 
const*) [/usr/lib/dyld]
    579,216  ???:ImageLoaderMachOCompressed::libReExported(unsigned int) const 
[/usr/lib/dyld]
    248,565  ???:ImageLoaderMachOCompressed::findShallowExportedSymbol(char 
const*, ImageLoader const**) const [/usr/lib/dyld]
    204,809  ???:ImageLoaderMachOCompressed::eachBind(ImageLoader::LinkContext 
const&, unsigned long (ImageLoaderMachOCompressed::*)(ImageLoader::LinkContext 
const&, unsigned long, unsigned char, char const*, unsigned char, long, long, 
char const*, ImageLoaderMachOCompressed::LastLookup*, bool)) [/usr/lib/dyld]
    203,803  ???:ImageLoaderMachO::findExportedSymbol(char const*, bool, char 
const*, ImageLoader const**) const [/usr/lib/dyld]
    200,688  ???:strcmp [/usr/lib/system/libsystem_kernel.dylib]
    179,635  ???:dyld::loadPhase5(char const*, char const*, dyld::LoadContext 
const&, unsigned int&, std::__1::vector<char const*, std::__1::allocator<char 
const*> >*) [/usr/lib/dyld]
    173,763  ???:_NXMapMember(_NXMapTable*, void const*, void**) 
[/usr/lib/libobjc.A.dylib]
    173,367  ???:_pthread_mutex_unlock_slow 
[/usr/lib/system/libsystem_pthread.dylib]
===============================================================================

Let's dissect the time, as before:

mbrtowc:
103,504,100  ../lib/mbrtowc-factory.c:utf8_mbrtowc [src/wc]
 46,200,000  ???:mbsinit [/usr/lib/system/libsystem_c.dylib]
 21,000,000  ???:_UTF8_mbsinit [/usr/lib/system/libsystem_c.dylib]
-----------
170,704,100 = 31%

rpl_wcwidth:
  locale_charset: 0%
  uc_width:
 66,000,000  ../lib/uniwidth/width.c:uc_width [src/wc]
 10,000,000  ../lib/wcwidth-factory.c:utf8_wcwidth [src/wc]
  8,000,000  ../lib/streq.h:uc_width
  4,000,498  ???:rpl_wcwidth [src/wc]
-----------
 88,000,498 = 16%

No more time is spent in locale_charset!

And the macOS-compatible rewrite of UTF-8 mbrtowc is about 2.3 times faster
than the macOS implementation.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]