[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#39413: 26.2; Emacs gets hung
From: |
chiaki-ishikawa-thunderbird-account |
Subject: |
bug#39413: 26.2; Emacs gets hung |
Date: |
Wed, 20 May 2020 13:31:29 +0900 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 |
Hi,
I am back.
I have observed another long hung.
This time, I did verify that it is during GC.
The hung lasted close to 8 minutes give or take.
My comment inline.
On 2020/04/30 11:45, chiaki-ishikawa-thunderbird-account wrote:
Hi,
Thank you for your tips.
My comment inline.:
On 2020/04/29 15:55, Eli Zaretskii wrote:
From: "ISHIKAWA,chiaki" <ishikawa@yk.rim.or.jp>
Date: Wed, 29 Apr 2020 06:36:46 +0900
Cc: chiaki-ishikawa-thunderbird-account <chiaki.ishikawa@ubin.jp>,
Noam Postavsky <npostavs@gmail.com>
I am back to this strange guru meditation of emacs.
Emacs would eventually begin working, but the silent period where no
key
or mouse event is acknowledged is rather longish.
Several weeks after I installed GNU Emacs 28.0.50 (that is what
|emacs-version| variable says. I installed from git repo.) , I began
noticing the occasional long pause (no response) again.
I suggest to begin by establishing whether these pauses are caused by
GC. To that end, set garbage-collection-messages to a non-nil value,
and see if Emacs displays a message about GC in progress when these
pauses occur. Also, try to tell how frequent are these pauses and
much time each pause takes.
If you see that every time you experience a pause there's GC running,
we can then proceed to investigating whether the frequency of GC and
the time it takes on your system is something abnormal, and why.
I have set garbage-collection-messages to t.
Let me see what happens when the next apparent long guru meditation
happens.
Since the last message, I did not get a very long hung until this
morning. Maybe 5-10 second long GC only. Those I can tolerate.
I feel "what?" after maybe a five second hung and then realize something
is holding up emacs, and then the hung is over. I tell myself GC was
rather long, but that is it.
But this morning I get a rather long hung, or GC, twice.
Attaching gdb to the emacs process then after a 10 second or so of the
first hung, and noticed that sweep_cons() took three seconds to finish
after I entered "finish" command to gdb. That certainly felt long during
gdb session.
So that was an initial clue that something is wrong.
(Now I suspect some kind of memory area corruption. Before this happened
I loaded a several large log files each about 20MB, and when I tried to
erase a rather large *shell* buffer that contained the log from a large
compilation job, this garbage collection kicked in.)
Anyway, I checked and found gc-elapsed showed a number in 400-500
seconds range. I really wished I wrote down the exact number.
So my take on this is sweep_conses() takes a rather long time and not
letting emacs to take any input. (And there could be other functions
before it.)
The question is how long is "long time". We need some quantitative
measure of this before we can tell whether it's normal or not.
I agree. It is all relative.
I agree "long" is a relative term.
But after the somewhat long pause of maybe 20-30 seconds, I get another
really long pause which prompted me to write this mail.
The second long hung this morning was maybe longer than 8 minutes give
or take a minute or so.
I stopped emacs a few times after the initial five minute or so of the
long hung to check where it spends time occasionally using gdb. But it
definitely is close to 8 minutes. I would have killed emacs and start
over if I had not learned about gc-elapsed and other internal variables.
After GC finally finished and control returned to keyboard, I checked
gc-elapsed.
It now shows:
gc-elapsed is a variable defined in ‘C source code’.
Its value is 1067.998007492
It matches my observation that gc took about 8 minutes give or take. (Of
course, there were a few times I interrupted the operation from
Control-C from GDB and even captured a screen shot or two. But that was
less than two minutes in total.)
Question.: Has there been any change in emacs's alloc.c code in the
last year or two?
Yes, a lot. But I don't see any significant change in frequency or
time duration of GC on my system due to those changes, FWIW.
Maybe I should take a look when my time permits.
I am running emacs under Debian GNU/linux inside VirtualBox which is
assigned 16GB memory.
It is possible that the virtual machine's memory management is
misconfigured, so it slows down Emacs when it needs to access a large
portion of its address space (which happens during GC).
I have not touched any thing (famous last words) for my linux
configuration for the last few years aside from simply upgrading the
packages from time to time.
gc-elapsed is a variable defined in ‘C source code’.
Its value is 68.3795433
The question is how many GC cycles took 68 seconds. What is the value
of gcs-done? Also, is this in a session to which you attached GDB? if
so, then some of the elapsed time was due to Emacs being stopped under
the debugger. We need these values in a session that is never stopped
by a debugger.
I will monitor gcs-done when the next guru meditation happens.
Is there a hook to garbage-collection so that I can print gcs-done and
gc-elapsed maybe after the garbage-collection is run?
I failed to check gcs-done. It shows 1032 after the fact.
But given that I monitored gc using gdb and it took almost 8 minutes
before successive "finish" commands to gdb (after sweep_floats(), etc.)
to come up to the interactive functions, I am quite sure that one
GC-related entry point used approximately 8 min to finish.
And I think that entry is maybe_gc() or maybe_garbage_collect().
(That is why I suspected some type of memory corruption software-wise.
My PC has ECC enabled and so it is rather unlikely memory got corrupted
spontaneously.)
PS: Given that this may have something to do with the paging activity
of the emacs process, the linux version change (from 4 to 5) may have
affected the paging behavior of the program, but it is hard to say. I
now see this linux instance is using 293M SWAP. Compiling mozilla
thunderbird source runs |make -j6| and g++-9 compiles CPP source files
which are very large and so they exert heavy memory pressure. A bad
memory access pattern of emacs, not friendly to locality of access,
would add to page faults. Except, |xosview| program does not show any
paging activity. I am not sure if |xosview| is correct or not.
Building a large program with -j6 also uses CPU, and that could make
CPU-intensive operations, such as GC, slow. Was that build running in
the same VirtualBox as Emacs? If so, how many physical CPU execution
units do you have allocated to VirtualBox?
Yes, the build was running in the same VirtualBox as Emacs.
I have assigned 7 cores to VirtualBox.
(Oh wait!!! My current VirtualBox says it is assigned only 6 cores!?
I thought I did assign 7 cores to this VirtualBox image.
My PC runs Windows 10 natively on AMD Ryzen 7 Eight-Core processor
that has additional 8 hyperthreads.
So there are 16 virtual cores.
I have no idea why only 6 core is assigned to VM image today.
The number has been tweaked when an earlier VBox had an unresolved
issue when AMD Ryzen CPU came out.
[I used to run this image on XEON CPU.]
Maybe this could explain the problem on my HOME PC.However, the same
guru meditation has been observed on office PC with a different Intel
CPU, too.
Again, Emacs is run inside virtualBox on that PC that runs Windows 10
professional natively.))
However, the important point is this. I have not noticed this strange
guru meditation until about 12 months ago or so.
That is why I suspect that there could be an issue with the recent
alloc.c or something, but I think I need to gather more data
to focus in it yet.
Stay tuned.
Thank you again for your help. I will try to gather more
circumstantial evidence so that we can make an educated guess why
Emacs seems to enter guru meditation on my PC.
Chiaki
Since last message, I have re-assigned 7 CPUs to my virtual box image.
At the time of long hung this morning, I have no largish build or
anything. I was only editing a message file or something.
I wanted to clear *shell* buffer with "erase-buffer" when this series of
long hungs hit emacs.
I am attaching two screen dumps.
sweep-conses shows that emacs is spending time in sweep-conses after I
typed "finish" command to gdb.
Emacs process in the background did not respond to X repaint event and
thus only a portion of its screen, most notably echo area is partially
shown.
The window on the right is xosview system resource monitor program.
That shows CPP 2, 3, and 4 are used 100% successively (and that is emacs).
The durration of time covered by the horizontal length of the montor
window is about 15 seconds.
Emacs spent time almost like 8 minutes (I need to subtract a minute or
two for gdb interaction)
like this. Just churning CPU and nothing else.
memory-consolidate-1.PNG is interesting in that this was the first time
I saw something two level below sweep_conses().
A routine malloc_consolidate() is visible.
Maybe the memory consolidation may take time.
A repeated question, which I think has No answer is as follows.:
Is there a hook that gets executed just before GC happens?
Now I think the answer is "No" because GC is invoked when the available
conses are few and we don't want to run arbitrary lisp program whch may
lead to another GC invocation.
Short of such feature, is there a post hook that gets executed AFTER GC
finishes (well in the stacktrace I see maybe_gc() called from call2(),
and so
the hook may have to be placedon maybe_gc() routine if possible.
I am asking the questions about the hook(s) since I want to record some
statistics about memory usage, etc. automatically
before/after GC so that we can have a scientific/measurable data to
discuss what is going on.
If we can have a set of such pre/post data for the last few dozen GC
invocations, we may be able to home into the cause of the program.
Short of lisp function, maybe I can add message print out in alloc.c to
print out such information so that it gets recorded in *Message* buffer.
It is done only when gabage-collection-message is t and there should be
some throttling mechanism to avoid spamming the user.
I now feel both alloc.c and malloc.c may be to blame, most probably in
terms of locality of references that may interfere badly with CPU cache
and linux memory system. However, do not there does not seem to be much
paging activity AND more than 4 GB real memory is more or less
available still. That is visible in xosview's screen in
sweep-conses.png. That lead to me wonder if there is algorithmic slow
down due to non-linear execution time of memory size? But again, this is
hard to believe in this age of home users having GB of memory. Such
algorithmic lapse can't remain undetected for so long.
Hmm, that makes me to throw the net on linux AND virtualbox which I use
to run linux on my home Windows PC.
I have 32 GB of memory on my home PC and assign 16 GB of memory to
VirtualBox image.
Given that I still have 2 GB of free memory on Window side right now (I
just checked with system performance monitor),
all the memory references during GC ought to go into the real memory
without paging on the Window side even, .i.e., the virtual memory is
mapped straight to the real memory on the PC.
OR, do I feel lucky and compile emacs with GCC-9's -O3 flag and see if
it speeds up emacs?
To be honest, I always prefer correctness rather than speed, but 6
minutes GC is not something I can put up with in an interactive program.
Come to think of it if Emacs is going multi-threaded (I mean Emacs-Lisp
interpreter) then low-level GC can become incremental opportunistic
non-blocking GC?
Any pointer re the hook(s) short of rewriting alloc.c is appreciated.
Well, if worst comes to worst, I don't mind rewriting alloc.c to create
an array of fixed size to store the last dozen or so GC-related
information that is updated before and after invocation of a certain GC
entry point.
I can print that out after a long GC to learn the memory situation.
But that sounds a bit drastic. OTOH, it may be the only way to go.
But the last time I worked on alloc.c to debug it to run on non-UNIX
platform was close to 30 years ago. :-)
Thank you in advance for your attention on this matter.
Chiaki
malloc-consolidate-1.PNG
Description: PNG image
sweep-conses.png
Description: PNG image
- bug#39413: 26.2; Emacs gets hung,
chiaki-ishikawa-thunderbird-account <=