Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not con

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not con

From:	Fred Hucht
Subject:	Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Date:	Tue, 25 Nov 2008 17:40:15 +0100

Hi,

I installed and ran mcelog and only found DIMM problems on one nodewhich is not related to our problems.

I don't think that the problems are related to only a few nodes. Theproblems occur on the nodes where the test job runs, and the queuescheduler always selects different nodes.


Do you think kernel 2.5.25.16 is unstable with respect to GlusterFS?

I am quite sure that it is no network issue.

We use local XFS FS on all nodes and a unify with NUMA as described inmy first mail. The client config for one node is


volume scns
  type protocol/client
  option transport-type tcp/client
  option remote-host 127.0.0.1
  option remote-subvolume scns
end-volume

volume sc0
  type protocol/client
  option transport-type tcp/client
  option remote-host 127.0.0.1
  option remote-subvolume sc0
end-volume

[same for sc1 - sc87 ]

volume scratch
  type cluster/unify

subvolumes sc0 sc1 sc2 sc3 sc4 sc5 sc6 sc7 sc8 sc9 sc10 sc11 sc12sc13 sc14 sc15 sc16 sc17 sc18 sc19 sc20 sc21 sc22 sc23 sc24 sc25 sc26sc27 sc28 sc29 sc30 sc31 sc32 sc33 sc34 sc35 sc36 sc37 sc38 sc39 sc40sc41 sc42 sc43 sc44 sc45 sc46 sc47 sc48 sc49 sc50 sc51 sc52 sc53 sc54sc55 sc56 sc57 sc58 sc59 sc60 sc61 sc62 sc63 sc64 sc65 sc67 sc68 sc69sc70 sc71 sc72 sc73 sc74 sc75 sc76 sc77 sc78 sc79 sc80 sc81 sc82 sc83sc84 sc85 sc86 sc87

  option namespace scns
  option scheduler nufa
  option nufa.limits.min-free-disk 15
  option nufa.refresh-interval 10
  option nufa.local-volume-name sc0
end-volume

volume scratch-io-threads
  type performance/io-threads
  option thread-count 4
  subvolumes scratch
end-volume

volume scratch-write-behind
  type performance/write-behind
  option aggregate-size 128kB
  option flush-behind off
  subvolumes scratch-io-threads
end-volume

volume scratch-read-ahead
  type performance/read-ahead
  option page-size 128kB # unit in bytes
  option page-count 2    # cache per file  = (page-count x page-size)
  subvolumes scratch-write-behind
end-volume

volume scratch-io-cache
  type performance/io-cache
  option cache-size 64MB
  option page-size 512kB
  subvolumes scratch-read-ahead
end-volume

Regards,

     Fred

On 25.11.2008, at 15:27, Joe Landman wrote:

Fred Hucht wrote:
Hi,
crawling through all /var/log/messages, I found on one of thefailing nodes (node68)
Does your setup use local disk? Is it possible that the backingstore is failing?
If you run

        mcelog > /tmp/mce.log 2>&1

on the failing node, do you get any output in /tmp/mce.log ?

My current thoughts in no particular order are
hardware based: failures always concentrated on a few specific nodes(always repeatable only on those nodes)
a) failing local hard drive: backing store failing *could* impactthe file system, and you would see this as NFS working on a remoteFS while failing on an FS in part storing locally.
b) network issue: possibly a bad driver/flaky port/overloadedswitch backplane. This is IMO less likely, as NFS works. Could youpost output of "ifconfig" so we can look for error indicators in theport state?
Software based:
c) fuse bugs: I have run into a few in the past, and they havecaused errors like this. But umount/mount rarely fixes a hung fuseprocess, so this is, again, IMO, less likely.
d) GlusterFS bugs: I think the devels would recognize it if it wereone. I doubt this at this moment.
e) kernel bug: We are using 2.6.27.5 right now, about to update to .7 due to some Cert advisories. We have had (stability) issues withkernels from 2.6.24 to 2.6.26.x (x low numbers) under intenseloads. It wouldn't surprise me if what you are observing isactually just a symptom of a real problem somewhere else in thekernel. That the state gets resolved when you umount/mount suggeststhat this could be the case.
Joe




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: address@hidden
web  : http://www.scalableinformatics.com
      http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


Dr. Fred Hucht <address@hidden>
Institute for Theoretical Physics
University of Duisburg-Essen, 47048 Duisburg, Germany

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
- Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Basavanagowda Kanur, 2008/11/25
  - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Joe Landman, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Joe Landman, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht <=
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
- Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Harald Stürzebecher, 2008/11/25
  - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Fred Hucht, 2008/11/25
    - Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected, Harald Stürzebecher, 2008/11/25

Prev by Date: [Gluster-devel] Namespace cache size ratio
Next by Date: Re: [Gluster-devel] rdiff-backup to glusterfs share doesn't work at all
Previous by thread: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Next by thread: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Index(es):
- Date
- Thread