gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Performance Translators' Stability and Usefulness


From: Geoff Kassel
Subject: Re: [Gluster-devel] Performance Translators' Stability and Usefulness
Date: Sun, 5 Jul 2009 02:07:53 +1000
User-agent: KMail/1.9.9

Hi Gordan,

> > When will the Gluster team be able to deliver a stable, mature, and
> > reliable version of GlusterFS?
>
> While I can relate to that sentiment to some extent, I think you're a
> bit overly harsh. Stability has, in my experience, improved quite a lot
> recently.

I'm sorry if I come off frustrated. (It's the result of cleaning up after yet 
another daily GlusterFS crash.) GlusterFS has so much promise - yet I've been 
using it for over two years now, and little has changed for me in terms of 
core stability or reliability.

(If it wasn't for that migrating to another solution would cause considerable, 
business-destroying downtime for my client base, I would have done so quite 
some time ago.)

All I see instead is this constant drive towards new features, with little to 
no signs that functionality that should be complete by now is actually so.

Like the return of data integrity bugs in versions as recent as 2.0.1. (As 
reported in the 'Files being cut off in the middle' thread.)

AFR was an early feature of 1.x - it should be rock solid by now, yet we're 
still seeing big bugs in it in release-grade software.

AFR is *the* key feature of GlusterFS in my mind - and the only point (I feel) 
for using it. Yet it's still this unstable after two plus years of 
development?

> > I have been using GlusterFS since the v1.3.x days, and I have yet to see
> > a version since then that doesn't crash at least once a day from just
> > load on even the simplest configurations.
>
> I wouldn't say daily, but occasionally, I have seen lock-ups recently
> during multiple glusterfs resyncs (separate volumes) on the new/target
> machine. I have only seen it once, however, forcefully killing the
> processes fixed it and it didn't re-occur. I have a suspicion that this
> was related to the mounting order. I have seen weirdness happen when
> changing the server order cluster-wide, and when servers rejoin the
> cluster.

Well, I see one to two crashes nightly, when I rotate logs or perform backups 
that are stored on the GlusterFS exported drive. (It's hit and miss which 
processes run to completion on the first go before the crash, which should 
never be an issue with a reliable storage medium.)

The only common factor identifiable is higher-than-average I/O load.

I don't run any performance translators, because they make the situation much 
worse. It's just a straight AFR/posix-locks/dataspace/namespace setup, as 
I've posted quite a few times before.

I've had to institute server scripting to restart GlusterFS and any processes 
that touches replicated files (i.e. nearly everything running on my servers) 
because of these crashes to try to minimise the downtime to my clients.

(Hence my previous requests for features to allow crashes of GlusterFS servers 
to happen more gracefully for files mid-access. Non-stop I/O, I believe it 
used to be called.)

> Yes, that was bad, 2.0.2 is pretty good. Sure, there is still that
> annoying settle-time bug that consistently fails the first attempt to
> access the file system immediately after mounting (the time gap is
> pretty tight, but if you script it, it is 100% reproducible). But other
> than that I'm finding that all the other issues I had with it have been
> resolved.

After two major data integrity bugs in two major releases in a row, I'm taking 
very much a wait-and-see attitude with any and all GlusterFS releases.

> What exactly do you mean by "regression test"? Regression testing means
> putting in a test case to check for all the bugs that were previously
> discovered and fixed to make sure a further change doesn't re-introduce
> the bug. I haven't seen the test suite, so have no reason to doubt that
> there is regression testing being carried out for each release. Perhaps
> the developers can clarify the situation on the testing?

I meant it in the same sense that you do. I have not seen any framework - 
automated or otherwise - in the repository or release files to run through 
tests for previous and/or forseeable bugs and corner cases.

A test to compare cryptographic hashes of files before, after, and during 
storage/transfer between GlusterFS clients and backends should surely exist 
if there's any half-serious attempt at regression testing going on.

(As a developer, I would institute this test on general principles from the 
start, since the whole purpose of this system is reliable data storage.)

Surely, though, if tests like these existed and were being used, after the 
debacle with 2.0.0, they would have picked up at least the issue reported in 
2.0.1 before release?

That leads me to ask - where's the unit tests that are meant to exist, 
according to http://www.gluster.org/docs/index.php/GlusterFS_QA? If they 
exist, why (apparently) aren't tests like these still not part of them?

It's nearly a year after QA process document was written - and months after 
the first big corruption bug. And still no sign of a testing framework. It 
really makes me wonder how accurately the QA process is documented on this 
Wiki page.

We need a lot more transparency in the QA process.

Publishing the test framework in use - as I seem to recall asking before - 
would be a good start.

Geoff.

On Sat, 4 Jul 2009, Gordan Bobic wrote:
> Geoff Kassel wrote:
> >>> Finally - which translators are deemed stable (no know issues -
> >>> memory leaks/bloat, crashes, corruption, etc.)?
> >>
> >> We can definitely vouch for a higher degree of stability of the
> >> releases. Otherwise, I dont think there is any performance translator we
> >> can call completely stable/mature because of the roadmap we have for
> >> constantly upgrading algorithms, functionality, etc.
> >
> > When will the Gluster team be able to deliver a stable, mature, and
> > reliable version of GlusterFS?
>
> While I can relate to that sentiment to some extent, I think you're a
> bit overly harsh. Stability has, in my experience, improved quite a lot
> recently.
>
> > I have been using GlusterFS since the v1.3.x days, and I have yet to see
> > a version since then that doesn't crash at least once a day from just
> > load on even the simplest configurations.
>
> I wouldn't say daily, but occasionally, I have seen lock-ups recently
> during multiple glusterfs resyncs (separate volumes) on the new/target
> machine. I have only seen it once, however, forcefully killing the
> processes fixed it and it didn't re-occur. I have a suspicion that this
> was related to the mounting order. I have seen weirdness happen when
> changing the server order cluster-wide, and when servers rejoin the
> cluster.
>
> > Then there's the data corruption bug of the early 2.0.0 releases, which
> > has kept me (and no doubt others) from upgrading to these releases.
>
> Yes, that was bad, 2.0.2 is pretty good. Sure, there is still that
> annoying settle-time bug that consistently fails the first attempt to
> access the file system immediately after mounting (the time gap is
> pretty tight, but if you script it, it is 100% reproducible). But other
> than that I'm finding that all the other issues I had with it have been
> resolved.
>
> > I have read about the Gluster QA team, but quite frankly, I have yet to
> > see the fruits of this team's work. Letting through a bug of that
> > magnitude in a major release blew a lot of trust I had in the Gluster
> > team's QA process.
> >
> > When will regression tests be used? It's been months now since this bug,
> > and still I don't see any sign of the use of this simple,
> > industry-standard technique to minimise the risk of such issues slipping
> > through again.
>
> What exactly do you mean by "regression test"? Regression testing means
> putting in a test case to check for all the bugs that were previously
> discovered and fixed to make sure a further change doesn't re-introduce
> the bug. I haven't seen the test suite, so have no reason to doubt that
> there is regression testing being carried out for each release. Perhaps
> the developers can clarify the situation on the testing?
>
> Personally, I think of much benefit to testing would be having syslog
> support so that when using glusterfs as the root file system the logs
> can be acquired/redirected for troubleshooting. This is currently not
> possible since at the point where glusterfs starts up there is no
> permanent root file system that logs can be written to.
>
> Gordan
>
>
> _______________________________________________
> Gluster-devel mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/gluster-devel




reply via email to

[Prev in Thread] Current Thread [Next in Thread]