[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Interface for SCSI transactions ?

From: Thomas Schmitt
Subject: Re: Interface for SCSI transactions ?
Date: Sun, 02 Oct 2011 13:44:56 +0200


Olaf Buddenhagen wrote:
> I could try to reach a consensus with Samuel, and present it to you as a
> fixed decision 

In the end it will happen that way.

After all it is you who have the overall interest in Hurd, whereas
i am mainly striving for getting GNU xorriso work on GNU/Hurd with
similar capabilities as on GNU/Linux.

So i can only defend the core functionality of my plan, but hardly
its implementation details.

(After spending some hours on this mail:)
I still think that the time is ripe to decide my two questions.
- Really implement a new method of struct device_emulation_ops ?
- Transmit parameters as structs
  (rather than fan out all parameters to the RFC definition) ?

Below argumentation mainly is about pros and cons.
Especially about my reluctance to combine a method of device_emulation_ops
with fanned-out RPC parameters. 

The other three combinations would cause no objections by me.
(yes,yes) and (no,no) are my personal favorites.


> I'm sure we could reach a conclusion much quicker in a
> realtime discussion on IRC...

This would be a very asynchronous discussion. My Hurd knowledge is
so low that i often need a few hours of research and pondering
before i can make a statement on a particular topic.

We should collect arguments, and then you make a decision that can
fulfill the needs of comunicating with a CD drive, without spoiling
Hurd's overall appearance.

> > 1: Shall the new call be implemented on the kernel side as a new
> > method of struct device_emulation_ops ?

> That's what I was assuming all the way -- regardless whether we go with
> a SG-specific call, or something more generic... More on this later.

I really doubt that a general method in struct device_emulation_ops
would go well with a highly specialized list of RPC parameters.
(No problem for me ... but what about overall Hurd architecture ?)

> > 2: Shall the parameters be bundled in serialized structs as far as
> > technically feasible ? Alternative: Each primitive data field shall be
> > transferred as a separate RPC parameter.

> Actually, we discussed three options here: a) explicit RPC parameters
> for everything; b) transfer all fixed size fields in a single struct
> represented as a flat byte array, and all variable-size fields as
> separate parameters; or c) do custom marshalling (RPC-over-RPC).

I dropped (c) for expected performance problems.
Embedding the fat payload buffer into a serialized array would cause
the implementation to copy all payload data one time more than
absolutely nexessary. In the sum we have to expect 700 MB, 4.4 GB
and 25 GB, depending on the type of optical media. So one more extra
copy in memory does matter for system load and data throughput.
(This was the conclusion in my mail of 23 Sep 2011 22:33:58 +0200)

(b) in the form "all fixed size fields in a single struct
represented as a flat byte array" still appears un-RPC-ish to me.
But except from that, (b) is the prototype which i personally would
prefer if the call shall become a method of struct device_emulation_ops.

In my eyes, (a) makes few sense if the function shall be a method of
struct device_emulation_ops.
It would use lots of parameters which will hardly make sense for
devices other than CD drives or certain SCSI devices.
But struct device_emulation_ops covers a lot more device types.
So for me, (a) implies not to touch struct device_emulation_ops.

> > Yes on 2: New struct members could be added in a binary compatible
> > way.

> That's actually not true. 
> [... ] you would have to
> manually match version numbers and do case-specific handling in your
> custom RPC-over-RPC handling code.

Believe me that i have some experience with keeping ABIs stable
while not hampering their evolution. It would work with (b) and (c)
(as well as with my idea of (b) being capable of mixed processing
Adding a new member to the end of a struct is easy. One only needs a
version indicator among the members of the first version.

> I see no benefit in this...

Although i have examined the structs of 4 SCSI transaction interfaces
(Linux has 2 of them), i am not absulutely sure to have covered all
future needs of such a call.
An opportunity to correct shortcommings would soothe my nerves.

> > But this [(a)] will cause a giant prototype. Very ugly to its users.
> I wouldn't consider it ugly...

Indeed, we could hide it by using structs in userspace.
On the kernel side, i will have to re-pack parameters anyway, to fit
them into the structs of the kernel calls.

> (On the other hand it's still much simpler than implementing custom RPC
> marshalling... :-) )

I have to agree. It would be more straightforward than serializers.
(Does MIG tolerate 20+ parameters ?)

> > My current sketch has 9 members in the input struct and 18 in the
> > output one.

> Well, I supported the idea of using a future-proof structure under the
> assumption that we can simply reuse the one from Linux

It is an augmented form of Linux struct sg_io_hdr in <scsi/sg.h>.
Split into two structs to avoid unnecessary transmission of parameters
which actually need to be transported only one way. Among them is the
fat payload buffer, which for my use case is always one-way.

The only semantical addition is to have result parameters for the
autosense feature. Solaris and FreeBSD have such info, Linux has not.

In userspace, the structs could look like this:


/* Input parameters and data for the transaction */ 
struct device_transact_scsi_pkt_in {

  int version;                /* Describes the layout of this struct.
                                 Submit 0.

  int dxfer_direction;        /* Data transfer direction:

  unsigned char cmd_len;      /* SCSI command length ( <= 16 bytes).
                                 (might be overridden SCSI knowledge of the
                                 ATAPI cannot transfer commands > 12 bytes.
  unsigned char cmdp[16];      /* The array with the SCSI command bytes
                                  to perform */

  unsigned int dxfer_len;     /* Intended byte count of data transfer.
                                 I.e. number of bytes to write or to read.

  void * write_buffer;        /* Points to caller provided bytes which shall
                                 be written to the drive.
                                 May be NULL with dxfer_direction _NONE or
  unsigned char mx_sb_len;    /* Tells size of
                                 device_transact_native_out.sbp .
  unsigned int timeout;       /* Time limit for command execution (millisec) */

  unsigned int flags;         /* unused yet, submit 0 */



/* Result and data output of the transaction */
struct device_transact_scsi_pkt_out {

  int version;                /* Describes the layout of this struct.
                                 The members described here are present since
                                 version 0.

  unsigned int read_count;    /* Actual count of read bytes.

  void * read_buffer;         /* Points to caller provided memory which shall
                                 receive the bytes read from the drive.
                                 The size of this buffer is announced by
                                 device_transact_native_in.dxfer_len .
                                 May be NULL with dxfer_direction _NONE or _TO.

  unsigned char sb_len_wr;    /* Byte count actually available in sbp */

  /* NOTE: This is an embedded array to allow for answer (x,yes).
           If fanned out parameters are chosen, then this can be 
               unsigned char * sbp;
           like in Linux and can be transmitted as variable length array.
  unsigned char sbp[264];     /* Replies the sense data which indicate errors
                                 or noteworthy drive conditions.
                                 One could reduce the size of this field to 14
                                 because one is mainly interested in the
                                 triplet of KEY, ASC, ASCQ which characterizes
                                 SCSI errors.
                                 But MMC-6 allows up to 255+7 bytes in fixed
                                 format (0x70, 0x71) and guarantees that only
                                 these fixed formats are emitted by the drive.

  unsigned char status;       /* SCSI status
                                 >>> still to be explored, unused by libburn
                                 >>> growisofs uses it as fallback error
                                 >>> indicator if KERNEL_BROKEN

  unsigned char masked_status; /* Shifted, masked scsi status
                                 >>> still to be explored, unused by libburn
                                 >>> growisofs reads it to learn about
                                 >>> noteworthy sense data:
                                 >>> if (sg_io.masked_status&CHECK_CONDITION)

  unsigned int host_status;    /* errors from host adapter
                                  >>> indicate transport errors if not 0

  unsigned int driver_status; /* errors from software driver
                                     >>> indicate transport errors if not 0

  int resid;                  /* Number of bytes which were not transmitted
                                 despite dxfer_len value of _in struct.
                                 dxfer_len - actual_transferred

  /* The following members rqs_* give status information about a 
     REQUEST SENSE command that was emitted after the actual command failed.
  unsigned char rqs_valid;    /* 1 if information is available about a
                                   REQUEST SENSE command, which fetched sbp.
                                 0 if not. 
  unsigned char rqs_status;   /* Like status, but for the REQUEST SENSE command
                                 which fetched sbp.
                                 Only valid if .rq_happened .
  unsigned char rqs_masked_status;      /* see above .masked_status */
  unsigned int rqs_host_status;         /* see above .host_status */
  unsigned int rqs_driver_status;       /* see above .driver_status */
  int rqs_resid;              /* Number of discarded sense bytes.
                                 Only valid if .rq_happened .

  unsigned int duration;      /* Time taken by cmd (unit: millisec)
                                 (0xffffffff = time measurement not valid)

  unsigned int info;          /* Auxiliary information.
                                 unused yet, do not interpret



>  ABI stability is totally irrelevant here: there is no way
> on earth the client code will work without change -- and thus without a
> recompile...

My personal experience with maintaining dynamic libraries differs from
your statement.
It is possibly to make API/ABI changes in a way that old clients can
still use the new ABI without recompiling. (And of course they can
compile with the new API without changes in their own source code.)

See for example struct IsoStream_Iface in <libisofs/libisofs.h> which
is now at its 5th version and still works for clients which only
expect version 0.
Of course these clients cannot derive IsoStream classes which implement
the newer features, e.g. filter functionality. But derived classes
of old versions still work with the modern library revisions.
(IsoStream is an abstraction of data file content. Custom IsoStream
 implementations allow the client to inject arbitrary content into
 regular files in the ISO 9660 image.)

> With that in mind, I now think that indeed we should go with an RPC
> definition having only the minimal parameters necessary to expose the
> capabilities of the current driver; and simply introduce a new RPC once
> we actually have a better driver.

It would be less work for me that way.
But i think it is not a good idea to plan for incompatible re-implemetation
already now, while we plan for a first implementation.

If ever, i would prefer to plan for compatible re-implementation.
I.e. to throw out unnecessary struct member now and to have a roadmap
for their compatible introduction, if they later turn out to be needed.

> Still, the generic device_transact() approach
> looks *very* ugly after seeing the complexity actually necessary here :-) )

I agree. If it is a generic device_transact() call, then it should
have generic parameters. (Or may i say: "must have" ?)

If we decide for specific parameters, e.g. by exposing them to the RPC
definition, then we can hardly give sense to a generic call.

> > I do not understand yet the "RPC-over-RPC" aspect, resp. why it is so
> > bad.

> That's like loading your car onto a truck and driving the truck,

It is more like the tunnel between France and Britain.
You put your car into a railroad wagon because its own wheels do not
fit the track.

It would be a Hurd equivalent of Unix ioctl() with about all
good and bad consequences.

> Well, I must admit that I'm not a big fan of OOP... :-) 

Aggregation and encapsulation are indispensible for any large
software project. (Aggregation stems from old Structured Programming,
I agree that one can easily get lost in the woods if one overdoes
inheritence and polymorphism.

> But anyways, I'm pretty sure that trying to force vastly different
> method invocations through a fixed common prototype, would be considered
> very ugly in *any* object-oriented design.

It is the idea behind the two more questionable patterns in OOP:
Put different kinds of things into the same black box and get in return
the appropriate results by miracle.

The motivation for these potentially messy patterns is re-usability,
which was the big promise of OOP (not always kept, though).

My motivation for proposing a generic RPC is the same. It shall
be reusable with other device classes.
(But i myself am ok with a call that only applies to my device class.)

> Polymorphism doesn't come in here at all BTW.

My eyes can clearly see it in form of function_code, that selects
the actuall implementation class, and of generic parameter structs,
which become senseful only within a specific implementation class.

Gnumach's struct device_emulation_ops defines an OO interface.
Such interfaces are a way to express polymorphism.

> I don't see any reason why the specific SG transaction call can't be
> hooked into device_emulation_ops() just as well. There is precedence for
> having a device-specific call in the generic device interface:
> device_set_filter() is used only for network interfaces.

Why using the method panel at all in this case ?
The call could just jump into the appropriate kernel code and perform
the network specific operation there.

I understand struct device_emulation_ops as an interface definition,
which gets implemented by various device classes.
It makes few sense to burden this interface with a method that can
only be implemented for a single device class, because its parameters
are highly specific.

> Is the sense information really something that needs to be communicated
> between these two parts, rather than being abstracted by the hardware
> access layer? 

Yes it is.
libburn does the job of a driver for SCSI/ATAPI MMC devices, because
the CDROM driver of Linux does not refer to the write capabilities of
optical drives.
For that, libburn must analyse in detail the error replies of the MMC
drive. Especially it has to decide whether the drive's refusal is an
error or just a peculiar drive state (of which there are many).
Simple (errno == EIO) won't help.

> But apparently the Linux developers didn't think it's helpful to expose
> this... Would be good to investigate the rationale behind this: trying
> to find the relevant discussion(s) on LKML for example.

<linux/cdrom.h> has a struct request_sense which is part of
struct packet_command and struct cdrom_generic_command.
It is exposed to userspace e.g. after call ioctl(CDROM_SEND_PACKET).
See in dvd+rw-tools file transport.hxx the case for Linux < 2.5.43.

Nevertheless Jens Axboe since 2003 recommended to use ioctl(SG_IO)
instead, which in its struct has
  unsigned char * sbp;        /* [i], [*o] points to sense_buffer memory */
  unsigned char sb_len_wr;    /* [o] byte count actually written to sbp */

What Linux does not expose, are some driver status results of the
REQUEST SENSE command that is issued automatically after the
command from userspace has failed.
FreeBSD and Solaris expose this to some degree.

> You might not be aware that the whole *idea* behind the original Mach
> research was allowing transparent network IPC between processes running
> on machines with possibly different architecture.

This is what gives me remorse about the proposal to transmit structs
without explicit serialization. This will only work if the same source
definition of a struct leads to the same memory layout on both sides of
the RPC.

> I for my part consider the network transpacency a legacy I'd prefer to
> get rid of (just like the actual network IPC code we already purged), as
> I believe it's responsible for a substantial part of the complexity of
> Mach RPC. (And possibly other design shortcomings as well.)

As mentioned in my last post: Cloud and number crunching graphics cards
are modern forms of mixed architecture.
So abandoning parts the RPC abstraction seems to cut away a specific
future option of Hurd.

> MIG has hardly any abstraction

But it does not prevent proper conversion of primitive types.

> Another problem might be that to allow network-transparent IPC, MIG
> would have to guess the appropriate machine-independent types to safely
> hold the value of each field

Solved by SUN Microsystems in form of XDR and SUN RPC at least two years
before the text mig.ps was published. RFC 1014 (SUN XDR, 1987) gives
motivation why certain design decisions were made.
These competitors are the main reason why MIG disappoints me.

I could not use SUN RPC back then, because it was not sufficiently
implemented on Apollo Domain/ix. I finally created an own quite
restricted RPC protocol based on sockets with TCP or UDP.
Nevertheless, i endorsed the motivations of XDR - especially its
unambigous representation of data types at the cost of some
conversion needs on DEC VAX and Intergraph Clipper (few Intel based
workstations were around at that time).

> > nor for a neutral representation of data during transmission.

> What is a neutral representation?

A data representation on byte array level, which allows to unambigously
derive the local data representation on the level of multi-byte
data types (e.g. double, int, struct).

SUN XDR prescribes that the primitive types are to be represented
as on SUN MC68000 workstations.
This saves the effort to attach information about byte sex, word
size or floating point formats (of which there were several, back
Intel x86, VAX, Intergraph Clipper, and others had to convert
at the interface. Nevertheless SUN NFS was implemented with
sufficient performance on all these systems using SUN RPC and XDR.

> > Representing arbitrary C-pointer graphs is quite impossible for an RPC
> > system.

> Actually, if the relations of these pointers are known in advance, I
> don't see why they couldn't be expressed in the interface definition

Let's look at the example of Linux sg_io_hdr
  unsigned int dxfer_len;     /* [i] byte count of data transfer */
  void * dxferp;              /* [i], [*io] points to data transfer memory

This can only be represented in a serialized (byte array) form, if
the converter knows about the relation of both members. Only then
it can determine how many bytes have to be taken from .dxferp.

The problem is that a serializer has to know more than a simple
subroutine mechanism. It cannot join the habit of C language to
treat on the call stack arrays as pointers.
RPC has no common address space on both sides and thus cannot
transfer data bytes by pointers.
So one always will have to transmit the target bytes of the pointer
and not only the pointer's value.

> I hope this is clear by now, but let me say it again just to be sure:
> use individual parameters for all the components, and it would work just
> fine. Would -- if the distributed systems actually existed ;-)

Yes. As soon as a decision is made for the specialized RPC call, it makes
a lot of sense to fan out the parameters in the RPC definition.

But again: i deem it inappropriate for a call of which the kernel side
is listed in struct device_emulation_ops.

> > The assumption of identical data representation on both sides of an
> > RPC

> Another thing that should be clear by now, but just to be sure: there is
> no such assumption in either Mach or MIG.

Yes. And therefore it is extremely questionable whether one
should introduce such an assumption for a RPC implementation.

Thus i think we either need to transmit lots of primitive types,
or else have explicit serialization. (To my view this decision depends
on the decision how specialized the new RPC shall be.)

> Mach *does* require the same sizes for the individual parameters on both
> sides;

Is this really specified that way ?

> > lets me rise the question why to use RPCs at all and not [...]
> > a shared address space.

> As for Mach, that was obviously not an option, considering its original
> purpose...

Yeah. But you are currently discussing away this original purpose.

> Well, "The Cloud" doesn't really tell anything, as everyone has a
> different idea of what it means...

That's why it still has an unlimited potential. :))

> > and using graphics boards as number crunchers.

> I never thought about using Mach IPC for such things... It might be
> actually possible.

I mentioned these examples to counter the assumption that nowadays
everything runs on homogenous architectures.

> But there is nothing *transparent* about the communication between
> processes running on CPUs and GPUs;

I think that the _published_ architecture of Hurd would be
well suitable to employ GPU servers.
... but if the _implemented_ RPC infrastructure cannot support this ...

(One may call such considerations futile, given the fact that Gnumach
 cannot yet deal with the legacy SCSI controller emulated by qemu,
 not to speak of Nvidia graphics bricks.)


Have a nice day :)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]