[DMCA-Activists] Groklaw: Did SCO Reveal Code?

dmca-activists
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[DMCA-Activists] Groklaw: Did SCO Reveal Code?

From:	Seth Johnson
Subject:	[DMCA-Activists] Groklaw: Did SCO Reveal Code?
Date:	Fri, 21 Nov 2003 23:16:29 -0500

> http://www.groklaw.net/article.php?story=20031119041719640


Did SCO Really Reveal the Code to IBM, as Darl Claims? 
 
Thursday, November 20 2003 @ 01:17 AM EST  


You may have noticed that in the teleconference on Tuesday, SCO CEO Darl
McBride made the claim that they have shown the code to IBM in discovery and
that IBM knows exactly what code is in dispute. Specifically, he said this: 


" . . .by the way, we have shared the code in question there with IBM under
the litigation event. They know what we're talking about there." 


There is room for skepticism. While it is impossible to rule out that there
may have been code shown privately that is not in the public record yet, if
Darl was referring to the list of files it presented to IBM in discovery so
far in the record, I think we need to look at those lists of "infringing
files" more carefully. 

I noticed, as soon as their discovery list of files was released, coders
everywhere were fallling over laughing or snorting in contempt. I'm not a
coder, so I asked some of our readers to explain why the lists strike them
as so pitiful. There were many replies, including some fine comments, but
they included and were based on code that went over my head and would not be
accessible to some of Groklaw's readers either. 

Most people in the world are not programmers, and it's a language we don't
know. So my request was for a translation into English, so we too could
grasp what they were noticing, exactly what SCO has on their lists, how they
likely arrived at the lists, and what it indicates as to how much SCO
actually provided during discovery, so we can understand why IBM filed a
motion to compel discovery after receiving the lists from SCO. 

The final result is mostly Frank Sorenson's work, but it incorporates
helpful input from other Groklaw readers, so it represents the work of a
group. I hope you enjoy looking at it from this fresh perspective. This case
is, after all, about code, so the rest of us can only gain insight by trying
to comprehend that part of the story. 

Because I am not a programmer, I appreciated Justin Rowles' explanation
about the utilities find and grep that Frank talks about in his article: 


"Unix provides highly flexible tools for searching directory trees and the
files they contain.  The two most common ones are called find and grep.  Use
of these tools is taught in 'Unix 101' type classes.  For example, if I
wanted to find all the files on my hard disk that started with 'apple' and
ended in 'pie', I could use the find tool to do so.  It would find files
called 'apple pie', 'apple and blackberry pie' and so on. 

grep is a similar tool for looking at the contents of files.  It would be
used to look at files and find, for example, which ones contained the word
'custard'.  Usually it searches files in a single specified directory, but
it can also be used to search a list of files generated by another command,
like find.

"Both of these tools are highly flexible, and can be used together by a
competent Unix person to search their disks for highly specific things.  I
could use find to find files that are called 'apple something pie', but not
'apple and redcurrant pie' and then check all of those files with grep to
leave only those which also contain 'custard'.  I can do all this in one
instruction to the computer.

"In fact, in GNU/Linux, grep has been improved.  GNU grep contains the
ability to search directory  structures, so I can dispense with step one
above.  In SCO Unix, you can't do that, so you need to use find."


Keep this explanation in mind, all you nonprogrammers, as we take a look now
at the file lists with Frank. And the other thing you need to know to
understand what Frank describes is that a Caldera employee was a key Linux
contributor, Christoph Hellwig, and he wasn't the only one, and the evidence
indicates strongly that Caldera knew at the time the contributions being
made. Old SCO also contributed code to Linux. I think you will conclude, as
I did, that when Darl says that they "deep dived" and looked at the code
every which way, as he again claimed yesterday, he couldn't have been
describing the process used to come up with the lists they have provided to
IBM in the court case. They definitely didn't need spectral analysis, the
missing MIT mathematicians, or physicists to come up with such lists as
those they provided IBM and the court in their Supplemental Responses.
Google and a couple of simple utilities are sufficient. With that
introduction, here is Frank's article.


***************************************************** 


The SCO Group's List of "Infringing" Files -- How Might They Have Come Up
With This List? 


~by Frank Sorenson 


In IBM's Reply Memorandum in Support of their (First) Motion to Compel
Discovery (text here), IBM includes SCO's Supplemental Responses to IBM's
First Set of Interrogatories (text here) and tells the Judge that SCO is
still not answering their questions. One of the responses SCO provided was a
list of files that may or may not be infringing, according to SCO. Why might
IBM view the list as inadequte? To someone without the programming
background, it might be hard to know. 

A closer look by a computer programmer, with English translation for
nonprogrammers, may give a clearer picture of why SCO's responses were
neither "responsive nor identified with meaningful particularity", according
to IBM. It also reveals the likely method SCO used to draw up the list,
which bears on SCO's earlier claims that it had three groups of analysts,
including the MIT mathematicians, analyzing the code. 

SCO's response includes five lists from several categories: 

A list of "source code files identified by SCO thus far ... part of which
include information (including methods) that IBM was required to maintain as
confidential or proprietary...and/or which constitute trade secrets misused
by IBM..." It's a list of 115 files. 

A list of "source code files identified by SCO thus far...which
may...include information (including methods) that IBM was required to
maintain as confidential or proprietary...and/or which constitute trade
secrets misused by IBM..." It's a list of 591 files. 

A list of people at IBM that SCO claims to be aware of "in which part of the
confidential or proprietary and/or trade secrets [were] known or [have] been
disclosed." There are 5 lists of names, whose names appear in the Linux code
base, adding up to about 74 people. 

A list of IBM copyrights. This is a list of 22 names. 

A list of people who "likely have knowledge, although their names do not
appear in the Linux code base." It's a list of 62 names. 


First, a little background on Linux/Unix utilities and tools, then we will
examine each of these lists, how they may have been created, and what (if
anything) they mean. We conclude with some general comments. 


Background 


There are a number of useful utilities in Linux/Unix. Because we will be
using some of them in our discussion, we'll briefly mention a few before
moving on: 

One utility is called grep, and it is a utility designed to search inside a
file (or files) for lines containing a certain pattern. In its simplest
form, it is usually used like this: 'grep string filename', but it also
accepts numerous flags (options) to allow it to perform various functions.
When calling grep as egrep, extended pattern matches are enabled. Here, we
will use grep to quickly find files containing strings that we are
interested in. 

Another commonly used utility is find, which is used to search a directory
for files having certain properties, such as a specific name or pattern.
Here, it will be used to locate files that we are interested in searching
the contents of. 

sort does just what it says; it sorts a list of strings. It can also be used
with the -u option (unique) to remove duplicate references. 

cat is used to type out the contents of files, and is very similar to type
under DOS/Windows. 

xargs is used to execute commands on the output of a previous command. We
will be using it to reprocess the output of find commands and the output of
other utilities. 


--------------------------------------------------------------------------------

SCO's Lists of Files 


Let's start with List 2: The list of "source code files identified by SCO
thus far...which may...include information (including methods) that IBM was
required to maintain as confidential or proprietary...and/or which
constitute trade secrets misused by IBM..." This is a list of 591 files. 

While this list contains a number of files from Linux, 591 of them, SCO
fails to mention what kernel version, and only says they're from 2.4 and/or
2.5 kernels. As IBM correctly points out, "This is no small problem since
there are 75 different releases of the Linux kernel 2.5 alone." SCO also
says that they do not claim the entire source code found in those files, but
that this information is interspersed in those 330,000 lines of code. 

IBM also points out that since it is Unix code (SVRx) that SCO claims was
misappropriated, pointing to the Linux source code does not really answer
their question, which was: from where were the trade secrets
misappropriated? SCO passes this argument off by saying that they have not
completed discovery, and that since IBM hasn't given them everything they've
asked for, they don't know exactly where it came from. 

Because SCO is claiming that it is IBM's trade secrets that were
misappropriated, they don't have the trade secrets yet themselves. In other
words, they need IBM to reveal more information. The question becomes "Why
does SCO believe that this list contains their trade secrets if they don't
know the trade secrets and need IBM to point them out?" 

In attempts to answer this, a number of discussions have occurred, here on
Groklaw, on the Linux Kernel Mailing List, and elsewhere. Here on Groklaw,
Lev managed to narrow the Linux kernel version down to either 2.5.68 or
2.5.69. Many people were quick to point out that most files on the list
contained one or more strings that SCO likes to claim as theirs: SMP, JFS,
RCU, and NUMA. 

By using the appropriate utilities, it is possible to reproduce SCO's list
(number 2) without any manual investigation of the contents of any of those
files. A sorted (and cleaned up) copy of SCO's list number 2 is located here
for reference. While this solution is certainly not the only one, and is
probably not optimal, it is the one that the author managed to construct: 


find . -type f -name "*.[ch]" -print0 
   | xargs -0 egrep -wil 'smp|rcu|numa' 
   | cut -c 3- > /tmp/output1

find fs/jfs -type f -path "*.[ch]" -print0 
   | xargs -0 egrep -Li "@sco|@caldera" >> /tmp/output1

egrep -v 'alpha|parisc|sparc|sound|drivers' /tmp/output1 
   | sort -u > /tmp/SCOFiles-list2.output


This may look like quite a mess, but it can be deconstructed into manageable
pieces. All three lines really consist of several commands strung together
using the |, or pipe. This means that the results of one command are used as
input to the next command. 

Picking apart these lines, first I found all files with a filename ending in
.c or .h (C source code and header files). I searched the contents of these
files for any of the strings 'smp', 'rcu', or 'numa' (without caring about
upper- or lower-case). I placed these matching files into the file
/tmp/output1. Next, I searched the JFS filesystem code for .c or .h
filenames, removing any files that mention someone at SCO or Caldera working
on them. The results were appended to /tmp/output1. Finally, I searched the
/tmp/output1 file and removed all file names referring to alpha, parisc, or
sparc (essentially Sun and HP). References to driver files and sound were
then also removed. 

When applying this process to the kernel versions identified by Lev, we get
3 false positives and 3 false negatives with the 2.5.68 kernel and just one
false positive with the 2.5.69 kernel. As the list is otherwise identical to
SCO's, I believe that SCO used the Linux 2.5.69 kernel to generate these
lists. 

The false positive was include/asm-h8300/smplock.h. There may be a number of
explanations for this, one of the most likely being that someone at SCO
messed up, and missed a line when sending the list to the lawyers. This is,
of course, presuming that the person preparing the list used a similar
process, which I believe is likely. 

What does this mean? Essentially, that SCO searched for any reference in the
Linux kernel source for SMP, JFS, RCU, and NUMA, and claimed all of those
files as possibly infringing. They included the entire JFS source code, but,
perhaps realizing that it would look really bad to claim a file that
implicated SCO or Caldera by showing the names of their employees, removed
those files. 

A number of people have pointed out that some of the files are so trivial
that they could not contain trade secrets. For example,
include/asm-arm/spinlock.h contains only 6 lines, but is included in the
list because it contains the string SMP (as in "we don't do SMP"): 


#ifndef __ASM_SPINLOCK_H
#define __ASM_SPINLOCK_H

#error ARM architecture does not support SMP spin locks

#endif /* __ASM_SPINLOCK_H */


In providing this list to IBM, it appears that all SCO has done is to make
vague claims over all of SMP, JFS, RCU, and NUMA, which is hardly news, but
they have given no explanation of how they created their list of possibly
infringing files. They haven't answered IBM's question at all (which relates
to original SVRx code), and they look silly in the process, at least to
those who understand the code and the list. 

It is obvious that SCO did not spend a great deal of time or effort at
answering IBM's question with valuable information. If they actually did
spend time and effort to produce this list, their technical person is not
extremely skilled. 


--------------------------------------------------------------------------------

List 1: A list of "source code files identified by SCO thus far ... part of
which include information (including methods) that IBM was required to
maintain as confidential or proprietary...and/or which constitute trade
secrets misused by IBM...", the list of 115 files. 

The first thing to note is that the files in this list are actually a subset
of the files in List 2. For reference, a copy of SCO's list number 2 can be
found here. Using our trusty Linux utilities, we can again construct a
sequence of commands that produces SCO's list automatically. The following
commands will produce all of SCO's files (again, 100%) with just 2 false
positives: 


cat /tmp/SCOFiles-list2.output 
  | xargs egrep -l 'International Business Machines|ibm.|IBM Corp' >
/tmp/output1

cat /tmp/SCOFiles-list2.output 
  | xargs egrep -wl 'IBM|RCU' 
  | xargs egrep -L 'sco' >> /tmp/output1

sort -u /tmp/output1 > /tmp/SCOFiles-list1.output


These commands first search (List 2) for anything that would be easily
identifiable as coming from IBM, files containing "International Business
Machines", "IBM Corp", or "ibm." (as could be contained in an email address
like address@hidden). Next, any mention whatsoever of "IBM" or "RCU" is
included, as long as the file does not also contain "sco". 

Again, while we do not know for certain that this is the method that SCO
used to produce this list, it is easy to demonstrate that even though our
commands do not produce an identical list, SCO spent little more time to
create this list than List 2. 

We are unable to determine whether someone messed up and omitted the two
false positives, arch/ppc/kernel/setup.c and include/linux/list.h, or
whether our search string is not sufficiently developed to produce the same
list. What we do know is that this list of "definitely infringing files" is
little more than files with IBM mentioned, minus files referring to SCO. IBM
is asking for specifics because SCO has given no explanation of how they
built their list. Also, they've avoided the question of where in SVRx these
trade secrets came from, and why SCO believes they are trade secrets. 


--------------------------------------------------------------------------------

List 3: A list of people at IBM that SCO claims to be aware of "in which
part of the confidential or proprietary and/or trade secrets [were] known or
[have] been disclosed." This consists of 5 lists of authors, for a total of
about 74 people. 

In SCO's Supplemental Response, they identify a number of people as having
disclosed proprietary information and/or trade secrets. They break down
these names into "US Authors" (30), "German Authors" (24), "Australian
Authors" (2), "Other" (15), and "Austin Office (JFS)" (3). We won't be going
into the same detail in analyzing this section because it involves the names
and email addresses of people and we have redacted this information from the
text version of the document. Those curious should view SCO's filing to see
examples. 

Suffice it to say that these lists can be regenerated by searching the
kernel source for all files containing an email address at IBM. It contains
actual lines from the copyright notices contained in the Linux kernel. On
more than one, the line also contained references to other email addresses
that the person used, and at least one just ends like this:
"address@hidden or". The next line in the kernel source file contains
the alternate address. 

This list is fairly easy to generate, but does require a bit more manual
intervention than most of the others. Since some people have contributed
using multiple names (such as Pat and Patrick), someone has manually merged
these names together. It was done sloppily, though, since there are other
email IBM-related email addresses in the source code which are not
mentioned. 

Here, SCO is apparently telling IBM that they believe that every
contribution from IBM is tainted, but they'll need all the source code ever
written from IBM in order to prove it. I have serious doubts that everyone
that ever contributed to Linux from IBM has done so under such suspicious
circumstances (I actually have serious doubts that _any_ contributions are
tainted in this way). 


--------------------------------------------------------------------------------

List 4: A list of IBM copyrights (a list of 22 names) 

This list is as easy to generate as List 3. It is merely a list of all the
various copyright notices involving IBM in the kernel source. It's actually
a pretty boring list, and doesn't seem to tell anyone much, including IBM.
It can be regenerated merely by searching for "Copyright" or "(C)" in the
same line as "IBM Corporation". They're all just lines like:

Fred So-and-So, IBM Corporation 


--------------------------------------------------------------------------------

List 5: A list of people who "likely have knowledge, although their names do
not appear in the Linux code base." (a list of 62 names). 

We've left the best for last. Here, we've left the kernel source, but where
has SCO gotten this list? Ready? Okay... Here goes. They got it from a
Google search. 

Well, at least that is what it appears. The fact is that you can find the
names on this list by searching on Google for email addresses from IBM that
posted to the Linux Kernel Mailing List (LKML). Like I said, I don't
actually know that this is how SCO did it, but if you're really curious,
look at SCO's filing, then check out Google Groups for messages that hit the
Linux Kernel Mailing List: '"ibm.com" group:fa.linux.kernel' (for example). 

Without doing an extensive study, it is difficult to know exactly how much
(or little) work was done to actually build the list, but it is clear that
SCO belives that these individuals "likely have knowledge" because their
email address can be found on the Linux Kernel Mailing List. To test this
theory (in a highly unscientific manner), we chose 5-10 email addresses from
the LKML (compliments of Google) and all were located on SCO's list. We then
tested things the other way around, and had similar results. The addresses
we chose were easy to find on the LKML. One brief example: SCO's list
includes the email address address@hidden, which is easy to find here. 

So SCO produced a list that they believe holds the names of people with
knowledge of Linux. They may have actually searched the Changelogs, as well.
A list of names you can find on Google hardly qualifies as a response to
IBM's interrogatory. 


--------------------------------------------------------------------------------

Some General Comments 

In SCO's list, in the legal document, SCO has replaced all the slashes (/)
in the file names with periods (.). There are several theories in the Linux
community as to why. One possibility is that the lawyers may have written it
up using a program that doesn't like slashes, instead of using Unix or
Linux. While I used GNU utilities such as grep, the person preparing the
list may have used a different platform. 

Regular file/path names can be converted to the dotted format with the
following command (if you so desire): 'cat /tmp/SCOFiles | sed s:/:.:g' At
any rate, they could be converted back easily enough. Interestingly, the
path /arch/ppc64/kernel was also changed to .arch.ppc.64.kernel for some yet
unknown reason. 

Whoever prepared these lists was rather sloppy. They didn't pay attention to
detail, missed obvious files and email addresses, and didn't edit very well.
Obvious references to SCO or Caldera have been removed, but some of the
less-obvious ones remain. For example, some contributions to JFS by
Christoph Hellwig (once an employee of SCO) remain. Presumably, at least
some of those contributions occurred while he was working for SCO. 

Some of the files included are trivial and obviously contain no relevant
information. The 6-line files that just say "we don't do SMP" come to mind. 

It is easy for coders to understand IBM's contention that SCO has not been
answering their questions, regardless of the amount of data that they have
produced. They don't explain how anything they have reported is a trade
secret. And the fact that their lists can be recreated over a weekend using
simple scripts indicates to us that their answers are too broad to qualify
as answers to the questions they were asked. 

Maybe SCO hasn't heard the old saying: "Never tangle with a geek when source
code is on the line." 


--------------------------------------------------------------------------------

Prepared by Frank Sorenson 
With numerous helpful comments from other Groklaw Regulars 



-- 

DRM is Theft!  We are the Stakeholders!

New Yorkers for Fair Use
http://www.nyfairuse.org

[CC] Counter-copyright: http://realmeasures.dyndns.org/cc

I reserve no rights restricting copying, modification or distribution of
this incidentally recorded communication.  Original authorship should be
attributed reasonably, but only so far as such an expectation might hold for
usual practice in ordinary social discourse to which one holds no claim of
exclusive rights.
[Prev in Thread]
Current Thread
[Next in Thread]
[DMCA-Activists] Groklaw: Did SCO Reveal Code?, Seth Johnson <=
Prev by Date: [DMCA-Activists] Record Label Sings New Tune
Next by Date: [DMCA-Activists] P2PSim: Roll Your Own P2P Protocol (fwd from address@hidden)
Previous by thread: [DMCA-Activists] Record Label Sings New Tune
Next by thread: [DMCA-Activists] P2PSim: Roll Your Own P2P Protocol (fwd from address@hidden)
Index(es):
- Date
- Thread