emacs-orgmode
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [O] Efficiency of Org v. LaTeX v. Word ---LOOK AT THE DATA!


From: Christophe Pouzat
Subject: Re: [O] Efficiency of Org v. LaTeX v. Word ---LOOK AT THE DATA!
Date: Sun, 28 Dec 2014 22:40:24 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

Hi all,

After seeing Ken's mail:

Le 26/12/2014 23:47, Ken Mankoff a écrit :
People here might be interested in a publication from [2014-12-19 Fri]
available at http://dx.doi.org/10.1371/journal.pone.0115069

Title: An Efficiency Comparison of Document Preparation Systems Used
in Academic Research and Development

Summary: Word users are more efficient and have less errors than even
experienced LaTeX users.

Someone here should repeat experiment and add Org into the mix, perhaps
Org -> ODT and/or Org -> LaTeX and see if it helps or hurts. I assume
Org would trump LaTeX, but would Org -> ODT or Org -> X -> DOCX (via
pandoc) beat straight Word?

   -k.


and some of replies it triggered on the list, I went to check the paper. As many of you guys I found some "results" puzzling in particular: 1. the use of bar graphs when the data would better be displayed directly (that qualifies immediately the paper as "low quality" for me).
2. the larger error bars observed for LaTeX when compared to Word.
3. the systematic inverse relationship between the blue and pink bars heights.

So I went to figshare to download the data and looked at them. A quick and dirty "analysis" is attached to this mail in PDF format (generated with org, of course, and this awful software called LaTeX!) and the source org file can be found at the bottom of this mail. I used R to do the figures (and I'm sure the authors of the paper will then criticize me for not using Excel with which everyone knows errors are generated much more efficiently).

I managed to understand the inverse relationship in point 3 above: the authors considered 3 types of mistakes / errors:
1. Formatting and typos error.
2. Orthographic and grammatical errors.
3. Missing words and signs.
Clearly, following the mail of Tom (Dye) on the list and on the Plos web site, I would argue that formatting errors in LaTeX are bona fide bugs. But the point I want to make is that the third source accounts for 80% of the total errors (what's shown in pink bars in the paper) and clearly the authors counted what the subjects did not have time to type as an error of this type. Said differently, the blue and pink bars are showing systematically the same thing by construction! The second type of error in not a LaTeX issue (and in fact does not differ significantly from the Word case) but an "environment" issue (what spelling corrector had the LaTeX users access to?).

There is another strange thing in the table copy case. For both the expert and novice group in LaTeX, there is one among 10 subjects that did produce 0% of the table but still manage to produce 22 typographic errors!

The overall worst performance of LaTeX users remains to be explained and as mentioned in on the mails in the list, that does not make sense at least for the continuous text exercise. The method section of the paper is too vague but my guess is that some LaTeX users did attempt to reproduce the exact layout of the text they had to copy, something LaTeX is definitely not design to provide quickly.

One more point: how many of you guys could specify their total number of hours of experience with LaTeX (or any other software you are currently using)? That what the subjects of this study had to specify...

Let me know what you think,

Christophe

My org buffer:

#+TITLE: An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development: A Re-analysis.
#+DATE: <2014-12-28 dim.>
#+AUTHOR: Christophe Pouzat
#+EMAIL: address@hidden
#+OPTIONS: ':nil *:t -:t ::t <:t H:3 \n:nil ^:t arch:headline
#+OPTIONS: author:t c:nil creator:comment d:(not "LOGBOOK") date:t
#+OPTIONS: e:t email:nil f:t inline:t num:t p:nil pri:nil stat:t
#+OPTIONS: tags:t tasks:t tex:t timestamp:t toc:nil todo:t |:t
#+CREATOR: Emacs 24.4.1 (Org mode 8.2.10)
#+DESCRIPTION:
#+EXCLUDE_TAGS: noexport
#+KEYWORDS:
#+LANGUAGE: en
#+SELECT_TAGS: export
#+LaTeX_HEADER: \usepackage{alltt}
#+LaTeX_HEADER: \usepackage[usenames,dvipsnames]{xcolor}
#+LaTeX_HEADER: \renewenvironment{verbatim}{\begin{alltt} \scriptsize \color{Bittersweet} \vspace{0.2cm} }{\vspace{0.2cm} \end{alltt} \normalsize \color{black}}
#+LaTeX_HEADER: \definecolor{lightcolor}{gray}{.55}
#+LaTeX_HEADER: \definecolor{shadecolor}{gray}{.85}
#+LaTeX_HEADER: \usepackage{minted}
#+LaTeX_HEADER: \hypersetup{colorlinks=true}

#+NAME: org-latex-set-up
#+BEGIN_SRC emacs-lisp :results silent :exports none
(setq org-latex-listings 'minted)
(setq org-latex-minted-options
      '(("bgcolor" "shadecolor")
    ("fontsize" "\\scriptsize")))
(setq org-latex-pdf-process
'("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"
    "biber %b"
"pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"))
#+END_SRC

* Introduction
This is a re-analysis of the data presented in [[http://dx.doi.org/10.1371/journal.pone.0115069][An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development]]. My "interest" in this paper was triggered by a discussion on the [[http://article.gmane.org/gmane.emacs.orgmode/93655][emacs org mode mailing list]]. Ignoring the "message" of the paper, what stroke me was the systematic use of bar graphs: a way of displaying data that *should never be used* since when many observations are considered, a box plot is going to do a much better job and when, like in the present paper, few (10 in each of the 4 categories) observations are available, a direct display or even a simple table is going to do a *much better* job. Since it turns out that the data are available both on the Plos web site and on [[http://figshare.com/articles/_An_Efficiency_Comparison_of_Document_Preparation_Systems_Used_in_Academic_Research_and_Development_/1275631][figshare]], I decided to re-analyze them.

* Getting the data, etc.

We get the data with:

#+BEGIN_SRC sh
wget http://files.figshare.com/1849394/S1_Materials.xlsx
#+END_SRC

#+RESULTS:
Using for instance [[http://dag.wiee.rs/home-made/unoconv/][unoconv]], we can convert the =Excel= file into a friendlier =csv= file:

#+BEGIN_SRC sh
unoconv -f csv S1_Materials.xlsx
#+END_SRC

#+RESULTS:
We then get the data with =R= =read.csv= function:

#+NAME: data-table
#+BEGIN_SRC R :session *R* :results silent
efficiency <- read.csv("S1_Materials.csv",header=TRUE,dec=",")
#+END_SRC
The description of this table is obtained with:

#+BEGIN_SRC sh :exports both :results output
wget http://files.figshare.com/1849395/S2_Materials.txt
cat "S2_Materials.txt"
#+END_SRC

* Making some figures
We can now make a figure out of the same data as figures 4, 5 and 6 of the paper but showing the actual data. We start with the "continuous text" exercise. We represent, in each of the four categories, each of the 10 individuals by a number between 0 and 9. Some horizontal jitter has been added to avoid overlaps. Category 1 corresponds to expert =Word= users; 2 to novice =Word= users; 3 to expert \LaTeX{} users; 4 to novice \LaTeX{} users:

#+HEADER: :file continuous.png :width 1000 :height 1000
#+BEGIN_SRC R :session *R* :exports both :results output graphics
layout(matrix(1:4,nc=2,byrow=TRUE))
par(cex=2)
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=c(0,100),
     xlab="User category",ylab="",main="Fraction of text")
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               PROZENT1[Kenntnisse==k],
                               pch = paste(0:9))))

with(efficiency,
     plot(c(1,4),c(0,100),type="n",
          xlim=c(0.5,4.5),ylim=range(FEHLERSFT),xlab="User category",
          ylab="",main="Formatting errors and typos"))
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               FEHLERSFT[Kenntnisse==k],
                               pch = paste(0:9))))

with(efficiency,
     plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
          ylim=range(FEHLEROFT),xlab="User category",ylab="",
          main="Orthographic and grammatical mistakes"))
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               FEHLEROFT[Kenntnisse==k],
                               pch = paste(0:9))))

with(efficiency,
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=range(FEHLENDFT),
          xlab="User category",ylab="",main="Missing words and signs"))
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               FEHLENDFT[Kenntnisse==k],
                               pch = paste(0:9))))
#+END_SRC


Notice that the number of "missing words and signs" exactly mirrors the fraction of written text. We will see that this observation holds for the two following exercises. This "missing words and signs" is always roughly ten times as large as the two other sources of mistakes. This explains the inverse relationship between the blue and pink bars on each of the 3 figures.

Let's keep going with the "table exercise":

#+HEADER: :file table.png :width 1000 :height 1000
#+BEGIN_SRC R :session *R* :exports both :results output graphics
layout(matrix(1:4,nc=2,byrow=TRUE))
par(cex=2)
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=c(0,100),
     xlab="User category",ylab="",main="Fraction of text")
with(efficiency,sapply(1:4,
                       function(k) points(runif(10,k-0.2,k+0.2),
                                          PROZENT2[Kenntnisse==k],
                                          pch = paste(0:9))))

with(efficiency,plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
                     ylim=range(FEHLERST),xlab="User category",
                     ylab="",main="Formatting errors and typos"))
with(efficiency,sapply(1:4,
                       function(k) points(runif(10,k-0.2,k+0.2),
                                          FEHLERST[Kenntnisse==k],
                                          pch = paste(0:9))))

with(efficiency,plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
                     ylim=range(FEHLEROT),xlab="User category",
                     ylab="",main="Orthographic and grammatical mistakes"))
with(efficiency,sapply(1:4,
                       function(k) points(runif(10,k-0.2,k+0.2),
                                          FEHLEROT[Kenntnisse==k],
                                          pch = paste(0:9))))

with(efficiency,plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
                     ylim=range(FEHLENDT),xlab="User category",ylab="",
                     main="Missing words and signs"))
with(efficiency,sapply(1:4,
                       function(k) points(runif(10,k-0.2,k+0.2),
                                          FEHLENDT[Kenntnisse==k],
                                          pch = paste(0:9))))
#+END_SRC

We also see a strange thing here: in each of the expert \LaTeX{} and the novice \LaTeX{} users we have one individual who did not right anything but still manage to produce 22 "formatting errors and typos" (!) but luckily no orthographic or grammatical error...

#+BEGIN_SRC R :session *R* :exports both
with(efficiency,cbind(c(PROZENT2[Kenntnisse==3][10],
                        FEHLERST[Kenntnisse==3][10],
                        FEHLEROT[Kenntnisse==3][10],
                        FEHLENDT[Kenntnisse==3][10]),
                      c(PROZENT2[Kenntnisse==4][7],
                        FEHLERST[Kenntnisse==4][7],
                        FEHLEROT[Kenntnisse==4][7],
                        FEHLENDT[Kenntnisse==4][7])))
#+END_SRC


Now for the "equations" exercise:

#+HEADER: :file equation.png :width 1000 :height 1000
#+BEGIN_SRC R :session *R* :exports both :results output graphics
layout(matrix(1:4,nc=2,byrow=TRUE))
par(cex=2)
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=c(0,100),
     xlab="User category",ylab="",main="Fraction of text")
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               PROZENT3[Kenntnisse==k],
                               pch = paste(0:9))))

with(efficiency,
     plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
          ylim=range(FEHLERSFOR),xlab="User category",ylab="",
          main="Formatting errors and typos"))
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               FEHLERSFOR[Kenntnisse==k],
                               pch = paste(0:9))))

with(efficiency,
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=range(FEHLEROFOR),
          xlab="User category",ylab="",
          main="Orthographic and grammatical mistakes"))
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               FEHLEROFOR[Kenntnisse==k],
                               pch = paste(0:9))))

with(efficiency,
     plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
          ylim=range(FEHLENDFOR),xlab="User category",ylab="",
          main="Missing words and signs"))
with(efficiency,
     sapply(1:4,
            function(k) points(runif(10,k-0.2,k+0.2),
                               FEHLENDFOR[Kenntnisse==k],
                               pch = paste(0:9))))
#+END_SRC



--
A Master Carpenter has many tools and is expert with most of them. If you only 
know how to use a hammer, every problem starts to look like a nail. Stay away 
from that trap.

Richard B Johnson.

--

Christophe Pouzat
MAP5 - Mathématiques Appliquées à Paris 5
CNRS UMR 8145
45, rue des Saints-Pères
75006 PARIS
France

tel: +33142863828
mobile: +33662941034
web: http://xtof.disque.math.cnrs.fr

Attachment: EfficiencyComparison.pdf
Description: Adobe PDF document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]