[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[GNUnet-SVN] r2149 - in Extractor: . src/plugins/pdf
From: |
grothoff |
Subject: |
[GNUnet-SVN] r2149 - in Extractor: . src/plugins/pdf |
Date: |
Mon, 26 Sep 2005 20:33:40 -0700 (PDT) |
Author: grothoff
Date: 2005-09-26 20:33:37 -0700 (Mon, 26 Sep 2005)
New Revision: 2149
Modified:
Extractor/ChangeLog
Extractor/src/plugins/pdf/pdfextractor.cc
Log:
minor improvements to PDF
Modified: Extractor/ChangeLog
===================================================================
--- Extractor/ChangeLog 2005-09-26 14:21:56 UTC (rev 2148)
+++ Extractor/ChangeLog 2005-09-27 03:33:37 UTC (rev 2149)
@@ -1,3 +1,12 @@
+Mon Sep 26 20:44:10 PDT 2005
+ Minor improvements to the PDF extractor:
+ - first change is to avoid outputting keywords with empty values
+ (for now the new check is only effective when the value is seen
+ as a string of 8-bit characters; I'm not sure how to rewrite it
+ for the Unicode case in the if branch just above.)
+ - second change is to remap PDF Creator as 'software' keyword
+ instead of 'creator'.
+
Sun Sep 25 11:31:51 PDT 2005
Made sure extract returns error code (1) if some files could not
be accessed.
Modified: Extractor/src/plugins/pdf/pdfextractor.cc
===================================================================
--- Extractor/src/plugins/pdf/pdfextractor.cc 2005-09-26 14:21:56 UTC (rev
2148)
+++ Extractor/src/plugins/pdf/pdfextractor.cc 2005-09-27 03:33:37 UTC (rev
2149)
@@ -87,17 +87,45 @@
con = (char*) convertToUtf8((const char*) u, 2, "UNICODE");
strcat(result, con);
free(con);
- }
+ }
next = addKeyword(type,
strdup(result),
next);
free(result);
} else {
- next = addKeyword(type,
- convertToUtf8(s,
- strlen(s),
+ unsigned int len = (NULL == s) ? 0 : strlen(s);
+
+ while(0 < len) {
+ /*
+ * Avoid outputting trailing spaces.
+ *
+ * The following expression might be rewritten as
+ * (! isspace(s[len - 1]) && 0xA0 != s[len - 1]).
+ * There seem to exist isspace() implementations
+ * which do return non-zero from NBSP (maybe locale-dependent).
+ * Remove ISO-8859 non-breaking space (NBSP, hex value 0xA0) from
+ * the expression if it looks suspicious (locale issues for instance).
+ *
+ * Squeezing out all non-printable characters might also be useful.
+ */
+ if ( (' ' != s[len - 1]) && ((char)0xA0 != s[len - 1]) &&
+ ('\r' != s[len - 1]) && ('\n' != s[len - 1]) &&
+ ('\t' != s[len - 1]) && ('\v' != s[len - 1]) &&
+ ('\f' != s[len - 1]) )
+ break;
+
+ else
+ len --;
+ }
+
+ /* there should be a check to truncate preposterously long values. */
+
+ if (0 < len) {
+ next = addKeyword(type,
+ convertToUtf8(s, len,
"ISO-8859-1"),
next);
+ }
}
}
obj.free();
@@ -251,9 +279,14 @@
"Author",
EXTRACTOR_AUTHOR,
result);
+ /*
+ * we now believe that Adobe's Creator
+ * is not a person nor an organisation,
+ * but just a piece of software.
+ */
result = printInfoString(info.getDict(),
"Creator",
- EXTRACTOR_CREATOR,
+ EXTRACTOR_SOFTWARE,
result);
result = printInfoString(info.getDict(),
"Producer",
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [GNUnet-SVN] r2149 - in Extractor: . src/plugins/pdf,
grothoff <=