emacs-elpa-diffs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[elpa] externals/pyim d8e6a5b 4/6: * pyim.el (pyim-cstring-split-to-list


From: ELPA Syncer
Subject: [elpa] externals/pyim d8e6a5b 4/6: * pyim.el (pyim-cstring-split-to-list): Add delete-dups and prefer-short-word arguments.
Date: Sun, 28 Feb 2021 01:57:10 -0500 (EST)

branch: externals/pyim
commit d8e6a5b1ee5a6bb90a90cffc91ee4b583a95d977
Author: Feng Shu <tumashu@163.com>
Commit: Feng Shu <tumashu@163.com>

    * pyim.el (pyim-cstring-split-to-list): Add delete-dups and 
prefer-short-word arguments.
---
 pyim.el | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/pyim.el b/pyim.el
index 0019d8d..953e34b 100644
--- a/pyim.el
+++ b/pyim.el
@@ -4119,7 +4119,7 @@ PUNCT-LIST 格式类似:
                       (- current-pos str-beginning-pos)
                       (- str-end-pos current-pos)))))))
 
-(defun pyim-cstring-split-to-list (chinese-string &optional max-word-length)
+(defun pyim-cstring-split-to-list (chinese-string &optional max-word-length 
delete-dups prefer-short-word)
   "一个基于 pyim 的中文分词函数。这个函数可以将中文字符
 串 CHINESE-STRING 分词,得到一个词条 alist,这个 alist 的元素
 都是列表,其中第一个元素为分词得到的词条,第二个元素为词条相对于
@@ -4127,6 +4127,13 @@ PUNCT-LIST 格式类似:
 6个字符,用户可以通过 MAX-WORD-LENGTH 来自定义,但值得注意的是:
 这个值设置越大,分词速度越慢。
 
+如果 DELETE-DUPS 设置为 non-nil, 一个中文字符串只保留一种分割方式。
+比如:
+
+  我爱北京天安门 => 我爱 北京 天安门
+
+如果 PREFER-SHORT-WORD 为 non-nil, 去重的时候则优先保留较短的词。
+
 注意事项:
 1. 这个工具使用暴力匹配模式来分词,*不能检测出* pyim 词库中不存在
    的中文词条。
@@ -4177,7 +4184,22 @@ PUNCT-LIST 格式类似:
               (dolist (word words)
                 (when (equal word (car string-list))
                   (push string-list result)))))))
-      result)))
+
+      (if delete-dups
+          (cl-delete-duplicates
+           ;;  判断两个词条在字符串中的位置
+           ;;  是否冲突,如果冲突,仅保留一个,
+           ;;  删除其它。
+           result
+           :test #'(lambda (x1 x2)
+                     (let ((begin1 (nth 1 x1))
+                           (begin2 (nth 1 x2))
+                           (end1 (nth 2 x1))
+                           (end2 (nth 2 x2)))
+                       (not (or (<= end1 begin2)
+                                (<= end2 begin1)))))
+           :from-end prefer-short-word)
+        result))))
 
 ;; (let ((str "医生随时都有可能被患者及其家属反咬一口"))
 ;;   (benchmark 1 '(pyim-cstring-split-to-list str)))



reply via email to

[Prev in Thread] Current Thread [Next in Thread]