[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[elpa] externals/pyim d8e6a5b 4/6: * pyim.el (pyim-cstring-split-to-list
From: |
ELPA Syncer |
Subject: |
[elpa] externals/pyim d8e6a5b 4/6: * pyim.el (pyim-cstring-split-to-list): Add delete-dups and prefer-short-word arguments. |
Date: |
Sun, 28 Feb 2021 01:57:10 -0500 (EST) |
branch: externals/pyim
commit d8e6a5b1ee5a6bb90a90cffc91ee4b583a95d977
Author: Feng Shu <tumashu@163.com>
Commit: Feng Shu <tumashu@163.com>
* pyim.el (pyim-cstring-split-to-list): Add delete-dups and
prefer-short-word arguments.
---
pyim.el | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/pyim.el b/pyim.el
index 0019d8d..953e34b 100644
--- a/pyim.el
+++ b/pyim.el
@@ -4119,7 +4119,7 @@ PUNCT-LIST 格式类似:
(- current-pos str-beginning-pos)
(- str-end-pos current-pos)))))))
-(defun pyim-cstring-split-to-list (chinese-string &optional max-word-length)
+(defun pyim-cstring-split-to-list (chinese-string &optional max-word-length
delete-dups prefer-short-word)
"一个基于 pyim 的中文分词函数。这个函数可以将中文字符
串 CHINESE-STRING 分词,得到一个词条 alist,这个 alist 的元素
都是列表,其中第一个元素为分词得到的词条,第二个元素为词条相对于
@@ -4127,6 +4127,13 @@ PUNCT-LIST 格式类似:
6个字符,用户可以通过 MAX-WORD-LENGTH 来自定义,但值得注意的是:
这个值设置越大,分词速度越慢。
+如果 DELETE-DUPS 设置为 non-nil, 一个中文字符串只保留一种分割方式。
+比如:
+
+ 我爱北京天安门 => 我爱 北京 天安门
+
+如果 PREFER-SHORT-WORD 为 non-nil, 去重的时候则优先保留较短的词。
+
注意事项:
1. 这个工具使用暴力匹配模式来分词,*不能检测出* pyim 词库中不存在
的中文词条。
@@ -4177,7 +4184,22 @@ PUNCT-LIST 格式类似:
(dolist (word words)
(when (equal word (car string-list))
(push string-list result)))))))
- result)))
+
+ (if delete-dups
+ (cl-delete-duplicates
+ ;; 判断两个词条在字符串中的位置
+ ;; 是否冲突,如果冲突,仅保留一个,
+ ;; 删除其它。
+ result
+ :test #'(lambda (x1 x2)
+ (let ((begin1 (nth 1 x1))
+ (begin2 (nth 1 x2))
+ (end1 (nth 2 x1))
+ (end2 (nth 2 x2)))
+ (not (or (<= end1 begin2)
+ (<= end2 begin1)))))
+ :from-end prefer-short-word)
+ result))))
;; (let ((str "医生随时都有可能被患者及其家属反咬一口"))
;; (benchmark 1 '(pyim-cstring-split-to-list str)))
- [elpa] externals/pyim updated (789c7eb -> e55fd0d), ELPA Syncer, 2021/02/28
- [elpa] externals/pyim 68289b9 1/6: * pyim-common.el (pyim-dcache-get-value-from-file): Do not use eval., ELPA Syncer, 2021/02/28
- [elpa] externals/pyim c9aaf72 2/6: * pyim.el (pyim-start): Update comment about kill-emacs-hook., ELPA Syncer, 2021/02/28
- [elpa] externals/pyim cf3f2e3 5/6: * pyim.el (pyim-cstring-split-to-string-1): use pyim-cstring-split-to-list delete-dups argument., ELPA Syncer, 2021/02/28
- [elpa] externals/pyim d8e6a5b 4/6: * pyim.el (pyim-cstring-split-to-list): Add delete-dups and prefer-short-word arguments.,
ELPA Syncer <=
- [elpa] externals/pyim 6a3f0e0 3/6: * pyim.el (pyim-exit-from-minibuffer): Call quail-exit-from-minibuffer., ELPA Syncer, 2021/02/28
- [elpa] externals/pyim e55fd0d 6/6: * pyim.el (pyim-ivy-cregexp): New function., ELPA Syncer, 2021/02/28