"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件,目前翻譯版本為 jieba-0.33 版本,未來再慢慢往上升級,效能也需要再改善,請有興趣的開發者一起加入開發!若想使用 Python 版本請前往 fxsjy/jieba
現在已經可以支援繁體中文!只要將字典切換為 big 模式即可!
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.
Scroll down for English documentation.
jieba-php English Document
Online Demo
- Demo Site Url:http://jieba-php.fukuball.com
- Demo Site Repo:https://github.com/fukuball/jieba-php.fukuball.com
Feature
- Support three types of segmentation mode:
- Accurate Mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis;
- Full Mode, break the words of the sentence into words scanned
- Search Engine Mode, based on the Accurate Mode, with an attempt to cut the long words into several short words, which can enhance the recall rate
Usage
- Installation: Use composer to install jieba-php, then require the autoload file to use jieba-php.
Algorithm
- Based on the Trie tree structure to achieve efficient word graph scanning; sentences using Chinese characters constitute a directed acyclic graph (DAG).
- Employs memory search to calculate the maximum probability path, in order to identify the maximum tangential points based on word frequency combination.
- For unknown words, the character position HMM-based model is used, using the Viterbi algorithm.
- The meaning of BEMS https://github.com/fxsjy/jieba/issues/7.
Interface
- The
cut
method accepts two parameters: 1) first parameter is the string to segmentation 2)the second parametercut_all
to control segmentation mode. - The string to segmentation may use utf-8 string.
cutForSearch
accpets only on parameter: the string that requires segmentation, and it will cut the sentence into short wordscut
andcutForSearch
return an segmented array.
Function 1) Segmentation
Example (Tutorial)
ini_set('memory_limit', '1024M'); require_once "/path/to/your/vendor/multi-array/MultiArray.php"; require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php"; require_once "/path/to/your/class/Jieba.php"; require_once "/path/to/your/class/Finalseg.php"; use Fukuball\Jieba\Jieba; use Fukuball\Jieba\Finalseg; Jieba::init(); Finalseg::init(); $seg_list = Jieba::cut("怜香惜玉也得要看对象啊!"); var_dump($seg_list); seg_list = jieba.cut("我来到北京清华大学", true) var_dump($seg_list); #全模式 seg_list = jieba.cut("我来到北京清华大学", false) var_dump($seg_list); #默認精確模式 seg_list = jieba.cut("他来到了网易杭研大厦") var_dump($seg_list); seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式 var_dump($seg_list);
Output:
array(7) { [0]=> string(12) "怜香惜玉" [1]=> string(3) "也" [2]=> string(3) "得" [3]=> string(3) "要" [4]=> string(3) "看" [5]=> string(6) "对象" [6]=> string(3) "啊" } Full Mode: array(15) { [0]=> string(3) "我" [1]=> string(3) "来" [2]=> string(6) "来到" [3]=> string(3) "到" [4]=> string(3) "北" [5]=> string(6) "北京" [6]=> string(3) "京" [7]=> string(3) "清" [8]=> string(6) "清华" [9]=> string(12) "清华大学" [10]=> string(3) "华" [11]=> string(6) "华大" [12]=> string(3) "大" [13]=> string(6) "大学" [14]=> string(3) "学" } Default Mode: array(4) { [0]=> string(3) "我" [1]=> string(6) "来到" [2]=> string(6) "北京" [3]=> string(12) "清华大学" } array(6) { [0]=> string(3) "他" [1]=> string(6) "来到" [2]=> string(3) "了" [3]=> string(6) "网易" [4]=> string(6) "杭研" [5]=> string(6) "大厦" } (此處,“杭研“並沒有在詞典中,但是也被 Viterbi 算法識別出來了) Search Engine Mode: array(18) { [0]=> string(6) "小明" [1]=> string(6) "硕士" [2]=> string(6) "毕业" [3]=> string(3) "于" [4]=> string(6) "中国" [5]=> string(6) "科学" [6]=> string(6) "学院" [7]=> string(9) "科学院" [8]=> string(15) "中国科学院" [9]=> string(6) "计算" [10]=> string(9) "计算所" [11]=> string(3) "后" [12]=> string(3) "在" [13]=> string(6) "日本" [14]=> string(6) "京都" [15]=> string(6) "大学" [16]=> string(18) "日本京都大学" [17]=> string(6) "深造" }
Function 2) Add a custom dictionary
-
Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation.
-
Usage:
Jieba::loadUserDict(file_name)
# file_name is a custom dictionary path. -
The dictionary format is the same as that of
dict.txt
: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space. -
Example:
云计算 5 李小福 2 创新办 3
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 加載自定義詞庫後: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
Function 3) Keyword Extraction
- JiebaAnalyse::extractTags($content, $top_k)
- content: the text to be extracted
- top_k: to return several TF/IDF weights for the biggest keywords, the default value is 20
Example (keyword extraction)
ini_set('memory_limit', '600M'); require_once "/path/to/your/vendor/multi-array/MultiArray.php"; require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php"; require_once "/path/to/your/class/Jieba.php"; require_once "/path/to/your/class/Finalseg.php"; require_once "/path/to/your/class/JiebaAnalyse.php"; use Fukuball\Jieba\Jieba; use Fukuball\Jieba\Finalseg; use Fukuball\Jieba\JiebaAnalyse; Jieba::init(array('mode'=>'test','dict'=>'small')); Finalseg::init(); JiebaAnalyse::init(); $top_k = 10; $content = file_get_contents("/path/to/your/dict/lyric.txt", "r"); $tags = JiebaAnalyse::extractTags($content, $top_k); var_dump($tags);
Output:
array(10) { ["是否"]=> float(1.2196321889395) ["一般"]=> float(1.0032459890209) ["肌迫"]=> float(0.64654314660465) ["怯懦"]=> float(0.44762844339349) ["藉口"]=> float(0.32327157330233) ["逼不得已"]=> float(0.32327157330233) ["不安全感"]=> float(0.26548304656279) ["同感"]=> float(0.23929673812326) ["有把握"]=> float(0.21043366018744) ["空洞"]=> float(0.20598261709442) }
Function 4) Word Segmentation and Tagging
- Word Tagging Meaning:https://gist.github.com/luw2007/6016931
Example (word tagging)
ini_set('memory_limit', '600M'); require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php"; require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php"; require_once dirname(dirname(__FILE__))."/class/Jieba.php"; require_once dirname(dirname(__FILE__))."/class/Finalseg.php"; require_once dirname(dirname(__FILE__))."/class/Posseg.php"; use Fukuball\Jieba\Jieba; use Fukuball\Jieba\Finalseg; use Fukuball\Jieba\Posseg; Jieba::init(); Finalseg::init(); Posseg::init(); $seg_list = Posseg::cut("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。"); var_dump($seg_list);
Output:
array(21) { [0]=> array(2) { ["word"]=> string(3) "这" ["tag"]=> string(1) "r" } [1]=> array(2) { ["word"]=> string(3) "是" ["tag"]=> string(1) "v" } [2]=> array(2) { ["word"]=> string(6) "一个" ["tag"]=> string(1) "m" } [3]=> array(2) { ["word"]=> string(18) "伸手不见五指" ["tag"]=> string(1) "i" } [4]=> array(2) { ["word"]=> string(3) "的" ["tag"]=> string(2) "uj" } [5]=> array(2) { ["word"]=> string(6) "黑夜" ["tag"]=> string(1) "n" } [6]=> array(2) { ["word"]=> string(3) "。" ["tag"]=> string(1) "w" } [7]=> array(2) { ["word"]=> string(3) "我" ["tag"]=> string(1) "r" } [8]=> array(2) { ["word"]=> string(3) "叫" ["tag"]=> string(1) "v" } [9]=> array(2) { ["word"]=> string(9) "孙悟空" ["tag"]=> string(2) "nr" } [10]=> array(2) { ["word"]=> string(3) "," ["tag"]=> string(1) "w" } [11]=> array(2) { ["word"]=> string(3) "我" ["tag"]=> string(1) "r" } [12]=> array(2) { ["word"]=> string(3) "爱" ["tag"]=> string(1) "v" } [13]=> array(2) { ["word"]=> string(6) "北京" ["tag"]=> string(2) "ns" } [14]=> array(2) { ["word"]=> string(3) "," ["tag"]=> string(1) "w" } [15]=> array(2) { ["word"]=> string(3) "我" ["tag"]=> string(1) "r" } [16]=> array(2) { ["word"]=> string(3) "爱" ["tag"]=> string(1) "v" } [17]=> array(2) { ["word"]=> string(6) "Python" ["tag"]=> string(3) "eng" } [18]=> array(2) { ["word"]=> string(3) "和" ["tag"]=> string(1) "c" } [19]=> array(2) { ["word"]=> string(3) "C++" ["tag"]=> string(3) "eng" } [20]=> array(2) { ["word"]=> string(3) "。" ["tag"]=> string(1) "w" } }
Function 5):Use Traditional Chinese
Example (Tutorial)
ini_set('memory_limit', '1024M'); require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php"; require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php"; require_once dirname(dirname(__FILE__))."/class/Jieba.php"; require_once dirname(dirname(__FILE__))."/class/Finalseg.php"; use Fukuball\Jieba\Jieba; use Fukuball\Jieba\Finalseg; Jieba::init(array('mode'=>'default','dict'=>'big')); Finalseg::init(); $seg_list = Jieba::cut("怜香惜玉也得要看对象啊!"); var_dump($seg_list); $seg_list = Jieba::cut("憐香惜玉也得要看對象啊!"); var_dump($seg_list);
Output:
array(7) { [0]=> string(12) "怜香惜玉" [1]=> string(3) "也" [2]=> string(3) "得" [3]=> string(3) "要" [4]=> string(3) "看" [5]=> string(6) "对象" [6]=> string(3) "啊" } array(7) { [0]=> string(12) "憐香惜玉" [1]=> string(3) "也" [2]=> string(3) "得" [3]=> string(3) "要" [4]=> string(3) "看" [5]=> string(6) "對象" [6]=> string(3) "啊" }
Function 6):Keeping Japanese or Korean original text
Example (Tutorial)
ini_set('memory_limit', '1024M'); require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php"; require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php"; require_once dirname(dirname(__FILE__))."/class/Jieba.php"; require_once dirname(dirname(__FILE__))."/class/Finalseg.php"; use Fukuball\Jieba\Jieba; use Fukuball\Jieba\Finalseg; Jieba::init(array('cjk'=>'all')); Finalseg::init(); $seg_list = Jieba::cut("한국어 또는 조선말은 제주특별자치도를 제외한 한반도 및 그 부속 도서와 한민족 거주 지역에서 쓰이는 언어로"); var_dump($seg_list); $seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。"); var_dump($seg_list); // Loading custom Japanese dictionary can do a simple word segmentation Jieba::loadUserDict("/path/to/your/japanese/dict.txt"); $seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。"); var_dump($seg_list);
Output:
array(15) { [0]=> string(9) "한국어" [1]=> string(6) "또는" [2]=> string(12) "조선말은" [3]=> string(24) "제주특별자치도를" [4]=> string(9) "제외한" [5]=> string(9) "한반도" [6]=> string(3) "및" [7]=> string(3) "그" [8]=> string(6) "부속" [9]=> string(9) "도서와" [10]=> string(9) "한민족" [11]=> string(6) "거주" [12]=> string(12) "지역에서" [13]=> string(9) "쓰이는" [14]=> string(9) "언어로" } array(21) { [0]=> string(6) "日本" [1]=> string(3) "語" [2]=> string(3) "は" [3]=> string(3) "主" [4]=> string(3) "に" [5]=> string(6) "日本" [6]=> string(6) "国内" [7]=> string(3) "や" [8]=> string(6) "日本" [9]=> string(3) "人" [10]=> string(6) "同士" [11]=> string(3) "の" [12]=> string(3) "間" [13]=> string(3) "で" [14]=> string(3) "使" [15]=> string(3) "わ" [16]=> string(6) "れて" [17]=> string(6) "いる" [18]=> string(6) "言語" [19]=> string(3) "で" [20]=> string(6) "ある" } array(17) { [0]=> string(9) "日本語" [1]=> string(3) "は" [2]=> string(6) "主に" [3]=> string(9) "日本国" [4]=> string(3) "内" [5]=> string(3) "や" [6]=> string(9) "日本人" [7]=> string(6) "同士" [8]=> string(3) "の" [9]=> string(3) "間" [10]=> string(3) "で" [11]=> string(3) "使" [12]=> string(3) "わ" [13]=> string(6) "れて" [14]=> string(6) "いる" [15]=> string(6) "言語" [16]=> string(9) "である" }