Chinese text segmentation: built to be the best PHP Chinese word segmentation module

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module, sentences using Chinese

"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件,目前翻譯版本為 jieba-0.33 版本,未來再慢慢往上升級,效能也需要再改善,請有興趣的開發者一起加入開發!若想使用 Python 版本請前往 fxsjy/jieba

現在已經可以支援繁體中文!只要將字典切換為 big 模式即可!

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.

Scroll down for English documentation.

jieba-php English Document

Online Demo

Feature

  • Support three types of segmentation mode:
    1. Accurate Mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis;
    1. Full Mode, break the words of the sentence into words scanned
    1. Search Engine Mode, based on the Accurate Mode, with an attempt to cut the long words into several short words, which can enhance the recall rate

Usage

  • Installation: Use composer to install jieba-php, then require the autoload file to use jieba-php.

Algorithm

  • Based on the Trie tree structure to achieve efficient word graph scanning; sentences using Chinese characters constitute a directed acyclic graph (DAG).
  • Employs memory search to calculate the maximum probability path, in order to identify the maximum tangential points based on word frequency combination.
  • For unknown words, the character position HMM-based model is used, using the Viterbi algorithm.
  • The meaning of BEMS https://github.com/fxsjy/jieba/issues/7.

Interface

  • The cut method accepts two parameters: 1) first parameter is the string to segmentation 2)the second parameter cut_all to control segmentation mode.
  • The string to segmentation may use utf-8 string.
  • cutForSearch accpets only on parameter: the string that requires segmentation, and it will cut the sentence into short words
  • cut and cutForSearch return an segmented array.

Function 1) Segmentation

Example (Tutorial)

ini_set('memory_limit', '1024M');
require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init();
Finalseg::init();
$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);
seg_list = jieba.cut("我来到北京清华大学", true)
var_dump($seg_list); #全模式
seg_list = jieba.cut("我来到北京清华大学", false)
var_dump($seg_list); #默認精確模式
seg_list = jieba.cut("他来到了网易杭研大厦")
var_dump($seg_list);
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
var_dump($seg_list);

Output:

array(7) {
 [0]=>
 string(12) "怜香惜玉"
 [1]=>
 string(3) "也"
 [2]=>
 string(3) "得"
 [3]=>
 string(3) "要"
 [4]=>
 string(3) "看"
 [5]=>
 string(6) "对象"
 [6]=>
 string(3) "啊"
}
Full Mode:
array(15) {
 [0]=>
 string(3) "我"
 [1]=>
 string(3) "来"
 [2]=>
 string(6) "来到"
 [3]=>
 string(3) "到"
 [4]=>
 string(3) "北"
 [5]=>
 string(6) "北京"
 [6]=>
 string(3) "京"
 [7]=>
 string(3) "清"
 [8]=>
 string(6) "清华"
 [9]=>
 string(12) "清华大学"
 [10]=>
 string(3) "华"
 [11]=>
 string(6) "华大"
 [12]=>
 string(3) "大"
 [13]=>
 string(6) "大学"
 [14]=>
 string(3) "学"
}
Default Mode:
array(4) {
 [0]=>
 string(3) "我"
 [1]=>
 string(6) "来到"
 [2]=>
 string(6) "北京"
 [3]=>
 string(12) "清华大学"
}
array(6) {
 [0]=>
 string(3) "他"
 [1]=>
 string(6) "来到"
 [2]=>
 string(3) "了"
 [3]=>
 string(6) "网易"
 [4]=>
 string(6) "杭研"
 [5]=>
 string(6) "大厦"
}
(此處,“杭研“並沒有在詞典中,但是也被 Viterbi 算法識別出來了)
Search Engine Mode:
array(18) {
 [0]=>
 string(6) "小明"
 [1]=>
 string(6) "硕士"
 [2]=>
 string(6) "毕业"
 [3]=>
 string(3) "于"
 [4]=>
 string(6) "中国"
 [5]=>
 string(6) "科学"
 [6]=>
 string(6) "学院"
 [7]=>
 string(9) "科学院"
 [8]=>
 string(15) "中国科学院"
 [9]=>
 string(6) "计算"
 [10]=>
 string(9) "计算所"
 [11]=>
 string(3) "后"
 [12]=>
 string(3) "在"
 [13]=>
 string(6) "日本"
 [14]=>
 string(6) "京都"
 [15]=>
 string(6) "大学"
 [16]=>
 string(18) "日本京都大学"
 [17]=>
 string(6) "深造"
}

Function 2) Add a custom dictionary

  • Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation.

  • Usage: Jieba::loadUserDict(file_name) # file_name is a custom dictionary path.

  • The dictionary format is the same as that of dict.txt: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space.

  • Example:

    云计算 5 李小福 2 创新办 3

    之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 加載自定義詞庫後: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /

Function 3) Keyword Extraction

  • JiebaAnalyse::extractTags($content, $top_k)
  • content: the text to be extracted
  • top_k: to return several TF/IDF weights for the biggest keywords, the default value is 20

Example (keyword extraction)

ini_set('memory_limit', '600M');
require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
require_once "/path/to/your/class/JiebaAnalyse.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\JiebaAnalyse;
Jieba::init(array('mode'=>'test','dict'=>'small'));
Finalseg::init();
JiebaAnalyse::init();
$top_k = 10;
$content = file_get_contents("/path/to/your/dict/lyric.txt", "r");
$tags = JiebaAnalyse::extractTags($content, $top_k);
var_dump($tags);

Output:

array(10) {
 ["是否"]=>
 float(1.2196321889395)
 ["一般"]=>
 float(1.0032459890209)
 ["肌迫"]=>
 float(0.64654314660465)
 ["怯懦"]=>
 float(0.44762844339349)
 ["藉口"]=>
 float(0.32327157330233)
 ["逼不得已"]=>
 float(0.32327157330233)
 ["不安全感"]=>
 float(0.26548304656279)
 ["同感"]=>
 float(0.23929673812326)
 ["有把握"]=>
 float(0.21043366018744)
 ["空洞"]=>
 float(0.20598261709442)
}

Function 4) Word Segmentation and Tagging

Example (word tagging)

ini_set('memory_limit', '600M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
require_once dirname(dirname(__FILE__))."/class/Posseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\Posseg;
Jieba::init();
Finalseg::init();
Posseg::init();
$seg_list = Posseg::cut("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。");
var_dump($seg_list);

Output:

array(21) {
 [0]=>
 array(2) {
 ["word"]=>
 string(3) "这"
 ["tag"]=>
 string(1) "r"
 }
 [1]=>
 array(2) {
 ["word"]=>
 string(3) "是"
 ["tag"]=>
 string(1) "v"
 }
 [2]=>
 array(2) {
 ["word"]=>
 string(6) "一个"
 ["tag"]=>
 string(1) "m"
 }
 [3]=>
 array(2) {
 ["word"]=>
 string(18) "伸手不见五指"
 ["tag"]=>
 string(1) "i"
 }
 [4]=>
 array(2) {
 ["word"]=>
 string(3) "的"
 ["tag"]=>
 string(2) "uj"
 }
 [5]=>
 array(2) {
 ["word"]=>
 string(6) "黑夜"
 ["tag"]=>
 string(1) "n"
 }
 [6]=>
 array(2) {
 ["word"]=>
 string(3) "。"
 ["tag"]=>
 string(1) "w"
 }
 [7]=>
 array(2) {
 ["word"]=>
 string(3) "我"
 ["tag"]=>
 string(1) "r"
 }
 [8]=>
 array(2) {
 ["word"]=>
 string(3) "叫"
 ["tag"]=>
 string(1) "v"
 }
 [9]=>
 array(2) {
 ["word"]=>
 string(9) "孙悟空"
 ["tag"]=>
 string(2) "nr"
 }
 [10]=>
 array(2) {
 ["word"]=>
 string(3) ","
 ["tag"]=>
 string(1) "w"
 }
 [11]=>
 array(2) {
 ["word"]=>
 string(3) "我"
 ["tag"]=>
 string(1) "r"
 }
 [12]=>
 array(2) {
 ["word"]=>
 string(3) "爱"
 ["tag"]=>
 string(1) "v"
 }
 [13]=>
 array(2) {
 ["word"]=>
 string(6) "北京"
 ["tag"]=>
 string(2) "ns"
 }
 [14]=>
 array(2) {
 ["word"]=>
 string(3) ","
 ["tag"]=>
 string(1) "w"
 }
 [15]=>
 array(2) {
 ["word"]=>
 string(3) "我"
 ["tag"]=>
 string(1) "r"
 }
 [16]=>
 array(2) {
 ["word"]=>
 string(3) "爱"
 ["tag"]=>
 string(1) "v"
 }
 [17]=>
 array(2) {
 ["word"]=>
 string(6) "Python"
 ["tag"]=>
 string(3) "eng"
 }
 [18]=>
 array(2) {
 ["word"]=>
 string(3) "和"
 ["tag"]=>
 string(1) "c"
 }
 [19]=>
 array(2) {
 ["word"]=>
 string(3) "C++"
 ["tag"]=>
 string(3) "eng"
 }
 [20]=>
 array(2) {
 ["word"]=>
 string(3) "。"
 ["tag"]=>
 string(1) "w"
 }
}

Function 5):Use Traditional Chinese

Example (Tutorial)

ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'default','dict'=>'big'));
Finalseg::init();
$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);
$seg_list = Jieba::cut("憐香惜玉也得要看對象啊!");
var_dump($seg_list);

Output:

array(7) {
 [0]=>
 string(12) "怜香惜玉"
 [1]=>
 string(3) "也"
 [2]=>
 string(3) "得"
 [3]=>
 string(3) "要"
 [4]=>
 string(3) "看"
 [5]=>
 string(6) "对象"
 [6]=>
 string(3) "啊"
}
array(7) {
 [0]=>
 string(12) "憐香惜玉"
 [1]=>
 string(3) "也"
 [2]=>
 string(3) "得"
 [3]=>
 string(3) "要"
 [4]=>
 string(3) "看"
 [5]=>
 string(6) "對象"
 [6]=>
 string(3) "啊"
}

Function 6):Keeping Japanese or Korean original text

Example (Tutorial)

ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('cjk'=>'all'));
Finalseg::init();
$seg_list = Jieba::cut("한국어 또는 조선말은 제주특별자치도를 제외한 한반도 및 그 부속 도서와 한민족 거주 지역에서 쓰이는 언어로");
var_dump($seg_list);
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);
// Loading custom Japanese dictionary can do a simple word segmentation
Jieba::loadUserDict("/path/to/your/japanese/dict.txt");
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);

Output:

array(15) {
 [0]=>
 string(9) "한국어"
 [1]=>
 string(6) "또는"
 [2]=>
 string(12) "조선말은"
 [3]=>
 string(24) "제주특별자치도를"
 [4]=>
 string(9) "제외한"
 [5]=>
 string(9) "한반도"
 [6]=>
 string(3) "및"
 [7]=>
 string(3) "그"
 [8]=>
 string(6) "부속"
 [9]=>
 string(9) "도서와"
 [10]=>
 string(9) "한민족"
 [11]=>
 string(6) "거주"
 [12]=>
 string(12) "지역에서"
 [13]=>
 string(9) "쓰이는"
 [14]=>
 string(9) "언어로"
}
array(21) {
 [0]=>
 string(6) "日本"
 [1]=>
 string(3) "語"
 [2]=>
 string(3) "は"
 [3]=>
 string(3) "主"
 [4]=>
 string(3) "に"
 [5]=>
 string(6) "日本"
 [6]=>
 string(6) "国内"
 [7]=>
 string(3) "や"
 [8]=>
 string(6) "日本"
 [9]=>
 string(3) "人"
 [10]=>
 string(6) "同士"
 [11]=>
 string(3) "の"
 [12]=>
 string(3) "間"
 [13]=>
 string(3) "で"
 [14]=>
 string(3) "使"
 [15]=>
 string(3) "わ"
 [16]=>
 string(6) "れて"
 [17]=>
 string(6) "いる"
 [18]=>
 string(6) "言語"
 [19]=>
 string(3) "で"
 [20]=>
 string(6) "ある"
}
array(17) {
 [0]=>
 string(9) "日本語"
 [1]=>
 string(3) "は"
 [2]=>
 string(6) "主に"
 [3]=>
 string(9) "日本国"
 [4]=>
 string(3) "内"
 [5]=>
 string(3) "や"
 [6]=>
 string(9) "日本人"
 [7]=>
 string(6) "同士"
 [8]=>
 string(3) "の"
 [9]=>
 string(3) "間"
 [10]=>
 string(3) "で"
 [11]=>
 string(3) "使"
 [12]=>
 string(3) "わ"
 [13]=>
 string(6) "れて"
 [14]=>
 string(6) "いる"
 [15]=>
 string(6) "言語"
 [16]=>
 string(9) "である"
}

GitHub

Post a Comment