elasticsearch6.5.1安装分词elasticsearch-analysis-hanlp

安装elasticsearch6.5.1

docker run -p 9200:9200 -p 9300:9300 -e "http.host=0.0.0.0" -e "http.cors.enabled=true" -e "http.cors.allow-origin="*"" -d docker.elastic.co/elasticsearch/elasticsearch:6.5.1

源码地址:https://github.com/KennFalcon/elasticsearch-analysis-hanlp

安装

进入容器
docker exex -it <容器名> /bin/bash

安装jdk1.8
yum install java-1.8.0-openjdk* -y
java -version

安装elasticsearch-analysis-hanlp
./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v6.5.1/elasticsearch-analysis-hanlp-6.5.1.zip

下载hanlp数据包
查看hanlp数据的地址:https://github.com/hankcs/HanLP/releases 下载data-for-1.7.0.zip
删除原有的基础data
cd /usr/share/elasticsearch/plugins/analysis-hanlp
rm -rf data
wget http://hanlp.linrunsoft.com/release/data-for-1.7.0.zip
unzip data-for-1.7.0.zip
chmod -R 777 data/
rm -rf data-for-1.7.0.zip
修改中文文件名 hanlp.properties中已经修改过了 对应配置文件中的CustomDictionaryPath
cd /usr/share/elasticsearch/plugins/analysis-hanlp/data/dictionary/custom
mv \#U4eba#U540d#U8bcd#U5178.txt ModernChineseSupplementaryWord.txt
mv \#U5168#U56fd#U5730#U540d#U5927#U5168.txt ChinesePlaceName.txt
mv \#U673a#U6784#U540d#U8bcd#U5178.txt PersonalName.txt
mv \#U73b0#U4ee3#U6c49#U8bed#U8865#U5145#U8bcd#U5e93.txt OrganizationName.txt
mv \#U4e0a#U6d77#U5730#U540d.txt ShanghaiPlaceName.txt
修改hanlp root路径为绝对路径
vi /usr/share/elasticsearch/config/analysis-hanlp/hanlp.properties
root=/usr/share/elasticsearch/plugins/analysis-hanlp/

重启容器
docker restart <容器名>

添加自定义词典热更新

cd /usr/share/elasticsearch/plugins/analysis-hanlp/data/dictionary/custom
vi Test.txt
西双版纳的姑娘
我爱蓝天
我爱
vi /usr/share/elasticsearch/config/analysis-hanlp/hanlp.properties
在ShanghaiPlaceName.txt后面添加; Test.txt
# Custom dictinary path
CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; ModernChineseSupplementaryWord.txt; ChinesePlaceName.txt ns; PersonalName.txt; OrganizationName.txt; ShanghaiPlaceName.txt; Test.txt ns;data/dictionary/person/nrf.txt nrf;
等待1分钟后,词典自动加载

测试

POST http://localhost:9200/twitter2/_analyze
{
  "text": "美国阿拉斯加州发生8.0级地震",
  "tokenizer": "hanlp_index"
}
{
  "tokens" : [
    {
      "token" : "美国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nsf",
      "position" : 0
    },
    {
      "token" : "阿拉斯加州",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "nsf",
      "position" : 1
    },
    {
      "token" : "发生",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "v",
      "position" : 2
    },
    {
      "token" : "8.0",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "m",
      "position" : 3
    },
    {
      "token" : "级",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "q",
      "position" : 4
    },
    {
      "token" : "地震",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "n",
      "position" : 5
    }
  ]
}

提供的分词方式说明

hanlp: hanlp默认分词

hanlp_standard: 标准分词

hanlp_index: 索引分词

hanlp_nlp: NLP分词

hanlp_n_short: N-最短路分词

hanlp_dijkstra: 最短路分词

hanlp_crf: CRF分词(在hanlp 1.6.6已开始废弃)

hanlp_speed: 极速词典分词


文章作者: Ciwei
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Ciwei !
  目录