精华内容
下载资源
问答
  • Bert NER 实战

    2021-04-22 13:53:07
    目录0. 比赛介绍1. Bert NER Finetune数据准备原始数据数据转换模型训练 0. 比赛介绍 本项目来自 Kaggle 的 NER 比赛:比赛链接 此 pipeline 及 code 参考自 ...

    0. 比赛介绍

    本项目来自 Kaggle 的 NER 比赛:比赛链接

    此 pipeline 及 code 参考自

    1. https://www.kaggle.com/tungmphung/coleridge-matching-bert-ner?select=kaggle_run_ner.py
    2. https://www.kaggle.com/tungmphung/pytorch-bert-for-named-entity-recognition

    1. Bert NER Finetune

    数据准备

    首先需要将数据转换成 NER 的 json 格式。

    原始数据

    train.csv

    0007f880-0a9b-492d-9a58-76eb0b0e0bd7.json (某篇文章)
    在这里插入图片描述

    由于train.csv中 Id 有重复,首先通过 group 将相同的并入一行:

    train = train.groupby('Id').agg({
        'pub_title': 'first',
        'dataset_title': '|'.join,
        'dataset_label': '|'.join,
        'cleaned_label': '|'.join
    }).reset_index()
    
    print(f'No. grouped training rows: {len(train)}')
    

    No. grouped training rows: 14316

    数据转换

    直接上代码:

    cnt_pos, cnt_neg = 0, 0 # number of sentences that contain/not contain labels
    ner_data = []
    
    pbar = tqdm(total=len(train))
    for i, id, dataset_label in train[['Id', 'dataset_label']].itertuples():
        # paper
        paper = papers[id]
        
        # labels
        labels = dataset_label.split('|')
        labels = [clean_training_text(label) for label in labels]
        
        # sentences
        sentences = set([clean_training_text(sentence) for section in paper 
                     for sentence in section['text'].split('.') 
                    ])
        sentences = shorten_sentences(sentences) # make sentences short
        sentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars
        
        # positive sample
        for sentence in sentences:
            is_positive, tags = tag_sentence(sentence, labels)
            if is_positive:
                cnt_pos += 1
                ner_data.append(tags)
            elif any(word in sentence.lower() for word in ['data', 'study']): 
                ner_data.append(tags)
                cnt_neg += 1
        
        # process bar
        pbar.update(1)
        pbar.set_description(f"Training data size: {cnt_pos} positives + {cnt_neg} negatives")
    
    # shuffling
    random.shuffle(ner_data)
    

    代码上半部分主要是清洗数据,除去短句等等。重要的还是之后的tag_sentence操作,本质就是字符串匹配。使用BIO标记。具体实现如下:

    def tag_sentence(sentence, labels): # requirement: both sentence and labels are already cleaned
        sentence_words = sentence.split()
        
        if labels is not None and any(re.findall(f'\\b{label}\\b', sentence)
                                      for label in labels): # positive sample
            nes = ['O'] * len(sentence_words)
            for label in labels:
                label_words = label.split()
    
                all_pos = find_sublist(sentence_words, label_words)
                for pos in all_pos:
                    nes[pos] = 'B'
                    for i in range(pos+1, pos+len(label_words)):
                        nes[i] = 'I'
    
            return True, list(zip(sentence_words, nes))
            
        else: # negative sample
            nes = ['O'] * len(sentence_words)
            return False, list(zip(sentence_words, nes))
    

    这样就得到了可训练的数据train_ner.json

    部分样例

    {"tokens": ["Ongoing", "projects", "managing", "flowthrough", "water", "data", "include", "SAMOS", "the", "U"], "tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"tokens": ["The", "numbers", "and", "percentages", "from", "which", "the", "figures", "are", "drawn", "are", "contained", "in", "a", "set", "The", "Survey", "of", "Earned", "Doctorates", "collects", "information", "on", "research", "doctorates", "only"], "tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B", "I", "I", "I", "O", "O", "O", "O", "O", "O"]}
    

    模型训练

    训练可以直接使用huggingface github 中的 代码run_ner.py

    可以命令行直接运行

    !python ../input/kaggle-ner-utils/kaggle_run_ner.py \
    --model_name_or_path 'bert-base-cased' \
    --train_file './train_ner.json' \
    --validation_file './train_ner.json' \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --save_steps 15000 \
    --output_dir './output' \
    --report_to 'none' \
    --seed 43 \
    --do_train 
    
    展开全文
  • NLP:BERT NER实体识别部署运行笔记

    万次阅读 2020-05-25 17:31:01
    NER -bert_model_dir $modePath/chinese_L-12_H-768_A-12 -model_pb_dir $modePath/Bert_NER -mode NER -port=7777 -port_out=7778 # 注意:我这边指定了端口为 7777 和 7778 3.2、实体识别: import time from bert...

    一直想整理一下NLP相关的算法,但就是懒... ...

    1、环境准备

    Python >= 3.5

    Tensorflow >= 1.10

    #个人使用:python==3.6
    # tensorflow==1.13.1
    pip3 install --upgrade tensorflow==1.13.1

    bert安装:

    【推荐】
    pip3 install bert-base==0.0.9 -i https://pypi.python.org/simple
    # 或者源码安装
    git clone https://github.com/macanv/BERT-BiLSTM-CRF-NER
    cd BERT-BiLSTM-CRF-NER/
    python3 setup.py install

    2、模型下载

    整理好了一份:链接:https://pan.baidu.com/s/1xrlxgAZesEaBv3BDRqRpJA  密码:y4j1

    目录结构:

    或者在github中找链接下载:https://github.com/macanv/BERT-BiLSTM-CRF-NER

    3、使用

    3.1、bert服务启动:

    modePath=/Users/Davide/resource_test # 你存放模型的目录
    bert-base-serving-start -model_dir $modePath/Bert_NER -bert_model_dir $modePath/chinese_L-12_H-768_A-12 -model_pb_dir $modePath/Bert_NER -mode NER -port=7777 -port_out=7778
    # 注意:我这边指定了端口为 7777 和 7778 

    3.2、实体识别:

    import time
    from bert_base.client import BertClient
    # 注意:这个地方要更改默认的ip和端口
    with BertClient(ip='localhost', port=7777, port_out=7778,show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
        start_t = time.perf_counter()
        str_list = ['特朗普团队和特朗普基金会','中英法俄会谈。']
        rst_list = bc.encode(str_list)
        for sent,rst in zip(str_list,rst_list):
            for word,ner in zip(list(sent),rst):
                print(word,ner)
            print()
        print(time.perf_counter() - start_t)

    输出:

    一定记得查看ip和port是否正确,否则没有结果!默认是5555 和 5556

     

    参考文献:

    https://github.com/macanv/BERT-BiLSTM-CRF-NER
    https://github.com/google-research/bert
    https://github.com/kyzhouhzau/BERT-NER
    https://github.com/zjy-ucas/ChineseNER

    ---------------------------侵权删---------------------------

    展开全文
  • 在 Gpu 上使用bert Ner

    2020-11-12 09:29:04
    bert server bert-base-serving-start -bert_model_dir /home/Ner/chinese_L-12_H-768_A-12 -model_pb_dir /home/Ner/model_pb_dir -model_dir /home/Ner/model_pb_dir -mode NER -num_worker=1 bert_model_dir = /...

    查看 gpu 支持

    # 查看 gpu 支持
    nvidia-smi
    Wed Nov 11 09:28:37 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            Off  | 00000000:21:01.0 Off |                    0 |
    | N/A   56C    P0    28W /  70W |      0MiB / 15079MiB |      6%      Default |
    +-------------------------------+----------------------+----------------------+
    
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +——————————————————————————————————————+
    

    tensor cuda 版本对应关系

    在这里插入图片描述

    安装 bert 环境

    # 安装 bert 环境
    conda create --name leon python=3.6 pip tensorflow-gpu==1.11.0 numpy pandas
    
    # 会 安装 cuda 9 和 cudNN 7 
    cudatoolkit:         9.2-0                     https://mirrors.ustc.edu.cn/anaconda/pkgs/main
    cudnn:               7.6.5-cuda9.2_0           https://mirrors.ustc.edu.cn/anaconda/pkgs/main
    tensorboard:         1.11.0-py36hf484d3e_0     https://mirrors.ustc.edu.cn/anaconda/pkgs/main
    tensorflow:          1.11.0-gpu_py36h9c9050a_0 https://mirrors.ustc.edu.cn/anaconda/pkgs/main
    tensorflow-base:     1.11.0-gpu_py36had579c0_0 https://mirrors.ustc.edu.cn/anaconda/pkgs/main
    tensorflow-gpu:      1.11.0-h0d30ee6_0         https://mirrors.ustc.edu.cn/anaconda/pkgs/main
    
    # in
    source activate leon
    
    # Note that the server MUST be running on Python >= 3.5 with Tensorflow >= 1.10 (one-point-ten) 
    
    pip install bert-serving-server  # server
    pip install bert-serving-client  # client, independent of `bert-serving-server`
    erver does not support Python 2!
    
    pip install keras_preprocessing
    pip install keras_applications
    pip install h5py==2.8.0
    
    pip install gas
    
    pip install bert-base==0.9.0
    

    启动 bert

    pre-trained model 下载链接
    chinese_L-12_H-768_A-12

    # 启动 bert server 
    bert-base-serving-start -bert_model_dir /home/Ner/chinese_L-12_H-768_A-12 -model_pb_dir /home/Ner/model_pb_dir -model_dir /home/Ner/model_pb_dir   -mode NER  -num_worker=1
    
    bert_model_dir = /home/Ner/chinese_L-12_H-768_A-12
               ckpt_name = bert_model.ckpt
             config_name = bert_config.json
                    cors = *
                     cpu = False
              device_map = []
                    fp16 = False
     gpu_memory_fraction = 0.5
        http_max_connect = 10
               http_port = None
            mask_cls_sep = False
          max_batch_size = 1024
             max_seq_len = 128
                    mode = NER
               model_dir = /home/Ner/model_pb_dir
            model_pb_dir = /home/Ner/model_pb_dir
              num_worker = 1
           pooling_layer = [-2]
        pooling_strategy = REDUCE_MEAN
                    port = 5555
                port_out = 5556
           prefetch_size = 10
     priority_batch_size = 16
         tuned_model_dir = None
                 verbose = False
                     xla = False
    
    
    I:VENTILATOR:[__i:__i: 91]:lodding ner model, could take a while...
    pb_file exits /home/Ner/model_pb_dir/ner_model.pb
    I:VENTILATOR:[__i:__i:100]:optimized graph is stored at: /home/Ner/model_pb_dir/ner_model.pb
    I:VENTILATOR:[__i:_ru:148]:bind all sockets
    I:VENTILATOR:[__i:_ru:153]:open 8 ventilator-worker sockets, ipc://tmpUp04dl/socket,ipc://tmpG9e8EV/socket,ipc://tmpG51b6v/socket,ipc://tmpMLsgx6/socket,ipc://tmpKBelYG/socket,ipc://tmpExAqph/socket,ipc://tmp0zzwQR/socket,ipc://tmpIeRChs/socket
    I:VENTILATOR:[__i:_ru:157]:start the sink
    I:SINK:[__i:_ru:317]:ready
    I:VENTILATOR:[__i:_ge:239]:get devices
    I:VENTILATOR:[__i:_ge:271]:device map:
            worker  0 -> gpu  0
    I:WORKER-0:[__i:_ru:497]:use device gpu: 0, load graph from /home/Ner/model_pb_dir/ner_model.pb
    
    
    # Use
    from bert_serving.client import BertClient
    bc = BertClient()
    bc.encode(['First do it', 'then do it right', 'then do it better'])
    
    
    

    参考

    hanxiao博士 bert-as-service

    展开全文
  • BERT相关论文、文章和代码资源汇总 4条回复 BERT最近太火,蹭个热点,整理一下相关的资源,包括Paper, 代码和文章解读。 1、Google官方: 1)BERT: Pre-training of Deep Bidirectional Transformers for ...

    BERT相关论文、文章和代码资源汇总

    BERT最近太火,蹭个热点,整理一下相关的资源,包括Paper, 代码和文章解读。

    1、Google官方:

    1) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    一切始于10月Google祭出的这篇Paper, 瞬间引爆整个AI圈包括自媒体圈: https://arxiv.org/abs/1810.04805

    2) Github: https://github.com/google-research/bert

    11月Google推出了代码和预训练模型,再次引起群体亢奋。

    3) Google AI Blog: Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

    2、第三方解读:
    1) 张俊林博士的解读, 知乎专栏:从Word Embedding到Bert模型—自然语言处理中的预训练技术发展史

    我们在AINLP微信公众号上转载了这篇文章和张俊林博士分享的PPT,欢迎关注:

    2) 知乎: 如何评价 BERT 模型?

    3) 【NLP】Google BERT详解

    4) [NLP自然语言处理]谷歌BERT模型深度解析

    5) BERT Explained: State of the art language model for NLP

    6) BERT介绍

    7) 论文解读:BERT模型及fine-tuning

    8) NLP突破性成果 BERT 模型详细解读

    9) 干货 | BERT fine-tune 终极实践教程: 奇点智能BERT实战教程,在AI Challenger 2018阅读理解任务中训练一个79+的模型。

    10) 【BERT详解】《Dissecting BERT》by Miguel Romero Calvo
    Dissecting BERT Part 1: The Encoder
    Understanding BERT Part 2: BERT Specifics
    Dissecting BERT Appendix: The Decoder

    11)BERT+BiLSTM-CRF-NER用于做ner识别

    12)AI赋能法律 | NLP最强之谷歌BERT模型在智能司法领域的实践浅谈

    3、第三方代码:

    1) pytorch-pretrained-BERT: https://github.com/huggingface/pytorch-pretrained-BERT
    Google官方推荐的PyTorch BERB版本实现,可加载Google预训练的模型:PyTorch version of Google AI's BERT model with script to load Google's pre-trained models

    2) BERT-pytorch: https://github.com/codertimo/BERT-pytorch
    另一个Pytorch版本实现:Google AI 2018 BERT pytorch implementation

    3) BERT-tensorflow: https://github.com/guotong1988/BERT-tensorflow
    Tensorflow版本:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    4) bert-chainer: https://github.com/soskek/bert-chainer
    Chanier版本: Chainer implementation of "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"

    5) bert-as-service: https://github.com/hanxiao/bert-as-service
    将不同长度的句子用BERT预训练模型编码,映射到一个固定长度的向量上:Mapping a variable-length sentence to a fixed-length vector using pretrained BERT model
    这个很有意思,在这个基础上稍进一步是否可以做一个句子相似度计算服务?有没有同学一试?

    6) bert_language_understanding: https://github.com/brightmart/bert_language_understanding
    BERT实战:Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN

    7) sentiment_analysis_fine_grain: https://github.com/brightmart/sentiment_analysis_fine_grain
    BERT实战,多标签文本分类,在 AI Challenger 2018 细粒度情感分析任务上的尝试:Multi-label Classification with BERT; Fine Grained Sentiment Analysis from AI challenger

    8) BERT-NER: https://github.com/kyzhouhzau/BERT-NER
    BERT实战,命名实体识别: Use google BERT to do CoNLL-2003 NER !

    9) BERT-keras: https://github.com/Separius/BERT-keras
    Keras版: Keras implementation of BERT with pre-trained weights

    10) tbert: https://github.com/innodatalabs/tbert
    PyTorch port of BERT ML model

    11) BERT-Classification-Tutorial: https://github.com/Socialbird-AILab/BERT-Classification-Tutorial

    12) BERT-BiLSMT-CRF-NER: https://github.com/macanv/BERT-BiLSMT-CRF-NER
    Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning

    13) bert-Chinese-classification-task
    bert中文分类实践

    14) bert-chinese-nerhttps://github.com/ProHiryu/bert-chinese-ner
    使用预训练语言模型BERT做中文NER

    15)BERT-BiLSTM-CRF-NER
    Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning

    16) bert-sequence-tagging: https://github.com/zhpmatrix/bert-sequence-tagging
    基于BERT的中文序列标注

    转载于:https://www.cnblogs.com/jfdwd/p/11232715.html

    展开全文
  • https://zhuanlan.zhihu.com/p/295248694
  • BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset). The original version (see old_version for more detail) contains some hard codes and lacks ...
  • Chinese NER using Bert BERT for Chinese NER. dataset list cner: datasets/cner CLUENER: https://github.com/CLUEbenchmark/CLUENER model list BERT+Softmax BERT+CRF BERT+Span requirement 1.1.0 =&...
  • BERT_NER

    2021-02-22 12:06:00
    BERT_NER
  • BERT-NER-Pytorch:三种不同模式的BERT中文NER实验
  • 该存储库包含NER任务的解决方案,该解决方案基于PyTorch对,与论文 Jacob Devlin,Chang Ming-Wei Chang,Kenton Lee和克里斯蒂娜·图塔诺娃(Kristina Toutanova)。 此实现可以为BERT加载任何预训练的TensorFlow...
  • 伯特·中国人 前言 ... python BERT_NER.py --data_dir=data/ --bert_config_file=checkpoint/bert_config.json --init_checkpoint=checkpoint/bert_model.ckpt --vocab_file=vocab.txt --output_d
  • 使用google BERT进行CoNLL-2003 NER! 使用Python训练模型并使用C ++进行推理 要求 python3 pip3 install -r requirements.txt 跑步 python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_...
  • 使用Bert的中文NER BERT代表中文NER。 数据集列表 cner:数据集/ cner 主持人: : 型号清单 BERT + Softmax BERT + CRF BERT +跨度 需求 1.1.0 = <PyTorch <1.5.0 cuda = 9.0 python3.6 + 输入格式 ...
  • BERT-NER版本2 使用Google的BERT进行命名实体识别(CoNLL-2003作为数据集)。 原始版本(有关更多详细信息,请参见old_version)包含一些硬代码,并且缺少相应的注释,因此不方便理解。 因此,在此更新版本中,有...
  • bert_ner_punc

    2020-06-26 23:43:08
    bert和实体识别的融合(1) 优点: 多任务学习和对抗训练相结合,可以从额外的实体识别任务中学习任务不变信息。 实体识别任务通过多任务学习用作辅助任务,以进一步提高标点预测任务的性能。对抗性损失用于防止共享...
  • tensorflow2.0 对实体命名识别的数据预处理 1

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 2,823
精华内容 1,129
关键字:

bertner