精华内容
下载资源
问答
  • bert代码

    2019-11-06 16:18:52
    谷歌终于开源BERT代码:3 亿参数量,机器之心全面解读 代码结构: `-- bert |-- CONTRIBUTING.md |-- create_pretraining_data.py |-- extract_features.py |-- __init__.py |-- LICENSE |-- modeling...

    目录

    北京时间10.31 23时许,bert官方版代码正式出炉~

    https://github.com/google-research/bert

    原文的解读和pytorch版本的解读参考:https://daiwk.github.io/posts/nlp-bert.html

    参考参考机器之心发的谷歌终于开源BERT代码:3 亿参数量,机器之心全面解读

    代码结构:

    `-- bert
        |-- CONTRIBUTING.md
        |-- create_pretraining_data.py
        |-- extract_features.py
        |-- __init__.py
        |-- LICENSE
        |-- modeling.py
        |-- modeling_test.py
        |-- optimization.py
        |-- optimization_test.py
        |-- README.md
        |-- run_classifier.py
        |-- run_pretraining.py
        |-- run_squad.py
        |-- sample_text.txt
        |-- tokenization.py
        `-- tokenization_test.py
    
    1 directory, 16 files
    

    pretrained model

    有这几个版本(在进行WordPiece分词之前是否区分大小写:是:cased,否:uncased(即全部转为小写)):

    • BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
    • BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
    • BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
    • BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters (Not available yet. Needs to be re-generated).
    • BERT-Base, Multilingual: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    • BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

    每个zip中包含如下三个文件:

    • 一个TensorFlow checkpoint (bert_model.ckpt) :包含pre-trained weights(里面有3个文件)
    • 一个vocab文件(vocab.txt):将WordPiece映射成word id
    • 一个config file (bert_config.json) :存储hyperparameters

    例如:

    uncased_L-12_H-768_A-12
    |-- bert_config.json
    |-- bert_model.ckpt.data-00000-of-00001
    |-- bert_model.ckpt.index
    |-- bert_model.ckpt.meta
    |-- checkpoint
    `-- vocab.txt
    
    0 directories, 6 files
    

    Sentence (and sentence-pair) classification tasks

    glue data数据集

    下载glue数据,使用https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e的py,执行【记住要是python3!!!!!】。。不过在墙内好像怎么都下不下来。。

    python download_glue_data.py --data_dir glue_data --tasks all
    

    官方文档:https://github.com/nyu-mll/GLUE-baselines

    如果是国内,先把这个clone下来:https://github.com/wasiahmad/paraphrase_identification

    然后

    python download_glue_data.py --data_dir glue_data --tasks all --path_to_mrpc=paraphrase_identification/dataset/msr-paraphrase-corpus
    

    注意,如果要用glove,从https://nlp.stanford.edu/projects/glove/下载下来的840B版本的zip就有2G多,直接unzip解压不了。。可以

    7z x glove.840B.300d.zip
    

    然后就很风骚。。。

    7z x glove.840B.300d.zip
    
    7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
    p7zip Version 16.02 (locale=en_US,Utf16=on,HugeFiles=on,64 bits,56 CPUs x64)
    
    Scanning the drive for archives:
    1 file, 2176768927 bytes (2076 MiB)
    
    Extracting archive: glove.840B.300d.zip
    --
    Path = glove.840B.300d.zip
    Type = zip
    Physical Size = 2176768927
    64-bit = +
    
    Everything is Ok
    
    Size:       5646236541
    Compressed: 2176768927
    

    参考https://github.com/nyu-mll/GLUE-baselines,装allennlp==0.7.0,torch>=0.4.1,可以跑glue数据集的baseline:

    py=/home/xxx/python-3-tf-cpu/bin/python3.6
    alias superhead='/opt/compiler/gcc-4.8.2/lib/ld-linux-x86-64.so.2 --library-path /opt/compiler/gcc-4.8.2/lib:$LD_LIBRARY_PATH '
    alias python='superhead $py'
    python main.py \
        --exp_dir EXP_DIR \
        --run_dir RUN_DIR \
        --train_tasks all \
        --cove 0 \
        --cuda -1 \
        --eval_tasks all \
        --glove 1 \
        --word_embs_file ./emb_dir/glove.840B.300d.txt
    

    运行

    export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
    export GLUE_DIR=/path/to/glue
    
    python run_classifier.py \
      --task_name=MRPC \
      --do_train=true \
      --do_eval=true \
      --data_dir=$GLUE_DIR/MRPC \
      --vocab_file=$BERT_BASE_DIR/vocab.txt \
      --bert_config_file=$BERT_BASE_DIR/bert_config.json \
      --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
      --max_seq_length=128 \
      --train_batch_size=32 \
      --learning_rate=2e-5 \
      --num_train_epochs=3.0 \
      --output_dir=/tmp/mrpc_output/
    

    输出:

    ***** Eval results *****
      eval_accuracy = 0.845588
      eval_loss = 0.505248
      global_step = 343
      loss = 0.505248
    

    表示dev set上有84.55%的准确率,像MRPC(glue_data中的一个数据集)这样的小数据集,即使从pretrained的checkpoint开始,仍然可能在dev set的accuracy上会有很高的variance(跑多次,可能结果在84-88%之间)。

    pretraining

    step1. create-pretraining-data

    paper的源码是用c++写的,这里用py又实现了一遍。。实现masked lm和next sentence prediction。

    输入文件的格式:一行一句话(对于next sentence prediction这很重要),不同文档间用空行分隔。例如sample_text.txt

    Something glittered in the nearest red pool before him.
    Gold, surely!
    But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible, but a bit of jeweler's handicraft in the form of a pla
    in gold ring.
    Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
    Like most of his fellow gold-seekers, Cass was superstitious.
    
    The fountain of classic wisdom, Hypatia herself.
    As the ancient sage--the name is unimportant to a monk--pumped water nightly that he might study by day, so I, the guardian of cloaks and parasols, at the sacred
     doors of her lecture-room, imbibe celestial knowledge.
    From my youth I felt in me a soul above the matter-entangled herd.
    She revealed to me the glorious fact, that I am a spark of Divinity itself.
    

    输出是一系列的TFRecordtf.train.Example

    注意:这个脚本把整个输入文件都放到内存里了,所以对于大文件,你可能需要把文件分片,然后跑多次这个脚本,得到一堆tf_examples.tf_record*,然后把这些文件都作为下一个脚本run_pretraining.py的输入。

    参数:

    • max_predictions_per_seq:每个序列里最大的masked lm predictions。建议设置为max_seq_length*masked_lm_prob(这个脚本不会自动设置)
    python create_pretraining_data.py \
      --input_file=./sample_text.txt \
      --output_file=/tmp/tf_examples.tfrecord \
      --vocab_file=$BERT_BASE_DIR/vocab.txt \
      --do_lower_case=True \
      --max_seq_length=128 \
      --max_predictions_per_seq=20 \
      --masked_lm_prob=0.15 \
      --random_seed=12345 \
      --dupe_factor=5
    

    输出如下:

    INFO:tensorflow:*** Example ***
    INFO:tensorflow:tokens: [CLS] indeed , it was recorded in [MASK] star that a fortunate early [MASK] ##r had once picked up on the highway a solid chunk [MASK] gold quartz which the [MASK] had freed from its inc [MASK] ##ing soil , and washed into immediate and [MASK] popularity . [SEP] rainy season , [MASK] insult show habit of body , and seldom lifted their eyes to the rift ##ed [MASK] india - ink washed skies [MASK] them . " cass " beard [MASK] elliot early that morning , but not with a view to [MASK] . a leak in his [MASK] roof , - - quite [MASK] with his careless , imp ##rov ##ide ##nt habits , - - had rouse ##d him at 4 a [MASK] m [SEP]
    INFO:tensorflow:input_ids: 101 5262 1010 2009 2001 2680 1999 103 2732 2008 1037 19590 2220 103 2099 2018 2320 3856 2039 2006 1996 3307 1037 5024 20000 103 2751 20971 2029 1996 103 2018 10650 2013 2049 4297 103 2075 5800 1010 1998 8871 2046 6234 1998 103 6217 1012 102 16373 2161 1010 103 15301 2265 10427 1997 2303 1010 1998 15839 4196 2037 2159 2000 1996 16931 2098 103 2634 1011 10710 8871 15717 103 2068 1012 1000 16220 1000 10154 103 11759 2220 2008 2851 1010 2021 2025 2007 1037 3193 2000 103 1012 1037 17271 1999 2010 103 4412 1010 1011 1011 3243 103 2007 2010 23358 1010 17727 12298 5178 3372 14243 1010 1011 1011 2018 27384 2094 2032 2012 1018 1037 103 1049 102
    INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    INFO:tensorflow:masked_lm_positions: 7 12 13 25 30 36 45 52 53 54 68 74 81 82 93 99 103 105 125 0
    INFO:tensorflow:masked_lm_ids: 17162 2220 4125 1997 4542 29440 20332 4233 1037 16465 2030 2682 2018 13763 5456 6644 1011 8335 1012 0
    INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
    INFO:tensorflow:next_sentence_labels: 0
    INFO:tensorflow:*** Example ***
    INFO:tensorflow:tokens: [CLS] and there burst on phil ##am ##mon ' s astonished eyes a vast semi ##ci ##rcle of blue sea [MASK] ring ##ed with palaces and towers [MASK] [SEP] like most of [MASK] fellow gold - seekers , cass was super ##sti [MASK] . [SEP]
    INFO:tensorflow:input_ids: 101 1998 2045 6532 2006 6316 3286 8202 1005 1055 22741 2159 1037 6565 4100 6895 21769 1997 2630 2712 103 3614 2098 2007 22763 1998 7626 103 102 2066 2087 1997 103 3507 2751 1011 24071 1010 16220 2001 3565 16643 103 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:masked_lm_positions: 10 20 23 27 32 39 42 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:masked_lm_ids: 22741 1010 2007 1012 2010 2001 20771 0 0 0 0 0 0 0 0 0 0 0 0 0
    INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    INFO:tensorflow:next_sentence_labels: 1
    INFO:tensorflow:Wrote 60 total instances
    

    step2. run-pretraining

    • 如果你是从头开始pretrain,不要include init_checkpoint
    • 模型配置(包括vocab size)在bert_config_file中设置
    • num_train_steps在现实中一般要设置10000以上
    • max_seq_length和max_predictions_per_seq要和传给create_pretraining_data的一样
    python run_pretraining.py \
      --input_file=/tmp/tf_examples.tfrecord \
      --output_dir=/tmp/pretraining_output \
      --do_train=True \
      --do_eval=True \
      --bert_config_file=$BERT_BASE_DIR/bert_config.json \
      --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
      --train_batch_size=32 \
      --max_seq_length=128 \
      --max_predictions_per_seq=20 \
      --num_train_steps=20 \
      --num_warmup_steps=10 \
      --learning_rate=2e-5
    

    跑的时候发现会充分利用显存,具体不是特别清楚,显存太小应该也跑不了吧。由于sample_text.txt很小,所以会overfit。log如下(最后会生成一个eval_results.txt文件,记录***** Eval results *****部分):

    INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Evaluation [10/100]
    INFO:tensorflow:Evaluation [20/100]
    INFO:tensorflow:Evaluation [30/100]
    INFO:tensorflow:Evaluation [40/100]
    INFO:tensorflow:Evaluation [50/100]
    INFO:tensorflow:Evaluation [60/100]
    INFO:tensorflow:Evaluation [70/100]
    INFO:tensorflow:Evaluation [80/100]
    INFO:tensorflow:Evaluation [90/100]
    INFO:tensorflow:Evaluation [100/100]
    INFO:tensorflow:Finished evaluation at 2018-10-31-18:13:12
    INFO:tensorflow:Saving dict for global step 20: global_step = 20, loss = 0.27842212, masked_lm_accuracy = 0.94665253, masked_lm_loss = 0.27976906, next_sentence_accuracy = 1.0, next_sentence_loss = 0.0002133457
    INFO:tensorflow:Saving 'checkpoint_path' summary for global step 20: ./pretraining_output/model.ckpt-20
    INFO:tensorflow:***** Eval results *****
    INFO:tensorflow:  global_step = 20
    INFO:tensorflow:  loss = 0.27842212
    INFO:tensorflow:  masked_lm_accuracy = 0.94665253
    INFO:tensorflow:  masked_lm_loss = 0.27976906
    INFO:tensorflow:  next_sentence_accuracy = 1.0
    INFO:tensorflow:  next_sentence_loss = 0.0002133457
    

    具体可以看对应的tensorboard,比较卡,猜测是模型比较大,截图如下:


     

    还有个projector,如下:


     

    左边可以选哪个模型的哪一层

    然后在中间的图中可以选中一个点,这样在最右边会显示出与这个点最近的n个点,度量方式可以选择cos或者欧氏距离。


     

    pretrain tips and caveats

    • 如果你的任务有很大的domain-specific语料,最好从bert的checkpoint开始,在你的语料上进行多一些的pre-train
    • paper中的学习率设为1e-4,如果基于已有bert checkpoint继续pretrain,建议把学习率调小(如2e-5)
    • 当前的bert模型只是English的,2018年11月底会放出更多语言的!!
    • 更长的序列的计算代价会非常大,因为attention是序列长度平方的复杂度。例如,一个长度是512的minibatch-size=64的batch,比一个长度为128的minibatch-size=256的batch计算代码要大得多。对于全连接或者cnn来讲,其实这个计算代价是一样的。但对attention而言,长度是512的计算代价会大得多。所以,建议对长度为128的序列进行9w个step的预训练,然后对长度为512的序列再做1w个step的预训练是更好的~对于非常长的序列,最需要的是学习positional embeddings,这是很快就能学到的啦。注意,这样做就需要使用不同的max_seq_length来生成两次数据。
    • 如果你从头开始pretrain,计算代价是很大的,特别是在gpu上。建议的是在一个preemptible Cloud TPU v2上pretrain一个bert-base(2周要500美刀…)。如果在一个single cloud TPU上的话,需要把batchsize scale down。建议使用能占满TPU内存的最大batchsize…

    抽取feature vector(类似ELMo)

    输入文件input.txt格式:

    • 如果是两个句子,那就是sentence A ||| sentence B
    • 如果是一个句子,那就是sentence A,不要分隔符
    python extract_features.py \
      --input_file=input.txt \
      --output_file=/tmp/output.json \
      --vocab_file=$BERT_BASE_DIR/vocab.txt \
      --bert_config_file=$BERT_BASE_DIR/bert_config.json \
      --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
      --layers=-1,-2,-3,-4 \
      --max_seq_length=128 \
      --batch_size=8
    

    例如输入的内容是『大家』,那么输出的output.json格式如下:

    其中的”linex_index”表示第几行

    {
      "linex_index": 0,
      "features": [{
        "token": "[CLS]",
        "layers": [{
          "index": -1,
          "values": [1.507966, -0.155272, 0.108119, ..., 0.111],
        }, {
          "index": -2,
          "values": [1.39443, 0.307064, 0.483496, ..., 0.332],
        }, {
          "index": -3,
          "values": [0.961682, 0.757408, 0.720898, ..., 0.332],
        }, {
          "index": -4,
          "values": [-0.275457, 0.632056, 1.063737, ..., 0.332],
        }, {
        "token": "大",
        "layers": [{
          "index": -1,
          "values": [0.326004, -0.313136, 0.233399, ..., 0.111],
        }, {
          "index": -2,
          "values": [0.795364, 0.361322, -0.116774, ..., 0.332],
        }, {
          "index": -3,
          "values": [0.807957, 0.206743, -0.359639, ..., 0.332],
        }, {
          "index": -4,
          "values": [-0.226106, -0.129655, -0.128466, ..., 0.332],
        }, {
        "token": "家",
        "layers": [{
          "index": -1,
          "values": [1.768678, -0.814265, 0.016321, ..., 0.111],
        }, {
          "index": -2,
          "values": [1.76887, -0.020193, 0.44832, 0.193271, ..., 0.332],
        }, {
          "index": -3,
          "values": [1.695086, 0.050979, 0.188321, -0.537057, ..., 0.332],
        }, {
          "index": -4,
          "values": [0.745073, -0.09894, 0.166217, -1.045382, ..., 0.332],
        }, {
        "token": "[SEP]",
        "layers": [{
          "index": -1,
          "values": [0.881939, -0.34753, 0.210375, ..., 0.111],
        }, {
          "index": -2,
          "values": [-0.047698, -0.030813, 0.041558, ..., 0.332],
        }, {
          "index": -3,
          "values": [-0.049113, -0.067705, 0.018293, ..., 0.332],
        }, {
          "index": -4,
          "values": [0.000215, -0.057331, -3.2e-05, ..., 0.332],
        }]
      }]
    }
    

    自己尝试

    基于预训练的中文模型中的vocab,把网络改小,基于190w的中文语料(还是用默认的wordpiece分词)进行单机cpu训练,一个句子当成一篇文档,这个句子当成sentence2,这个句子的tag当成sentence1:

    模型配置如下:

    {
      "attention_probs_dropout_prob": 0.1, 
      "directionality": "bidi", 
      "hidden_act": "gelu", 
      "hidden_dropout_prob": 0.1, 
      "hidden_size": 64, 
      "initializer_range": 0.02, 
      "intermediate_size": 3072, 
      "max_position_embeddings": 512, 
      "num_attention_heads": 8, 
      "num_hidden_layers": 2, 
      "pooler_fc_size": 64, 
      "pooler_num_attention_heads": 12, 
      "pooler_num_fc_layers": 3, 
      "pooler_size_per_head": 32, 
      "pooler_type": "first_token_transform", 
      "type_vocab_size": 2, 
      "vocab_size": 21128
    }
    

    参数设置如下:

    ## g_max_predictions_per_seq approx_to g_max_seq_length * g_masked_lm_prob
    
    # online or offline
    export train_mode=offline
    export param_name=param1
    export g_train_batch_size=128
    export g_num_train_steps=10000
    export g_max_seq_length=128
    export g_max_predictions_per_seq=20
    export g_masked_lm_prob=0.15
    export g_dupe_factor=3
    
    sh -x scripts/run_train_bert.sh  > log/$param_name.log &
    
    # online or offline
    export train_mode=offline
    export param_name=param2
    export g_train_batch_size=64
    export g_num_train_steps=10000
    export g_max_seq_length=128
    export g_max_predictions_per_seq=20
    export g_masked_lm_prob=0.15
    export g_dupe_factor=3
    
    sh -x scripts/run_train_bert.sh  > log/$param_name.log &
    
    # online or offline
    export train_mode=offline
    export param_name=param3
    export g_train_batch_size=128
    export g_num_train_steps=10000
    export g_max_seq_length=128
    export g_max_predictions_per_seq=8
    export g_masked_lm_prob=0.05
    export g_dupe_factor=5
    
    sh -x scripts/run_train_bert.sh  > log/$param_name.log &
    
    # online or offline
    export train_mode=offline
    export param_name=param4
    export g_train_batch_size=64
    export g_num_train_steps=10000
    export g_max_seq_length=128
    export g_max_predictions_per_seq=8
    export g_masked_lm_prob=0.05
    export g_dupe_factor=5
    
    sh -x scripts/run_train_bert.sh  > log/$param_name.log &
    
    # online or offline
    export train_mode=offline
    export param_name=param5
    export g_train_batch_size=32
    export g_num_train_steps=10000
    export g_max_seq_length=128
    export g_max_predictions_per_seq=20
    export g_masked_lm_prob=0.15
    export g_dupe_factor=3
    
    sh -x scripts/run_train_bert.sh  > log/$param_name.log &
    
    # online or offline
    export train_mode=offline
    export param_name=param6
    export g_train_batch_size=32
    export g_num_train_steps=10000
    export g_max_seq_length=128
    export g_max_predictions_per_seq=8
    export g_masked_lm_prob=0.05
    export g_dupe_factor=5
    
    sh -x scripts/run_train_bert.sh  > log/$param_name.log &
    
    wait
    
    

    跑1w个step,效果如下(图中训了2w步的那个忘了是啥配置了…):


     

    可见,同为1w个step,参数1训练时间最久,但loss最低

    每秒的example数:


     

    每秒的global-steps:


     

    拿来eval时,next sentence的准确率:


     

    拿来eval时,masked lm的准确率就比较。。。了:


     

    注意

    我们发现,代码里没看到tf.summary相关的代码,却可以看到tensorboard…

    是因为用了tpuestimator。。。”TPUEstimator API 不支持 tensorboard 的自定义摘要。但是,基本摘要会自动记录到模型目录中的事件文件中。”

    https://cloud.google.com/tpu/docs/tutorials/migrating-to-tpuestimator-api


    原创文章,转载请注明出处! 
    本文链接:http://daiwk.github.io/posts/nlp-bert-code.html

    展开全文
  • 从头开始训练BERT代码

    2021-02-22 17:30:01
    从头开始训练BERT代码,解压密码在https://blog.csdn.net/herosunly/article/details/113937736
  • BERT代码阅读

    2021-09-01 10:14:00
    BERT代码阅读 input_ids : 单词id input_ids 是 输入单词的id 输入单词经历的转变: 单词文本 -----------------> 单词id 单词id -----embedding—> 单词向量 input_mask : 该单词是填充吗? #max_seq_...

    BERT代码阅读

    bert源码

    bert源码

    流程图

    流程图

    简要笔记

    input_ids : 单词id

    input_ids 是 输入单词的id

    输入单词经历的转变:

    1. 单词文本 -----------------> 单词id
    2. 单词id -----embedding—> 单词向量
    input_mask : 该单词是填充吗?
    #max_seq_length:  ___________________________________________________________________________
    #  tokens:        [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 0 0 0 0 ... 0
    #  input_mask:    1     1   1     1   1       1    1  1     1  1  1  1      1   0 0 0 0 ... 0
    # input_mask:是否是 为了达到最大序列长度max_seq_length 而进行的填充(即padding)
    
    

    input_mask 出现在 比如:

    1. run_classifier.py 的函数 convert_single_example()
    2. modeling.py 的函数 init()
    token_type_ids : 是第几句话的单词?

    token_type_ids : 是哪句话的单词?

    token_type_ids 意思是 该单词 是第一句话的 还是 第二句话的

    第一句话用0表示, 第二句话用1表示

    代码中的:token_type_ids, segment_ids , type_ids 这些词都是一个意思

    
    #以下这段来自 run_classifier.py 的函数 convert_single_example() :
    # The convention in BERT is:
    # (a) For sequence pairs:
    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
    # (b) For single sequences:
    #  tokens:   [CLS] the dog is hairy . [SEP]
    #  type_ids: 0     0   0   0  0     0 0
    

    segment_ids 来自 run_classifier.py 的函数 convert_single_example()

    token_type_ids 来自 modeling.py 的函数 init()

    先持有引用, 再初始化实物.

    正常逻辑应该是 : 先初始化实物, 再持引用.

    但这里的情况是:

    • 先持有引用, 再初始化实物.
    • 或者 : 先创建实物, 中间使用引用, 最后初始化实物.
    先拿到词向量的引用, 再初始化词向量.

    embedding_lookup 拿到的 词向量 只是个引用, 对该词向量的初始化 是后续调用init_from_checkpoint 做的.

    run_classifier.py 的函数 model_fn_builder:

    #file run_classifier.py:
    def model_fn_builder(...):
        def model_fn(...):
            #create_model 中调用 embedding_lookup 拿到的 词向量 只是个引用
            ...=create_model(...)
            ...
            
            # 对该词向量的初始化 是后续调用init_from_checkpoint 做的.
            if ...:
                ...init_from_checkpoint(...)
            else:
                ...init_from_checkpoint(...)
                
        return model_fn
    
    def create_model(...):
        model = modeling.BertModel(...)
        ...
    
    
    #file modeling.py:
    def __init__(...):
        #embedding_lookup 拿到的 词向量 只是个引用
        ...=embedding_lookup(...)
    
    
    debug

    在这里插入图片描述

    展开全文
  • bert代码调试

    2019-06-16 18:25:33
    bert代码大致可以分为三大部分: 处理输入数据:将词转化成embedding,并且增加positional embedding和token embedding信息。positional embedding即词的位置信息;token tpye embedding即为某个句子中的词。在这一...

    1. 综述

    bert代码大致可以分为三大部分:

    • 处理输入数据:将词转化成embedding,并且增加positional embedding和token embedding信息。positional embedding即词的位置信息;token tpye embedding即为某个句子中的词。在这一部分中,最终输入数据为[batch_size, seq_length, width],width即词向量的长度。
    • encoder:使用transformer方法对句子进行编码,这里和attention is all your need论文的编码方式类似,仅有微小区别。
    • decoder对句子进行解码。

    2. 输入

    输入数据会限定一个句子的最大长度,不足部分为0,即token_id。根据id数据将词转化为词向量,如下:

    # Perform embedding lookup on the word ids.
    (self.embedding_output, self.embedding_table) = embedding_lookup(
        input_ids=input_ids,
        vocab_size=config.vocab_size,
        embedding_size=config.hidden_size,
        initializer_range=config.initializer_range,
        word_embedding_name="word_embeddings",
        use_one_hot_embeddings=use_one_hot_embeddings)
    

    此外,除了词向量信息,输入中还增加了词的位置信息token type embedding信息。

    3. attention等进一步处理

    encoder中需要先做attention_mask处理:attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask),这里会减少对mask词与填充部分词的关注,因此attention时权重相对较低。
    下面为部分代码信息:

    self.all_encoder_layers = transformer_model(
        input_tensor=self.embedding_output,
        attention_mask=attention_mask,
        hidden_size=config.hidden_size,
        num_hidden_layers=config.num_hidden_layers,
        num_attention_heads=config.num_attention_heads,
        intermediate_size=config.intermediate_size,
        intermediate_act_fn=get_activation(config.hidden_act),
        hidden_dropout_prob=config.hidden_dropout_prob,
        attention_probs_dropout_prob=config.attention_probs_dropout_prob,
        initializer_range=config.initializer_range,
        do_return_all_layers=True)
    
    def transformer_model(input_tensor,
                          attention_mask=None,
                          hidden_size=768,
                          num_hidden_layers=12,
                          num_attention_heads=12,
                          intermediate_size=3072,
                          intermediate_act_fn=gelu,
                          hidden_dropout_prob=0.1,
                          attention_probs_dropout_prob=0.1,
                          initializer_range=0.02,
                          do_return_all_layers=False):
    input_tensor是[batch_size, seq_length, hidden_size]
    
    # `query_layer` = [B*F, N*H]
    query_layer = tf.layers.dense(
        from_tensor_2d,
        num_attention_heads * size_per_head,
        activation=query_act,
        name="query",
        kernel_initializer=create_initializer(initializer_range))
    
    # `key_layer` = [B*T, N*H]
    key_layer = tf.layers.dense(
        to_tensor_2d,
        num_attention_heads * size_per_head,
        activation=key_act,
        name="key",
        kernel_initializer=create_initializer(initializer_range))
    
    # `value_layer` = [B*T, N*H]
    value_layer = tf.layers.dense(
        to_tensor_2d,
        num_attention_heads * size_per_head,
        activation=value_act,
        name="value",
        kernel_initializer=create_initializer(initializer_range))
    
    attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
    attention_scores = tf.multiply(attention_scores,
                                   1.0 / math.sqrt(float(size_per_head)))
    
    if attention_mask is not None:
      # `attention_mask` = [B, 1, F, T]
      attention_mask = tf.expand_dims(attention_mask, axis=[1])
    
      # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
      # masked positions, this operation will create a tensor which is 0.0 for
      # positions we want to attend and -10000.0 for masked positions.
      adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
    
      # Since we are adding it to the raw scores before the softmax, this is
      # effectively the same as removing these entirely.
      attention_scores += adder
    # Normalize the attention scores to probabilities.
    # `attention_probs` = [B, N, F, T]
    attention_probs = tf.nn.softmax(attention_scores)
    
    # seem a bit unusual, but is taken from the original Transformer paper.
    attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
    
    attention_output = tf.concat(attention_heads, axis=-1)
    
    attention_output = dropout(attention_output, hidden_dropout_prob)
    attention_output = layer_norm(attention_output + layer_input)
    
    # The activation is only applied to the "intermediate" hidden layer.
    with tf.variable_scope("intermediate"):
      intermediate_output = tf.layers.dense(
          attention_output,
          intermediate_size,
          activation=intermediate_act_fn,
          kernel_initializer=create_initializer(initializer_range))
    
    # Down-project back to `hidden_size` then add the residual.
    with tf.variable_scope("output"):
      layer_output = tf.layers.dense(
          intermediate_output,
          hidden_size,
          kernel_initializer=create_initializer(initializer_range))
      layer_output = dropout(layer_output, hidden_dropout_prob)
      layer_output = layer_norm(layer_output + attention_output)
      prev_output = layer_output
      all_layer_outputs.append(layer_output)
    
    

    3. transformer

    与之前的transformer论文相类似,首先需要计算attention_head_size,并将input转化为2D的矩阵,然后对input进行attention操作,并进行线性投影、dropout、layer_norm等一系列操作,此过程循环多次得出最终结果。

    def transformer_model(input_tensor,
                          attention_mask=None,
                          hidden_size=768,
                          num_hidden_layers=12,
                          num_attention_heads=12,
                          intermediate_size=3072,
                          intermediate_act_fn=gelu,
                          hidden_dropout_prob=0.1,
                          attention_probs_dropout_prob=0.1,
                          initializer_range=0.02,
                          do_return_all_layers=False):
      if hidden_size % num_attention_heads != 0:
        raise ValueError(
            "The hidden size (%d) is not a multiple of the number of attention "
            "heads (%d)" % (hidden_size, num_attention_heads))
    
      attention_head_size = int(hidden_size / num_attention_heads)
      input_shape = get_shape_list(input_tensor, expected_rank=3)
      batch_size = input_shape[0]
      seq_length = input_shape[1]
      input_width = input_shape[2]
    
      if input_width != hidden_size:
        raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                         (input_width, hidden_size))
    
      prev_output = reshape_to_matrix(input_tensor)
    
      all_layer_outputs = []
      for layer_idx in range(num_hidden_layers):
        with tf.variable_scope("layer_%d" % layer_idx):
          layer_input = prev_output
    
          with tf.variable_scope("attention"):
            attention_heads = []
            with tf.variable_scope("self"):
              attention_head = attention_layer(
                  from_tensor=layer_input,
                  to_tensor=layer_input,
                  attention_mask=attention_mask,
                  num_attention_heads=num_attention_heads,
                  size_per_head=attention_head_size,
                  attention_probs_dropout_prob=attention_probs_dropout_prob,
                  initializer_range=initializer_range,
                  do_return_2d_tensor=True,
                  batch_size=batch_size,
                  from_seq_length=seq_length,
                  to_seq_length=seq_length)
              attention_heads.append(attention_head)
    
            attention_output = None
            if len(attention_heads) == 1:
              attention_output = attention_heads[0]
            else:
              attention_output = tf.concat(attention_heads, axis=-1)
            with tf.variable_scope("output"):
              attention_output = tf.layers.dense(
                  attention_output,
                  hidden_size,
                  kernel_initializer=create_initializer(initializer_range))
              attention_output = dropout(attention_output, hidden_dropout_prob)
              attention_output = layer_norm(attention_output + layer_input)
    
          with tf.variable_scope("intermediate"):
            intermediate_output = tf.layers.dense(
                attention_output,
                intermediate_size,
                activation=intermediate_act_fn,
                kernel_initializer=create_initializer(initializer_range))
    
          with tf.variable_scope("output"):
            layer_output = tf.layers.dense(
                intermediate_output,
                hidden_size,
                kernel_initializer=create_initializer(initializer_range))
            layer_output = dropout(layer_output, hidden_dropout_prob)
            layer_output = layer_norm(layer_output + attention_output)
            prev_output = layer_output
            all_layer_outputs.append(layer_output)
    
      if do_return_all_layers:
        final_outputs = []
        for layer_output in all_layer_outputs:
          final_output = reshape_from_matrix(layer_output, input_shape)
          final_outputs.append(final_output)
        return final_outputs
      else:
        final_output = reshape_from_matrix(prev_output, input_shape)
        return final_output
    
    展开全文
  • bert代码使用详解

    2021-03-18 14:21:04
    接下来将为大家介绍bert代码的详细解释 敬请期待~~~~~~~~~~~~~~~

    接下来将为大家介绍bert代码的详细解释
    敬请期待~~~~~~~~~~~~~~~

    展开全文
  • BERT代码解读

    万次阅读 多人点赞 2019-06-16 19:41:08
    flags.DEFINE_string( "output_file", None, "Output TF example file (or comma-separated list of files).") flags.DEFINE_string("vocab_file", None, "The vocabulary file that the BERT model was trained on....
  • bert代码解读2之完整模型解读

    万次阅读 多人点赞 2018-12-31 14:05:40
    bert代码模型部分的解读 bert_config.josn 模型中参数的配置 { "attention_probs_dropout_prob": 0.1, #乘法attention时,softmax后dropout概率 "hidden_act": "gelu", #激活函数...
  • bert代码学习

    2020-08-28 10:50:30
    基于官方的https://github.com/google-research/bert if mode == tf.estimator.ModeKeys.TRAIN: train_op = optimization.create_optimizer( total_loss, learning_rate, num_train_steps, num_warmup_steps, ...
  • Bert代码详解(一)

    万次阅读 多人点赞 2019-03-12 12:23:09
    这是bert的pytorch版本(与tensorflow一样的,这个更简单些,这个看懂了,tf也能看懂),地址:https://github.com/huggingface/pytorch-pretrained-BERT 主要内容在pytorch_pretrained_bert/modeling文件中。...
  • bert代码解析

    千次阅读 2019-01-08 20:08:42
    https://daiwk.github.io/posts/nlp-bert-code-annotated-framework.html#get-pooled-output https://blog.csdn.net/weixin_39470744/article/details/84401339模型构建 ...
  • bert代码解读4----中文命名实体识别

    千次阅读 2019-01-01 23:48:37
    bert代码解读之中文命名实体识别 中文ner Use google BERT to do CoNLL-2003 NER 数据处理部分:20864句话, train-0:tokens tokens:汉字 inpu_ids:转换成词典中对应的id input_mask:对应的mask,此处只是...
  • pytorch版bert modeling_bert代码解析

    千次阅读 2020-10-28 21:49:06
    modeling_bert.py 预训练模型的下载地址,如果加载时 参数设置没用下好的模型地址,则会自动从这些地址上下载 BERT_PRETRAINED_MODEL_ARCHIVE_MAP = { 'bert-base-uncased': ...
  • Bert代码解读记录

    2020-01-18 16:13:04
    代码学习的是前一篇博客中pytorch的代码的BertForTokenClassification模型,run的是ner例子:https://github.com/huggingface/transformers/blob/master/examples/run_ner.py。 1、模型概览: 使用的模型是:multi...
  • bert代码参考资料

    2019-01-14 20:35:32
    unicodedata :   1. unicode字符的标准化:  https://python3-cookbook.readthedocs.io/zh_CN/latest/c02/p09_normalize_unicode_text_to_regexp.html  2.unicode字符的CJK的含义: ...
  • BERT代码实现及解读

    千次阅读 2019-08-01 13:42:00
    注意力机制系列可以参考前面的一文: 注意力机制及其理解 Transformer Block BERT中的点积注意力模型 公式: 代码: class Attention(nn.Module): """ Scaled Dot Product Attention """ ...
  • Bert代码详解(二)

    千次阅读 2019-03-12 12:30:08
    这是bert的pytorch版本(与tensorflow一样的,这个更简单些,这个看懂了,tf也能看懂),地址:https://github.com/huggingface/pytorch-pretrained-BERT主要内容在pytorch_pretrained_bert/modeling文件中。...
  • Bert 代码详细解读——modeling.py

    千次阅读 2019-08-23 17:30:35
    在官方的bert-github上, git clone https://github.com/google-research/bert.git ...首先解读的是modeling.py文件,是bert实现的核心代码,主要包括2个类和17个函数,如下所示: 一、类 1.class Bert...
  • 运行bert代码错误小记

    2021-04-11 09:16:04
    后记 本小白整个bert代码大概断断续续弄了一个月左右吧,中间也在写论文的其他部分,刚开始觉得很简单,就只用把别人的源码下载下来跑一遍,理解基本过程就可以用在自己的数据上了,但是实际运行过程中各种千奇百怪...
  • 简单地读懂Bert代码

    2020-02-19 21:20:54
    使用Bert预训练模型进行文本分类 bert做文本分类,简单来说就是将每句话的第一个位置加入了特殊分类嵌入[CLS]。而该[CLS]包含了整个句子的信息,它的最终隐藏状态(即,Transformer的输出)被用作分类任务的聚合序列...
  • bert_config.josn 模型中参数的配置 { "attention_probs_dropout_prob": 0.1, #乘法attention时,softmax后dropout概率 "hidden_act": "gelu", #激活函数 "hidden_dropout_prob": 0.1, #隐藏层dropout概率 ...
  • https://zhuanlan.zhihu.com/p/120315111

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 13,160
精华内容 5,264
关键字:

bert代码