
  • 0. 比赛介绍
  • 1. Bert NER Finetune
    • 数据准备
      • 原始数据
      • 数据转换
    • 模型训练

0. 比赛介绍

本项目来自 Kaggle 的 NER 比赛:比赛链接

此 pipeline 及 code 参考自

  1. https://www.kaggle.com/tungmphung/coleridge-matching-bert-ner?select=kaggle_run_ner.py
  2. https://www.kaggle.com/tungmphung/pytorch-bert-for-named-entity-recognition

1. Bert NER Finetune


首先需要将数据转换成 NER 的 json 格式。



0007f880-0a9b-492d-9a58-76eb0b0e0bd7.json (某篇文章)

由于train.csv中 Id 有重复,首先通过 group 将相同的并入一行:

train = train.groupby('Id').agg({'pub_title': 'first','dataset_title': '|'.join,'dataset_label': '|'.join,'cleaned_label': '|'.join
}).reset_index()print(f'No. grouped training rows: {len(train)}')

No. grouped training rows: 14316



cnt_pos, cnt_neg = 0, 0 # number of sentences that contain/not contain labels
ner_data = []pbar = tqdm(total=len(train))
for i, id, dataset_label in train[['Id', 'dataset_label']].itertuples():# paperpaper = papers[id]# labelslabels = dataset_label.split('|')labels = [clean_training_text(label) for label in labels]# sentencessentences = set([clean_training_text(sentence) for section in paper for sentence in section['text'].split('.') ])sentences = shorten_sentences(sentences) # make sentences shortsentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars# positive samplefor sentence in sentences:is_positive, tags = tag_sentence(sentence, labels)if is_positive:cnt_pos += 1ner_data.append(tags)elif any(word in sentence.lower() for word in ['data', 'study']): ner_data.append(tags)cnt_neg += 1# process barpbar.update(1)pbar.set_description(f"Training data size: {cnt_pos} positives + {cnt_neg} negatives")# shuffling


def tag_sentence(sentence, labels): # requirement: both sentence and labels are already cleanedsentence_words = sentence.split()if labels is not None and any(re.findall(f'\\b{label}\\b', sentence)for label in labels): # positive samplenes = ['O'] * len(sentence_words)for label in labels:label_words = label.split()all_pos = find_sublist(sentence_words, label_words)for pos in all_pos:nes[pos] = 'B'for i in range(pos+1, pos+len(label_words)):nes[i] = 'I'return True, list(zip(sentence_words, nes))else: # negative samplenes = ['O'] * len(sentence_words)return False, list(zip(sentence_words, nes))



{"tokens": ["Ongoing", "projects", "managing", "flowthrough", "water", "data", "include", "SAMOS", "the", "U"], "tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"tokens": ["The", "numbers", "and", "percentages", "from", "which", "the", "figures", "are", "drawn", "are", "contained", "in", "a", "set", "The", "Survey", "of", "Earned", "Doctorates", "collects", "information", "on", "research", "doctorates", "only"], "tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B", "I", "I", "I", "O", "O", "O", "O", "O", "O"]}


训练可以直接使用huggingface github 中的 代码:run_ner.py


!python ../input/kaggle-ner-utils/kaggle_run_ner.py \
--model_name_or_path 'bert-base-cased' \
--train_file './train_ner.json' \
--validation_file './train_ner.json' \
--num_train_epochs 1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--save_steps 15000 \
--output_dir './output' \
--report_to 'none' \
--seed 43 \

