Python爬取古诗词写入Neo4j

最近准备自己动手做一个诗词智能问答的工具，首先爬取古诗词，以作者、合称、朝代、分类、诗词标题为节点，以作者live_in朝代，作者write诗词，诗词belong分类为关系，创建知识图谱。代码如下：

from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen,urlparse,urlsplit,Request
import urllib.request
import re
from base import writefile,gethtml
import csv
import codecs
import random
import py2neo
from py2neo import Graph,Node,Relationshipua_list = ["Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36",#Chrome"Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0",#firwfox"Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",#IE"Opera/9.99 (Windows NT 5.1; U; zh-CN) Presto/9.9.9",#Opera
]
def not_empty(s):return s and s.strip()# 创建list存储节点信息，节点不允许重复创建
list_sub = []def get_everypoetry(sub_url,author,dynasty):req1 = urllib.request.Request ( sub_url, headers={'User-agent' : ua} )html1 = urlopen ( req1 ).read ( )soup1 = BeautifulSoup ( html1, 'lxml' )# 开始获取诗词内容# 1先获取该诗词连接，跳转到对应诗词界面才能够获取到诗词类型page = soup1.find_all ( 'div', {'class' : 'shici_list_main'} )for item in page :# 获取到诗词链接text0 = item.find ( 'a', {'target' : "_blank"} ) ['href']# 访问诗词页面content_url = 'http://www.shicimingju.com' + text0req2 = urllib.request.Request ( content_url, headers={'User-agent' : ua} )html2 = urlopen ( req2 ).read ( )soup2 = BeautifulSoup ( html2, 'lxml' )# 此时可以获取诗词内容和诗词类别# title1=soup2.find_all ( 'div', {'id' : 'item_div'} )# 诗词标题try:title = soup2.find ( 'h1' ).textexcept:print("获取title报错")print(content_url)#添加诗词节点try:graph.run ("CREATE (title:Poetry {name:'" + title + "'})" )except:print("title写入Neo4j报错")print(content_url)#添加诗词跟作者关系graph.run ("match (p:Person{name:'" + author + "'}),(t:Poetry{name:'" + title+ "'})" + "CREATE (p)-[:write]->(t)" )# 诗词内容try:#诗词内容不能在存储时分解，可以放在数据检索时再行分词# contents = soup2.find ( 'div', {'class' : 'item_content'} ).text.strip().split('。')contents = soup2.find ( 'div', {'class' : 'item_content'} ).text.strip ( )except:print ( "获取诗词内容报错" )print ( content_url )try:graph.run ("match (p:Poetry {name:'" + title + "'}) set p.content ='" + contents + "'" )except:print ( "获取诗词内容报错" )print ( content_url )# 诗词赏析try:appreciation1 = soup2.find ( 'div', {'class' : 'shangxi_content'} )appreciation = soup2.find ( 'div', {'class' : 'shangxi_content'} ).text.strip()graph.run ("match (t:Poetry {name:'" + title + "'}) set t.zzapp='" + appreciation + "'" )except:print("获取赏析报错")print(content_url)# 诗词类型try:poetry_type = soup2.find ( 'div', {'class' : 'shici-mark'} ).text.strip ( ).split ( '\n' )except :print ( "type读写报错" )poetry_type=['类型','其它']print ( content_url )type_len = len ( poetry_type )poetry_type_list = []if type_len > 2 :for n in range ( 1, type_len ) :poetry_type_list.append ( poetry_type [n].strip ( ) )else :poetry_type_list.append ( poetry_type [1] )while '' in poetry_type_list :poetry_type_list.remove ( '' )for ty in poetry_type_list :ty = ty.strip ( )if ty not in list_sub :graph.run ("CREATE (types:Types {name:'" + ty + "'})" )# 添加诗词跟作者关系graph.run ("match (t:Poetry{name:'" + title + "'}),(p:Types{name:'" + ty + "'})" + "CREATE (t)-[:belong_to]->(p)" )list_sub.append ( ty )if __name__ == "__main__":#连接图数据库graph = Graph ("http://localhost:11003/",username="admin",password="password")# 创建list存储节点信息，节点不允许重复list_main = []url='http://www.shicimingju.com/chaxun/zuozhe/'for i in range(1,652):ua = random.choice ( ua_list )main_url = url+str(i)+'.html'# html, status = gethtml.get_html ( url )req = urllib.request.Request ( main_url, headers={'User-agent' : ua} )html = urlopen ( req ).read ( )soup = BeautifulSoup ( html, 'lxml' )try:# 主页面要获取诗人、朝代、简介、数量# 诗词作者author = soup.find ( 'div', {'class' : 'card about_zuozhe'} ).find ( 'h4' ).text  # Node# 诗词简介brief = soup.find ( 'div', {'class' : 'des'} ).text  # property# 诗人朝代dynasty = soup.find ( 'div', {'class' : 'aside_val'} ).text  # Node# 诗人写诗数量total_poetry = soup.find ( 'div', {'class' : 'aside_right'} ).find ( 'div', {'class' : 'aside_val'} ).text  # propertyif author not in list_main :graph.run ("CREATE (author:Person {name:'" + author + "', brief:'" + brief + "', total_poetry:'" + total_poetry + "'})" )if dynasty not in list_main :graph.run ("CREATE (dynasty:Time {name:'" + dynasty + "'})" )if author not in list_main or dynasty not in list_main :graph.run ("match (p:Person{name:'" + author + "'}),(t:Time{name:'" + dynasty + "'})" + "CREATE (p)-[:live_in]->(t)" )# 进入子页面读取诗词list_main.append ( author )list_main.append ( dynasty )except:print("获取诗人、数量、年代、简介")get_everypoetry ( main_url,author,dynasty )# page=soup.find_all('div',{'id':'list_nav_all'})# haha=len(page)#获取总共有多少页try:number=soup.find('div',{'id':'list_nav_all'}).find_all('a')except:print("获取页数报错")page_number=len(number)# href=number[0]for j in range(2,page_number):sub_url=url+str(i)+'_'+str(j)+'.html'get_everypoetry(sub_url,author,dynasty)# print(poetry_type_list)# text1 = item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' ) [1]# all_a = soup.find_all ( 'a', target='_blank' )punc = '：· - ...:-'list_item = []

Python爬取古诗词写入Neo4j相关推荐

python爬取小说写入txt_python 爬取网络小说清洗并下载至txt文件
什么是爬虫网络爬虫,也叫网络蜘蛛(spider),是一种用来自动浏览万维网的网络机器人.其目的一般为编纂网络索引. 网络搜索引擎等站点通过爬虫软件更新自身的网站内容或其对其他网站的索引.网络爬虫可以 ...
python爬取小说写入txt_燎原博客—python爬取网络小说存储为TXT的网页爬虫源代码实例...
python是一门优秀的计算机编程语言,两年前曾因为动过自动化交易的念头而关注过它.前几天在微信上点了个python教学的广告,听了两堂课,所以现在又热心了起来,照葫芦画瓢写了一段简单的网络爬虫代码, ...
python爬取小说写入txt_对新笔趣阁小说进行爬取，保存和下载！这就是Python的魅力...
原标题:对新笔趣阁小说进行爬取,保存和下载!这就是Python的魅力以前挺爱在笔趣阁看小说的(老白嫖怪了) 现在学了一点爬虫技术,就自然而然的想到了爬取笔趣阁的小说也算锻炼一下自己的技术,就以新笔 ...
python爬取小说写入txt_Python BeautifulSoup 爬取笔趣阁所有的小说
这是一个练习作品.用python脚本爬取笔趣阁上面的免费小说. 环境:python3 类库:BeautifulSoup 数据源:http://www.biqukan.cc 原理就是伪装正常http请求 ...
Python爬取古诗词
一.需求爬取网址:https://www.gushiwen.org/ 需求: (1)获取侧边栏[类型]信息: (2)获取每个类型中古诗文详情页信息: (3)提取详情页数据:古诗文名.作者.朝代.类型 ...
python爬取小说写入txt_用python爬整本小说写入txt文件
没太完善,但是可以爬下整本小说.日后会写入数据库,注释不要太在意,都是调试的.入库估计这周之后,这次爬的是笔趣阁的第1150本书,大家只要可以改get_txt()里数字就行,查到自己要看哪本书一改就可 ...
python爬取小说写入txt_Python爬虫练习爬取网络小说保存到txt
利用python爬虫爬取网络小说保存到txt,熟悉利用python抓取文本数据的方法. 选取其中某一章,检查网页,可以找到这本小说所有章节的链接和名称. 写出xpath表达式提取出href里的内容:/ ...
python爬取小说写入txt_python爬虫自学之路：爬取小说并保存成TXT文件
最近闲着无聊开始翻看之前看了一半的小说<明朝那些事儿>,天天用网络看好麻烦就写了个爬虫下载下来放到手机上看,下面把写爬虫的过程遇到的问题记录一下,方便以后再来找,写这个爬虫碰到的问题总共就 ...
python爬取小说写入txt_零基础写python爬虫之抓取百度贴吧并存储到本地txt文件改进版...
百度贴吧的爬虫制作和糗百的爬虫制作原理基本相同,都是通过查看源码扣出关键数据,然后将其存储到本地txt文件. 项目内容: 用Python写的百度贴吧的网络爬虫. 使用方法: 新建一个BugBaidu. ...

Python爬取古诗词写入Neo4j

Python爬取古诗词写入Neo4j相关推荐

最新文章

热门文章