python: pandas 、dataframe 与hdf5

HDF(Hierarchical Data Format），翻译一下就是层次数据格式，其实就是一个高级文件夹，或一个文件系统。
一、group、dataset
h5文件中有两个核心的概念：组“group”和数据集“dataset”。一个h5文件就是 “dataset” 和 “group” 二合一的容器。
dataset ：简单来讲类似数组组织形式的数据集合，一个类似n维的矩阵。
group：包含了其它 dataset(数组) 和其它 group ，像字典一样工作。

关于group和dataset:

    path =r"C:\Users\rustr\Desktop\test.h5"arr1 = np.array([(1,2,3),(4,5,6)])arr2 =  np.array([(2,4,6),(8,9,10)])f = h5py.File(path,'w')g1=f.create_group("\data1")g2=f.create_group("\data2")g3 =f.create_group("\data3")g1.create_dataset("arr1",data =arr1)g2.create_dataset("arr1",data =arr2)print("f keys:",f.keys())print("f keys len:", len(f.keys()))# 如何判断 某group是否存在？if "\data1" in f:print("\\data1 is a existing group!" )# 如何判断某dataset是否存在？if "arr1" in f["\data1"]:print("arr1 is a existing dataset!")for p in f: # groupprint("group p value:",f[p]) # get group valueprint("p:",p,"  ,f[p] keys: ",f[p].keys(),"  len:",len(f[p].keys()))for q in f[p]: # datasetprint("=>")print("dataset:", q," value:",f[p][q][()]) # get dataset valuef.close()

输出如下：

\data1 is a existing group!
arr1 is a existing dataset!
group p value: <HDF5 group “/\data1” (1 members)>
p: \data1 ,f[p] keys: <KeysViewHDF5 [‘arr1’]> len: 1
=>
dataset: arr1 value: [[1 2 3]
[4 5 6]]
group p value: <HDF5 group “/\data2” (1 members)>
p: \data2 ,f[p] keys: <KeysViewHDF5 [‘arr1’]> len: 1
=>
dataset: arr1 value: [[ 2 4 6]
[ 8 9 10]]
group p value: <HDF5 group “/\data3” (0 members)>
p: \data3 ,f[p] keys: <KeysViewHDF5 []> len: 0

二、python 中库：PyTables和h5py

Python中的HDF5库通常有PyTables和h5py。

h5py提供了一种直接而高级的HDF5 API访问接口；

PyTables则抽象了HDF5的许多细节以提供多种灵活的数据容器、表索引、查询功能等支持。比如可以进行where索引。

pandas有一个最小化的类似于字典的HDFStore类，它通过PyTables存储pandas对象。

在h5的文件存储中，storer（fixed format）和table（more scalable random-access and query abilities）格式的存储，在读写也是存在着很大的差异。前者的速度快，但是更占空间。后者的速度要慢一些。

从方便程度来看，pandas内置的方法对df类型的数据封装较好，应用最为方便，但有人认为速度较差；从速度来看，h5py最优，但在具体数据处理上更复杂。

在2015年SciPy上，来自PyTables，h5py，HDF Group，Pandas以及社区成员的开发人员坐下来讨论如何使Python和HDF5的协作更加精简和更易于维护。

这个是未来几个库之间关系的设想：

# -*- coding: utf-8 -*-import h5py
import pandas as pd
import numpy as np
import time as t# 把日期分解为int 数组，提高效率；
# 把代码存为int,建立解析关系
path =r'C:\Users\rustr\Desktop\000001.XSHE.csv'
path_h5 =r'C:\Users\rustr\Desktop\000001.h5'
path_h5_2 =r'C:\Users\rustr\Desktop\000001_2.h5'def to_hdf_test():t0= t.time()df1 = pd.read_csv(path,  encoding='gb18030')print('csv=>df read cost time:',t.time()-t0,'s')df2 = df1;print('df1的格式：',type(df1))print('df 的行',df1.shape[0],' 列：',df1.shape[1])print("hdf5=>write")t1 =t.time()df1.to_hdf(path_h5,'df',mode='w', format='table')print("to_hdf write=> cost time :",t.time()-t1,'s')t2=t.time()print("hdf5=>append")df2.to_hdf(path_h5, 'data', append=True)print("append mode: to_hdf write=> cost time :",t.time()-t2,'s')t3 =t.time()print("hdf5=>read")df3 = pd.read_hdf(path_h5, 'data')print("read hdf5 =>cost time:",t.time()-t3,'s')print('type df3: ',type(df3))#print(df3)def hdf5Store_test():# read csvt0 =t.time()df = pd.read_csv(path,  encoding='gb18030')print('read csv cost time:',t.time()-t0,'s')print('df 的行',df.shape[0],' 列：',df.shape[1])# write# way 1print("hdf5=>write")store = pd.HDFStore(path_h5_2)store['data'] = dfstore.close()# way 2 :appendprint("hdf5=>write")store = pd.HDFStore(path_h5_2)for i in range(2):store.append('data', df) #追加的形式,label为data,也可以取别的名print('hdf5 append work is close')store.close()# readprint('hdf5 => read')t1= t.time()store = pd.HDFStore(path_h5_2)data = store['data']store.close()print('read hdf5 cost time:',t.time()-t1,'s')print('hdf5 data 的行',data.shape[0],' 列：',data.shape[1])print('type data:',type(data))#print(data)def h5py_test():# hdf5 象是文件夹方式# 比如'\files\IC.csv'# dataset是数据元素的一个多维数组以及支持元数据（metadata）# n = 10000 #重复次数now_time_str = '2019-08-18 13:53:00'.encode('utf-8')*nnow_time_int = np.array([2019, 8, 18,13.,53,0])*npath1 = r'C:\Users\rustr\Desktop\test1.h5'path2 = r'C:\Users\rustr\Desktop\test2.h5'# 写入HDF5# 1、strprint('now is writing hdf5 file test')t1 = t.time()f1 = h5py.File(path1,'w')#f.create_dataset('/group/time_str', data=now_time_str) # 方式1#放在mygroup下，以dataset存放f1['/mygroup/time_str'] = now_time_str                     #方式2# 打个标识f1['/mygroup/time_str'].attrs["string"] = 'string'f1.close()print('dt 写入hdf5 =>字符串花时：',t.time()-t1,'s')# 2、int ; 不设groupt2 = t.time()f2 = h5py.File(path2,'w')f2.create_dataset('time_int', data=now_time_int)#f2['time_int'] = now_time_intf2.close()print('dt 写入hdf5 =>int花时：',t.time()-t2,'s')# 读取HDF5print('now is reading hdf5 file test......')# str_t1 =t.time()for i in range(100):g1 = h5py.File(path1,'r')time_str_dataset =g1[r'/mygroup/time_str'] # datasettime_str = time_str_dataset[()] # 从dataset中取出元素#time_str = g1[r'/mygroup/time_str'].value => depreciated#print(time_str)g1.close()print('dt str type => read hdf5=> cost:',t.time()-_t1,'s')#int_t2 =t.time()for i in range(100):g2 = h5py.File(path2,'r')time_int_dataset = g2['time_int'] #取出datasettime_int = time_int_dataset [()] #取出value#print(time_int)g2.close()print('dt int type => read hdf5=> cost:',t.time()-_t2,'s')

关于不同的字符串和数值型的读写比较：

to_hdf_test()
csv=>df read cost time: 0.2543447017669678 s
df1的格式： <class ‘pandas.core.frame.DataFrame’>
df 的行 116638 列： 14
hdf5=>write
to_hdf write=> cost time : 0.09382462501525879 s
hdf5=>append
append mode: to_hdf write=> cost time : 0.09407901763916016 s
hdf5=>read
read hdf5 =>cost time: 0.12476396560668945 s
type df3: <class ‘pandas.core.frame.DataFrame’>

pandas的read_csv的速度也让我很吃惊。

h5py_test()
now is writing hdf5 file test
dt 写入hdf5 =>字符串花时： 0.001994609832763672 s
dt 写入hdf5 =>int花时： 0.001024007797241211 s
now is reading hdf5 file test…
dt str type => read hdf5=> cost: 0.051293134689331055 s
dt int type => read hdf5=> cost: 0.03966259956359863 s

可以看出，hdf5读的速度非常快，优势非常明显。

python: pandas 、dataframe 与hdf5相关推荐

python pandas DataFrame 替换 NaN 值和删除 NaN 所在的行。
python pandas DataFrame 替换 NaN 值和删除 NaN 所在的行. import pandas as pd import numpy as np df1 = pd.Data ...
python pandas DataFrame 查找NaN所在的位置
python pandas DataFrame 查找 NaN 所在的位置 import pandas as pd import numpy as np df1 = pd.DataFrame({'日期' ...
python pandas dataframe 列转换为离散值
python pandas dataframe 列转换为离散值 import pandas as pd import numpy as np df1 = pd.DataFrame({'日期': [' ...
python pandas DataFrame 排序
python pandas DataFrame 排序 import pandas as pd import numpy as np df1 = pd.DataFrame({'日期': ['2021-7 ...
python pandas DataFrame 字符串转日期格式
python pandas DataFrame 字符串转日期格式 import pandas as pd import numpy as np df1 = pd.DataFrame({'日期': [' ...
python pandas DataFrame 数据替换
python pandas DataFrame 替换 import pandas as pd import numpy as np df1 = pd.DataFrame({'日期': ['2021-7 ...
python pandas DataFrame 转置
python pandas DataFrame 转置 import pandas as pd df1 = pd.DataFrame({'日期': ['2021-7-2', '2021-8-2', '2 ...
python pandas dataframe 行列选择，切片操作原创 2017年02月15日 21:43:18 标签： python 30760 python pandas dataframe
python pandas dataframe 行列选择,切片操作原创 2017年02月15日 21:43:18 标签: python / 30760 编辑删除 python pandas dat ...
Python pandas.DataFrame.combine_first函数方法的使用
Pandas是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具.Pandas提供了大量能使我们快速 ...
Python pandas.DataFrame.tz_localize函数方法的使用
Pandas是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具.Pandas提供了大量能使我们快速 ...

python: pandas 、dataframe 与hdf5

python: pandas 、dataframe 与hdf5相关推荐

最新文章

热门文章