tableau跨库创建并集

One of the coolest things about making our personal project is the fact that we can explore topics of our own interest. On my case, I’ve had the chance to backpack around the world for more than a year between 2016–2017, and it was one of the best experiences of my life.

进行个人项目的最酷的事情之一是,我们可以探索自己感兴趣的主题。 就我而言,2016年至2017年之间,我有机会在世界各地背包旅行了一年多,这是我一生中最好的经历之一。

During my travel, I used A LOT OF HOSTELS. From Hanoi to the Iguassu Falls, passing through Tokyo, Delhi and many other places, one always need a place to rest after a long day exploring the city. Funny enough, it was on some of those hostels that I got interested in learning how to code, which started my way to become a data analyst today.

在旅行中,我使用了很多杂物。 从河内到伊瓜苏瀑布,途经东京,德里和许多其他地方,经过漫长的一天探索这座城市,人们总是需要一个休息的地方。 有趣的是,正是在一些旅馆中,我对学习如何编码感兴趣,这开始使我成为今天的数据分析师。

Source. 来源 。

So, I’m very interested in understanding what makes a hostel better than another, how to compare them, etc, and after thinking about that I came up with this tutorial idea. Today we are going to do 2 things:

因此,我对了解什么使旅馆比其他旅馆更好,如何进行比较等感兴趣,并且在考虑了这一点之后,我想到了本教程。 今天我们要做两件事:

  • Scrap data from Hostel World, using Berlin as our study case, and save it into a data frame.使用柏林作为我们的研究案例,从Hostel World收集数据,并将其保存到数据框中。
  • Use that data to build a Tableau Dashboard that will allow us to select the hostel based in different criteria.使用该数据来构建Tableau仪表板,该仪表板将使我们能够根据不同的条件选择旅馆。

Why Berlin Hostels? Because Berlin is an amazing city, and there’s a lot of options of hostels there for us to explore. There are many different websites to look for hostels, and we will use my favorite, Hostel World, which I particularly utilized many times, and it’s the one I trust for the accuracy of the information they provide.

为什么选择柏林青年旅舍? 因为柏林是一个了不起的城市,所以这里有很多旅馆供我们探索。 有很多不同的网站可以寻找旅馆,我们将使用我最喜欢的Hostel World ,我多次使用它,并且我相信它可以提供所提供信息的准确性。

Ricardo Gomez Angel on Ricardo Gomez Angel在 Unsplash Unsplash拍摄

My goal is to show you that we can do the whole process of collect/transform/visualize data in a simple yet effective way so you can start doing your own projects. To fully enjoy this tutorial, it’s important that you are familiar with python, pandas, and also comfortable with HTML and Tableau basics concepts.

我的目标是向您展示,我们可以以一种简单而有效的方式完成收集/转换/可视化数据的整个过程,以便您可以开始自己的项目。 要完全享受本教程,重要的是,您必须熟悉python,pandas,并熟悉HTML和Tableau基本概念。

You can follow along with the notebook containing the code here, and access the Tableau Dashboard here.

您可以使用包含代码的笔记本跟着在这里 ,和访问的Tableau仪表板在这里 。

始终先浏览网站! (Always Explore the Website First!)

I highly recommend that you take some time to explore the structure of the website prior to start coding. If you’re using Chrome, just click on the right button of the mouse and select “Inspect”. That’s what you got:

我强烈建议您在开始编码之前花一些时间来探索网站的结构。 如果您使用的是Chrome,只需单击鼠标右键,然后选择“检查”。 那就是你得到的:

The HTML structure of our target page. Source: author.
目标页面HTML结构。 资料来源:作者。

Think the HTML structure as a tree, with all its branches holding the information of the page. Try to find which class has information about hostel name, ratings, etc. More important, check out how information of each hostel has its own “branch”, or container. That means that once we figure out how to access it, we can expand the same logic for all other hostels/containers.

将HTML结构想像成一棵树,其所有分支都保存页面的信息。 尝试查找哪个班级提供有关旅馆名称,等级等的信息。更重要的是,检查每个旅馆的信息如何有其自己的“分支”或容器。 这意味着一旦弄清楚如何访问它,我们便可以为所有其他旅馆/容器扩展相同的逻辑。

On the code below I’m showing you how to get the raw information, then how to figure out how many pages of hostels we have, as we will need that to iterate later, and then how to separate the information about the first hostel in order to explore it. Take your time to read the code and comments, I wrote it specially for you:

在下面的代码中,我向您展示如何获取原始信息,然后如何确定我们拥有多少个旅舍页面,因为我们以后需要进行迭代,然后如何在其中分离有关第一个旅舍的信息。为了探索它。 花些时间阅读代码和注释,我是专门为您编写的:

# importing the libraries to use on the scraping
from requests import get
from bs4 import BeautifulSoupimport pandas as pd
import numpy as npimport timeimport re# getting the html info to be used
url = 'https://www.hostelworld.com/hostels/Berlin'
response = get(url)# create soup
soup = BeautifulSoup(response.text, 'html.parser')# creating individual containers, on each one there's information about one hostel.
holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')# Figuring out how many pages with hostels do we have available. This information is important when iterating over pages.
total_pages= soup.findAll(class_= "pagination-page-number")
final_page= pd.to_numeric(total_pages[-1].text)
print(final_page)# checking how many hostels we have on the first page
print(len(holstel_containers))first_hostel = holstel_containers[0]
print(first_hostel.prettify())

The output of this code will be first a “3”, the number of pages with hostel info, then a “30”, the number of hostels per page, and finally a long bunch of HTML, which is the information about the first hostel on the list. The information we will extract today is the following:

此代码的输出将首先是“ 3”,即包含旅馆信息的页面数,然后是“ 30”,即每页的旅馆数,最后是一堆HTML,这是有关第一家旅馆的信息在清单上。 我们今天将提取的信息如下:

  • Name名称
  • Link链接
  • Distance from centre (km)距中心的距离(公里)
  • Average Rating平均评分
  • Number of reviews评论数
  • Average price in USD平ASP格(美元)

Using our super HTML skills, we figured out that the code to extract that is the one below. If you have already used Beautiful Soup, could you get the same information in a different way? If yes, I would love to see that on the comments.

使用我们的超级HTML技能,我们找出了下面要提取的代码。 如果您已经使用过Beautiful Soup,可以通过其他方式获得相同的信息吗? 如果是,我希望在评论中看到这一点。

# Hostel name
first_hostel.h2.a.text# hostel link
first_hostel.h2.a.get('href')# distance from city centre in km
first_hostel.find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip()# average rating
first_hostel.find(class_='hwta-rating-score').text.replace('\n', '').strip()# number of reviews
first_hostel.find(class_="hwta-rating-counter").text.replace('\n', '').strip()# average price per night in USD
first_hostel.find(class_= "price").text.replace('\n', '').strip()[3:]

Note that we will need to use some pandas essentials, like replace and strip, along with some operators from the Beautiful Soup package, mostly the find, find_all and get. Knowing how to combining them is something that requires some practice, but I can guarantee that,once you understand the idea, it is pretty simple.

注意,我们将需要使用一些熊猫必需品,例如replace和strip ,以及Beautiful Soup包中的一些运算符,主要是findfind_allget。 知道如何将它们组合起来是需要一些实践的事情,但是我可以保证,一旦您理解了这个想法,它就非常简单。

Now that we know how to access the information we need in the first container, we will expand the same logic across all the hostels on the first page, and also across all the pages with hostel information. How do we do that? First by using our very well known for loop, then saving the information into empty lists, and finally using those lists to create a data frame:

现在,我们知道了如何访问第一个容器中所需的信息,我们将在第一页上的所有旅馆以及包含旅馆信息的所有页面上扩展相同的逻辑。 我们该怎么做? 首先使用我们众所周知的for循环 然后将信息保存到空列表中,最后使用这些列表创建数据框:

# first, create the empty lists
hostel_names= []
hostel_links= []
hostel_distance= []
hostel_ratings= []
hostel_reviews= []
hostel_prices= []for page in np.arange(1,final_page+1): # to iterate over the pages and create the conteiners, using the final_page data we've got at the beginingurl = 'https://www.hostelworld.com/hostels/Berlin?page=' + str(page)response = get(url)soup = BeautifulSoup(response.text, 'html.parser')holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')for item in range(len(holstel_containers)): # to iterate over the results on each pagehostel_names.append(holstel_containers[item].h2.a.text)hostel_links.append(holstel_containers[item].h2.a.get('href'))hostel_distance.append(holstel_containers[item].find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip())hostel_ratings.append(holstel_containers[item].find(class_='hwta-rating-score').text.replace('\n', '').strip())hostel_reviews.append(holstel_containers[item].find(class_="hwta-rating-counter").text.replace('\n', '').strip())hostel_prices.append(holstel_containers[item].find(class_= "price").text.replace('\n', '').strip()[3:])                          time.sleep(2) # this is used to not push too hard on the website# using the lists to create a brand new dataframe
hw_berlin = pd.DataFrame({'hostel_name': hostel_names,'distance_centre_km': hostel_distance,'average_rating': hostel_ratings,'number_reviews': hostel_reviews,'average_price_usd': hostel_prices,'hw_link': hostel_links
})hw_berlin.head()

And now we can appreciate the beauty of what we have just created:

现在我们可以欣赏到我们刚刚创造的美丽:

First lines of the Berlin Hostels data frame. Source: author.
柏林旅馆数据框的第一行。 资料来源:作者。

After that we just need to clean up the data a little bit, removing non-numerical characters and converting strings, saved initially as object, to numbers. Finally, we will save our results into a .csv file.

之后,我们只需要稍微整理一下数据,删除非数字字符并将最初保存为object的字符串转换为数字。 最后,我们将结果保存到.csv文件中。

# removing non numerical character on the column distance_centre_km
hw_berlin.distance_centre_km = [re.sub('[^0-9.]','', x) for x in hw_berlin.distance_centre_km]# converting numerical columns to proper formatlist_to_convert = ['distance_centre_km', 'average_rating', 'number_reviews', 'average_price_usd']for column in list_to_convert:hw_berlin[column] = pd.to_numeric(hw_berlin[column], errors= 'coerce')# saving the final version into a .csv file
hw_berlin.to_csv('hw_berlin_basic_info.csv')

Tableau欢乐时光! (Tableau Fun Time!)

Tableau is one of the most powerful BI tools available today, and it offers a free version, Tableau Public, that allows you to do A LOT of cool stuff. However, it can become pretty complex very fast, even to do some basic graphs. I cannot cover all the steps I did here, as it was a lot of click and drag actions. It’s different than code where you can just type and reproduce it all.

Tableau是当今可用的功能最强大的BI工具之一,它提供了免费版本Tableau Public ,使您可以做很多很棒的事情。 但是,即使做一些基本图形,它也会变得非常复杂。 我无法涵盖我在此处所做的所有步骤,因为这涉及很多单击和拖动操作。 它与代码不同,在代码中,您只需键入并复制所有内容即可。

So, if you are new to Tableau and if you want to understand how I build my visualization, the way to do that is by downloading the .twb file, which is available here, then open it in your computer, and do what we call “reverse engineering”, which is basically to check and play with the files that I’ve created yourself. Trust me, this is the most effective way to learn Tableau, and even when you can see the engineering behind, it can be hard to reproduce the same visualization. Let’s try to do it?

因此,如果您是Tableau的新手,并且想了解如何构建可视化文件,则可以通过下载.twb文件(在此处可用),然后在计算机中打开它并执行我们所谓的操作来实现。 “逆向工程” ,基本上是检查并播放我自己创建的文件。 相信我,这是学习Tableau的最有效方法,即使您看到了背后的工程知识,也很难再现相同的可视化效果。 让我们尝试做吗?

Tableau offers different filters that help you to slice and visualize our recently scraped data. Source: author.
Tableau提供了不同的筛选器,可帮助您切片和可视化我们最近抓取的数据。 资料来源:作者。

As data or business analyst, we need basically to make data readable and easy to manipulate. The visualization I’ve build for this tutorial offers you that: you can slice and play with the hostels based in some different criteria we have available, filtering the options and finding the ones you are interested, just like a stakeholder would do. Besides the filters, I’ve included also a scatter plot where we can check the relationship between price and reviews.

作为数据或业务分析师,我们基本上需要使数据可读并易于操纵。 我为本教程构建的可视化为您提供:您可以根据我们可用的一些不同标准对旅馆进行切片和玩耍,过滤选项并找到您感兴趣的选项,就像利益相关者会做的那样。 除了过滤器之外,我还包括了一个散点图,我们可以在其中检查价格和评论之间的关系。

The dashboard is pretty simple, and I’ve done that way by purpose, I would like to see you doing it by yourself and sharing the link of your results on the comments. What kind of different information can you get from the date we’ve scraped? Could you do the same analysis with hostels in Paris, New York or Rio de Janeiro? I’ll leave those questions for you to answer with your own code and dashboard.

仪表板非常简单,我是有意这样做的,我希望您自己做,并分享您的结果在评论中的链接。 从我们抓取之日起,您可以获得什么不同的信息? 您是否可以对巴黎,纽约或里约热内卢的旅馆进行同样的分析? 我将用您自己的代码和仪表板来回答这些问题。

That’s all for today! I hope this tutorial will help you to get more knowledge about data scraping and Tableau. Feel free to connect with me on LinkedIn and to check my other texts and code on my Medium and GitHub profiles.

今天就这些! 我希望本教程将帮助您获得有关数据抓取和Tableau的更多知识。 随时在LinkedIn上与我联系,并在我的Medium和GitHub个人资料中查看我的其他文本和代码。

翻译自: https://towardsdatascience.com/scraping-berlin-hostels-and-building-a-tableau-viz-with-it-a73ce5b88e22

tableau跨库创建并集


http://www.taodudu.cc/news/show-5884720.html

相关文章:

  • ubuntu 20.04 休眠_联想小新Pro13 2020锐龙版R7 4800U安装ubuntu-20.04
  • Linux项目设计:ALSA库安装(声卡)、语音识别、文字转语音、语音转文字
  • 街区的最短路径 nyist7
  • 楚楚街php面试题,面试题总结
  • 楚楚街2016校招 ——礼物(动态规划)
  • 楚楚街2016招聘笔试(旅途)(未完待续)
  • 算法题:解密(楚楚街2016招聘笔试)
  • 算法题:旅途(楚楚街2016招聘笔试)
  • 楚楚街开场动画(有彩蛋)
  • 【楚楚街】解密
  • 楚楚街 旅途
  • 楚楚街2016招聘笔试(航线)
  • 楚楚街 寻宝
  • 楚楚街 解密
  • 楚楚街——解密
  • 病毒假冒工行电子银行升级盗取帐号密码(转)
  • 网上银行如何防盗成关键
  • 微信开放平台开发第三方授权登陆
  • 网上银行假冒网站井喷网民谨防资金和信息被窃
  • 最近工商银行钓鱼网站井喷式增长,过年了小心被钓鱼
  • WiFi下登陆银行账户安不安全
  • 腾讯通服务器文件数据源,Rtx/关于SQL数据库配置
  • 腾讯通服务器历史文件路径,腾讯通RTX管理器打开时提示找不到数据源,怎么办?...
  • H3C路由器与Cisco路由器IPSEC对接
  • H3C路由器OSPF和RIP路由引入简单实验
  • 委以重用的意思_第三章  委以重用
  • 你尝过被人误会的委屈吗?看完本篇…
  • 拉勾网许单单:没有一条通往远方的路不充满误解与委屈
  • 连那样的委屈都受不了!
  • python中nonlocal是什么意思_Python中关键字nonlocal和global的声明与解析

tableau跨库创建并集_刮擦柏林青年旅舍,并以此建立一个Tableau全景。相关推荐

  1. java跨库join方案_集算器协助java处理多样性数据源之跨库关联

    Java的数据计算类库RowSet提供了JoinRowSet和FilteredRowSet类,可以进行跨库的关联计算,但是有很多局限.首先,JoinRowSet只支持inner join,不支持out ...

  2. 手工建库】(二)在原有数据库的基础上再建立一个数据库

    数据库 congjiu2607 2016-10-23 10:33:38 32 收藏 在原有数据库的基础上再建立一个数据库 (先决条件:当当前虚拟内存不够用时,将现有数据库关闭:如果不关库,则需要增加虚 ...

  3. mysql跨库分页查询_跨库跨表分页

    前言 之前经常思考的一个问题,数据库分表后,分页怎么做才是最好的方案呢?今天就来整理一波. 由来 首先是由来,数据量增大,一张表数据太多的话,会使用分表.同理,一个数据库实例到达瓶颈,所以可能需要分库 ...

  4. java跨库调用存储_存储库仅在第二个调用数据时发送回ViewModel

    问题: My Fragment仅在第二次调用时获取数据(例如,当我旋转屏幕时) . 在我看来,这些代码行是repo类中的问题: public class UserRepository { privat ...

  5. java 账本 创建数据库_想用你所学的JAVA与数据库写一个属于自己的账本吗?一起来看看呗!看如何用java项目操作数据库...

    *利用简单的JAVA与数据库写一个属于你自己的账本* 效果图 * 目标实现 把用户输入的信息录入到数据库中,并且从数据库中取出值来,是不是很简单? 所需工具 相信大家都有的,eclipse.myecl ...

  6. C语言创建学生姓名分数链表,C语言编程 编写程序,建立一个学生数据链表,学生的数据包括学号、姓名、成绩。...

    满意答案 w6611826 2013.04.19 采纳率:50%    等级:11 已帮助:10597人 #include #include #define NULL 0 struct stud { ...

  7. 怎么在同一页中分页_分库分表业界难题,跨库分页的几种常见方案

    为什么需要研究跨库分页? 互联网很多业务都有分页拉取数据的需求,例如: (1)微信消息过多时,拉取第N页消息: (2)京东下单过多时,拉取第N页订单: (3)浏览58同城,查看第N页帖子: 这些业务场 ...

  8. mysql跨实际视图_MySQL 跨库关联查询 (创建视图)

    MySQL 跨库关联查询 (创建视图) 一, 前言 SQL CREATE VIEW 语句 什么是视图? 在 SQL 中, 视图是基于 SQL 语句的结果集的可视化的表. 视图包含行和列, 就像一个真实 ...

  9. 跨库一致性_设计跨平台的一致性

    跨库一致性 I offended an Apple employee the other day when I was checking out the new iPad Pro and I told ...

最新文章

  1. 替换WCF默认序列化方式
  2. 网站的高性能架构-性能测试方法
  3. 进阶:案例六: Context Menu(静态 与 动态)
  4. What's New In C# 6.0
  5. 语言怎么绘画人物肖像_国画里的新年,看看古人是怎么过年的!
  6. PMP答题技巧(详细版)
  7. Python中socket入门例子
  8. python代码段有什么用_25个超有用的Python代码段
  9. DQL、DML、DDL、DCL全名是啥?
  10. HTTP长连接和短连接原理浅析
  11. 天正提示加载lisp_天正加载不了 - 卡饭网
  12. 【JVM · 调优】监控及诊断工具
  13. 博士申请 | 新西兰梅西大学王睿俐教授招收语音识别和NLP方向全奖博士生
  14. java对比两张图片是否一致_Java实现图片对比功能
  15. import settings 错误
  16. 关于C语言中的绝对值函数
  17. vue中如何使用JS通过a标签下载文件
  18. requireJs学习06requirejs引入
  19. 推荐 - Github标星113K的前端学习路线图
  20. vue2-editor简单使用及插入图片光标位置问题

热门文章

  1. 嵌入式 linux 屏 翻转,linux嵌入式qt的屏幕旋转与字体大小问题
  2. easyExcel导出日期问题
  3. “数加”斩获2017软博会金奖
  4. mysql核心知识之视图的应用
  5. java pecs_JAVA的PECS原则
  6. 我们应该知道的java位运算
  7. 新手机iPhone X适配
  8. python打印数字倒三角形_在python中用while语句打印出倒三角形【python 三角形】
  9. 寻找OpenHarmony「锦鲤」|万元豪礼+技术干货全是你的!
  10. 关于Android获取屏幕宽高、dp、sp、px之间的转化