Part 1: Yelp Dataset Profiling and Understanding

Yelp Dataset :在网页操作吧

1. Profile the data by finding the total number of records for each of the tables below:

i. Attribute table = 10000
ii. Business table = 10000
iii. Category table = 10000
iv. Checkin table = 10000
v. elite_years table = 10000
vi. friend table = 10000
vii. hours table = 10000
viii. photo table = 10000
ix. review table = 10000
x. tip table = 10000
xi. user table = 10000

2. Find the total distinct records by either the foreign key or primary key for each table. If two foreign keys are listed in the table, please specify which foreign key.

i. Business = 10000
ii. Hours = 1562
iii. Category = 2643
iv. Attribute = 1115
v. Review = 10000
vi. Checkin = 493
vii. Photo = 10000
viii. Tip = 537 distinct user IDs and 3979 distinct business IDs (No Primary Key)
ix. User = 10000
x. Friend = 11
xi. Elite_years = 2780

3. Are there any columns with null values in the Users table? Indicate “yes,” or “no.”

Answer:
No. There is no null values in the Users table.
SQL code used to arrive at answer:

SELECT COUNT(id) FROM user WHERE id IS NULL  OR name IS NULL OR review_count IS NULL OR yelping_since IS NULL OR useful IS NULL OR funny IS NULL OR cool IS NULL OR fans IS NULL OR average_stars IS NULL OR compliment_hot IS NULL OR compliment_more IS NULL OR compliment_profile IS NULL OR compliment_cute IS NULL OR compliment_list IS NULL OR compliment_note IS NULL OR compliment_plain IS NULL OR compliment_cool IS NULL OR compliment_funny IS NULL OR compliment_writer IS NULL OR compliment_photos IS NULL

4. For each table and column listed below, display the smallest (minimum), largest (maximum), and average (mean) value for the following fields:

i. Table: Review, Column: Starsmin: 1        max: 5      avg:3.7082ii. Table: Business, Column: Starsmin: 1.0        max: 5.0        avg:3.6549iii. Table: Tip, Column: Likesmin: 0      max: 2      avg: 0.0144iv. Table: Checkin, Column: Countmin: 1      max: 53     avg: 1.9414v. Table: User, Column: Review_countmin: 0       max: 2000       avg: 24.2995

5. List the cities with the most reviews in descending order:

SQL code used to arrive at answer:

 SELECT city, SUM(review_count) as city_reviewsFROM businessGROUP BY cityORDER BY city_reviews DESC

Copy and Paste the Result Below:
Results:
±----------------±-------------+
| city | city_reviews |
±----------------±-------------+
| Las Vegas | 82854 |
| Phoenix | 34503 |
| Toronto | 24113 |
| Scottsdale | 20614 |
| Charlotte | 12523 |
| Henderson | 10871 |
| Tempe | 10504 |
| Pittsburgh | 9798 |
| Montréal | 9448 |
| Chandler | 8112 |
| Mesa | 6875 |
| Gilbert | 6380 |
| Cleveland | 5593 |
| Madison | 5265 |
| Glendale | 4406 |
| Mississauga | 3814 |
| Edinburgh | 2792 |
| Peoria | 2624 |
| North Las Vegas | 2438 |
| Markham | 2352 |
| Champaign | 2029 |
| Stuttgart | 1849 |
| Surprise | 1520 |
| Lakewood | 1465 |
| Goodyear | 1155 |
±----------------±-------------+
(Output limit exceeded, 25 of 362 total rows shown)

6. Find the distribution of star ratings to the business in the following cities:

i. Avon

SQL code used to arrive at answer:

SELECT stars, count(id) AS num_businesses  FROM businessWHERE city='Avon'GROUP BY starsORDER BY stars ASC

Copy and Paste the Resulting Table Below (2 columns - star rating and count):
Results
±------±---------------+
| stars | num_businesses |
±------±---------------+
| 1.5 | 1 |
| 2.5 | 2 |
| 3.5 | 3 |
| 4.0 | 2 |
| 4.5 | 1 |
| 5.0 | 1 |
±------±---------------+

ii. Beachwood

SQL code used to arrive at answer:

SELECT stars, count(id) AS num_businesses  FROM businessWHERE city='Beachwood'GROUP BY starsORDER BY stars ASC

Copy and Paste the Resulting Table Below (2 columns - star rating and count):
Results
±------±---------------+
| stars | num_businesses |
±------±---------------+
| 2.0 | 1 |
| 2.5 | 1 |
| 3.0 | 2 |
| 3.5 | 2 |
| 4.0 | 1 |
| 4.5 | 2 |
| 5.0 | 5 |
±------±---------------+

7. Find the top 3 users based on their total number of reviews:

SQL code used to arrive at answer:

SELECT name, id, review_countFROM userORDER BY review_count DESCLIMIT 3

Copy and Paste the Result Below:
Results
±-------±-----------------------±-------------+
| name | id | review_count |
±-------±-----------------------±-------------+
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 |
| Sara | -3s52C4zL_DHRK0ULG6qtg | 1629 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 |
±-------±-----------------------±-------------+

8. Does posing more reviews correlate with more fans?

Please explain your findings and interpretation of the results:

Answer: No. The counts of reviews are not correlated with the counts of fans. When I sorted the top 10 of the reviews, the fans’ counts were not in the same ordering. When I sorted the top 10 of the fans, it showed the same situiations.

Code:

SELECT name, id,review_count, fansFROM userORDER BY review_count DESCLIMIT 10

Results review_count output :
±----------±-----------------------±-------------±-----+
| name | id | review_count | fans |
±----------±-----------------------±-------------±-----+
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 | 253 |
| Sara | -3s52C4zL_DHRK0ULG6qtg | 1629 | 50 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 | 76 |
| .Hon | -K2Tcgh2EKX6e6HqqIrBIQ | 1246 | 101 |
| William | -FZBTkAZEXoP7CYvRV2ZwQ | 1215 | 126 |
| Harald | --2vR0DIsmQ6WfcSzKWigw | 1153 | 311 |
| eric | -gokwePdbXjfS0iF7NsUGA | 1116 | 16 |
| Roanna | -DFCC64NXgqrxlO8aLU5rg | 1039 | 104 |
| Mimi | -8EnCioUmDygAbsYZmTeRQ | 968 | 497 |
| Christine | -0IiMAZI2SsQ7VmyzJjokQ | 930 | 173 |
±----------±-----------------------±-------------±-----+

Code:

SELECT name, id,review_count, fansFROM userORDER BY fans DESCLIMIT 10

Results fans output:
±----------±-----------------------±-------------±-----+
| name | id | review_count | fans |
±----------±-----------------------±-------------±-----+
| Amy | -9I98YbNQnLdAmcYfb324Q | 609 | 503 |
| Mimi | -8EnCioUmDygAbsYZmTeRQ | 968 | 497 |
| Harald | --2vR0DIsmQ6WfcSzKWigw | 1153 | 311 |
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 | 253 |
| Christine | -0IiMAZI2SsQ7VmyzJjokQ | 930 | 173 |
| Lisa | -g3XIcCb2b-BD0QBCcq2Sw | 813 | 159 |
| Cat | -9bbDysuiWeo2VShFJJtcw | 377 | 133 |
| William | -FZBTkAZEXoP7CYvRV2ZwQ | 1215 | 126 |
| Fran | -9da1xk7zgnnfO1uTVYGkA | 862 | 124 |
| Lissa | -lh59ko3dxChBSZ9U7LfUw | 834 | 120 |
±----------±-----------------------±-------------±-----+
*/

9. Are there more reviews with the word “love” or with the word “hate” in them?

Answer: Yes, there are 1780 total reviews with the word “love” and 232 reviews with the word “hate”.

SQL code used to arrive at answer:

SELECT *FROM reviewwhere text like '%love%' or '%hate%';
  1. Find the top 10 users with the most fans:

SQL code used to arrive at answer:

SELECT name,id, fansFROM userORDER BY fans DESCLIMIT 10

Copy and Paste the Result Below:
Results
±----------±-----------------------±-----+
| name | id | fans |
±----------±-----------------------±-----+
| Amy | -9I98YbNQnLdAmcYfb324Q | 503 |
| Mimi | -8EnCioUmDygAbsYZmTeRQ | 497 |
| Harald | --2vR0DIsmQ6WfcSzKWigw | 311 |
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 253 |
| Christine | -0IiMAZI2SsQ7VmyzJjokQ | 173 |
| Lisa | -g3XIcCb2b-BD0QBCcq2Sw | 159 |
| Cat | -9bbDysuiWeo2VShFJJtcw | 133 |
| William | -FZBTkAZEXoP7CYvRV2ZwQ | 126 |
| Fran | -9da1xk7zgnnfO1uTVYGkA | 124 |
| Lissa | -lh59ko3dxChBSZ9U7LfUw | 120 |
±----------±-----------------------±-----+

Part 2: Inferences and Analysis

1. Pick one city and category of your choice and group the businesses in that city or category by their overall star rating.

Compare the businesses with 2-3 stars to the businesses with 4-5 stars and answer the following questions. Include your code.

i. Do the two groups you chose to analyze have a different distribution of hours?
Answer: No, they are all in unique category.

Results:
±----------±-----------------±------+
| city | category | stars |
±----------±-----------------±------+
| Las Vegas | Beauty & Spas | 2.5 |
| Las Vegas | Restaurants | 3.0 |
| Las Vegas | Bars | 3.5 |
| Las Vegas | Health & Medical | 4.0 |
| Las Vegas | Japanese | 4.5 |
| Las Vegas | Doctors | 5.0 |
±----------±-----------------±------+
Code:

Select b.city,c.category,b.starsfrom business b LEFT JOIN category c ON b.id=c.business_idWHERE c.category IS NOT NULL AND B.city='Las Vegas' AND stars BETWEEN 2.0 AND 5.0 GROUP BY stars;

ii. Do the two groups you chose to analyze have a different number of reviews?
Answer: Yes. The different number of reviews count is included in the analysis.

Results:
±----------±------±-------------+
| city | stars | review_count |
±----------±------±-------------+
| Las Vegas | 2.0 | 8 |
| Las Vegas | 2.5 | 19 |
| Las Vegas | 3.0 | 355 |
| Las Vegas | 3.5 | 26 |
| Las Vegas | 4.0 | 4 |
| Las Vegas | 4.5 | 10 |
| Las Vegas | 5.0 | 4 |
±----------±------±-------------+
Code:

Select city,stars,review_countfrom businessWHERE city='Las Vegas' AND stars BETWEEN 2.0 AND 5.0GROUP BY stars;

iii. Are you able to infer anything from the location data provided between these two groups? Explain.
Answer: Yes, I pulled the neighboorhood information and address information to find the differences between these two groups.
I found that “Beauty & Spas” and “Restaurants” were all belonged to the Tropicana Ave district, however, the reviewer has a large different.
“Beauty & Spas” has received less “review_count” as compared to “Restaurants” i.e. it has received only “6” reviews. Whereas, “Restaurants” has a whopping 123 reviews.

±----------±-----------------±------±--------------------±-------------±--------------±----------------------------+
| city | category | stars | hours | review_count | neighborhood | address |
±----------±-----------------±------±--------------------±-------------±--------------±----------------------------+
| Las Vegas | Beauty & Spas | 2.5 | Saturday|8:00-22:00 | 6 | Eastside | 3808 E Tropicana Ave |
| Las Vegas | Restaurants | 3.0 | Saturday|11:00-0:00 | 123 | | 5045 W Tropicana Ave |
| Las Vegas | Bars | 3.5 | Saturday|0:00-0:00 | 105 | Southwest | 4785 Blue Diamond Rd |
| Las Vegas | Health & Medical | 4.0 | Saturday|8:00-12:00 | 16 | Spring Valley | 6070 S Rainbow Blvd, Ste 10 |
| Las Vegas | Japanese | 4.5 | None | 3 | Eastside | 3480 S Maryland Pkwy |
| Las Vegas | Doctors | 5.0 | None | 5 | Anthem | 2779 W Horizon Ridge Pkwy |
±----------±-----------------±------±--------------------±-------------±--------------±----------------------------+
SQL code used for analysis:

Select b.city,c.category,b.stars,h.hours,b.review_count,b.neighborhood,b.addressfrom business b LEFT JOIN category c ON b.id=c.business_id --(applied left join to join the id of business table and category table to find out the categories)--LEFT JOIN hours h ON b.id=h.business_id --(applied left join to join the id of business table and hours table to find out the working hours)--WHERE c.category IS NOT NULLAND B.city='Las Vegas' AND stars BETWEEN 2.0 AND 5.0GROUP BY stars;

2. Group business based on the ones that are open and the ones that are closed. What differences can you find between the ones that are still open and the ones that are closed? List at least two differences and the SQL code you used to arrive at your answer.

i. Difference 1:
Answer: I grouped business together of the city “Las Vegas” and found out that these two “restaurants” named " Jacques Cafe " and “Desert Medical Equipment” have difference of rating.
Since, “Jacques Cafe” is closed on a saturday its rating is “4.0” where as “Desert Medical Equipment” is open on Monday has a rating of “5.0”.

ii. Difference 2:
Answer: The second differnece which I see is that “Desert Medical Equipment” has received less “review_count” as compared to “Jacques Cafe”
i.e. it has received only “4” reviews. Whereas, “Jacques Cafe” has a whopping 168 reviews.

±-------------------------±----------±------------±------±---------------------±-------------±-----------+
| name | city | category | stars | hours | review_count | Open/Close |
±-------------------------±----------±------------±------±---------------------±-------------±-----------+
| Jacques Cafe | Las Vegas | Gluten-Free | 4.0 | Saturday|11:00-20:00 | 168 | CLOSE |
| Desert Medical Equipment | Las Vegas | Shopping | 5.0 | Monday|8:00-17:00 | 4 | OPEN |
±-------------------------±----------±------------±------±---------------------±-------------±-----------+
SQL code used for analysis:

 Select b.name,b.city,c.category,b.stars,h.hours,b.review_count,CASEWHEN b.is_open = 1 THEN "OPEN"ELSE 'CLOSE'END 'Open/Close'from business b LEFT JOIN category c ON b.id=c.business_idLEFT JOIN hours h ON b.id=h.business_idWHERE hours IS NOT NULLAND category IS NOT NULLAND b.city = 'Las Vegas'GROUP BY b.is_open;

3. For this last part of your analysis, you are going to choose the type of analysis you want to conduct on the Yelp dataset and are going to prepare the data for analysis.

Ideas for analysis include: Parsing out keywords and business attributes for sentiment analysis, clustering businesses to find commonalities or anomalies between them, predicting the overall star rating for a business, predicting the number of fans a user will have, and so on. These are just a few examples to get you started, so feel free to be creative and come up with your own problem you want to solve. Provide answers, in-line, to all of the following:

i. Indicate the type of analysis you chose to do:
Answer: Since I am interested in Yoga, so I wanted to find how many places does Yelp can help me find which are “Yoga” friendly. I moved to New York so I thought to explore some new Yoga places around me or just generally I just wanted to check. It was maybe curiosity but I was very excited to do this.

ii. Write 1-2 brief paragraphs on the type of data you will need for your analysis and why you chose that data:

Answer: So since I am specifically looking for “Yoga” so in order to find this I would have to first search from the “Category” table that how many Yoga places are there.
First, I started writing of my data from the “category” table, I then also wanted to knew the place’s “city, state, stars, names, review count etc” so I joined in the “business” table with my category table by applying the “left join” on their “id” i.e the id of category table and the id of the business table.
Besides, I was also curious about the user reviews on the Yoga places so I looped in the “review” table so that I could pull out all of the “text” column of that specific Yoga.

iii. Output of your finished dataset:

±-------------------------------------±--------±------±------±---------±-------------±-----+
| name | city | state | stars | category | review_count | text |
±-------------------------------------±--------±------±------±---------±-------------±-----+
| None | None | None | None | Yoga | None | None |
| Lifestyles Fitness Personal Training | Tempe | AZ | 5.0 | Yoga | 17 | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
| The Gym at 99 Sudbury | Toronto | ON | 3.0 | Yoga | 14 | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
| None | None | None | None | Yoga | None | None |
±-------------------------------------±--------±------±------±---------±-------------±-----+

iv. Provide the SQL code you used to create your final dataset:

 Select b.name,b.city,b.state,b.stars,c.category,b.review_count,r.textfrom category cLEFT JOIN business b ON c.business_id=b.idLEFT JOIN review r ON b.id=r.idWhere category='Yoga';

感悟: SQL学习可以迈出第一步;多Push自己。

记录贴: SQL Data Scientist Profiling and Analyzing the Yelp Dataset Coursera Worksheet相关推荐

  1. This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its 错误记录

    This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declaration and binary log ...

  2. DS/ML:《Top 19 Skills You Need to Know in 2023 to Be a Data Scientist,2023年成为数据科学家需要掌握的19项技能》翻译与解读

    DS/ML:<Top 19 Skills You Need to Know in 2023 to Be a Data Scientist,2023年成为数据科学家需要掌握的19项技能>翻译 ...

  3. 数据科学工作者(Data Scientist) 的日常工作内容包括什么?

    作者:阿萨姆 众所周知,数据科学是这几年才火起来的概念,而应运而生的数据科学家(data scientist)明显缺乏清晰的录取标准和工作内容.即使在2017年,数据科学家这个岗位的依然显得" ...

  4. python如何统计累计每日的人数‘’_每日一练 | Data Scientist amp; Business Analyst amp; Leetcode 面试题 902...

    点击上方蓝字 会变美 " 每 日 一 练 " Jun. 30 Data Application Lab 自2017年6月15日起,每天和你分享讨论一道数据科学(DS)和商业分析(B ...

  5. Red Gate系列之四 SQL Data Compare 10.2.0.885 Edition 数据比较同步工具 完全破解+使用教程...

    Red Gate系列之四 SQL Data Compare 10.2.0.885 Edition 数据比较同步工具 完全破解+使用教程 Red Gate系列文章: Red Gate系列之一 SQL C ...

  6. 使用Python将MQTT传感器数据记录到SQL数据库

    使用Python将MQTT传感器数据记录到SQL数据库 在这个项目中,我们将创建一个简单的数据记录器,将数据记录到sqlite数据库. 该项目包括两个模块. sql logger类模块sql_logg ...

  7. mysql 取出20条数据_“取出数据表中第10条到第20条记录”的sql语句+select top 使用方法...

    1.首先.select top使用方法: select * from table --  取全部数据.返回无序集合 select top n * from table -- 依据表内数据存储顺序取前n ...

  8. 如何有效地记录 Java SQL 日志(转)

    在常规项目的开发中可能最容易出问题的地方就在于对数据库的处理了,在大部分的环境下,我们对数据库的操作都是使用流行的框架,比如 Hibernate . MyBatis 等.由于各种原因,我们有时会想知道 ...

  9. This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its de 错误解决办法

    这是我们开启了bin-log, 我们就必须指定我们的函数是否是 1 DETERMINISTIC 不确定的 2 NO SQL 没有SQl语句,当然也不会修改数据 3 READS SQL DATA 只是读 ...

最新文章

  1. 协程Coroutines入门
  2. .netcore 如何获取系统中所有session_集群化部署,Spring Security 要如何处理 session 共享?
  3. redis 自减命令_Redis 实战 —— 04. Redis 数据结构常用命令简介
  4. js改变classname 或添加classname
  5. 云原生生态周报 Vol. 17 | Helm 3 发布首个 beta 版本
  6. 20200428总结
  7. 《游戏编程入门 4th》笔记(1 / 14):Windows初步
  8. Redis:分布式锁Watch
  9. 用于数据输入的基本WPF窗口功能
  10. vue项目中更新element-ui版本
  11. 前端知识天天学(4)
  12. 令用EclipseJ2EE创建的Dynamic Web project目录结构与用MyEclipse创建的Web project一样
  13. linux opendir路径_Linux目录操作函数:opendir/readdir/closedir
  14. TX1、跨平台文件传输工具\使用记录
  15. linux 复制文件并改名,在Linux上复制和重命名文件
  16. 安装软件提示重启计算机,安装西门子软件时提示重启电脑的解决办法
  17. 大数据决策领跑零售业
  18. 国外2个在线web程序代码编辑网站
  19. 分享111个HTML医疗保健模板,总有一款适合您
  20. 如何快速提高视唱练耳能力

热门文章

  1. Volatile最终解释
  2. 白勇老师的yolox自定义训练数据集
  3. 垃圾自动分拣系统python代码_[HuskyLens]掌控垃圾分拣系统
  4. 无mac电脑ios证书的申请流程
  5. 2023年湖北监理工程师报考条件和专业要求有哪些? 甘建二告诉你
  6. 使用 GraphQL 无限滚动
  7. 数字经济时代:强者恒强的 “马太效应” 愈发明显
  8. 加密域可逆信息隐藏 ,针对异或加密的唯密文攻击
  9. 我构想的CW通讯莫尔斯编码
  10. java第十五次作业