美国队长3:内战

There are plenty of reasons why one would want to find solitude in the wilderness, from the therapeutic effects of being immersed in nature, to not wanting to contribute to trail degradation and soil erosion on busier trails.

人们有很多理由想要在旷野找到孤独,从沉浸在大自然中的治疗效果到不想在繁忙的小径上造成小径的退化和土壤侵蚀。

Now more than ever the reprieve of the outdoors is greatly needed. But in a post-COVID 19 world, where it can be practically impossible to maintain proper social distancing measures when passing hikers on a narrow trail, it is especially important to find less frequented trails to hike.

现在比以往任何时候都更需要户外缓刑。 但是在19后COVID的世界中,在狭窄的步道上经过远足者时,几乎不可能维持适当的社会疏远措施,因此寻找不那么频繁的远足径尤为重要。

I set out on a mission to use data science and machine learning to find the best little-known trails in America. You can check out the code on my github if you want to jump into the nitty gritty, or read on for analysis and a list of the hidden gems in your state!

我的任务是使用数据科学和机器学习来找到美国鲜为人知的最佳路径。 您可以在我的github上签出代码,如果想跳入更多细节,或者继续阅读以进行分析以及您所在州的隐藏宝石清单!

该方法 (The Approach)

If you’re anything like me, before you go anywhere or buy anything, you’re going to read all the reviews. When looking for trails to hike, a popular medium for discovering where to go is AllTrails.com.

如果您像我一样,在去任何地方或购买任何东西之前,您需要阅读所有评论。 当您寻找远足小径时, AllTrails.com是找到目的地的一种流行媒介。

When I first approached this project, I wanted to answer the question, “What makes a trail good?” That is, what combination of features and statistics about a trail would lead to it having a high overall rating?

当我第一次接触这个项目时,我想回答一个问题:“什么让步道更好?” 就是说,特征和统计信息的组合如何才能使它具有较高的总体评价?

What I pretty quickly found out though, is that across the 35,000 trails I scraped and analyzed, basically all of them were rated “pretty good” — that is, with an average user rating of 4.2 out of 5 stars and standard deviation of less than 0.6, it was really hard to distinguish which trails were excellent, and which were just okay, from their 5-star rating alone.

不过,我很快发现,在我抓取和分析的35,000条路径中,基本上所有路径都被评为“相当好”,也就是说,平均用户评分为5颗星中的4.2颗,标准偏差小于0.6,真的很难从它们的5星评级中区分出哪些是优秀的,哪些还可以。

What there was huge variation in across all the trails though, was their popularity as represented by the total number of reviews each trail had. While the vast majority of trails had only 100 or so reviews, a select few had several thousand! What was making these trails so popular?

但是,所有路径之间的差异都很大,它们的受欢迎程度由每个路径的评论总数表示。 虽然绝大多数足迹只有100条左右的评论,但很少的一条只有数千条! 是什么让这些足迹如此受欢迎?

I thus pivoted to try to predict not the rating of a trail, but instead determine, via a data-driven model, the relationship between the various features of a given trail and its popularity. In finding commonalities, I could then apply that model to unpopular trails, to find which ones check all the same boxes and are likely to be great, even though they haven’t been discovered yet.

因此,我转而尝试不预测路线的等级,而是通过数据驱动模型确定给定路线的各种特征与其受欢迎程度之间的关系。 在寻找共性时,我可以将该模型应用于不受欢迎的线索,以找出哪些会选中所有相同的框,即使它们尚未被发现,也可能很棒。

方法 (Methodology)

  1. ) With Selenium and Beautiful Soup, scrape AllTrails.com to obtain trail data about 35,000 trails in the United States. This included information about the length of the hike, its elevation gain, its location, and a list of all of the natural features (such as waterfall, wild flowers, paving) the trail had.)使用Selenium和Beautiful Soup,抓取AllTrails.com以获取有关美国35,000条路径的路径数据。 其中包括有关远足时间,海拔提升,位置以及所有自然特征(例如瀑布,野花,铺路)的列表的信息。
  2. ) Clean this data and create a Pandas DataFrame. This included one-hot encoding dummy variables for all of categorical feature columns.)清理此数据并创建一个Pandas DataFrame。 其中包括所有分类要素列的一键编码伪变量。
  3. ) Utilize the VADER Sentiment Analysis module to analyze the text reviews via simple Natural Language Processing for each trail and determine a mean composite score.)利用VADER情绪分析模块通过简单的自然语言处理对每条线索进行文本评论分析,并确定平均综合得分。
  4. ) Use linear regression modeling methodologies including Statsmodels OLS to determine the relationship between a trail’s features and its’ popularity.)使用包括Statsmodels OLS在内的线性回归建模方法来确定路径特征与其受欢迎程度之间的关系。
  5. ) Perform feature engineering and regularization via LassoCV to remove multicollinearity amongst those features and optimize the model.)通过LassoCV执行特征工程和正则化,以消除这些特征之间的多重共线性并优化模型。
  6. ) Apply that model to trails that are described as “lightly trafficked”, to find trails which would be expected to be popular based on their combination of features, but just haven’t been discovered yet.)将该模型应用于描述为“轻度贩运”的路径,以根据其功能组合查找预期会流行的路径,但尚未发现。

发现 (Findings)

A linear regression model was fit to the trail’s stats with the number of reviews (and hence, popularity) serving as the target variable. The model yielded a list of the most influential features on a trail on it being popular. These included there being a fee, having a high sentiment analysis score, it being rocky, and having a scramble and no shade, amongst others.

线性回归模型适合于线索的统计数据,其中评论数(因此受欢迎程度)用作目标变量。 该模型列出了受欢迎的路径上最有影响力的功能。 这些包括收费情感分析得分高不算困难争夺没有阴影 ,等等。

I interpret those important features like this:

我将解释以下重要特征:

  • A fee: If the most popular trails have a fee to use, this indicates they are likely located inside National Parks. As many National Parks are closed due to COVID, or may be very busy, it is even more important to find alternatives.

    收费 :如果最受欢迎的步道需要付费,则表明它们可能位于国家公园内。 由于许多国家公园因COVID而关闭,或者可能非常繁忙,因此寻找替代方案显得尤为重要。

  • Sentiment analysis score: Since all trails have roughly the same score out of 5 stars, its hard to gather a lot of reliable information about their quality from this rating alone. By using natural language processing to analyze the written text reviews themselves, I was able to gain an actual useful metric in determining how people actually feel about the trail. The higher the score (on a scale of -1=very negative to +1=very positive), the stronger people felt positively toward the trail, which was super useful in finding hidden gems.

    情感分析得分 :由于所有足迹在5星中的得分大致相同,因此仅凭此评分就很难收集有关其质量的大量可靠信息。 通过使用自然语言处理本身来分析书面评论,我能够获得一个实际有用的指标来确定人们对这条路的实际感觉。 分数越高(从-1 =非常负到+1 =非常正),人们对步道的感觉越强,这对于发现隐藏的宝石非常有用。

  • Rocky/scramble/no shade: What this says to me is that the very popular trails take place above tree line! It’s on those more difficult hikes with higher elevation gain that you encounter these features. And with higher elevation, you’ll likely get better views! As it turns out, people love these tougher trails.

    崎//无序/无阴影 :这对我说的是,非常受欢迎的步道发生在林线上方! 在遇到这些功能的情况下,就是那些具有更高仰角增益的较困难的远足。 随着海拔的升高,您可能会获得更好的视野! 事实证明,人们喜欢这些艰难的路。

The R² of this model was optimized to 0.19. Though this isn’t a very high score, you can see below that this is because the relationship between trail features and popularity simply isn’t linear. The residuals plot below showing the difference between the predicted popularity values and actual values demonstrates this pretty clearly (if this were linearly dependent, residuals would all fall in a fairly horizontal bar around 0!) So what’s actually determining a trail’s popularity if not it having all the right features of a popular trail?

该模型的R²优化为0.19。 尽管这并不是一个很高的分数,但是您可以在下面看到这是因为足迹特征和受欢迎程度之间的关系不是线性的。 下面的残差图显示了预测的流行度值与实际值之间的差异,很清楚地证明了这一点(如果线性相关,则残差都将落在0附近的相当水平的条形中!)流行路线的所有正确功能?

My key finding was that AllTrail’s algorithm shows the trails with the most reviews first and foremost, which leads to a form of recursive confirmation bias. If all trails have roughly the same rating, users will turn to the reviews to determine whether a trail is good, will choose to do one with a lot of reviews, hence feeding in to the loop of making the very few busiest trails even busier. Meanwhile, other similar trails may have plenty of opportunity but go neglected.

我的主要发现是,AllTrail的算法首先显示了具有最多评论的路径,这导致了递归确认偏差的形式。 如果所有路径的评分大致相同,则用户将转向评论来确定一条路径是否良好,并选择对一条路径进行大量评论,从而进入使最繁忙的路径变得更加繁忙的循环。 同时,其他类似的路线可能有很多机会,但被忽略了。

那么,什么使小道受欢迎呢? (So What Makes a Trail Popular?)

There are tens of thousands of hikes listed on AllTrails.com, but their search algorithm always offers viewers the most popular hikes first. Trails with the most reviews get the most hikes, and hence even more reviews; while lesser known trails may be just a good, but are harder to find on the website, and hard to know for sure whether they’ll be a good trail if they have so few ratings.

AllTrails.com上列出了数以万计的远足,但他们的搜索算法始终始终为观众提供最受欢迎的远足。 评论最多的步道获得最多的加息,因此获得更多评论; 虽然鲜为人知的足迹可能只是一个好选择,但很难在网站上找到,并且如果它们的评分太少,很难确定它们是否会是一个好的足迹。

So what makes a trail popular? Ultimately, AllTrails does.

那么,什么使小道受欢迎呢? 最终, AllTrails做到了。

It’s time we break out of that feedback loop, and find some amazing alternative hikes where we can avoid the crowds. But how will you know if a trail is going to be worth your time? Well, I used Machine Learning to do that work for you.

现在该是我们打破这种反馈循环的时候了,找到一些令人惊奇的替代远足方案,我们可以避开人群。 但是,您怎么知道一条小路是否值得您花时间呢? 好吧,我使用机器学习为您完成了这项工作。

I fit the best model on a subset of trails which were designated as being “lightly trafficked”, and the R² for these trails was 0.08. This was actually encouraging, considering that these are specifically a selection of trails which aren’t popular, but according to this, given their features, should be.

我将最佳模型应用于被指定为“轻度贩运”的部分路径,这些路径的R²为0.08。 这实际上是令人鼓舞的,考虑到这是专门选择的路径不属于流行的,但根据这一点,由于其特点,应该是。

A potential area of future work for this project could be fitting a polynomial features model instead of a linear one. Early exploration into this method yielded a promising R² improvement to 0.26, but did induce some feature collinearity by duplicating features, that would need to be feature engineered out. I’m looking forward to continuing this work once I have more machine learning tools at my disposal! But I’m absolutely thrilled to present you with this list of the best lesser-known trails in America as my very first end-to-end data science project.

该项目未来工作的潜在领域可能是拟合多项式特征模型而不是线性模型。 对该方法的早期探索使R²改善到了0.26,但确实通过复制特征引起了某些特征共线性,这需要进行特征设计。 一旦我拥有更多可用的机器学习工具,我期待继续这项工作! 但是,作为我的第一个端到端数据科学项目,我非常高兴向您介绍这份美国鲜为人知的最佳路径。

远足径 (Hike The Trails)

Check out the Hidden Gems in your State below!

在下面查看您所在州的隐藏宝石!

翻译自: https://towardsdatascience.com/hidden-gems-finding-the-best-secret-trails-in-america-d9203e8ad073

美国队长3:内战


http://www.taodudu.cc/news/show-3431523.html

相关文章:

  • 如何简单的在阿里云centos7.6 64位操作系统上手动搭建LNMP环境(Nginx1.14.2+PHP7.x+mysql5.7)
  • java的比较级运算符的结果,「比较级和最高级的用法」英语语法---比较级和最高级的用法 - 金橙教程网...
  • 【Linux负载均衡】
  • ubuntu死机重启的魔术键
  • Ubuntu16.04死机了怎么办
  • Linux死机解决办法
  • 智能围网,智慧围栏,智能围栏,智能护栏,攀破报警围栏——0误报,无漏报
  • 智能护栏、智能围栏实现周界防护零误报,无漏报
  • Yolov5实现物体分类识别和电子围栏
  • python使用Yolov3实现电子围栏功能,检测目标是否进入指定区域
  • 智能电网如何实现 ?
  • 能源变革--数字孪生变电站,机器人巡检更胜一筹
  • 拆解1978年产荧光数码管计算器
  • 颜、智爆棚,未来广州21座变电站将彻底颠覆
  • 无人值守,智能变电站可视化管控系统
  • 喜讯,绥宁又新投运一座220千伏变电站
  • 浅谈利用红外传感和数字图像处理完善电子围栏系统
  • RISC-V MCU 导盲手套
  • 运维工程师是桥的护栏_运维工程师岗位要求
  • 电子万能试验机门式与单臂式结构,区别还是蛮大的
  • 计算机仍遵循着一位科学家,电子计算机技术在半个世纪中虽有很大的进步,但至今其运行仍遵循着一位科学家提出的基本原理。他...
  • deebot地面清洁机器人怎么关_科沃斯CR250说明书PDF电子版下载
  • 《推理作家的信条》pdfmobiepub电子版
  • 围墙护栏-耐腐蚀的阳台护栏最佳选择
  • vue使用高德地图画电子围栏_地理围栏-辅助功能-开发指南-iOS 定位SDK | 高德地图API...
  • VUE 实现简单的电子围栏 (AMap)
  • 四川企立方:拼多多的商家床护栏商品发布规范
  • 五谷对于健康有哪些好处?都有哪些东西?
  • 用户体验设计
  • 通过指甲看你的病症

美国队长3:内战_隐藏的宝石:寻找美国最好的秘密线索相关推荐

  1. pooled-jms_Hibernate隐藏的宝石:pooled-lo优化器

    pooled-jms 介绍 在这篇文章中,我们将揭示一个序列标识符生成器,​​它结合了标识符分配效率和与其他外部系统的互操作性(同时访问底层数据库系统). 传统上,有两种序列标识符策略可供选择. 序列 ...

  2. Hibernate隐藏的宝石:pooled-lo优化器

    介绍 在这篇文章中,我们将揭示一个序列标识符生成器,​​它结合了标识符分配效率和与其他外部系统的互操作性(同时访问底层数据库系统). 传统上,有两种序列标识符策略可供选择. 序列标识符,对于每个新值分 ...

  3. 基于机器学习聚类算法寻找美国职业篮球联赛NBA中的超级强队

    聚类算法 聚类算法是机器学习中经典的非监督学习算法之一,相比于分类算法,聚类不依赖预定义的样本标签,而是让算法通过对数据的学习从而找到其内部的规律,该算法对有相同特征的样本进行聚类,聚类的时候,我们并 ...

  4. 海量数据寻找最频繁的数据_在数据中寻找什么

    海量数据寻找最频繁的数据 Some activities are instinctive. A baby doesn't need to be taught how to suckle. Most p ...

  5. 刚认识女孩说不要浪费时间_不要浪费时间寻找学习数据科学的最佳方法

    刚认识女孩说不要浪费时间 重点 (Top highlight) Data science train is moving, at a constantly accelerating speed, an ...

  6. aspose.cells 无法读取公式值_隐藏 Excel表格、公式的9种方法

    Excel,站在你面前,你却看不见..... 1.给Excel文件穿上隐身衣 在win10系统中,隐藏文件变得如此容易,选取文件点"隐藏所选项目",再去掉"隐藏的项目&q ...

  7. cdialog创建后马上隐藏_隐藏你的小秘密,这款神器就是玩的这么6!

    每个人都有着自己的小秘密,有些秘密还是不能被女朋友发现的, 今天晨风给小伙伴们带来一款神器,绝对是隐藏隐私好神器.这是一款可以隐藏图片,视频,文件,软件的神器,不需要ROOT. 图文详解 应用锁(安卓 ...

  8. bootstarp js设置列隐藏_隐藏工作表的行、列(第一种简单,第二种很坑,第三种最坑)...

    各位朋友,你们好. 今天和你们分享怎样将工作表的行.列进行隐藏.我将分享两种方法: 一.直接隐藏 通过选中行.列或者单元格,对行列进行隐藏效果见下图: 隐藏行,可以使用:右键菜单.Ctrl+9.开始选 ...

  9. 前端有啥好用的手机模拟软件吗_隐藏应用,软件双开,一个APP就解决了

    大家好,这里是小狐狸 专注分享实用的干货 ​ 相信大家可能都有点 不想被人发现的小东西 平常保存在手机里 偶尔拿出来看两眼 如果女朋友翻看自己的手机 翻到些不想让她们看到的就尴尬了 今天给大家推荐可以 ...

最新文章

  1. Windows Store获得Fluent Design加成
  2. PostgreSQL 打开文件句柄优化 - 百万以上实体对象初始化优化
  3. InfoPath Forms Services的配置
  4. 学生电脑哪个牌子好_柳州304不锈钢学生饭盒哪个牌子好_家典美日用品
  5. 转 IDEA 解决代码提示功能消失
  6. 企业项目开发中的问题
  7. matlab java错误_求助:matlab load mat文件出错!java exception occurred:
  8. EXCEL 求解线性规划问题
  9. 时间序列预测--基于CNN的股价预测
  10. 判断一个数是否为4的倍数
  11. 为什么要隐藏ip地址
  12. Self-attention中为什么softmax要除d_k
  13. 关于点击率模型,你知道这三点就够了 点击率预估的几个经典模型简介
  14. 深度解析JavaScript原型链
  15. cmd的tree命令使用方法
  16. canvas 背景透明
  17. 数组求和,求平均数,求最大值和最小值
  18. 强制等待、显示等待和隐式等待
  19. 【网络教程】sublime text3 如何实现自动缩进排版
  20. 双非跨考浙大计算机,双非跨考浙大计算机详细经验分享

热门文章

  1. 编写程序C语言 用递归法求n,用C语言编写一个递归程序用来计算:1*2+2*3+3*4+.+(n-1)*n...
  2. [源码和文档分享]C++实现的基于NSM的简易数据库
  3. 【电力系统】基于粒子群算法求解热电联产系统优化配置问题附matlab代码
  4. 微信公众平台开发消息回复总结
  5. 实用的性能调整django
  6. 【生物质】生物质/化学物质/反应 热物性 热化性 数据库汇总
  7. mysql时间字段中又t和z怎么处理_Mysql 中现在仍旧不知道的小知识点
  8. 轻松学韩语初级第一课
  9. Slicer学习笔记(三十六)slicer坐标系
  10. 【NOI模拟赛】Froggy Problem(数据结构,线段树)