

Editor’s note: The Towards Data Science podcast’s “Climbing the Data Science Ladder” series is hosted by Jeremie Harris. Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

编者按:迈向数据科学播客的“攀登数据科学阶梯”系列由杰里米·哈里斯(Jeremie Harris)主持。 杰里米(Jeremie)帮助运营一家名为 SharpestMinds 的数据科学指导创业公司 您可以收听以下播客:

Data science is about much more than jupyter notebooks, because data science problems are about more than machine learning.


What data should I collect? How good does my model need to be to be “good enough” to solve my problem? What form should my project take for it to be useful? Should it be a dashboard, a live app, or something else entirely? How do I deploy it? How do I make sure something awful and unexpected doesn’t happen when it’s deployed in production?

我应该收集什么数据? 要使我的模型“足够好”才能解决我的问题,需要多好? 我的项目应该采用什么形式才能发挥作用? 它应该是仪表板,实时应用程序还是完全其他的东西? 如何部署? 我如何确保在生产中部署某些可怕和意外的事件时不会发生?

None of these questions can be answered by importing sklearn and pandas and hacking away in a jupyter notebook. Data science problems take a unique combination of business savvy and software engineering know-how, and that’s why Emmanuel Ameisen wrote a book called Building Machine Learning Powered Applications: Going from Idea to Product. Emmanuel is a machine learning engineer at Stripe, and formerly worked as Head of AI at Insight Data Science, where he oversaw the development of dozens of machine learning products.

通过导入sklearnpandas并在jupyter笔记本中偷窃,这些问题都无法回答。 数据科学问题将业务知识和软件工程知识独特地结合在一起,这就是Emmanuel Ameisen写一本名为《 构建机器学习支持的应用程序:从构思到产品》的原因 Emmanuel是Stripe的机器学习工程师,之前曾在Insight Data Science担任AI主管,在那里他负责了数十种机器学习产品的开发。

Our conversation was focused on the missing links in most online data science education: business instinct, data exploration, model evaluation and deployment. Here were some of my favourite take-homes:

我们的讨论重点是大多数在线数据科学教育中缺少的链接:业务本能,数据探索,模型评估和部署。 以下是一些我最喜欢的地方:

  • Data exploration is a critical step in the data science lifecycle, but its value is really hard to quantify. How would you know if someone failed to find interesting insights in a dataset because there weren’t any insights to be found, or because they’re not skilled enough for the job? Companies tend to bias towards assessing employees based on aspects of job performance that are easy to measure, and that bias means that data exploration is often de-prioritized. A good way around this is for companies or teams to carve out time explicitly for open-ended exploration tasks, so that data scientists don’t shy away from doing them when they’re needed.数据探索是数据科学生命周期中的关键步骤,但其价值确实很难量化。 您怎么知道是否有人因为找不到任何见解,或者因为他们不够熟练而无法在数据集中找到有趣的见解? 公司倾向于根据易于衡量的工作绩效来评估员工,这意味着数据探索通常被低优先级。 解决此问题的一种好方法是,公司或团队明确地花时间进行开放式探索任务,以使数据科学家在需要它们时不会回避。
  • One aspect of productionization that’s often undervalued by new data scientists and machine learning engineers is the importance of model robustness. What happens if someone tries to generate an socially unacceptable output from your model? What if your model encounters an input that it can’t predict with high confidence? Sometimes, adding a layer of rules that prevents models from producing outputs when a compromising or questionable user input is provided can be mission-critical.新数据科学家和机器学习工程师经常低估生产化的一个方面是模型稳健性的重要性。 如果有人试图从您的模型中产生社会上无法接受的结果,会发生什么? 如果您的模型遇到无法以高置信度预测的输入怎么办? 有时,添加一层规则以防止模型在提供折衷或有问题的用户输入时无法产生输出可能是关键任务。
  • Many people make the mistake of thinking about model optimization in a “top-down” manner. If their first model doesn’t work, they decide to use another (usually more complicated) model, rather than investigating the kinds of errors their model is making and trying to engineer features or design heuristics that might help tackle those errors. That’s a problem because most data science problems can only be solved by carefully examining the decision surface of a faulty model, and escalating model complexity rather than resorting to feature engineering on a simpler model tends to make this task harder and not easier.许多人错误地以“自上而下”的方式考虑模型优化。 如果他们的第一个模型不起作用,他们决定使用另一个(通常更复杂)的模型,而不是调查模型正在犯的错误种类,而是尝试设计可能有助于解决这些错误的功能或启发式方法。 这是一个问题,因为大多数数据科学问题只能通过仔细检查故障模型的决策面来解决,并且提高模型的复杂性而不是诉诸于简单模型上的特征工程往往会使这项任务变得更加艰巨而不容易。

You can also follow Emmanuel on Twitter here to keep up with his work, and me here.

您也可以在Twitter上关注Emmanuel,以跟上他的工作,而我也可以在这里 。

翻译自: https://towardsdatascience.com/beyond-the-jupyter-notebook-how-to-build-data-science-products-50d942fc25d8




