This article introduces how we build a comprehensive data platform to conduct Revenue Forecasting at Airbnb. It focuses on data warehousing and workflow orchestration management, and shares innovative solutions inspired by the challenges we’ve overcome.

本文介绍了我们如何构建全面的数据平台来在Airbnb上进行收入预测。 它专注于数据仓库和工作流程编排管理,并共享创新的解决方案,这些解决方案受到了我们克服的挑战的启发。

Authors: Zi (Jerry) Chu, Markus Schmaus, Didi Shi, Cillian Murphy, Kai Brusch, Wendy Shi

作者: Zi(Jerry)Chu,Markus Schmaus,Didi Shi,Cillian Murphy,Kai Brusch,Wendy Shi

1.背景 (1. Background)

Airbnb is a leading online marketplace for arranging or offering lodging. It had over 300M nights booked in 2019 across more than 190 listing countries and counting. We, as the Financial Forecasting team, carry the vision that a reliable forecast is essential in uncovering future revenue and growth opportunities.

Airbnb是用于安排或提供住宿的领先在线市场。 它在2019年预订了3亿多个晚上,遍及190多个上市国家,并且还在继续增长。 我们作为财务预测团队的愿景是,可靠的预测对于发现未来的收入和增长机会至关重要。

We’ve reached a very high level of forecasting accuracy, with a large revenue volume at the order of magnitude of billions of US dollars. This achievement is powered by Delphi, an engineering and scientific project developed by our team. Delphi provides insights into business trends at Airbnb, which are used in a variety of ways, such as helping Finance allocate resources, guide investment decisions and as measuring stick for performance.

我们已经达到了非常高的预测准确性水平,大量的收入达到了数十亿美元的数量级。 这项成就由Delphi推动Delphi是我们团队开发的一项工程和科学项目。 德尔福提供了Airbnb商业趋势的见解,并以多种方式使用,例如帮助财务部门分配资源,指导投资决策以及衡量业绩表现。

Figure 1. Architecture of Delphi
图1. Delphi的体系结构

Figure 1 illustrates the architecture of Delphi with the following key components:

图1说明了具有以下关键组件的Delphi的体系结构:

  • Data Platform: it has a set of data warehouses to collect training data from different query engines, including Hive, Presto and Druid. It uses Airflow to construct a complex directed acyclic graph (DAG) to manage workflow orchestration. More specifically, Machine Learning models are created via training data and R libraries. The models power a metric graph consisting of layers of metrics, including users, bookings, and revenue, etc. Such models predict Airbnb business metrics and generate forecasting data.

    数据平台 :它具有一组数据仓库,用于从不同的查询引擎(包括Hive,Presto和Druid)收集训练数据。 它使用Airflow构造复杂的有向无环图( DAG )以管理工作流程。 更具体地说,机器学习模型是通过训练数据和R库创建的。 这些模型提供了一个由指标层组成的指标图,包括用户,预订和收入等。此类模型可预测Airbnb业务指标并生成预测数据。

  • Compute Engine: we use a hybrid forecasting approach of interpolation and extrapolation. Simply speaking, Machine Learning models provide raw data as the baseline (interpolation), while our data scientists modify some business trends with human expertise (extrapolation). This component is Delphi’s brain. We create a large Python object in memory, and encapsulate all the data and methods into it. In this way, the user can modify the raw forecasting data in memory on the fly. From the perspective of scientific computation, this is a very complex procedure. In the graph of different layers of metrics, when we adjust the forecast data of one metric, we also need to recalculate all of its downstream metrics. The adjustment succeeds only if forecast calculations of all the involved metrics converge. The Compute Engine leverages Python scientific libraries including Numpy and Pandas. We also use Cython as a wrapper between Python code and R code. After compilation, Cython code can achieve C-like performance. This greatly speeds up the in-memory calculations.

    Compute Engine :我们使用插值和外推的混合预测方法。 简而言之,机器学习模型提供原始数据作为基准(插值),而我们的数据科学家则利用人类专业知识修改某些业务趋势(插值)。 这个组成部分是德尔福的大脑。 我们在内存中创建一个大型Python对象,并将所有数据和方法封装到其中。 这样,用户可以即时修改内存中的原始预测数据。 从科学计算的角度来看,这是一个非常复杂的过程。 在指标的不同层的图中,当我们调整一个指标的预测数据时,我们还需要重新计算其所有下游指标。 仅当所有相关度量的预测计算收敛时,调整才能成功。 Compute Engine利用Python科学库,包括Numpy和Pandas。 我们还将Cython用作Python代码和R代码之间的包装器。 编译后,Cython代码可以实现类似C的性能。 这大大加快了内存中的计算。

  • Web Service: it’s not desirable for our users to install the Python tool with a long list of dependencies, especially many of them are non-tech colleagues. To make Delphi easily accessed by the company, we build the Web Service running on Kubernetes instances. (1) Front-end: Javascript running within React framework. (2) Back-end: each Kubernetes instance runs several Flask Web applications, and each Flask application runs a Delphi service process.

    Web服务 :我们的用户不希望安装带有大量依赖关系的Python工具,尤其是其中许多是非技术同事。 为了使公司可以轻松访问Delphi,我们构建了在Kubernetes实例上运行的Web Service。 (1)前端:在React框架内运行的Javascript。 (2)后端:每个Kubernetes实例运行多个Flask Web应用程序,并且每个Flask应用程序运行一个Delphi服务进程。

We plan to launch a series of tech articles explaining different key components of Delphi, and this article focuses on Data Platform. Before diving into tech details, we want to share a few fun facts of Delphi.

我们计划发布一系列技术文章,解释Delphi的不同关键组件,并且本文重点介绍数据平台。 在深入探讨技术细节之前,我们想分享一些Delphi的有趣事实。

  • Located in Upper central Greece, Delphi hosts the Sanctuary of Apollo, the site of the ancient Oracle.德尔斐(Delphi)位于希腊中部上层,这里是古老Oracle的所在地阿波罗圣所(Sanctuary of Apollo)。
  • The word Delphi is likely best known for Oracle of Delphi. Our Forecasting project was named after Delphi, due to the aphorisms like “know thyself” (Greek: γνῶθι σεαυτόν) and “foresee the future” (Greek: Ορα το μελλον).Delphi这个词可能是最著名的DelphiOracle公司。 由于“知道自己”(希腊语:γνῶθισεαυτόν)和“预见未来”(希腊语:Ορατομελλον)等格言,我们的预测项目以Delphi命名。

2.挑战 (2. Challenges)

Delphi forecasts more than 100 business metrics of Airbnb, ranging from users, bookings, to fees and revenue. Not surprisingly, it consumes a large variety of upstream data sources developed and maintained by different teams across the company.

德尔福预测Airbnb的100多种业务指标,包括用户,预订,费用和收入。 毫不奇怪,它消耗了公司中不同团队开发和维护的各种上游数据源。

Having access to data of good quality is essential for forecasting. We’ve met the following challenges during building the data platform, including:

获得高质量数据对于进行预测至关重要。 在构建数据平台期间,我们遇到了以下挑战,包括:

  • Consistency: the data needs to accurately reflect the metrics that are reported internally and externally. The reality is that different teams (data owners) may have slightly different definitions on a metric. In other words, how could we guarantee consistent data definitions?

    一致性 :数据需要准确反映内部和外部报告的指标。 现实情况是,不同的团队(数据所有者)对度量的定义可能略有不同。 换句话说,我们如何保证一致的数据定义?

  • Speed: it’s advantageous if the data queries are fast to avoid long run times for the forecast update. We need to balance the tradeoff between speed and resources. For example, Druid supports pre-aggregation very well, but it’s sensitive to high-cardinality dimensions. We need to have a fall back logic to support queries with largely different resource requirements to switch among different query engines including Druid, Presto and Hive.

    速度 :如果数据查询速度快,则可以避免长时间运行预测更新,这是有利的。 我们需要在速度和资源之间进行权衡。 例如,Druid非常支持预聚合,但是对高基数维度非常敏感。 我们需要一种后备逻辑,以支持资源需求差异很大的查询,以便在包括Druid,Presto和Hive在内的不同查询引擎之间切换。

  • SLA: in order to build the forecasting pipeline in a reliable time frame, the data needs to be accessible when the forecast needs to be updated. Different data sources are generated by different data pipelines. Some data land slower than others. How could we guarantee the SLA of forecasting data?

    SLA :为了在可靠的时间范围内构建预测管道,需要更新预测时需要访问数据。 不同的数据源由不同的数据管道生成。 一些数据降落的速度比其他数据慢。 我们如何保证预测数据的SLA?

  • Single logic and dynamic DAG: DAG plays the role of data supplier which generates forecast data consumed by Compute Engine. These two parties need to achieve an agreement on what data to generate. Instead of duplicating logic on both of them, we adopt the design philosophy of single logic which resides on Compute Engine, the brain of Delphi. When the configuration of Compute Engine changes (i.e., a new metric is added, and the metric graph changes), the structure of DAG needs to update accordingly. How could Compute Engine inform Airflow to support dynamic DAG, so two parties keep synced?

    单一逻辑和动态DAG :DAG充当数据提供者的角色,该数据提供者生成Compute Engine消耗的预测数据。 这两方面需要就生成什么数据达成协议。 我们没有在两个逻辑上都重复逻辑,而是采用了位于Delphi大脑Compute Engine上的单一逻辑的设计理念。 当Compute Engine的配置更改时(即,添加了新的指标,并且指标图也发生了变化),DAG的结构需要相应地更新。 Compute Engine如何通知Airflow支持动态DAG,从而使两个参与者保持同步?

  • Limiting overhead: overhead can hide in many places of a complex data platform, such as scheduler delay between a job completing and the next job starting, saving and loading data. Such little things may sum up to slow down the pipeline.

    限制开销 :开销可能隐藏在复杂数据平台的许多地方,例如,作业完成与下一个作业开始,保存和加载数据之间的调度程序延迟。 这样的小事情可能会加总以减慢管道的速度。

We’d address the challenges with our solutions and reasonings in the following sections.

在以下各节中,我们将通过解决方案和推理来应对挑战。

3.数据仓库 (3. Data Warehousing)

3.1。 数据架构 (3.1. Data Schema)

In the designing stage of Delphi data warehouse, one of the major decisions to make is the data schema which should suit the data modelling at Airbnb. Let’s give a typical example of booking on the Airbnb platform. Guest Alice from the United States books a listing in Germany on 2020–01–01, for a 3-night stay starting from 2020–06–01. In this case, the date delta is the interval between travel date and booking date, as bookings are usually made in advance. To store the summation of such booking events on a daily base, we need the following dimensions:

在Delphi数据仓库的设计阶段,要做出的主要决策之一是应该适合Airbnb数据建模的数据模式。 让我们举一个在Airbnb平台上进行预订的典型示例。 来自美国的嘉宾爱丽丝(Alice)于2020-01-01年在德国预订了一份房源,并于2020-06-06年开始进行了3晚住宿。 在这种情况下,日期差是旅行日期和预订日期之间的间隔,因为通常是提前进行预订。 为了每天存储此类预订事件的总和,我们需要以下维度:

  • Origin geo of guest: 27 values at the country-group level访客的原始地理位置:国家/地区级别的27个值
  • Destination geo of listing: 80 values at the market level列表的目标地理位置:市场一级的80个值
  • Guest type: 10 values (i.e., past booker, recent sign-up, etc)访客类型:10个值(即过去的预订者,最近的注册等)
  • Date: historical dates of the past 8 years日期:过去8年的历史日期
  • Date delta: 400 future dates日期增量:400个未来日期

A full table containing all the above dimensions will end up with around 25 billion records. It’s definitely not a practical schema. Our improvement is inspired by the Snowflake schema. It is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. This schema is represented by centralized fact tables connected to multiple dimensions. Figure 2 centers a fact table, which consists of atomic events, surrounded by a series of dim tables that bring in dimensional data. The joins are all done via foreign keys (i.e., unique IDs).

包含上述所有维度的完整表格将最终包含大约250亿条记录。 这绝对不是一个实用的方案。 我们的改进受到了Snowflake模式的启发。 这是多维数据库中表的逻辑排列,以使实体关系图类似于雪花形状。 该模式由连接到多个维度的集中事实表表示。 图2集中了一个事实表,该表由原子事件组成,周围是一系列引入维数据的暗表。 联接都是通过外键(即唯一ID)完成的。

Figure 2. A well-known example of Snowflake schema
(Graph source: https://en.wikipedia.org/wiki/Snowflake_schema)
图2.一个著名的Snowflake模式示例
(图来源: https : //en.wikipedia.org/wiki/Snowflake_schema )

Instead of creating Delphi’s own Snowflake schema from scratch, we are lucky to use the Minerva Metrics framework developed by Metrics Infra team at Airbnb (refer to this tech talk for more details). This framework enables the user to define metrics in one place, input strong metadata for them, and access them wherever needed. Minerva uses the same methodology with Snowflake, but renames a few terms:

幸运的是,我们幸运地使用了由Airbnb的Metrics Infra团队开发的Minerva Metrics框架,而不是从头开始创建Delphi自己的Snowflake模式(有关更多详细信息,请参阅此技术讲座 )。 该框架使用户可以在一处定义指标,为其输入强大的元数据,并在需要时访问它们。 Minerva在Snowflake上使用了相同的方法,但重命名了几个术语:

  • Fact tables are renamed as Event Sources where metrics are based on events, defined in event sources.事实表被重命名为“事件源”,其中度量基于事件(在事件源中定义)。
  • Dimension tables are renamed as Dimension Sources.维度表被重命名为“维度源”。
  • Foreign keys are renamed as Subjects.外键重命名为“主题”。

The schema of bookings inside Minerva is illustrated by Figure 3. The bookings metric is defined as an Event Source. It uses a few dimensions, and each dimension is defined in a Dimension Source. For example, dim orgin_geo is defined by orgin_geo.yaml Dimension Source.

图3展示了密涅瓦内部的预订模式。预订指标被定义为事件源。 它使用几个维度,每个维度都在“维度来源”中定义。 例如,昏暗的orgin_geo是由orgin_geo.yaml维度来源定义的。

Figure 3. Example of Event Sources and Dimension Sources
图3.事件源和维度源的示例

Figure 4 gives a close look at how the data used by Forecasting is defined inside the bookings Event Source.

图4仔细研究了如何在预订事件源中定义Forecasting使用的数据。

  • Metrics: this Event Source provides a set of booking-related metrics, and the one used by Delphi is ‘bookings’.

    指标 :此事件源提供了一组与预订相关的指标,而Delphi使用的是“预订”。

  • Dimension_sets: every project (or data pipeline) configures its own set by choosing dimensions needed. Delphi defines a block called ‘forecasting’. The dimensions are defined separately in their own Dimension Sources. These dimensions can be shared in different Event Sources, carrying the design philosophy of define-once-and-use-everywhere.

    Dimension_sets :每个项目(或数据管道)都通过选择所需的维度来配置自己的集合。 Delphi定义了一个称为“预测”的块。 尺寸在其自己的尺寸源中单独定义。 这些维度可以在不同的事件源中共享,并具有在任何地方定义一次使用的设计理念。

Figure 4. Definition of bookings Event Source
图4.预订的定义事件源

As Airbnb is growing big, different teams (data owners) under different organizations may have slightly different definitions on a metric. However, forecasting data needs to accurately reflect the metrics that are reported internally and externally. Thanks to Minerva, it provides the source of truth for metric definitions across the company, and solves the challenge of inconsistent metrics definitions.

随着Airbnb的发展壮大,不同组织下的不同团队(数据所有者)对指标的定义可能会略有不同。 但是,预测数据需要准确反映内部和外部报告的指标。 多亏了密涅瓦(Minerva),它为整个公司的指标定义提供了真实的来源,并解决了指标定义不一致的挑战。

A short conclusion of this section: we use Minerva because

本节的简短结论:我们使用Minerva是因为

  • It helps Delphi define Snowflake schema of metrics and dimensions它可以帮助Delphi定义指标和维度的Snowflake模式
  • It provides consistent metrics definitions它提供一致的指标定义

3.2。 数据仓库架构 (3.2. Data Warehouse Architecture)

Figure 5. Architecture of Delphi Data Warehouse
图5. Delphi数据仓库的体系结构

Figure 5 illustrates the architecture of Delphi Data Warehouse. The top half is Minerva Metrics Framework, containing

图5说明了Delphi数据仓库的体系结构。 上半部分是Minerva Metrics Framework,其中包含

  • Metrics Configs layer where we define the data schema for Delphi, including appropriate Event Sources and Dimension Sources.度量配置层,我们在其中定义Delphi的数据模式,包括适当的事件源和维度源。
  • Computation pipeline: the metrics and dimensions used by Delphi get computed and joined together.计算管道:Delphi使用的度量和维数将被计算并结合在一起。

The bottom half is Data Access Layer of Delphi, containing

下半部分是Delphi的数据访问层,包含

  • Denormalized Hive and Presto views created on top of Minerva data在Minerva数据上创建的非规范化Hive和Presto视图
  • Druid datasources ingested from Hive从Hive提取的Druid数据源
  • Data Warehouse Clients that support different query engines, including Druid, Presto and Hive支持不同查询引擎的数据仓库客户端,包括Druid,Presto和Hive

3.3。 支持多个查询引擎 (3.3. Support for Multiple Query Engines)

Delphi imposes some demanding requirements on data collection, including

Delphi对数据收集提出了一些苛刻的要求,包括

  • Speed: it’s advantageous if the data queries are fast to avoid long run times for the forecast update.

    速度 :如果数据查询速度快,则可以避免长时间运行预测更新,这是有利的。

  • Massive data volume: our Machine Learning models need the training data of the past 8 years. The large time window coupled with some other high-cardinality dimensions explodes the data volume. Let’s remind you of that extreme example of 25 billion records in the imaginary bookings table.

    海量数据 :我们的机器学习模型需要过去8年的训练数据。 较大的时间窗口加上其他一些高基数维会爆炸数据量。 让我们提醒您一个假想的预订表中250亿条记录的极端示例。

Nobody is more familiar with the data usage than the data owner. Thus, we push ourselves to figure out the solutions.

没有人比数据所有者更熟悉数据使用情况。 因此,我们努力寻找解决方案。

In terms of improving speed, we realize that the finest granularity of training data required by forecasting models is the daily summation. In other words, our queries are of aggregation style (‘group by’ queries).

在提高速度方面,我们意识到预测模型所需的最佳训练数据粒度是每日总和。 换句话说,我们的查询属于聚合样式(“分组依据”查询)。

SELECT origin_geo, destination_geo, guest_type, ds, sum(value)FROM forecasting.bookings_viewWHERE ds BETWEEN ‘2012–01–01’ AND ‘2020–05–01’GROUP BY 1,2,3,4;

SELECT origin_geo,destination_geo,guest_type,ds,sum(value)FROM Forecasting.bookings_viewWHERE ds在'2012–01–01'和'2020–05–01'之间的组合1,2,3,4;

Code block 1: SQL-style query

代码块1:SQL风格的查询

Combining the facts of the data schema defined within Minerva (metrics and dimensions) and the aggregation query style, Druid appears as an ideal query engine for Delphi. By design, Druid works best with event-oriented data, and supports aggregation very nicely.

将Minerva中定义的数据模式的事实(度量和维)与聚合查询样式相结合,Druid似乎是Delphi的理想查询引擎。 从设计上讲,Druid最适合面向事件的数据,并且非常好地支持聚合。

relations: — type: DataWarehouse client_type: druid

关系:—类型:DataWarehouse client_type:德鲁伊

Code block 2: config of DataWarehouse task

代码块2:DataWarehouse任务的配置

On the other side, the massive data volume requires the tradeoff between speed and resources. For example, Druid is sensitive to high-cardinality dimensions. An expensive groupby query may fail due to the memory limits of the Druid cluster. Thus we need a fall back logic to support queries with largely different resource requirements. More specifically, Delphi Data Warehouse should enable a query to switch among different query engines. The implementation contains:

另一方面,海量数据需要在速度和资源之间进行权衡。 例如,德鲁伊对高基数维度敏感。 由于Druid群集的内存限制,昂贵的groupby查询可能会失败。 因此,我们需要一种后备逻辑来支持资源差异很大的查询。 更具体地说,Delphi数据仓库应使查询能够在不同的查询引擎之间切换。 该实现包含:

  • Firstly, according to the query speed, we define the preference order as Druid, Presto and Hive. If a query can not succeed with a higher-preference engine, it downgrades to a lower-preference one. The client type of a DataWarehouse task is easily specified in the Delphi yaml configuration.首先,根据查询速度,我们将优先顺序定义为Druid,Presto和Hive。 如果查询无法使用较高优先级的引擎成功,则它将降级为较低优先级的引擎。 在Delphi yaml配置中可以轻松指定DataWarehouse任务的客户端类型。
  • Secondly, the data sources are created for every query engine. According to Figure 5, the underlying data generated by Minerva are stored as Hive tables. Denormalized views are created for Hive and Presto, respectively. Then data is ingested from Hive to Druid. Thus, the storage is mainly consumed by base Hive tables and Druid segments.其次,为每个查询引擎创建数据源。 根据图5,Minerva生成的基础数据存储为Hive表。 将分别为Hive和Presto创建非规范化视图。 然后将数据从Hive提取到Druid。 因此,存储主要由基本Hive表和Druid段占用。
  • Thirdly, a query is rendered for different syntaxes of query engines. The difference between Presto and Hive is subtle, as they both follow SQL style. Druid query looks very different, as it follows JSON style. Each client of Delphi Data Warehouse uses a template to ensemble metrics and dimensions along with syntaxes into a query.第三,针对查询引擎的不同语法呈现查询。 Presto和Hive之间的区别很细微,因为它们都遵循SQL风格。 Druid查询看起来很不同,因为它遵循JSON样式。 Delphi Data Warehouse的每个客户端都使用模板将指标和维度以及语法整合到查询中。

{ ‘metrics’: [‘bookings’], ‘dimensions’: [‘dim_origin_geo’, ‘dim_destination_geo’, ‘dim_guest_type’], ‘start_date’: ‘2019–01–01’, ‘end_date’: ‘2019–01–21’, ‘aggregation_granularity’: ‘D’, ‘truncate_incomplete_trailing_data’: True, ‘truncate_incomplete_leading_data’: True, ‘include_all_value_for_dimensions’: False,}

{'metrics :: ['bookings'],'dimensions':['dim_origin_geo','dim_destination_geo','dim_guest_type'],'start_date':'2019-01-01','end_date':'2019-01- 21”,“ aggregation_granularity”:“ D”,“ truncate_incomplete_trailing_data”:true,“ truncate_incomplete_leading_data”:true,“ include_all_value_for_dimensions”:False,}

Code block 3: native JSON-style query

代码块3:本机JSON样式查询

3.4。 德鲁伊客户案例研究 (3.4. Case Study of Druid Client)

Among the clients of Druid, Presto and Hive, we conduct a case study of Druid, as it raises a few interesting challenges we’d like to share with our readers. We recommend this tech blog which provides good insights of using Druid to conduct big-data analytics.

在Druid,Presto和Hive的客户中,我们进行了Druid的案例研究,因为它提出了一些我们想与读者分享的有趣挑战。 我们推荐这个技术博客 ,它提供了使用Druid进行大数据分析的良好见解。

At Airbnb, we don’t directly query the Druid cluster due to some concerns of data access and integrity. Instead, our Metric Infra team (also the owner of Minerva) sets up an API service called Minerva Inquiry for internal users to query Minerva data in Druid. Delphi’s Druid client posts queries as JSON payload to the API endpoints. The API server runs the query on Druid, and returns the result in the format of Pandas DataFrame.

在Airbnb,由于对数据访问和完整性的某些担忧,我们不直接查询Druid集群。 相反,我们的Metric Infra团队(也是Minerva的所有者)设置了一个称为Minerva Inquiry的API服务,供内部用户查询Druid中的Minerva数据。 Delphi的Druid客户端将查询作为JSON有效负载发布到API端点。 API服务器在Druid上运行查询,并以Pandas DataFrame的格式返回结果。

Delphi needs to collect the training data of the past eight years, and this window is too large for Druid to aggregate. We have to split the entire window into much smaller ones. The query window size is an adjustable argument, which is set as 20 days by default.

Delphi需要收集过去八年的培训数据,并且此窗口太大,无法合并到Druid。 我们必须将整个窗口拆分成更小的窗口。 查询窗口大小是可调参数,默认情况下设置为20天。

By doing simple math, the range of eight years splits into around 150 small windows. Delphi queries about 20 metrics. Without any concurrency control, Delphi may send 3000 queries to the groupby API endpoint at the same time. It is certainly not a responsible user pattern.

通过简单的数学运算,将八年的时间范围划分为大约150个小窗口。 Delphi查询约20个指标。 没有任何并发​​控制,Delphi可能会同时向groupby API端点发送3000个查询。 这当然不是负责任的用户模式。

In practice, we combine the following techniques to build a scalable and resilient query client:

在实践中,我们结合以下技术来构建可伸缩且具有弹性的查询客户端:

  • Divide and conquer分而治之
  • Concurrency并发
  • Retry重试
Figure 6. Divide & Conquer of Druid client
图6. Druid客户的分而治之

Figure 6 illustrates the logic breakdown of the Druid client.

图6说明了Druid客户端的逻辑细分。

  • The client maintains a sequential queue for the entire query range of eight years. Each of the 150 queries covers a 20-day window (an adjustable argument) with the chronological order.客户在整个八年的查询范围内都维护一个顺序队列。 150个查询中的每个查询都按时间顺序覆盖20天的窗口(可调参数)。
  • We set the pool size of concurrent threads as 50 (an adjustable argument). It implements an asynchronous dispatch function which leverages the Python concurrent.futures library to schedule parallel threads.

    我们将并发线程的池大小设置为50(可调参数)。 它实现它利用了Python异步调度功能concurrent.futures库调度并行线程。

  • At the beginning, the first 50 queries in the queue are assigned to the pool.首先,队列中的前50个查询被分配给池。
  • Each thread sends the query to Minerva Inquiry API. It can retry up to 5 times (an adjustable argument). It receives the result in the format of Pandas DateFrame covering the 20-day window.每个线程将查询发送到Minerva查询API。 它最多可以重试5次(可调参数)。 它以Pandas DateFrame的格式接收覆盖20天窗口的结果。
  • When any thread slot is freed, the next query in the queue is assigned. For example, when q2 completes, q51 is assigned to take over its slot. From the standpoint of the query queue, the queries are executed with the chronological order.释放任何线程插槽后,将分配队列中的下一个查询。 例如,当q2完成时,分配q51来接管其插槽。 从查询队列的角度来看,查询是按时间顺序执行的。
  • After the completion of all the queries, the respective DateFrames are concatenated into a single one containing the final result for the entire 8-year query range.完成所有查询后,将各自的DateFrames串联为一个,包含整个8年查询范围的最终结果。

Eventually we are rewarded with the good performance gain.

最终,我们会获得良好的性能提升。

  • After converting the majority Forecasting queries from the SQL style (executed by Presto/Hive) to the Druid JSON style, the accumulated query time of the entire data warehouse reduces by 40%.将大多数Forecasting查询从SQL样式(由Presto / Hive执行)转换为Druid JSON样式后,整个数据仓库的累积查询时间减少了40%。
  • At the individual query level, we’ve observed the substantial time reduction (around 50% to 70%) for queries which used to run more than 20 minutes on Presto or Hive.在单个查询级别上,我们发现以前在Presto或Hive上运行20分钟以上的查询的时间大大减少(大约50%至70%)。

3.5。 跟踪SLA (3.5. Tracking SLA)

In order to build the forecasting pipeline in a reliable time frame, it’s important to have service-level agreements (SLAs) on upstream data sources. Delphi depends on a wide range of data sources generated by different data pipelines. A practical challenge is that some of the data sources aren’t consistently available each day. Thanks to SLA Tracker, an internal data tool built by the Analytics Infra team at Airbnb, now we have the visibility into the landing times of upstream data sources to enable monitoring and improving of SLAs.

为了在可靠的时间范围内建立预测管道,在上游数据源上具有服务级别协议(SLA)非常重要。 Delphi依赖于由不同数据管道生成的各种数据源。 一个实际的挑战是,并非每天都有某些数据源始终可用。 由于使用了由Airbnb的Analytics Infra团队构建的内部数据工具SLA Tracker,现在我们可以了解上游数据源的着陆时间,从而可以监控和改进SLA。

Figure 7. SLA Tracker of Druid data sources used by Delphi
图7. Delphi使用的Druid数据源的SLA跟踪器

Figure 7 shows the screenshot of SLA Tracker built for Delphi to track its Druid data sources. SLA Tracker uses an innovative event stream for landing times of data objects, so it provides statistics on SLA performance (i.e., the p10 and p90 landing times of the past 90 days). Besides the binary result whether this data source is late today or not, now we have more insights into its historical SLA performance. This enables us to locate the weak links in our data pipelines.

图7显示了为Delphi构建的SLA Tracker的屏幕截图,以跟踪其Druid数据源。 SLA Tracker使用创新的事件流来确定数据对象的到达时间,因此它提供了有关SLA性能的统计信息(即过去90天的p10和p90到达时间)。 除了该数据源是否为今天晚点的二进制结果之外,现在我们对它的历史SLA性能有更多的了解。 这使我们能够定位数据管道中的薄弱环节。

4.工作流程编排管理 (4. Workflow Orchestration Management)

Delphi’s Data Platform conducts a logical flow of tasks to build Machine Learning models to forecast business metrics in the following stages:

Delphi的数据平台执行以下逻辑流程,以构建机器学习模型以在以下阶段预测业务指标:

  • Running data warehouse tasks to collect training data for metrics运行数据仓库任务以收集指标的培训数据
  • Building Machine Learning models via R libraries通过R库构建机器学习模型
  • Forecasting business metrics and generating forecasting data预测业务指标并生成预测数据

Delphi forecasts more than one hundred metrics. They form a nested metric graph consisting of layers of metrics, meaning some metrics are derived from others. Managing such a complex workflow orchestration is challenging. Fortunately Delphi can use Airflow to construct a directed acyclic graph (DAG) for management. Airflow is an open-source workflow management platform started at Airbnb. Inside Airbnb, Workflow Orchestration team owns the Airflow platform, and uses a repository named Data to host codes of individual DAGs. Thus, Delphi has a dedicated DAG folder at Data repository to schedule and run the Forecasting DAG.

德尔福预测有一百多个指标。 它们形成了一个由指标层组成的嵌套指标图,这意味着某些指标是从其他指标衍生而来的。 管理如此复杂的工作流程非常具有挑战性。 幸运的是,Delphi可以使用Airflow来构建有向无环图(DAG)进行管理。 Airflow是从Airbnb开始的开源工作流管理平台。 在Airbnb内部,Workflow Orchestration团队拥有Airflow平台,并使用名为Data的存储库托管各个DAG的代码。 因此,Delphi在数据存储库中有一个专用的DAG文件夹,用于计划和运行Forecasting DAG。

Airbnb’s internal Airflow platform solves our main pain point of workflow orchestration management, but also raises the following challenges for us, including

Airbnb的内部Airflow平台解决了我们工作流程编排管理的主要痛点,但也给我们带来了以下挑战,包括

  • Single logic and dynamic DAG单逻辑和动态DAG
  • Executing different types of code in a single Airflow operator在单个Airflow操作员中执行不同类型的代码
  • Limiting overhead限制开销

We’d elaborate each of the challenges and our solutions in the sub-sections.

我们将在小节中详细介绍每个挑战和解决方案。

4.1。 单逻辑和动态DAG (4.1. Single Logic and Dynamic DAG)

Inside Figure 1, DAG and Compute Engine pairs as data supplier and consumer. DAG generates actuals and forecast data consumed by Compute Engine. These two parties need to achieve an agreement on what data to generate and consume. This leads to a natural question: how should we host the control logic? Delphi is our main repository, and Airbnb’s Data repository requires each DAG to have an individual folder in it for global management. This situation leaves us with two options:

在图1中,DAG和Compute Engine分别作为数据提供者和使用者。 DAG生成Compute Engine消耗的实际数据和预测数据。 这两方面需要就生成和使用哪些数据达成协议。 这就引出一个自然的问题:我们应该如何托管控制逻辑? Delphi是我们的主要存储库,而Airbnb的数据存储库要求每个DAG在其中都具有一个单独的文件夹以进行全局管理。 这种情况使我们有两个选择:

  • Option 1: duplicating logic in both repositories选项1:在两个存储库中复制逻辑
  • Option 2: adopting the design philosophy of single logic which resides inside Compute Engine, the brain of Delphi repository选项2:采用位于Delphi存储库大脑Compute Engine内的单一逻辑的设计原理

Option 1 is costly and prone to error, as we need to maintain duplicate logic from Delphi repository to Data repository. Let me list a few lessons learned from our daily operations.

选项1成本高昂并且容易出错,因为我们需要维护从Delphi存储库到Data存储库的重复逻辑。 让我列出从日常运营中学到的一些经验教训。

  • Multiple versions of logic: Delphi has a set of code branches to support different stages of a model’s life cycle, including DS feature branch, candidate branch, staging branch, and finally production branch. We conduct a DAG run on each of the code branches (i.e., a staging DAG on staging code branch). If the model performance is validated by our data scientists, it advances into the next stage. It’s hard to maintain the duplicates of multiple versions of logic in the Data repository.多种版本的逻辑:Delphi有一组代码分支来支持模型生命周期的不同阶段,包括DS功能分支,候选分支,登台分支以及最终生产分支。 我们在每个代码分支上进行DAG运行(即,在登台代码分支上登台DAG)。 如果模型性能得到我们数据科学家的验证,它将进入下一阶段。 很难在数据存储库中维护多个逻辑版本的重复项。
  • Code convergence: Airflow cluster has hundreds of AWS instances to run tasks. When we make a code change in Delphi, we need to wait for Airflow instances to pull the latest Delphi code. Converging takes time (usually 30 minutes or more), and a small percentage of instances may fail to converge, meaning that they still use the out-of-dated Delphi code. It raises tricky errors which are very difficult to trace.代码融合:Airflow集群具有数百个AWS实例来运行任务。 在Delphi中更改代码时,我们需要等待Airflow实例提取最新的Delphi代码。 收敛需要时间(通常为30分钟或更长时间),一小部分实例可能无法收敛,这意味着它们仍然使用过时的Delphi代码。 它会引起棘手的错误,很难跟踪。

Option 2 eliminates the above drawbacks, but creates a new problem of keeping two parties synced. Figure 8 illustrates that, when the configuration of Compute Engine changes (i.e., a new metric is added, and the metric graph changes), the structure of DAG needs to update accordingly.

选项2消除了上述缺点,但是创建了一个使两方保持同步的新问题。 图8说明,当Compute Engine的配置发生更改(即,添加了新的指标,并且指标图也发生变化)时,DAG的结构需要相应地更新。

Figure 8. The DAG structure change caused by Delphi config change
图8.由Delphi配置更改引起的DAG结构更改

How could Delphi inform Airflow to support dynamic DAGs? We use the method illustrated by Figure 9.

Delphi如何通知Airflow支持动态DAG? 我们使用图9所示的方法。

Figure 9. Passing control logic from Delphi to Airflow
图9.将控制逻辑从Delphi传递到Airflow

After we commit the code change in Delphi repo, and get ready to run a DAG, the Compute Engine module generates two types of JSON-style configurations, and uploads to S3 buckets. We put a helper function in the Forecasting DAG folder under Data repo to process configurations passed from Delphi. In this way, we avoid duplicating control logic in Data repo.

在Delphi存储库中提交代码更改并准备好运行DAG之后,Compute Engine模块将生成两种类型的JSON样式的配置,并将其上传到S3存储桶。 我们在数据仓库下的Forecasting DAG文件夹中放置了一个辅助函数,以处理从Delphi传递的配置。 这样,我们避免在数据仓库中复制控制逻辑。

  • DAG-level configuration: it provides enough information for the Airflow scheduler to parse a DAG object. This object contains the complete DAG structure, including Airflow tasks and their corresponding Airflow operators, and the interdependency between them.

    DAG级别配置 :它为Airflow调度程序提供了足够的信息来解析DAG对象 。 该对象包含完整的DAG结构,包括气流任务及其对应的气流操作员,以及它们之间的相互依赖性。

  • Task-level configuration: when the DAG is about to execute an Airflow task, it receives a dedicated config to instruct the Airflow operator how to execute the task, including what Delphi code to use, and what command to run the code.

    任务级配置 :当DAG即将执行Airflow任务时,它会收到专用的配置,以指示Airflow操作员如何执行任务,包括要使用的Delphi代码以及运行该代码的命令。

4.2。 执行不同类型的代码 (4.2. Executing different types of code)

Delphi is primarily a Python codebase, assisted with Cython, R. Executing Delphi code of different types on Airflow raises the following challenges:

Delphi主要是Python代码库,在Cython,R的协助下。在Airflow上执行不同类型的Delphi代码带来了以下挑战:

  • How should we enable a single Airflow operator to execute different types of code? For example, executing a R script is very different from calling a Python module.我们应该如何使单个Airflow操作员执行不同类型的代码? 例如,执行R脚本与调用Python模块非常不同。
  • Where should we install a variety of libraries (Python/Cython/R) required by Delphi code? If we install them on Airflow instances, they would increase the convergence time, and this is not fair for other DAGs which don’t need them. Besides, they may introduce version conflicts with existing libraries. Trying to be a good citizen in the Airflow community, we should minimize the exposure to others.我们应该在哪里安装Delphi代码所需的各种库(Python / Cython / R)? 如果我们将它们安装在Airflow实例上,它们会增加收敛时间,这对于不需要它们的其他DAG来说是不公平的。 此外,它们可能会引入与现有库的版本冲突。 努力成为Airflow社区的好公民,我们应尽量减少与他人的接触。

We leverage Airflow PythonVirtualenvOperator, that provides the solutions to above challenges. The right part of Figure 9 illustrates how a Delphi task is executed inside PythonVirtualenvOperator. An Airflow worker (AWS instance) has a few cgroups (a Linux kernel feature similar to Docker and Kubernetes instances). When the DAG schedules to run an Airflow task, the task is assigned to an available cgroup to follow this procedure:

我们利用Airflow PythonVirtualenvOperator ,它为上述挑战提供了解决方案。 图9的右侧部分说明了如何在PythonVirtualenvOperator中执行Delphi任务。 Airflow worker(AW​​S实例)具有几个cgroup(类似于Docker和Kubernetes实例的Linux内核功能)。 当DAG计划运行Airflow任务时,该任务将分配给可用的cgroup来遵循以下过程:

  • a PythonVirtualenvOperator is initialized初始化PythonVirtualenvOperator
  • the task-level configuration file is downloaded from S3 and parsed从S3下载任务级别的配置文件并进行解析
  • all the required libraries are installed via PIP package installer

    通过PIP软件包安装程序安装了所有必需的库

  • The execution command is decided by the code type执行命令由代码类型决定

Python code: python3 -m delphi.compute_engine.execute

Python代码:python3 -m delphi.compute_engine.execute

Cython code: python3 setup.py build_ext -inplace

Cython代码:python3 setup.py build_ext -inplace

R code: called by subprocess style, Popen.communicate()

R代码:由子流程样式Popen.communicate()调用

4.3。 限制开销 (4.3. Limiting overhead)

The overhead can hide in many places of a complex data platform, and such little things may sum up to hurt the pipeline running time and punish our negligence.

开销可能隐藏在复杂数据平台的许多地方,而这些小事可能会累加损害管道运行时间并惩罚我们的疏忽。

During tuning the Forecasting DAG, we found two outliers of overhead:

在调整Forecasting DAG的过程中,我们发现开销的两个异常值:

  • Scheduling delay: Airflow scheduler delay between a job completing and the next job starting

    调度延迟 :作业完成与下一个作业开始之间的气流调度延迟

  • IO delay: delay caused by saving and loading data

    IO延迟 :由于保存和加载数据导致的延迟

The Forecasting DAG has around 12,000 tasks, which can be roughly categorized as

预测DAG大约有12,000个任务,大致可分为以下几类:

  • Fast tasks (usually under two minutes) such as Cython wrappers conducting arithmetic and algebra operations including aggregation, distribution and so on.快速任务(通常不到两分钟),例如Cython包装器执行算术和代数运算,包括聚合,分布等。
  • Slow tasks (usually over twenty minutes) such as Data Warehouse tasks and R modelling tasks缓慢的任务(通常超过20分钟),例如数据仓库任务和R建模任务

The scheduling delay introduced by Airflow is at least five minutes. Besides, each task has a setup cost of around three minutes spent on initializing the PythonVirtualenvOperator and installing required libraries (explained in the previous sub-section). Thus, if we manage to combine multiple tasks to get sequentially executed in a single Airflow operator, it saves at least eight minutes per task. However, to some extent, the sequential execution hurts parallelism. It’s fun to design the following chaining algorithm to strike the balance between sequentialism and parallelism.

Airflow引入的计划延迟至少为五分钟。 此外,每个任务在初始化PythonVirtualenvOperator和安装所需的库上花费了大约三分钟的设置成本(在上一小节中已有说明)。 因此,如果我们设法组合多个任务以在单个Airflow操作员中按顺序执行,则每个任务至少可以节省八分钟。 但是,在某种程度上,顺序执行会损害并行性。 设计以下链接算法以在顺序和并行之间取得平衡是很有趣的。

Our chaining algorithm is inspired by the greedy algorithm and graph theory. It aims to combine individual tasks into chains to reduce the count of operators to launch in Airflow. The algorithm contains the following major steps:

我们的链接算法受贪婪算法和图论的启发。 它旨在将各个任务组合成链,以减少要在Airflow中发射的操作员的数量。 该算法包含以下主要步骤:

  • [starting point] It generates the original execution order of tasks in the DAG, and starts from the tail of the order.

    [起点]它在DAG中生成任务的原始执行顺序,并从该顺序的末尾开始。

  • [iteration] Each task is only visited once. If any of the conditions (explained in the next paragraph) is met, the task is added to a chain. Whenever a chain is created or extended, its precedents and dependents need to be updated.

    [迭代]每个任务只能访问一次。 如果满足任何条件(在下一段中说明),则将任务添加到链中。 每当创建或扩展链时,都需要更新其先例和从属关系。

  • [ending point] When the DAG structure no longer changes, remove duplicate tasks in each chain (if any).

    [终点]当DAG结构不再更改时,请删除每个链中的重复任务(如果有)。

Let’s consider two simple cases of inter-dependency of tasks.

让我们考虑任务相互依赖的两种简单情况。

  • Case 1: t1 -> t2 -> t3, where t1 is the precedent of t2, and t2 is the precedent of t3. It’s straightforward to combine them to a chain of [t1, t2, t3], and execute them sequentially inside the chain.情况1:t1-> t2-> t3,其中t1是t2的先例,t2是t3的先例。 将它们组合到[t1,t2,t3]的链中并在链中顺序执行它们很简单。
  • Case 2: t1 -> t2 & t3, where t1 is the precedent of t2 and t3. When t1 completes, the DAG can start t2 and t3 parallelly. If we combine them to a chain of [t1, t2, t3], the sequential execution hurts the original parallelism.情况2:t1-> t2&t3,其中t1是t2和t3的先例。 t1完成时,DAG可以并行启动t2和t3。 如果将它们组合为[t1,t2,t3]链,则顺序执行会损害原始的并行性。
Figure 10. Single precedent chaining
图10.单一先例链接

If we only chain tasks like Case 1, and skip those like Case 2, only a tiny percentage of tasks in the Forecasting DAG are chained. Because we’re willing to make tradeoffs to bring down the overall overhead, we come up with two chaining conditions.

如果我们仅链接案例1之类的任务,而跳过案例2之类的任务,那么在Forecasting DAG中只有一小部分任务被链接。 因为我们愿意进行权衡以降低总开销,所以我们提出了两个链接条件。

  • Condition 1 (single precedent chaining): if a fast task only has one precedent, its precedent and itself are combined into a chain. The Forecasting DAG contains many appearances of DataWarehouse task -> Cython task, where DataWarehouse task generates many data files, which can be processed locally by Cython task. The advantage of data locality leaves very few data files to pass downstream.条件1(单个先例链接):如果一项快速任务只有一个先例,则其先例和自身将合并为一条链。 预测DAG包含许多出现的DataWarehouse任务-> Cython任务,其中DataWarehouse任务生成许多数据文件,这些数据文件可由Cython任务本地处理。 数据局部性的优势使极少的数据文件可以向下传递。
  • Condition 2 (replicate-to-dependent chaining): when the task is fast and it has dependent(s), combine it to each of its dependents as an individual chain (creating a chain if necessary, otherwise prepending it). The idea beyond is that we pay the cost of executing a fast task multiple times to run its dependents parallelly. This is a good tradeoff between parallelism and sequentialism, as long as the extra running time of a fast task (usually two minutes) can easily get justified by saving the time to set up an Airflow operator.条件2(复制到依赖链):当任务快速且具有依赖项时,将其与每个依赖项组合为一条单独的链(必要时创建一条链,否则将其前置)。 超越的想法是,我们要支付多次执行快速任务以并行运行其依赖项的成本。 只要可以通过节省设置Airflow操作员的时间轻松地证明快速任务的额外运行时间(通常为两分钟),就可以在并行和顺序之间进行权衡。
Figure 11. Replicate-to-dependent chaining
图11.复制到依赖链

A quick summary: our chaining algorithm manages to reduce the count of Airflow operators from 12,000 to 2,000, and thus brings down the overall scheduling delay at the DAG level.

快速总结 :我们的链接算法设法将Airflow操作员的数量从12,000减少到2,000,从而降低了DAG级别的总体调度延迟。

Finally, let’s tackle the IO delay caused by saving and loading data. The Forecasting DAG generates tens of thousands of data files in the format of numpy.ndarray. Typically, an Airflow task may need to download input files generated by its precedents, and store output files to be consumed by its dependents. How should we store those data files? We tried HDFS, but the IO performance was not satisfactory. Eventually we switched to AWS S3. The Airflow tasks run on AWS EC2 instances, and the transfer speed between EC2 and S3 is around 100Mbps. S3 satisfies Delphi’s storage requirements in terms of elasticity, cost, availability and data integrity.

最后,让我们解决由于保存和加载数据而导致的IO延迟。 Forecasting DAG会以numpy.ndarray的格式生成数万个数据文件。 通常,Airflow任务可能需要下载其先例生成的输入文件,并存储输出文件以供其从属使用。 我们应该如何存储这些数据文件? 我们尝试使用HDFS,但IO性能不令人满意。 最终,我们切换到了AWS S3。 Airflow任务在AWS EC2实例上运行,并且EC2和S3之间的传输速度约为100Mbps。 S3在弹性,成本,可用性和数据完整性方面满足了Delphi的存储要求。

5.结论 (5. Conclusion)

When we review how we built the data platform of Delphi, we’ve learned that

当我们回顾如何构建Delphi数据平台时,我们了解到

  • Data schema is the foundation of a data warehouse, and everything else builds on top of it. No one is more familiar with the data usage than the data owner. We need to design an efficient schema to suit the business requirements.数据模式是数据仓库的基础,其他所有内容都建立在此基础之上。 没有人比数据所有者更熟悉数据使用情况。 我们需要设计一个有效的架构来适应业务需求。
  • If the data warehouse has external data dependencies, it’s important to have consistent data definitions across the organization. The data should accurately reflect the metrics that are reported internally and externally.如果数据仓库具有外部数据依赖性,则在整个组织中具有一致的数据定义很重要。 数据应准确反映内部和外部报告的指标。
  • Druid is a good query engine for event-oriented data, and supports aggregation very well. Druid may be a good fit if your use case fits a few of the following scenarios:Druid是面向事件数据的良好查询引擎,并且非常支持聚合。 如果您的用例适合以下几种情况,则Druid可能是一个不错的选择:
  • The majority of queries are aggregation and reporting queries大多数查询是聚合查询和报告查询
  • Query latencies are expected to be under a few seconds查询等待时间预计不到几秒钟
  • Data operations are mostly insertions with very few modifications数据操作大部分是插入,很少修改
  • The data warehouse may need the compatibility with different query engines to support queries with largely different requirements (i.e., Druid’s fast aggregation speed VS Hive’s capacity with high-cardinality dimensions).数据仓库可能需要与不同的查询引擎兼容,以支持具有极大不同要求的查询(即,德鲁伊的快速聚合速度与VS Hive的高基数维度的容量)。
  • When the data warehouse depends on a wide range of upstream data sources, it’s useful to have a SLA tracking tool to provide historical statistical data of landing times. It helps with monitoring and improving SLAs.当数据仓库依赖于广泛的上游数据源时,拥有SLA跟踪工具来提供到达时间的历史统计数据很有用。 它有助于监视和改进SLA。
  • When using a tool to manage the workflow orchestration of the data platform, we need to decide how to host the control logic. Do we choose the single logic or duplicate it? We weigh in the pros and cons of each option.在使用工具管理数据平台的工作流程编排时,我们需要确定如何托管控制逻辑。 我们选择单一逻辑还是重复逻辑? 我们权衡每种选择的利弊。
  • The overhead can hide in many places of a complex data platform, and such little things may sum up to hurt the pipeline running time. After completing the pipeline, it’s worth spending some efforts on tuning and harvesting the low-hanging fruits. Don’t get punished for negligence.开销可能隐藏在复杂数据平台的许多地方,而这些小事可能会累加损害管道的运行时间。 完成管道之后,值得花一些精力来调整和收获低挂的水果。 不要因疏忽而受到惩罚。

Finally, we’d like to recognize the contribution and support from our colleagues at Airbnb, who we are proud to work with:

最后,我们想感谢Airbnb同事的贡献和支持,我们很荣幸与他们合作:

  • Engineering Management: Mike Schierberl, Dwalu Khasu, Ninad Khisti工程管理:Mike Schierberl,Dwalu Khasu,Ninad Khisti
  • Finance Intelligence team: Chris Lindsey, Wanli Yang, Leon Zhu金融情报团队:克里斯·林赛,杨万里,朱Leon
  • FP&A team: Sean Wilson, Jiwoo Song, Claire Wang, Harrison KatzFP&A团队:肖恩·威尔逊,宋智宇,克莱尔·王,哈里森·卡茨
  • Metric Infra team: Clint Kelly, Amit Pahwa, Robert Chang公制基础小组:克林特·凯利,阿米特·帕瓦,罗伯特·张
  • Workflow Orchestration team: Kevin Yang工作流程编排团队:杨凯文
  • Analytics Infra team: John Bodley, Michelle Thomas, Sylvia TomiyamaGoogle Analytics(分析)基础小组:John Bodley,Michelle Thomas,Sylvia Tomiyama

翻译自: https://medium.com/@jerry.chu/airbnbs-data-platform-of-revenue-forecasting-2e95a01122e6


http://www.taodudu.cc/news/show-4836446.html

相关文章:

  • 个人收入证明
  • js和html5实现扫描条形码
  • EXCEL直接生成扫描枪可识别条码
  • html 5 调用手机条码扫描,vue h5页面如何实现扫一扫功能,扫条形码获取编码
  • error loading midas.dll问题
  • 接口测试平台-106: 番外-正交工具 excel导出
  • 混合正交表生成工具——allpairs安装及使用
  • vue3.0 引用 iconfont 图标
  • uniapp 使用在线 iconfont 图标
  • iconfont图标引入方法
  • 添加新的iconfont图标的方法(看了就会)
  • 淘宝API app店铺搜索
  • 网易蜗牛读书产品体验报告(1.9.6版本)
  • 【人工智能】基于五笔字型规范和人工神经网络的简中汉字识别【一】
  • android摇骰子源代码,Android实现微信摇骰子游戏
  • 您是否在找这样的软件:
  • HTML5期末大作业 HTML+CSS+JavaScript 简单的网页设计
  • HTML5+CSS大作业——个人网页设计(7页)
  • 饿百零售接入(一)
  • win10获取管理员权限_论文排版工具—LaTeX 安装+获取
  • 论文怕被查重怎么办?你的降重神器来了
  • 西交大卢院士、方学伟团队:钛纤维增强铝基复合材料增材制造技术研究
  • CTF隐写之文件怎么打不开?=0.0=
  • sql注入漏洞和sqlmap的使用
  • sql注入漏洞检测攻略
  • SQL 注入漏洞检测与利用
  • 解决前端中文字段乱码
  • 前端常用的正则校验
  • 让学前端不再害怕英语单词(一)
  • 【图像处理】双三次插值(Bicubic interpolation)原理及matlab简易版代码

airbnbs收入预测数据平台相关推荐

  1. GitHub开源比Hadoop快至少10倍的物联网大数据平台

    TDengine是一个开源的专为物联网.车联网.工业互联网.IT运维等设计和优化的大数据平台.除核心的快10倍以上的时序数据库功能外,还提供缓存.数据订阅.流式计算等功能,最大程度减少研发和运维的工作 ...

  2. 大众点评数据平台架构变迁

    2019独角兽企业重金招聘Python工程师标准>>> 最近和其他公司的同学对数据平台的发展题做了一些沟通,发现各自遇到的问题都类似,架构的变迁也有一定的相似性. 以下从数据& ...

  3. 比Hadoop快至少10倍的物联网大数据平台,我把它开源了

    作者 | 陶建辉 转载自爱倒腾的程序员(ID: taosdata) 导读:7月12日,涛思数据的TDengine物联网大数据平台宣布正式开源.涛思数据希望尽最大努力打造开发者社区,维护这个开源的商业模 ...

  4. 基于Hadoop的大数据平台实施记——整体架构设计[转]

    http://blog.csdn.net/jacktan/article/details/9200979 大数据的热度在持续的升温,继云计算之后大数据成为又一大众所追捧的新星.我们暂不去讨论大数据到底 ...

  5. kettle全量抽数据_漫谈数据平台架构的演化和应用

    随着科技的发展,数据在当代社会中所起的作用越来越大.阿里巴巴集团创始人马云在2014年提出了DT(Data Technology)的概念:"人类正从IT时代走向DT时代".DT的核 ...

  6. 大数据-平台-解决方案-基础架构一览

    1.talkingdata  (数据平台) 2.明略数据(解决方案) 3.百融金服(金融大数据) 4.国双科技(营销大数据) 5.国信优易(媒体大数据) 6.百分点(营销大数据) 7.华院集团(解决方 ...

  7. 第一家线下场景大数据平台Anchor-Point诞生

    近日,由衍合数据主办的"<Anchor-Point线下场景大数据分析平台发布会"在上海隆重举行.近80位媒体广告行业和数据研究龙头齐聚一堂,见证中国第一家线下场景大数据平台的 ...

  8. 淘宝、美团、滴滴分别如何搭建大数据平台?

    常规的大数据平台架构方案是基于大数据平台Lamda架构设计的.事实上,业界也基本是按照这种架构模型搭建自己的大数据平台. 接着我们来看一下淘宝.美团和滴滴的大数据平台,一方面进一步学习大厂大数据平台的 ...

  9. K8S 从懵圈到熟练--大数据平台技术栈18

    回顾:大数据平台技术栈 (ps:可点击查看),今天就来说说其中的K8S! 来自:阿里技术公众号 阿里妹导读:排查完全陌生的问题.不熟悉的系统组件,对许多工程师来说是无与伦比的工作乐趣,当然也是一大挑战 ...

最新文章

  1. 2021中科院院士候选名单出炉:清华胡事民、南大周志华等人在列
  2. 使用Javascript制作连续滚动字幕
  3. 列表刷新+SBJSON+HTTP
  4. boost::mp11::mp_similar相关用法的测试程序
  5. [Java基础]IO流概述和分类
  6. css模糊_如何使用CSS模糊图像?
  7. 爬虫初窥day1:urllib
  8. java多线程访问beans对象_java-多线程同时操作同一个对象之解决方法:读写锁ReadWriteLock的使用...
  9. php 新浪ip接口,php利用新浪接口查询ip获取地理位置示例
  10. 高中计算机思维导图,为高中信息技术教学插上思维导图翅膀
  11. 会跳动的爱心代码-简单易学的HTML网页(速成)
  12. 《原则》瑞·达利欧 --(五万字手敲笔记)
  13. 史上最简单的虚拟机搭建软路由+ NAS+家庭媒体中心的白皮书
  14. 离散数学 极大元,极小元,最大元,最小元,上界,上确界,下界,下确界
  15. Spring三大核心思想详解
  16. 1251:丛林中的路
  17. AIS标准(ITU-R M.1371-5)和Python解码模块
  18. 雷军投资“style”:不熟不投 找准“台风口”
  19. python怎么表示循环小数_循环小数表示法
  20. 青海哈里哈图国家森林公园雪景美若人间仙境

热门文章

  1. 如何应对不间断电源(UPS)设计挑战
  2. 锐捷三层交换机route-map设置
  3. 经典算法--韩信点兵
  4. 一个密码本(ACodebook)介绍
  5. CSS选择器、网页美化(字体、阴影、列表、渐变)
  6. Tomcat的GC优化实践
  7. 【收藏】UltraISO制作U盘启动安装CentOS 7.4
  8. Adobe Premiere Pro 2023导入音乐没有声音,并且提示“MME设备内部错误”
  9. pytdx 调用沪深300 所有股票实时行情
  10. python代码重构技巧_Python代码重构