Browse Source

重新整理文件并且将官方的英文版内容拷贝部分

pull/10/head
YuCheng Hu 1 year ago
parent
commit
a71a0228bf
  1. 26
      DataIngestion/dataformats.md
  2. 2
      DataIngestion/datamanage.md
  3. 6
      DataIngestion/faq.md
  4. 6
      DataIngestion/hadoopbased.md
  5. 6
      DataIngestion/ingestion.md
  6. 2
      DataIngestion/kafka.md
  7. 10
      DataIngestion/native.md
  8. 2
      DataIngestion/taskrefer.md
  9. 1
      Development/index.md
  10. 1
      Development/thrift.md
  11. 2
      Querying/Aggregations.md
  12. 6
      Querying/dimensionspec.md
  13. 28
      Querying/druidsql.md
  14. 2
      Querying/filters.md
  15. 4
      Querying/lookups.md
  16. 2
      Querying/makeNativeQueries.md
  17. 2
      Querying/multi-value-dimensions.md
  18. 4
      Querying/multitenancy.md
  19. 2
      Querying/postaggregation.md
  20. 2
      Querying/queryexecution.md
  21. 32
      SUMMARY.md
  22. 12
      _sidebar.md
  23. 0
      design/Broker.md
  24. 0
      design/Coordinator.md
  25. 0
      design/Deepstorage.md
  26. 2
      design/Design.md
  27. 0
      design/Historical.md
  28. 2
      design/Indexer.md
  29. 0
      design/Metadata.md
  30. 0
      design/MiddleManager.md
  31. 0
      design/Overlord.md
  32. 0
      design/Peons.md
  33. 2
      design/Processes.md
  34. 2
      design/Router.md
  35. 0
      design/Segments.md
  36. 0
      design/Zookeeper.md
  37. 0
      design/img/druid-architecture.png
  38. 0
      design/img/druid-column-types.png
  39. 0
      design/img/druid-timeline.png
  40. 100
      design/index.md
  41. 0
      development/JavaScript.md
  42. 0
      development/S3-compatible.md
  43. 0
      development/avro-extensions.md
  44. 0
      development/datasketches-extension.md
  45. 0
      development/experimental.md
  46. 54
      development/extensions-contrib/aliyun-oss-extensions.md
  47. 99
      development/extensions-contrib/ambari-metrics-emitter.md
  48. 30
      development/extensions-contrib/cassandra.md
  49. 98
      development/extensions-contrib/cloudfiles.md
  50. 99
      development/extensions-contrib/distinctcount.md
  51. 103
      development/extensions-contrib/gce-extensions.md
  52. 117
      development/extensions-contrib/graphite.md
  53. 67
      development/extensions-contrib/influx.md
  54. 74
      development/extensions-contrib/influxdb-emitter.md
  55. 54
      development/extensions-contrib/kafka-emitter.md
  56. 136
      development/extensions-contrib/materialized-view.md
  57. 125
      development/extensions-contrib/momentsketch-quantiles.md
  58. 349
      development/extensions-contrib/moving-average-query.md
  59. 62
      development/extensions-contrib/opentsdb-emitter.md
  60. 58
      development/extensions-contrib/redis-cache.md
  61. 56
      development/extensions-contrib/sqlserver.md
  62. 71
      development/extensions-contrib/statsd.md
  63. 151
      development/extensions-contrib/tdigestsketch-quantiles.md
  64. 87
      development/extensions-contrib/thrift.md
  65. 104
      development/extensions-contrib/time-min-max.md
  66. 320
      development/extensions-core/approximate-histograms.md
  67. 32
      development/extensions-core/avro.md
  68. 43
      development/extensions-core/azure.md
  69. 179
      development/extensions-core/bloom-filter.md
  70. 39
      development/extensions-core/datasketches-extension.md
  71. 121
      development/extensions-core/datasketches-hll.md
  72. 137
      development/extensions-core/datasketches-quantiles.md
  73. 284
      development/extensions-core/datasketches-theta.md
  74. 174
      development/extensions-core/datasketches-tuple.md
  75. 544
      development/extensions-core/druid-basic-security.md
  76. 125
      development/extensions-core/druid-kerberos.md
  77. 155
      development/extensions-core/druid-lookups.md
  78. 46
      development/extensions-core/druid-pac4j.md
  79. 127
      development/extensions-core/druid-ranger-security.md
  80. 26
      development/extensions-core/examples.md
  81. 58
      development/extensions-core/google.md
  82. 169
      development/extensions-core/hdfs.md
  83. 66
      development/extensions-core/kafka-extraction-namespace.md
  84. 417
      development/extensions-core/kafka-ingestion.md
  85. 473
      development/extensions-core/kinesis-ingestion.md
  86. 378
      development/extensions-core/lookups-cached-global.md
  87. 173
      development/extensions-core/mysql.md
  88. 84
      development/extensions-core/orc.md
  89. 36
      development/extensions-core/parquet.md
  90. 152
      development/extensions-core/postgresql.md
  91. 239
      development/extensions-core/protobuf.md
  92. 126
      development/extensions-core/s3.md
  93. 52
      development/extensions-core/simple-client-sslcontext.md
  94. 171
      development/extensions-core/stats.md
  95. 117
      development/extensions-core/test-stats.md
  96. 0
      development/extensions.md
  97. 3
      development/index.md
  98. 339
      development/modules.md
  99. 0
      development/orc-extensions.md
  100. 157
      development/overview.md
  101. Some files were not shown because too many files have changed in this diff Show More

26
DataIngestion/dataformats.md

@ -132,10 +132,10 @@ TSV `inputFormat` 有以下组件:
#### ORC
> [!WARNING]
> 使用ORC输入格式之前,首先需要包含 [druid-orc-extensions](../Development/orc-extensions.md)
> 使用ORC输入格式之前,首先需要包含 [druid-orc-extensions](../development/orc-extensions.md)
> [!WARNING]
> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../Development/orc-extensions.md#从contrib扩展迁移)。
> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../development/orc-extensions.md#从contrib扩展迁移)。
一个加载ORC格式数据的 `inputFormat` 示例:
```json
@ -169,7 +169,7 @@ ORC `inputFormat` 有以下组件:
#### Parquet
> [!WARNING]
> 使用Parquet输入格式之前,首先需要包含 [druid-parquet-extensions](../Development/parquet-extensions.md)
> 使用Parquet输入格式之前,首先需要包含 [druid-parquet-extensions](../development/parquet-extensions.md)
一个加载Parquet格式数据的 `inputFormat` 示例:
```json
@ -277,7 +277,7 @@ Parquet `inputFormat` 有以下组件:
> [!WARNING]
> parser在 [本地批任务](native.md), [Kafka索引任务](kafka.md) 和 [Kinesis索引任务](kinesis.md) 中已经废弃,在这些类型的摄入方式中考虑使用 [inputFormat](#数据格式)
该部分列出来了所有默认的以及核心扩展中的解析器。对于社区的扩展解析器,请参见 [社区扩展列表](../Development/extensions.md#社区扩展)
该部分列出来了所有默认的以及核心扩展中的解析器。对于社区的扩展解析器,请参见 [社区扩展列表](../development/extensions.md#社区扩展)
#### String Parser
@ -291,7 +291,7 @@ Parquet `inputFormat` 有以下组件:
#### Avro Hadoop Parser
> [!WARNING]
> 需要添加 [druid-avro-extensions](../Development/avro-extensions.md) 来使用 Avro Hadoop解析器
> 需要添加 [druid-avro-extensions](../development/avro-extensions.md) 来使用 Avro Hadoop解析器
该解析器用于 [Hadoop批摄取](hadoopbased.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.avro.AvroValueInputFormat`。您可能想在 `tuningConfig` 中的 `jobProperties` 选项设置Avro reader的schema, 例如:`"avro.schema.input.value.path": "/path/to/your/schema.avsc"` 或者 `"avro.schema.input.value": "your_schema_JSON_object"`。如果未设置Avro读取器的schema,则将使用Avro对象容器文件中的schema,详情可以参见 [avro规范](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution)
@ -339,10 +339,10 @@ Avro parseSpec可以包含使用"root"或"path"字段类型的 [flattenSpec](#fl
#### ORC Hadoop Parser
> [!WARNING]
> 需要添加 [druid-orc-extensions](../Development/orc-extensions.md) 来使用ORC Hadoop解析器
> 需要添加 [druid-orc-extensions](../development/orc-extensions.md) 来使用ORC Hadoop解析器
> [!WARNING]
> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../Development/orc-extensions.md#从contrib扩展迁移)。
> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../development/orc-extensions.md#从contrib扩展迁移)。
该解析器用于 [Hadoop批摄取](hadoopbased.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.orc.mapreduce.OrcInputFormat`
@ -564,7 +564,7 @@ Avro parseSpec可以包含使用"root"或"path"字段类型的 [flattenSpec](#fl
#### Parquet Hadoop Parser
> [!WARNING]
> 需要添加 [druid-parquet-extensions](../Development/parquet-extensions.md) 来使用Parquet Hadoop解析器
> 需要添加 [druid-parquet-extensions](../development/parquet-extensions.md) 来使用Parquet Hadoop解析器
该解析器用于 [Hadoop批摄取](hadoopbased.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.parquet.DruidParquetInputFormat`
@ -690,7 +690,7 @@ Parquet Hadoop 解析器支持自动字段发现,如果提供了一个带有 `
> 考虑在该解析器之上使用 [Parquet Hadoop Parser](#parquet-hadoop-parser) 来摄取Parquet文件。 两者之间的不同之处参见 [Parquet Hadoop解析器 vs Parquet Avro Hadoop解析器]() 部分
> [!WARNING]
> 使用Parquet Avro Hadoop Parser需要同时加入 [druid-parquet-extensions](../Development/parquet-extensions.md) 和 [druid-avro-extensions](../Development/avro-extensions.md)
> 使用Parquet Avro Hadoop Parser需要同时加入 [druid-parquet-extensions](../development/parquet-extensions.md) 和 [druid-avro-extensions](../development/avro-extensions.md)
该解析器用于 [Hadoop批摄取](hadoopbased.md), 该解析器首先将Parquet数据转换为Avro记录,然后再解析它们后摄入到Druid。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat`
@ -763,7 +763,7 @@ Parquet Avro Hadoop 解析器支持自动字段发现,如果提供了一个带
#### Avro Stream Parser
> [!WARNING]
> 需要添加 [druid-avro-extensions](../Development/avro-extensions.md) 来使用Avro Stream解析器
> 需要添加 [druid-avro-extensions](../development/avro-extensions.md) 来使用Avro Stream解析器
该解析器用于 [流式摄取](streamingest.md), 直接从一个流来读取数据。
@ -909,7 +909,7 @@ Avro Bytes Decorder首先提取输入消息的 `subject` 和 `id`, 然后使
#### Protobuf Parser
> [!WARNING]
> 需要添加 [druid-protobuf-extensions](../Development/protobuf-extensions.md) 来使用Protobuf解析器
> 需要添加 [druid-protobuf-extensions](../development/protobuf-extensions.md) 来使用Protobuf解析器
此解析器用于 [流接收](streamingest.md),并直接从流中读取协议缓冲区数据。
@ -949,7 +949,7 @@ Avro Bytes Decorder首先提取输入消息的 `subject` 和 `id`, 然后使
}
}
```
有关更多详细信息和示例,请参见 [扩展说明](../Development/protobuf-extensions.md)。
有关更多详细信息和示例,请参见 [扩展说明](../development/protobuf-extensions.md)。
### ParseSpec
@ -1117,7 +1117,7 @@ JSON数据也可以包含多值维度。维度的多个值必须在接收的数
注意: JavaScript解析器必须完全解析数据,并在JS逻辑中以 `{key:value}` 格式返回。这意味着任何展平或解析多维值都必须在这里完成。
> [!WARNING]
> 默认情况下禁用基于JavaScript的功能。有关使用Druid的JavaScript功能的指南,包括如何启用它的说明,请参阅 [Druid JavaScript编程指南](../Development/JavaScript.md)。
> 默认情况下禁用基于JavaScript的功能。有关使用Druid的JavaScript功能的指南,包括如何启用它的说明,请参阅 [Druid JavaScript编程指南](../development/JavaScript.md)。
#### 时间和维度解析规范

2
DataIngestion/datamanage.md

@ -163,7 +163,7 @@ Druid使用 `ioConfig` 中的 `inputSpec` 来知道要接收的数据位于何
### 删除数据
Druid支持永久的将标记为"unused"状态(详情可见架构设计中的 [段的生命周期](../Design/Design.md#段生命周期))的段删除掉
Druid支持永久的将标记为"unused"状态(详情可见架构设计中的 [段的生命周期](../design/Design.md#段生命周期))的段删除掉
杀死任务负责从元数据存储和深度存储中删除掉指定时间间隔内的不被使用的段

6
DataIngestion/faq.md

@ -34,7 +34,7 @@ Druid会拒绝时间窗口之外的事件, 确认事件是否被拒绝了的
### 摄取之后段存储在哪里
段的存储位置由 `druid.storage.type` 配置决定的,Druid会将段上传到 [深度存储](../Design/Deepstorage.md)。 本地磁盘是默认的深度存储位置。
段的存储位置由 `druid.storage.type` 配置决定的,Druid会将段上传到 [深度存储](../design/Deepstorage.md)。 本地磁盘是默认的深度存储位置。
### 流摄取任务没有发生段切换递交
@ -49,11 +49,11 @@ Druid会拒绝时间窗口之外的事件, 确认事件是否被拒绝了的
### 如何让HDFS工作
确保在类路径中包含 `druid-hdfs-storage` 和所有的hadoop配置、依赖项(可以通过在安装了hadoop的计算机上运行 `hadoop classpath`命令获得)。并且,提供必要的HDFS设置,如 [深度存储](../Design/Deepstorage.md) 中所述。
确保在类路径中包含 `druid-hdfs-storage` 和所有的hadoop配置、依赖项(可以通过在安装了hadoop的计算机上运行 `hadoop classpath`命令获得)。并且,提供必要的HDFS设置,如 [深度存储](../design/Deepstorage.md) 中所述。
### 没有在Historical进程中看到Druid段
您可以查看位于 `<Coordinator_IP>:<PORT>` 的Coordinator控制台, 确保您的段实际上已加载到 [Historical进程](../Design/Historical.md)中。如果段不存在,请检查Coordinator日志中有关复制错误容量的消息。不下载段的一个原因是,Historical进程的 `maxSize` 太小,使它们无法下载更多数据。您可以使用(例如)更改它:
您可以查看位于 `<Coordinator_IP>:<PORT>` 的Coordinator控制台, 确保您的段实际上已加载到 [Historical进程](../design/Historical.md)中。如果段不存在,请检查Coordinator日志中有关复制错误容量的消息。不下载段的一个原因是,Historical进程的 `maxSize` 太小,使它们无法下载更多数据。您可以使用(例如)更改它:
```json
-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]

6
DataIngestion/hadoopbased.md

@ -13,7 +13,7 @@
## 基于Hadoop的摄入
Apache Druid当前支持通过一个Hadoop摄取任务来支持基于Apache Hadoop的批量索引任务, 这些任务被提交到 [Druid Overlord](../Design/Overlord.md)的一个运行实例上。详情可以查看 [基于Hadoop的摄取vs基于本地批摄取的对比](ingestion.md#批量摄取) 来了解基于Hadoop的摄取、本地简单批摄取、本地并行摄取三者的比较。
Apache Druid当前支持通过一个Hadoop摄取任务来支持基于Apache Hadoop的批量索引任务, 这些任务被提交到 [Druid Overlord](../design/Overlord.md)的一个运行实例上。详情可以查看 [基于Hadoop的摄取vs基于本地批摄取的对比](ingestion.md#批量摄取) 来了解基于Hadoop的摄取、本地简单批摄取、本地并行摄取三者的比较。
运行一个基于Hadoop的批量摄取任务,首先需要编写一个如下的摄取规范, 然后提交到Overlord的 [`druid/indexer/v1/task`](../Operations/api.md#overlord) 接口,或者使用Druid软件包中自带的 `bin/post-index-task` 脚本。
@ -388,7 +388,7 @@ Hadoop的 [MapReduce文档](https://hadoop.apache.org/docs/stable/hadoop-mapredu
```json
classification=yarn-site,properties=[mapreduce.reduce.memory.mb=6144,mapreduce.reduce.java.opts=-server -Xms2g -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.map.java.opts=758,mapreduce.map.java.opts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.task.timeout=1800000]
```
* 按照 [Hadoop连接配置](../GettingStarted/chapter-4.md#Hadoop连接配置) 指导,使用EMR master中 `/etc/hadoop/conf` 的XML文件。
* 按照 [Hadoop连接配置](../tutorials/img/chapter-4.md#Hadoop连接配置) 指导,使用EMR master中 `/etc/hadoop/conf` 的XML文件。
### Kerberized Hadoop集群
@ -472,7 +472,7 @@ spec文件需要包含一个JSON对象,其中的内容与Hadoop索引任务中
| `password` | String | DB的密码 | 是 |
| `segmentTable` | String | DB中使用的表 | 是 |
这些属性应该模仿您为 [Coordinator](../Design/Coordinator.md) 配置的内容。
这些属性应该模仿您为 [Coordinator](../design/Coordinator.md) 配置的内容。
**segmentOutputPath配置**

6
DataIngestion/ingestion.md

@ -16,7 +16,7 @@
Druid中的所有数据都被组织成*段*,这些段是数据文件,通常每个段最多有几百万行。在Druid中加载数据称为*摄取或索引*,它包括从源系统读取数据并基于该数据创建段。
在大多数摄取方法中,加载数据的工作由Druid [MiddleManager](../Design/MiddleManager.md) 进程(或 [Indexer](../Design/Indexer.md) 进程)完成。一个例外是基于Hadoop的摄取,这项工作是使用Hadoop MapReduce作业在YARN上完成的(尽管MiddleManager或Indexer进程仍然参与启动和监视Hadoop作业)。一旦段被生成并存储在 [深层存储](../Design/Deepstorage.md) 中,它们将被Historical进程加载。有关如何在引擎下工作的更多细节,请参阅Druid设计文档的[存储设计](../Design/Design.md) 部分。
在大多数摄取方法中,加载数据的工作由Druid [MiddleManager](../design/MiddleManager.md) 进程(或 [Indexer](../design/Indexer.md) 进程)完成。一个例外是基于Hadoop的摄取,这项工作是使用Hadoop MapReduce作业在YARN上完成的(尽管MiddleManager或Indexer进程仍然参与启动和监视Hadoop作业)。一旦段被生成并存储在 [深层存储](../design/Deepstorage.md) 中,它们将被Historical进程加载。有关如何在引擎下工作的更多细节,请参阅Druid设计文档的[存储设计](../design/Design.md) 部分。
### 如何使用本文档
@ -394,7 +394,7 @@ Druid以两种可能的方式来解释 `dimensionsSpec` : *normal* 和 *schemale
##### `granularitySpec`
`granularitySpec` 位于 `dataSchema` -> `granularitySpec`, 用来配置以下操作:
1. 通过 `segmentGranularity` 来将数据源分区到 [时间块](../Design/Design.md#数据源和段)
1. 通过 `segmentGranularity` 来将数据源分区到 [时间块](../design/Design.md#数据源和段)
2. 如果需要的话,通过 `queryGranularity` 来截断时间戳
3. 通过 `interval` 来指定批摄取中应创建段的时间块
4. 通过 `rollup` 来指定是否在摄取时进行汇总
@ -418,7 +418,7 @@ Druid以两种可能的方式来解释 `dimensionsSpec` : *normal* 和 *schemale
| 字段 | 描述 | 默认值 |
|-|-|-|
| type | `uniform` 或者 `arbitrary` ,大多数时候使用 `uniform` | `uniform` |
| segmentGranularity | 数据源的 [时间分块](../Design/Design.md#数据源和段) 粒度。每个时间块可以创建多个段, 例如,当设置为 `day` 时,同一天的事件属于同一时间块,该时间块可以根据其他配置和输入大小进一步划分为多个段。这里可以提供任何粒度。请注意,同一时间块中的所有段应具有相同的段粒度。 <br><br> 如果 `type` 字段设置为 `arbitrary` 则忽略 | `day` |
| segmentGranularity | 数据源的 [时间分块](../design/Design.md#数据源和段) 粒度。每个时间块可以创建多个段, 例如,当设置为 `day` 时,同一天的事件属于同一时间块,该时间块可以根据其他配置和输入大小进一步划分为多个段。这里可以提供任何粒度。请注意,同一时间块中的所有段应具有相同的段粒度。 <br><br> 如果 `type` 字段设置为 `arbitrary` 则忽略 | `day` |
| queryGranularity | 每个段内时间戳存储的分辨率, 必须等于或比 `segmentGranularity` 更细。这将是您可以查询的最细粒度,并且仍然可以查询到合理的结果。但是请注意,您仍然可以在比此粒度更粗的场景进行查询,例如 "`minute`"的值意味着记录将以分钟的粒度存储,并且可以在分钟的任意倍数(包括分钟、5分钟、小时等)进行查询。<br><br> 这里可以提供任何 [粒度](../Querying/AggregationGranularity.md) 。使用 `none` 按原样存储时间戳,而不进行任何截断。请注意,即使将 `queryGranularity` 设置为 `none`,也将应用 `rollup`。 | `none` |
| rollup | 是否在摄取时使用 [rollup](#rollup)。 注意:即使 `queryGranularity` 设置为 `none`,rollup也仍然是有效的,当数据具有相同的时间戳时数据将被汇总 | `true` |
| interval | 描述应该创建段的时间块的间隔列表。如果 `type` 设置为`uniform`,则此列表将根据 `segmentGranularity` 进行拆分和舍入。如果 `type` 设置为 `arbitrary` ,则将按原样使用此列表。<br><br> 如果该值不提供或者为空值,则批处理摄取任务通常会根据在输入数据中找到的时间戳来确定要输出的时间块。<br><br> 如果指定,批处理摄取任务可以跳过确定分区阶段,这可能会导致更快的摄取。批量摄取任务也可以预先请求它们的所有锁,而不是逐个请求。批处理摄取任务将丢弃任何时间戳超出指定间隔的记录。<br><br> 在任何形式的流摄取中忽略该配置。 | `null` |

2
DataIngestion/kafka.md

@ -186,7 +186,7 @@ curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http:/
Kafka索引服务同时支持通过 [`inputFormat`](dataformats.md#inputformat) 和 [`parser`](dataformats.md#parser) 来指定数据格式。 `inputFormat` 是一种新的且推荐的用于Kafka索引服务中指定数据格式的方式,但是很遗憾的是目前它还不支持过时的 `parser` 所有支持的所有格式(未来会支持)。
`inputFormat` 支持的格式包括 [`csv`](dataformats.md#csv), [`delimited`](dataformats.md#TSV(Delimited)), [`json`](dataformats.md#json)。可以使用 `parser` 来读取 [`avro_stream`](dataformats.md#AvroStreamParser), [`protobuf`](dataformats.md#ProtobufParser), [`thrift`](../Development/thrift.md) 格式的数据。
`inputFormat` 支持的格式包括 [`csv`](dataformats.md#csv), [`delimited`](dataformats.md#TSV(Delimited)), [`json`](dataformats.md#json)。可以使用 `parser` 来读取 [`avro_stream`](dataformats.md#AvroStreamParser), [`protobuf`](dataformats.md#ProtobufParser), [`thrift`](../development/overview.md) 格式的数据。
### 操作

10
DataIngestion/native.md

@ -39,7 +39,7 @@ Apache Druid当前支持两种类型的本地批量索引任务, `index_parall
传统的 [`firehose`](#firehoses%e5%b7%b2%e5%ba%9f%e5%bc%83) 支持其他一些云存储类型。下面的 `firehose` 类型也是可拆分的。请注意,`firehose` 只支持文本格式。
* [`static-cloudfiles`](../Development/rackspacecloudfiles.md)
* [`static-cloudfiles`](../development/rackspacecloudfiles.md)
您可能需要考虑以下事项:
* 您可能希望控制每个worker进程的输入数据量。这可以使用不同的配置进行控制,具体取决于并行摄取的阶段(有关更多详细信息,请参阅 [`partitionsSpec`](#partitionsspec)。对于从 `inputSource` 读取数据的任务,可以在 `tuningConfig` 中设置 [分割提示规范](#分割提示规范)。对于合并无序段的任务,可以在 `tuningConfig` 中设置`totalNumMergeTasks`。
@ -235,7 +235,7 @@ PartitionsSpec用于描述辅助分区方法。您应该根据需要的rollup模
基于哈希分区的并行任务类似于 [MapReduce](https://en.wikipedia.org/wiki/MapReduce)。任务分为两个阶段运行,即 `部分段生成``部分段合并`
* 在 `部分段生成` 阶段,与MapReduce中的Map阶段一样,并行任务根据分割提示规范分割输入数据,并将每个分割分配给一个worker。每个worker(`partial_index_generate` 类型)从 `granularitySpec` 中的`segmentGranularity(主分区键)` 读取分配的分割,然后按`partitionsSpec` 中 `partitionDimensions(辅助分区键)`的哈希值对行进行分区。分区数据存储在 [MiddleManager](../Design/MiddleManager.md) 或 [Indexer](../Design/Indexer.md) 的本地存储中。
* 在 `部分段生成` 阶段,与MapReduce中的Map阶段一样,并行任务根据分割提示规范分割输入数据,并将每个分割分配给一个worker。每个worker(`partial_index_generate` 类型)从 `granularitySpec` 中的`segmentGranularity(主分区键)` 读取分配的分割,然后按`partitionsSpec` 中 `partitionDimensions(辅助分区键)`的哈希值对行进行分区。分区数据存储在 [MiddleManager](../design/MiddleManager.md) 或 [Indexer](../design/Indexer.md) 的本地存储中。
* `部分段合并` 阶段类似于MapReduce中的Reduce阶段。并行任务生成一组新的worker(`partial_index_merge` 类型)来合并在前一阶段创建的分区数据。这里,分区数据根据要合并的时间块和分区维度的散列值进行洗牌;每个worker从多个MiddleManager/Indexer进程中读取落在同一时间块和同一散列值中的数据,并将其合并以创建最终段。最后,它们将最后的段一次推送到深层存储。
**基于单一维度范围分区**
@ -254,7 +254,7 @@ PartitionsSpec用于描述辅助分区方法。您应该根据需要的rollup模
`single-dim` 分区下,并行任务分为3个阶段进行,即 `部分维分布`、`部分段生成` 和 `部分段合并`。第一个阶段是收集一些统计数据以找到最佳分区,另外两个阶段是创建部分段并分别合并它们,就像在基于哈希的分区中那样。
* 在 `部分维度分布` 阶段,并行任务分割输入数据,并根据分割提示规范将其分配给worker。每个worker任务(`partial_dimension_distribution` 类型)读取分配的分割并为 `partitionDimension` 构建直方图。并行任务从worker任务收集这些直方图,并根据 `partitionDimension` 找到最佳范围分区,以便在分区之间均匀分布行。请注意,`targetRowsPerSegment` 或 `maxRowsPerSegment` 将用于查找最佳分区。
* 在 `部分段生成` 阶段,并行任务生成新的worker任务(`partial_range_index_generate` 类型)以创建分区数据。每个worker任务都读取在前一阶段中创建的分割,根据 `granularitySpec` 中的`segmentGranularity(主分区键)`的时间块对行进行分区,然后根据在前一阶段中找到的范围分区对行进行分区。分区数据存储在 [MiddleManager](../Design/MiddleManager.md) 或 [Indexer](../Design/Indexer.md)的本地存储中。
* 在 `部分段生成` 阶段,并行任务生成新的worker任务(`partial_range_index_generate` 类型)以创建分区数据。每个worker任务都读取在前一阶段中创建的分割,根据 `granularitySpec` 中的`segmentGranularity(主分区键)`的时间块对行进行分区,然后根据在前一阶段中找到的范围分区对行进行分区。分区数据存储在 [MiddleManager](../design/MiddleManager.md) 或 [Indexer](../design/Indexer.md)的本地存储中。
* 在 `部分段合并` 阶段,并行索引任务生成一组新的worker任务(`partial_index_generic_merge`类型)来合并在上一阶段创建的分区数据。这里,分区数据根据时间块和 `partitionDimension` 的值进行洗牌;每个工作任务从多个MiddleManager/Indexer进程中读取属于同一范围的同一分区中的段,并将它们合并以创建最后的段。最后,它们将最后的段推到深层存储。
> [!WARNING]
@ -654,7 +654,7 @@ PartitionsSpec用于描述辅助分区方法。您应该根据需要的rollup模
#### S3输入源
> [!WARNING]
> 您需要添加 [`druid-s3-extensions`](../Development/S3-compatible.md) 扩展以便使用S3输入源。
> 您需要添加 [`druid-s3-extensions`](../development/S3-compatible.md) 扩展以便使用S3输入源。
S3输入源支持直接从S3读取对象。可以通过S3 URI字符串列表或S3位置前缀列表指定对象,该列表将尝试列出内容并摄取位置中包含的所有对象。S3输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个对象。
@ -734,7 +734,7 @@ S3对象:
| `accessKeyId` | S3输入源访问密钥的 [Password Provider](../Operations/passwordproviders.md) 或纯文本字符串 | None | 如果 `secretAccessKey` 被提供的话,则为必须 |
| `secretAccessKey` | S3输入源访问密钥的 [Password Provider](../Operations/passwordproviders.md) 或纯文本字符串 | None | 如果 `accessKeyId` 被提供的话,则为必须 |
**注意**: *如果 `accessKeyId``secretAccessKey` 未被指定的话, 则将使用默认的 [S3认证](../Development/S3-compatible.md#S3认证方式)*
**注意**: *如果 `accessKeyId``secretAccessKey` 未被指定的话, 则将使用默认的 [S3认证](../development/S3-compatible.md#S3认证方式)*
#### 谷歌云存储输入源

2
DataIngestion/taskrefer.md

@ -21,7 +21,7 @@
任务API主要在两个地方是可用的:
* [Overlord](../Design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../Operations/api.md) 可以看到完整列表
* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../Operations/api.md) 可以看到完整列表
* Druid SQL包括了一个 [`sys.tasks`](../Querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的,并且可以通过Overlord API查询完整信息的有限制的子集。
### 任务报告

1
Development/index.md

@ -1 +0,0 @@
## 开发指南

1
Development/thrift.md

@ -1 +0,0 @@
<!-- toc -->

2
Querying/Aggregations.md

@ -306,7 +306,7 @@ Double/Float/Long/String的ANY聚合器不能够使用在摄入规范中,只
```
> [!WARNING]
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../Development/JavaScript.md)。
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../development/JavaScript.md)。
### 近似聚合(Approximate Aggregations)
#### 唯一计数(Count distinct)

6
Querying/dimensionspec.md

@ -60,7 +60,7 @@
该项功能仅仅对多值维度是比较有用的。如果你在Apache Druid中有一个值为 ["v1","v2","v3"] 的行,当发送一个带有对维度值为"v1"进行[查询过滤](filters.md)的GroupBy/TopN查询, 在响应中,将会得到包含"v1","v2","v3"的三行数据。这个行为在大多数场景是不适合的。
之所以会发生这种情况,是因为"查询过滤器"是在位图上内部使用的,并且只用于匹配要包含在查询结果处理中的行。对于多值维度,"查询过滤器"的行为类似于包含检查,它将匹配维度值为["v1"、"v2"、"v3"]的行。有关更多详细信息,请参阅[段](../Design/Segments.md)中"多值列"一节, 然后groupBy/topN处理管道"分解"所有多值维度,得到3行"v1"、"v2"和"v3"。
之所以会发生这种情况,是因为"查询过滤器"是在位图上内部使用的,并且只用于匹配要包含在查询结果处理中的行。对于多值维度,"查询过滤器"的行为类似于包含检查,它将匹配维度值为["v1"、"v2"、"v3"]的行。有关更多详细信息,请参阅[段](../design/Segments.md)中"多值列"一节, 然后groupBy/topN处理管道"分解"所有多值维度,得到3行"v1"、"v2"和"v3"。
除了有效地选择要处理的行的"查询过滤器"之外,还可以使用带过滤的DimensionSpec来筛选多值维度值中的特定值。这些维度规范采用代理维度规范和筛选条件。从"分解"行中,查询结果中只返回与给定筛选条件匹配的行。
@ -87,7 +87,7 @@
#### 带Lookup的DimensionSpec
> [!WARNING]
> Lookups是一个[实验性的特性](../Development/experimental.md)
> Lookups是一个[实验性的特性](../development/experimental.md)
带Lookup的DimensionSpec可用于将lookup实现直接定义为维度规范。一般来说,有两种不同类型的查找实现。第一种是在查询时像map实现一样传递的。
@ -296,7 +296,7 @@ null字符串被认定为长度为0
```
> [!WARNING]
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../Development/JavaScript.md)。
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../development/JavaScript.md)。
#### 已注册的Lookup提取函数

28
Querying/druidsql.md

@ -131,7 +131,7 @@ Druid的原生类型系统允许字符串可能有多个值。这些 [多值维
在默认模式(`true`)下,Druid将NULL和空字符串互换处理,而不是根据SQL标准。在这种模式下,Druid SQL只部分支持NULL。例如,表达式 `col IS NULL``col = ''` 等效,如果 `col` 包含空字符串,则两者的计算结果都为true。类似地,如果`col1`是空字符串,则表达式 `COALESCE(col1,col2)` 将返回 `col2`。当 `COUNT(*)` 聚合器计算所有行时,`COUNT(expr)` 聚合器将计算expr既不为空也不为空字符串的行数。此模式中的数值列不可为空;任何空值或缺少的值都将被视为零。
在SQL兼容模式(`false`)中,NULL的处理更接近SQL标准,该属性同时影响存储和查询,因此为了获得最佳行为,应该在接收时和查询时同时设置该属性。处理空值的能力会带来一些开销;有关更多详细信息,请参阅 [段文档](../Design/Segments.md#SQL兼容的空值处理)。
在SQL兼容模式(`false`)中,NULL的处理更接近SQL标准,该属性同时影响存储和查询,因此为了获得最佳行为,应该在接收时和查询时同时设置该属性。处理空值的能力会带来一些开销;有关更多详细信息,请参阅 [段文档](../design/Segments.md#SQL兼容的空值处理)。
### 聚合函数
@ -148,14 +148,14 @@ Druid的原生类型系统允许字符串可能有多个值。这些 [多值维
| `MAX(expr)` | 取数字的最大值 |
| `AVG(expr)` | 取平均值 |
| `APPROX_COUNT_DISTINCT(expr)` | 唯一值的计数,该值可以是常规列或hyperUnique。这始终是近似值,而不考虑"useApproximateCountDistinct"的值。该函数使用了Druid内置的"cardinality"或"hyperUnique"聚合器。另请参见 `COUNT(DISTINCT expr)` |
| `APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])` | 唯一值的计数,该值可以是常规列或[HLL sketch](../Configuration/core-ext/datasketches-hll.md)。`lgk` 和 `tgtHllType` 参数在HLL Sketch文档中做了描述。 该值也始终是近似值,而不考虑"useApproximateCountDistinct"的值。另请参见 `COUNT(DISTINCT expr)`, 使用该函数需要加载 [DataSketches扩展](../Development/datasketches-extension.md) |
| `APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])` | 唯一值的计数,该值可以是常规列或[Theta sketch](../Configuration/core-ext/datasketches-theta.md)。`size` 参数在Theta Sketch文档中做了描述。 该值也始终是近似值,而不考虑"useApproximateCountDistinct"的值。另请参见 `COUNT(DISTINCT expr)`, 使用该函数需要加载 [DataSketches扩展](../Development/datasketches-extension.md) |
| `DS_HLL(expr, [lgK, tgtHllType])` | 在表达式的值上创建一个 [`HLL sketch`](../Configuration/core-ext/datasketches-hll.md), 该值可以是常规列或者包括HLL Sketch的列。`lgk` 和 `tgtHllType` 参数在HLL Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../Development/datasketches-extension.md) |
| `DS_THETA(expr, [size])` | 在表达式的值上创建一个[`Theta sketch`](../Configuration/core-ext/datasketches-theta.md),该值可以是常规列或者包括Theta Sketch的列。`size` 参数在Theta Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../Development/datasketches-extension.md) |
| `APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])` | 唯一值的计数,该值可以是常规列或[HLL sketch](../Configuration/core-ext/datasketches-hll.md)。`lgk` 和 `tgtHllType` 参数在HLL Sketch文档中做了描述。 该值也始终是近似值,而不考虑"useApproximateCountDistinct"的值。另请参见 `COUNT(DISTINCT expr)`, 使用该函数需要加载 [DataSketches扩展](../development/datasketches-extension.md) |
| `APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])` | 唯一值的计数,该值可以是常规列或[Theta sketch](../Configuration/core-ext/datasketches-theta.md)。`size` 参数在Theta Sketch文档中做了描述。 该值也始终是近似值,而不考虑"useApproximateCountDistinct"的值。另请参见 `COUNT(DISTINCT expr)`, 使用该函数需要加载 [DataSketches扩展](../development/datasketches-extension.md) |
| `DS_HLL(expr, [lgK, tgtHllType])` | 在表达式的值上创建一个 [`HLL sketch`](../Configuration/core-ext/datasketches-hll.md), 该值可以是常规列或者包括HLL Sketch的列。`lgk` 和 `tgtHllType` 参数在HLL Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../development/datasketches-extension.md) |
| `DS_THETA(expr, [size])` | 在表达式的值上创建一个[`Theta sketch`](../Configuration/core-ext/datasketches-theta.md),该值可以是常规列或者包括Theta Sketch的列。`size` 参数在Theta Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../development/datasketches-extension.md) |
| `APPROX_QUANTILE(expr, probability, [resolution])` | 在数值表达式或者[近似图](../Configuration/core-ext/approximate-histograms.md) 表达式上计算近似分位数,"probability"应该是位于0到1之间(不包括1),"resolution"是用于计算的centroids,更高的resolution将会获得更精确的结果,默认值为50。使用该函数需要加载 [近似直方图扩展](../Configuration/core-ext/approximate-histograms.md) |
| `APPROX_QUANTILE_DS(expr, probability, [k])` | 在数值表达式或者 [Quantiles sketch](../Configuration/core-ext/datasketches-quantiles.md) 表达式上计算近似分位数,"probability"应该是位于0到1之间(不包括1), `k`参数在Quantiles Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../Development/datasketches-extension.md) |
| `APPROX_QUANTILE_DS(expr, probability, [k])` | 在数值表达式或者 [Quantiles sketch](../Configuration/core-ext/datasketches-quantiles.md) 表达式上计算近似分位数,"probability"应该是位于0到1之间(不包括1), `k`参数在Quantiles Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../development/datasketches-extension.md) |
| `APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])` | 在数值表达式或者[fixed buckets直方图](../Configuration/core-ext/approximate-histograms.md) 表达式上计算近似分位数,"probability"应该是位于0到1之间(不包括1), `numBuckets`, `lowerLimit`, `upperLimit``outlierHandlingMode` 参数在fixed buckets直方图文档中做了描述。 使用该函数需要加载 [近似直方图扩展](../Configuration/core-ext/approximate-histograms.md) |
| `DS_QUANTILES_SKETCH(expr, [k])` | 在表达式的值上创建一个[`Quantiles sketch`](../Configuration/core-ext/datasketches-quantiles.md),该值可以是常规列或者包括Quantiles Sketch的列。`k`参数在Quantiles Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../Development/datasketches-extension.md) |
| `DS_QUANTILES_SKETCH(expr, [k])` | 在表达式的值上创建一个[`Quantiles sketch`](../Configuration/core-ext/datasketches-quantiles.md),该值可以是常规列或者包括Quantiles Sketch的列。`k`参数在Quantiles Sketch文档中做了描述。使用该函数需要加载 [DataSketches扩展](../development/datasketches-extension.md) |
| `BLOOM_FILTER(expr, numEntries)` | 根据`expr`生成的值计算bloom筛选器,其中`numEntries`在假阳性率增加之前具有最大数量的不同值。详细可以参见 [Bloom过滤器扩展](../Configuration/core-ext/bloom-filter.md) |
| `TDIGEST_QUANTILE(expr, quantileFraction, [compression])` | 根据`expr`生成的值构建一个T-Digest sketch,并返回分位数的值。"compression"(默认值100)确定sketch的精度和大小。更高的compression意味着更高的精度,但更多的空间来存储sketch。有关更多详细信息,请参阅 [t-digest扩展文档](../Configuration/core-ext/tdigestsketch-quantiles.md) |
| `TDIGEST_GENERATE_SKETCH(expr, [compression])` | 根据`expr`生成的值构建一个T-Digest sketch。"compression"(默认值100)确定sketch的精度和大小。更高的compression意味着更高的精度,但更多的空间来存储sketch。有关更多详细信息,请参阅 [t-digest扩展文档](../Configuration/core-ext/tdigestsketch-quantiles.md) |
@ -326,7 +326,7 @@ Druid的原生类型系统允许字符串可能有多个值。这些 [多值维
**HLL Sketch函数**
以下函数操作在 [DataSketches HLL sketches](../Configuration/core-ext/datasketches-hll.md) 之上,使用这些函数之前需要加载 [DataSketches扩展](../Development/datasketches-extension.md)
以下函数操作在 [DataSketches HLL sketches](../Configuration/core-ext/datasketches-hll.md) 之上,使用这些函数之前需要加载 [DataSketches扩展](../development/datasketches-extension.md)
| 函数 | 描述 |
|-|-|
@ -337,7 +337,7 @@ Druid的原生类型系统允许字符串可能有多个值。这些 [多值维
**Theta Sketch函数**
以下函数操作在 [theta sketches](../Configuration/core-ext/datasketches-theta.md) 之上,使用这些函数之前需要加载 [DataSketches扩展](../Development/datasketches-extension.md)
以下函数操作在 [theta sketches](../Configuration/core-ext/datasketches-theta.md) 之上,使用这些函数之前需要加载 [DataSketches扩展](../development/datasketches-extension.md)
| 函数 | 描述 |
|-|-|
@ -349,7 +349,7 @@ Druid的原生类型系统允许字符串可能有多个值。这些 [多值维
**Quantiles Sketch函数**
以下函数操作在 [quantiles sketches](../Configuration/core-ext/datasketches-quantiles.md) 之上,使用这些函数之前需要加载 [DataSketches扩展](../Development/datasketches-extension.md)
以下函数操作在 [quantiles sketches](../Configuration/core-ext/datasketches-quantiles.md) 之上,使用这些函数之前需要加载 [DataSketches扩展](../development/datasketches-extension.md)
| 函数 | 描述 |
|-|-|
@ -647,7 +647,7 @@ try (Connection connection = DriverManager.getConnection(url, connectionProperti
**连接粘性**
Druid的JDBC服务不在Broker之间共享连接状态。这意味着,如果您使用JDBC并且有多个Druid Broker,您应该连接到一个特定的Broker,或者使用启用了粘性会话的负载平衡器。Druid Router进程在平衡JDBC请求时提供连接粘性,即使使用普通的非粘性负载平衡器,也可以用来实现必要的粘性。请参阅 [Router文档](../Design/Router.md) 以了解更多详细信息
Druid的JDBC服务不在Broker之间共享连接状态。这意味着,如果您使用JDBC并且有多个Druid Broker,您应该连接到一个特定的Broker,或者使用启用了粘性会话的负载平衡器。Druid Router进程在平衡JDBC请求时提供连接粘性,即使使用普通的非粘性负载平衡器,也可以用来实现必要的粘性。请参阅 [Router文档](../design/Router.md) 以了解更多详细信息
注意:非JDBC的 [HTTP POST](#http-post) 是无状态的,不需要粘性
@ -759,10 +759,10 @@ segments表提供了所有Druid段的详细信息,无论该段是否被发布
| `partition_num` | LONG | 分区号(整数,在数据源+间隔+版本中是唯一的;不一定是连续的) |
| `num_replicas` | LONG | 当前正在服务的此段的副本数 |
| `num_rows` | LONG | 当前段中的行数,如果查询时Broker未知,则此值可以为空 |
| `is_published` | LONG | 布尔值表示为long类型,其中1=true,0=false。1表示此段已发布到元数据存储且 `used=1`。详情查看 [架构页面](../Design/Design.md) |
| `is_available` | LONG | 布尔值表示为long类型,其中1=true,0=false。1表示此段当前由任何进程(Historical或Realtime)提供服务。详情查看 [架构页面](../Design/Design.md) |
| `is_published` | LONG | 布尔值表示为long类型,其中1=true,0=false。1表示此段已发布到元数据存储且 `used=1`。详情查看 [架构页面](../design/Design.md) |
| `is_available` | LONG | 布尔值表示为long类型,其中1=true,0=false。1表示此段当前由任何进程(Historical或Realtime)提供服务。详情查看 [架构页面](../design/Design.md) |
| `is_realtime` | LONG | 布尔值表示为long类型,其中1=true,0=false。如果此段仅由实时任务提供服务,则为1;如果任何Historical进程正在为此段提供服务,则为0。 |
| `is_overshadowed` | LONG | 布尔值表示为long类型,其中1=true,0=false。如果此段已发布,并且被其他已发布的段完全覆盖则为1。目前,对于未发布的段,`is_overshadowed` 总是false,尽管这在未来可能会改变。可以通过过滤 `is_published=1``is_overshadowed=0` 来筛选"应该发布"的段。如果段最近被替换,它们可以短暂地被发布,也可以被掩盖,但还没有被取消发布。详情查看 [架构页面](../Design/Design.md) |
| `is_overshadowed` | LONG | 布尔值表示为long类型,其中1=true,0=false。如果此段已发布,并且被其他已发布的段完全覆盖则为1。目前,对于未发布的段,`is_overshadowed` 总是false,尽管这在未来可能会改变。可以通过过滤 `is_published=1``is_overshadowed=0` 来筛选"应该发布"的段。如果段最近被替换,它们可以短暂地被发布,也可以被掩盖,但还没有被取消发布。详情查看 [架构页面](../design/Design.md) |
| `payload` | STRING | JSON序列化数据段负载 |
例如,要检索数据源"wikipedia"的所有段,请使用查询:

2
Querying/filters.md

@ -116,7 +116,7 @@ JavaScript函数需要一个维度值的参数,返回值要么是true或者fal
JavaScript过滤器支持使用提取函数,详情可见 [带提取函数的过滤器](#带提取函数的过滤器)
> [!WARNING]
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../Development/JavaScript.md)。
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../development/JavaScript.md)。
### **提取过滤器(Extraction Filter)**

4
Querying/lookups.md

@ -2,7 +2,7 @@
## Lookups
> [!WARNING]
> Lookups是一个 [实验性的特性](../Development/experimental.md)
> Lookups是一个 [实验性的特性](../development/experimental.md)
Lookups是Apache Druid中的一个概念,在Druid中维度值(可选地)被新值替换,从而允许类似join的功能。在Druid中应用Lookup类似于在数据仓库中的联接维度表。有关详细信息,请参见 [维度说明](querydimensions.md)。在这些文档中,"key"是指要匹配的维度值,"value"是指其替换的目标值。所以如果你想把 `appid-12345` 映射到`Super Mega Awesome App`,那么键应该是 `appid-12345`,值就是 `Super Mega Awesome App`
@ -85,7 +85,7 @@ GROUP BY 1
### 动态配置
> [!WARNING]
> 动态Lookup配置是一个 [实验特性](../Development/experimental.md), 不再支持静态配置。下面的文档说明了集群范围的配置,该配置可以通过Coordinator进行访问。配置通过服务器的"tier"概念传播。"tier"被定义为一个应该接收一组Lookup的服务集合。例如,您可以让所有Historical都是 `_default`,而Peon是它们所负责的数据源的各个层的一部分。Lookups的tier完全独立于Historical tiers。
> 动态Lookup配置是一个 [实验特性](../development/experimental.md), 不再支持静态配置。下面的文档说明了集群范围的配置,该配置可以通过Coordinator进行访问。配置通过服务器的"tier"概念传播。"tier"被定义为一个应该接收一组Lookup的服务集合。例如,您可以让所有Historical都是 `_default`,而Peon是它们所负责的数据源的各个层的一部分。Lookups的tier完全独立于Historical tiers。
这些配置都可以通过以下URI模板来使用JSON获取到:

2
Querying/makeNativeQueries.md

@ -36,7 +36,7 @@ curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:applica
Druid的原生查询级别相对较低,与内部执行计算的方式密切相关。Druid查询被设计成轻量级的,并且非常快速地完成。这意味着对于更复杂的分析,或者构建更复杂的可视化,可能需要多个Druid查询。
即使查询通常是向Broker或Router发出的,但是它们也可以被 [Historical进程](../Design/Historical.md) 和运行流摄取任务的 [peon(任务jvm)](../Design/Peons.md) 接受。如果您想查询由特定进程提供服务的特定段的结果,这可能很有价值。
即使查询通常是向Broker或Router发出的,但是它们也可以被 [Historical进程](../design/Historical.md) 和运行流摄取任务的 [peon(任务jvm)](../design/Peons.md) 接受。如果您想查询由特定进程提供服务的特定段的结果,这可能很有价值。
### 可用的查询

2
Querying/multi-value-dimensions.md

@ -3,7 +3,7 @@
Apache Druid支持多值字符串维度。当输入字段中包括一个数组值而非单一值(例如,JSON数组,或者包括多个 `listDelimiter` 分割的TSV字段)时即可生成多值维度。
本文档描述了对一个维度进行聚合时,多值维度上的GroupBy查询行为(TopN很类似)。对于多值维度的内部详细信息可以查看 [Segments](../Design/Segments.md) 文档的多值列部分。本文档中的示例都为 [原生Druid查询](makeNativeQueries.md)格式,对于多值维度在SQL中的使用情况请查阅 [Druid SQL 文档](druidsql.md)
本文档描述了对一个维度进行聚合时,多值维度上的GroupBy查询行为(TopN很类似)。对于多值维度的内部详细信息可以查看 [Segments](../design/Segments.md) 文档的多值列部分。本文档中的示例都为 [原生Druid查询](makeNativeQueries.md)格式,对于多值维度在SQL中的使用情况请查阅 [Druid SQL 文档](druidsql.md)
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"

4
Querying/multitenancy.md

@ -46,10 +46,10 @@ Druid还通过提供可配置的数据分发方式来支持多租户。Druid的H
### 支持高查询并发
Druid的基本计算单位是[段](../Design/Segments.md)。进程并行地扫描段,给定进程可以根据`druid.processing.numThreads`的配置并发扫描。为了并行处理更多的数据并提高性能,可以向集群中添加更多的核。Druid段的大小应该使任何给定段上的计算都能在最多500毫秒内完成。
Druid的基本计算单位是[段](../design/Segments.md)。进程并行地扫描段,给定进程可以根据`druid.processing.numThreads`的配置并发扫描。为了并行处理更多的数据并提高性能,可以向集群中添加更多的核。Druid段的大小应该使任何给定段上的计算都能在最多500毫秒内完成。
Druid在内部将扫描段的请求存储在优先队列中。如果一个给定的查询需要扫描比集群中可用处理器总数更多的段,并且许多类似昂贵的查询同时运行,我们不希望任何查询都被耗尽。Druid的内部处理逻辑将扫描一个查询中的一组段,扫描完成后立即释放资源,允许继续扫描来自另一个查询的第二组段。通过保持段计算时间非常小,我们确保不断地产生资源,并且与不同查询相关的段都被处理。
Druid查询可以选择在[查询上下文](query-context.md)中设置`priority`标志。已知速度较慢的查询(下载或报告样式的查询)可以取消优先级,交互程度更高的查询可以具有更高的优先级。
Broker进程也可以专用于给定的层。例如,一组Broker进程可以专用于快速交互查询,另一组Broker进程可以专用于较慢的报告查询。Druid还提供了一个[Router](../Design/Router.md)进程,可以根据各种查询参数(datasource、interval等)将查询路由到不同的Broker。
Broker进程也可以专用于给定的层。例如,一组Broker进程可以专用于快速交互查询,另一组Broker进程可以专用于较慢的报告查询。Druid还提供了一个[Router](../design/Router.md)进程,可以根据各种查询参数(datasource、interval等)将查询路由到不同的Broker。

2
Querying/postaggregation.md

@ -113,7 +113,7 @@ postAggregation : {
```
> [!WARNING]
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../Development/JavaScript.md)。
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能,参考 [JavaScript编程指南](../development/JavaScript.md)。
### 超唯一基数后置聚合器(HyperUnique Cardinality post-aggregator)

2
Querying/queryexecution.md

@ -23,7 +23,7 @@ Druid的查询执行方法因查询的 [数据源类型](#数据源类型) 而
直接在 [表数据源](datasource.md#table) 上操作的查询使用由Broker进程引导的**分散-聚集**方法执行。过程如下:
1. Broker根据 `"interval"` 参数确定哪些 [](../Design/Segments.md) 与查询相关。段总是按时间划分的,因此任何间隔与查询间隔重叠的段都可能是相关的。
1. Broker根据 `"interval"` 参数确定哪些 [](../design/Segments.md) 与查询相关。段总是按时间划分的,因此任何间隔与查询间隔重叠的段都可能是相关的。
2. 如果输入数据使用 [`single_dim` partitionsSpec](../DataIngestion/native.md#partitionsSpec) 按范围分区,并且过滤器与用于分区的维度匹配,则Broker还可以根据 `"filter"` 进一步修剪段列表。
3. Broker在删除了查询的段列表之后,将查询转发到当前为这些段提供服务的数据服务器(如Historical或者运行在MiddleManagers的任务)。
4. 对于除 [Scan](scan.md) 之外的所有查询类型,数据服务器并行处理每个段,并为每个段生成部分结果。所做的具体处理取决于查询类型。如果启用了 [查询缓存](querycached.md),则可以缓存这些部分结果。对于Scan查询,段由单个线程按顺序处理,结果不被缓存。

32
SUMMARY.md

@ -16,7 +16,7 @@
* [新手入门]()
* [Druid介绍](GettingStarted/chapter-1.md)
* [快速开始](GettingStarted/chapter-2.md)
* [Docker](GettingStarted/Docker.md)
* [Docker](tutorials/docker.md)
* [单服务器部署](GettingStarted/chapter-3.md)
* [集群部署](GettingStarted/chapter-4.md)
@ -35,20 +35,20 @@
* [Kerberized HDFS存储](tutorials/chapter-12.md)
* [架构设计]()
* [整体设计](Design/Design.md)
* [段设计](Design/Segments.md)
* [进程与服务](Design/Processes.md)
* [Coordinator](Design/Coordinator.md)
* [Overlord](Design/Overlord.md)
* [Historical](Design/Historical.md)
* [MiddleManager](Design/MiddleManager.md)
* [Broker](Design/Broker.md)
* [Router](Design/Router.md)
* [Indexer](Design/Indexer.md)
* [Peon](Design/Peons.md)
* [深度存储](Design/Deepstorage.md)
* [元数据存储](Design/Metadata.md)
* [Zookeeper](Design/Zookeeper.md)
* [整体设计](design/Design.md)
* [段设计](design/Segments.md)
* [进程与服务](design/Processes.md)
* [Coordinator](design/Coordinator.md)
* [Overlord](design/Overlord.md)
* [Historical](design/Historical.md)
* [MiddleManager](design/MiddleManager.md)
* [Broker](design/Broker.md)
* [Router](design/Router.md)
* [Indexer](design/Indexer.md)
* [Peon](design/Peons.md)
* [深度存储](design/Deepstorage.md)
* [元数据存储](design/Metadata.md)
* [Zookeeper](design/Zookeeper.md)
* [数据摄取]()
* [摄取概述](DataIngestion/ingestion.md)
@ -107,7 +107,7 @@
* [操作指南](Operations/index.md)
* [开发指南]()
* [开发指南](Development/index.md)
* [开发指南](development/index.md)
* [其他相关]()
* [其他相关](misc/index.md)

12
_sidebar.md

@ -3,9 +3,9 @@
- [公众平台](CONTACT.md)
- 开始使用
- [从文件中载入数据](yong-zhou/ling-ling/mao-ping-li-cun/index.md)
- [从 Kafka 中载入数据](yong-zhou/ling-ling/tang-fu-cun/index.md)
- [从 Hadoop 中载入数据](yong-zhou/ling-ling/zhao-jia-wan-cun/index.md)
- [Druid 介绍](design/index.md)
- [快速开始](tutorials/index.md)
- [Docker 容器](tutorials/docker.md)
- 设计(Design)
- [JWT](jwt/README.md)
@ -16,7 +16,11 @@
- [面试问题和经验](interview/index.md)
- [算法题](algorithm/index.md)
- 查询(Querying)
- 开发(Development)
- [在 Druid 中进行开发](development/index.md)
- [创建扩展(extensions)](development/modules.md)
- 其他杂项(Misc)
- [Druid 资源快速导航](misc/index.md)

0
Design/Broker.md → design/Broker.md

0
Design/Coordinator.md → design/Coordinator.md

0
Design/Deepstorage.md → design/Deepstorage.md

2
Design/Design.md → design/Design.md

@ -139,7 +139,7 @@ clarity-cloud0_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T15:5
### 查询处理
查询首先进入[Broker](../Design/Broker.md), Broker首先鉴别哪些段可能与本次查询有关。 段的列表总是按照时间进行筛选和修剪的,当然也可能由其他属性,具体取决于数据源的分区方式。然后,Broker将确定哪些[Historical](../Design/Historical.md)和[MiddleManager](../Design/MiddleManager.md)为这些段提供服务、并向每个进程发送一个子查询。 Historical和MiddleManager进程接收查询、处理查询并返回结果,Broker将接收到的结果合并到一起形成最后的结果集返回给调用者。
查询首先进入[Broker](/Broker.md), Broker首先鉴别哪些段可能与本次查询有关。 段的列表总是按照时间进行筛选和修剪的,当然也可能由其他属性,具体取决于数据源的分区方式。然后,Broker将确定哪些[Historical](/Historical.md)和[MiddleManager](/MiddleManager.md)为这些段提供服务、并向每个进程发送一个子查询。 Historical和MiddleManager进程接收查询、处理查询并返回结果,Broker将接收到的结果合并到一起形成最后的结果集返回给调用者。
Broker精简是Druid限制每个查询扫描数据量的一个重要方法,但不是唯一的方法。对于比Broker更细粒度级别的精简筛选器,每个段中的索引结构允许Druid在查看任何数据行之前,找出哪些行(如果有的话)与筛选器集匹配。一旦Druid知道哪些行与特定查询匹配,它就只访问该查询所需的特定列。在这些列中,Druid可以从一行跳到另一行,避免读取与查询过滤器不匹配的数据。

0
Design/Historical.md → design/Historical.md

2
Design/Indexer.md → design/Indexer.md

@ -14,7 +14,7 @@
## Indexer
> [!WARNING]
> 索引器是一个可选的和[实验性](../Development/experimental.md)的功能, 其内存管理系统仍在开发中,并将在以后的版本中得到显著增强。
> 索引器是一个可选的和[实验性](../development/experimental.md)的功能, 其内存管理系统仍在开发中,并将在以后的版本中得到显著增强。
Apache Druid索引器进程是MiddleManager + Peon任务执行系统的另一种可替代选择。索引器在单个JVM进程中作为单独的线程运行任务,而不是为每个任务派生单独的JVM进程。

0
Design/Metadata.md → design/Metadata.md

0
Design/MiddleManager.md → design/MiddleManager.md

0
Design/Overlord.md → design/Overlord.md

0
Design/Peons.md → design/Peons.md

2
Design/Processes.md → design/Processes.md

@ -86,7 +86,7 @@ Data服务执行摄取作业并存储可查询数据。
[Indexer](./Indexer.md) 进程是MiddleManager和Peon的替代方法。Indexer在单个JVM进程中作为单个线程运行任务,而不是为每个任务派生单独的JVM进程。
与MiddleManager + Peon系统相比,Indexer的设计更易于配置和部署,并且能够更好地实现跨任务的资源共享。Indexer是一种较新的功能,由于其内存管理系统仍在开发中,因此目前被指定为[实验性的特性](../Development/experimental.md)。它将在Druid的未来版本中继续成熟。
与MiddleManager + Peon系统相比,Indexer的设计更易于配置和部署,并且能够更好地实现跨任务的资源共享。Indexer是一种较新的功能,由于其内存管理系统仍在开发中,因此目前被指定为[实验性的特性](../development/experimental.md)。它将在Druid的未来版本中继续成熟。
通常,您可以部署MiddleManagers或indexer,但不能同时部署两者。

2
Design/Router.md → design/Router.md

@ -98,7 +98,7 @@ Router有一个可配置的策略列表,用于选择将查询路由到哪个Br
```
> [!WARNING]
> 默认情况下禁用基于JavaScript的功能。有关使用Druid的JavaScript功能的指南,包括如何启用它的说明,请参阅[Druid JavaScript编程指南](../Development/JavaScript.md)。
> 默认情况下禁用基于JavaScript的功能。有关使用Druid的JavaScript功能的指南,包括如何启用它的说明,请参阅[Druid JavaScript编程指南](../development/JavaScript.md)。
### Avatica查询平衡

0
Design/Segments.md → design/Segments.md

0
Design/Zookeeper.md → design/Zookeeper.md

0
Design/img/druid-architecture.png → design/img/druid-architecture.png

Before

Width:  |  Height:  |  Size: 131 KiB

After

Width:  |  Height:  |  Size: 131 KiB

0
Design/img/druid-column-types.png → design/img/druid-column-types.png

Before

Width:  |  Height:  |  Size: 91 KiB

After

Width:  |  Height:  |  Size: 91 KiB

0
Design/img/druid-timeline.png → design/img/druid-timeline.png

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 24 KiB

100
design/index.md

@ -0,0 +1,100 @@
---
id: index
title: "Introduction to Apache Druid"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## What is Druid?
Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries) on large data sets. Druid is most often
used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important.
As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs
that need fast aggregations. Druid works best with event-oriented data.
Common application areas for Druid include:
- Clickstream analytics (web and mobile analytics)
- Network telemetry analytics (network performance monitoring)
- Server metrics storage
- Supply chain analytics (manufacturing metrics)
- Application performance metrics
- Digital marketing/advertising analytics
- Business intelligence / OLAP
Druid's core architecture combines ideas from data warehouses, timeseries databases, and logsearch systems. Some of
Druid's key features are:
1. **Columnar storage format.** Druid uses column-oriented storage, meaning it only needs to load the exact columns
needed for a particular query. This gives a huge speed boost to queries that only hit a few columns. In addition, each
column is stored optimized for its particular data type, which supports fast scans and aggregations.
2. **Scalable distributed system.** Druid is typically deployed in clusters of tens to hundreds of servers, and can
offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub-second to a
few seconds.
3. **Massively parallel processing.** Druid can process a query in parallel across the entire cluster.
4. **Realtime or batch ingestion.** Druid can ingest data either real-time (ingested data is immediately available for
querying) or in batches.
5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale the cluster out or in, simply add or
remove servers and the cluster will rebalance itself automatically, in the background, without any downtime. If any
Druid servers fail, the system will automatically route around the damage until those servers can be replaced. Druid
is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software
updates.
6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is
stored safely in [deep storage](architecture.html#deep-storage) (typically cloud storage, HDFS, or a shared filesystem).
Your data can be recovered from deep storage even if every single Druid server fails. For more limited failures affecting
just a few Druid servers, replication ensures that queries are still possible while the system recovers.
7. **Indexes for quick filtering.** Druid uses [Roaring](https://roaringbitmap.org/) or
[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create indexes that power fast filtering and
searching across multiple columns.
8. **Time-based partitioning.** Druid first partitions data by time, and can additionally partition based on other fields.
This means time-based queries will only access the partitions that match the time range of the query. This leads to
significant performance improvements for time-based data.
9. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often
substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also
offers exact count-distinct and exact ranking.
10. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
summarization partially pre-aggregates your data, and can lead to big costs savings and performance boosts.
## When should I use Druid?
Druid is used by many companies of various sizes for many different use cases. Check out the
[Powered by Apache Druid](/druid-powered) page
Druid is likely a good choice if your use case fits a few of the following descriptors:
- Insert rates are very high, but updates are less common.
- Most of your queries are aggregation and reporting queries ("group by" queries). You may also have searching and
scanning queries.
- You are targeting query latencies of 100ms to a few seconds.
- Your data has a time component (Druid includes optimizations and design choices specifically related to time).
- You may have more than one table, but each query hits just one big distributed table. Queries may potentially hit more
than one smaller "lookup" table.
- You have high cardinality data columns (e.g. URLs, user IDs) and need fast counting and ranking over them.
- You want to load data from Kafka, HDFS, flat files, or object storage like Amazon S3.
Situations where you would likely _not_ want to use Druid include:
- You need low-latency updates of _existing_ records using a primary key. Druid supports streaming inserts, but not streaming updates (updates are done using
background batch jobs).
- You are building an offline reporting system where query latency is not very important.
- You want to do "big" joins (joining one big fact table to another big fact table) and you are okay with these queries
taking a long time to complete.

0
Development/JavaScript.md → development/JavaScript.md

0
Development/S3-compatible.md → development/S3-compatible.md

0
Development/avro-extensions.md → development/avro-extensions.md

0
Development/datasketches-extension.md → development/datasketches-extension.md

0
Development/experimental.md → development/experimental.md

54
development/extensions-contrib/aliyun-oss-extensions.md

@ -0,0 +1,54 @@
---
id: aliyun-oss
title: "Aliyun OSS"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `aliyun-oss-extensions` extension.
## Deep Storage
[Aliyun](https://www.aliyun.com) is the 3rd largest cloud infrastructure provider in the world. It provides its own storage solution known as OSS, [Object Storage Service](https://www.aliyun.com/product/oss).
To use aliyun OSS as deep storage, first config as below
|Property|Description|Possible Values|Default|
|--------|---------------|-----------|-------|
|`druid.oss.accessKey`|the `AccessKey ID` of your account which can be used to access the bucket| |Must be set.|
|`druid.oss.secretKey`|the `AccessKey Secret` of your account which can be used to access the bucket| |Must be set. |
|`druid.oss.endpoint`|the endpoint url of your OSS storage| |Must be set.|
if you want to use OSS as deep storage, use the configurations below
|Property|Description|Possible Values|Default|
|--------|---------------|-----------|-------|
|`druid.storage.type`| Global deep storage provider. Must be set to `oss` to make use of this extension. | oss |Must be set.|
|`druid.storage.oss.bucket`|storage bucket name.| | Must be set.|
|`druid.storage.oss.prefix`|a prefix string prepended to the file names for the segments published to aliyun OSS deep storage| druid/segments | |
To save index logs to OSS, apply the configurations below:
|Property|Description|Possible Values|Default|
|--------|---------------|-----------|-------|
|`druid.indexer.logs.type`| Global deep storage provider. Must be set to `oss` to make use of this extension. | oss |Must be set.|
|`druid.indexer.logs.oss.bucket`|the bucket used to keep logs| |Must be set.|
|`druid.indexer.logs.oss.prefix`|a prefix string prepended to the log files.| | |

99
development/extensions-contrib/ambari-metrics-emitter.md

@ -0,0 +1,99 @@
---
id: ambari-metrics-emitter
title: "Ambari Metrics Emitter"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `ambari-metrics-emitter` extension.
## Introduction
This extension emits Druid metrics to a ambari-metrics carbon server.
Events are sent after been [pickled](http://ambari-metrics.readthedocs.org/en/latest/feeding-carbon.html#the-pickle-protocol); the size of the batch is configurable.
## Configuration
All the configuration parameters for ambari-metrics emitter are under `druid.emitter.ambari-metrics`.
|property|description|required?|default|
|--------|-----------|---------|-------|
|`druid.emitter.ambari-metrics.hostname`|The hostname of the ambari-metrics server.|yes|none|
|`druid.emitter.ambari-metrics.port`|The port of the ambari-metrics server.|yes|none|
|`druid.emitter.ambari-metrics.protocol`|The protocol used to send metrics to ambari metrics collector. One of http/https|no|http|
|`druid.emitter.ambari-metrics.trustStorePath`|Path to trustStore to be used for https|no|none|
|`druid.emitter.ambari-metrics.trustStoreType`|trustStore type to be used for https|no|none|
|`druid.emitter.ambari-metrics.trustStoreType`|trustStore password to be used for https|no|none|
|`druid.emitter.ambari-metrics.batchSize`|Number of events to send as one batch.|no|100|
|`druid.emitter.ambari-metrics.eventConverter`| Filter and converter of druid events to ambari-metrics timeline event(please see next section). |yes|none|
|`druid.emitter.ambari-metrics.flushPeriod` | Queue flushing period in milliseconds. |no|1 minute|
|`druid.emitter.ambari-metrics.maxQueueSize`| Maximum size of the queue used to buffer events. |no|`MAX_INT`|
|`druid.emitter.ambari-metrics.alertEmitters`| List of emitters where alerts will be forwarded to. |no| empty list (no forwarding)|
|`druid.emitter.ambari-metrics.emitWaitTime` | wait time in milliseconds to try to send the event otherwise emitter will throwing event. |no|0|
|`druid.emitter.ambari-metrics.waitForEventTime` | waiting time in milliseconds if necessary for an event to become available. |no|1000 (1 sec)|
### Druid to Ambari Metrics Timeline Event Converter
Ambari Metrics Timeline Event Converter defines a mapping between druid metrics name plus dimensions to a timeline event metricName.
ambari-metrics metric path is organized using the following schema:
`<namespacePrefix>.[<druid service name>].[<druid hostname>].<druid metrics dimensions>.<druid metrics name>`
Properly naming the metrics is critical to avoid conflicts, confusing data and potentially wrong interpretation later on.
Example `druid.historical.hist-host1:8080.MyDataSourceName.GroupBy.query/time`:
* `druid` -> namespace prefix
* `historical` -> service name
* `hist-host1:8080` -> druid hostname
* `MyDataSourceName` -> dimension value
* `GroupBy` -> dimension value
* `query/time` -> metric name
We have two different implementation of event converter:
#### Send-All converter
The first implementation called `all`, will send all the druid service metrics events.
The path will be in the form `<namespacePrefix>.[<druid service name>].[<druid hostname>].<dimensions values ordered by dimension's name>.<metric>`
User has control of `<namespacePrefix>.[<druid service name>].[<druid hostname>].`
```json
druid.emitter.ambari-metrics.eventConverter={"type":"all", "namespacePrefix": "druid.test", "appName":"druid"}
```
#### White-list based converter
The second implementation called `whiteList`, will send only the white listed metrics and dimensions.
Same as for the `all` converter user has control of `<namespacePrefix>.[<druid service name>].[<druid hostname>].`
White-list based converter comes with the following default white list map located under resources in `./src/main/resources/defaultWhiteListMap.json`
Although user can override the default white list map by supplying a property called `mapPath`.
This property is a String containing the path for the file containing **white list map JSON object**.
For example the following converter will read the map from the file `/pathPrefix/fileName.json`.
```json
druid.emitter.ambari-metrics.eventConverter={"type":"whiteList", "namespacePrefix": "druid.test", "ignoreHostname":true, "appName":"druid", "mapPath":"/pathPrefix/fileName.json"}
```
**Druid emits a huge number of metrics we highly recommend to use the `whiteList` converter**

30
development/extensions-contrib/cassandra.md

@ -0,0 +1,30 @@
---
id: cassandra
title: "Apache Cassandra"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-cassandra-storage` extension.
[Apache Cassandra](http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra) can also
be leveraged for deep storage. This requires some additional Druid configuration as well as setting up the necessary
schema within a Cassandra keystore.

98
development/extensions-contrib/cloudfiles.md

@ -0,0 +1,98 @@
---
id: cloudfiles
title: "Rackspace Cloud Files"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-cloudfiles-extensions` extension.
## Deep Storage
[Rackspace Cloud Files](http://www.rackspace.com/cloud/files/) is another option for deep storage. This requires some additional Druid configuration.
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.storage.type`|cloudfiles||Must be set.|
|`druid.storage.region`||Rackspace Cloud Files region.|Must be set.|
|`druid.storage.container`||Rackspace Cloud Files container name.|Must be set.|
|`druid.storage.basePath`||Rackspace Cloud Files base path to use in the container.|Must be set.|
|`druid.storage.operationMaxRetries`||Number of tries before cancel a Rackspace operation.|10|
|`druid.cloudfiles.userName`||Rackspace Cloud username|Must be set.|
|`druid.cloudfiles.apiKey`||Rackspace Cloud API key.|Must be set.|
|`druid.cloudfiles.provider`|rackspace-cloudfiles-us,rackspace-cloudfiles-uk|Name of the provider depending on the region.|Must be set.|
|`druid.cloudfiles.useServiceNet`|true,false|Whether to use the internal service net.|true|
## Firehose
<a name="firehose"></a>
#### StaticCloudFilesFirehose
This firehose ingests events, similar to the StaticAzureBlobStoreFirehose, but from Rackspace's Cloud Files.
Data is newline delimited, with one JSON object per line and parsed as per the `InputRowParser` configuration.
The storage account is shared with the one used for Rackspace's Cloud Files deep storage functionality, but blobs can be in a different region and container.
As with the Azure blobstore, it is assumed to be gzipped if the extension ends in .gz
This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).
Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object.
Sample spec:
```json
"firehose" : {
"type" : "static-cloudfiles",
"blobs": [
{
"region": "DFW"
"container": "container",
"path": "/path/to/your/file.json"
},
{
"region": "ORD"
"container": "anothercontainer",
"path": "/another/path.json"
}
]
}
```
This firehose provides caching and prefetching features. In IndexTask, a firehose can be read twice if intervals or
shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow.
|property|description|default|required?|
|--------|-----------|-------|---------|
|type|This should be `static-cloudfiles`.|N/A|yes|
|blobs|JSON array of Cloud Files blobs.|N/A|yes|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache.|1073741824|no|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no|
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no|
|fetchTimeout|Timeout for fetching a Cloud Files object.|60000|no|
|maxFetchRetry|Maximum retry for fetching a Cloud Files object.|3|no|
Cloud Files Blobs:
|property|description|default|required?|
|--------|-----------|-------|---------|
|container|Name of the Cloud Files container|N/A|yes|
|path|The path where data is located.|N/A|yes|

99
development/extensions-contrib/distinctcount.md

@ -0,0 +1,99 @@
---
id: distinctcount
title: "DistinctCount Aggregator"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) the `druid-distinctcount` extension.
Additionally, follow these steps:
1. First, use a single dimension hash-based partition spec to partition data by a single dimension. For example visitor_id. This to make sure all rows with a particular value for that dimension will go into the same segment, or this might over count.
2. Second, use distinctCount to calculate the distinct count, make sure queryGranularity is divided exactly by segmentGranularity or else the result will be wrong.
There are some limitations, when used with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment. If exceeded the result will be wrong. When used with topN, numValuesPerPass should not be too big. If too big the distinctCount will use a lot of memory and might cause the JVM to go our of memory.
Example:
## Timeseries query
```json
{
"queryType": "timeseries",
"dataSource": "sample_datasource",
"granularity": "day",
"aggregations": [
{
"type": "distinctCount",
"name": "uv",
"fieldName": "visitor_id"
}
],
"intervals": [
"2016-03-01T00:00:00.000/2013-03-20T00:00:00.000"
]
}
```
## TopN query
```json
{
"queryType": "topN",
"dataSource": "sample_datasource",
"dimension": "sample_dim",
"threshold": 5,
"metric": "uv",
"granularity": "all",
"aggregations": [
{
"type": "distinctCount",
"name": "uv",
"fieldName": "visitor_id"
}
],
"intervals": [
"2016-03-06T00:00:00/2016-03-06T23:59:59"
]
}
```
## GroupBy query
```json
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
"dimensions": ["sample_dim"],
"granularity": "all",
"aggregations": [
{
"type": "distinctCount",
"name": "uv",
"fieldName": "visitor_id"
}
],
"intervals": [
"2016-03-06T00:00:00/2016-03-06T23:59:59"
]
}
```

103
development/extensions-contrib/gce-extensions.md

@ -0,0 +1,103 @@
---
id: gce-extensions
title: "GCE Extensions"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
To use this Apache Druid (incubating) extension, make sure to [include](../../development/extensions.md#loading-extensions) `gce-extensions`.
At the moment, this extension enables only Druid to autoscale instances in GCE.
The extension manages the instances to be scaled up and down through the use of the [Managed Instance Groups](https://cloud.google.com/compute/docs/instance-groups/creating-groups-of-managed-instances#resize_managed_group)
of GCE (MIG from now on). This choice has been made to ease the configuration of the machines and simplify their
management.
For this reason, in order to use this extension, the user must have created
1. An instance template with the right machine type and image to bu used to run the MiddleManager
2. A MIG that has been configured to use the instance template created in the point above
Moreover, in order to be able to rescale the machines in the MIG, the Overlord must run with a service account
guaranteeing the following two scopes from the [Compute Engine API](https://developers.google.com/identity/protocols/googlescopes#computev1)
- `https://www.googleapis.com/auth/cloud-platform`
- `https://www.googleapis.com/auth/compute`
## Overlord Dynamic Configuration
The Overlord can dynamically change worker behavior.
The JSON object can be submitted to the Overlord via a POST request at:
```
http://<OVERLORD_IP>:<port>/druid/indexer/v1/worker
```
Optional Header Parameters for auditing the config change can also be specified.
|Header Param Name| Description | Default |
|----------|-------------|---------|
|`X-Druid-Author`| author making the config change|""|
|`X-Druid-Comment`| comment describing the change being done|""|
A sample worker config spec is shown below:
```json
{
"autoScaler": {
"envConfig" : {
"numInstances" : 1,
"projectId" : "super-project",
"zoneName" : "us-central-1",
"managedInstanceGroupName" : "druid-middlemanagers"
},
"maxNumWorkers" : 4,
"minNumWorkers" : 2,
"type" : "gce"
}
}
```
The configuration of the autoscaler is quite simple and it is made of two levels only.
The external level specifies the `type`—always `gce` in this case— and two numeric values,
the `maxNumWorkers` and `minNumWorkers` used to define the boundaries in between which the
number of instances must be at any time.
The internal level is the `envConfig` and it is used to specify
- The `numInstances` used to specify how many workers will be spawned at each
request to provision more workers. This is safe to be left to `1`
- The `projectId` used to specify the name of the project in which the MIG resides
- The `zoneName` used to identify in which zone of the worlds the MIG is
- The `managedInstanceGroupName` used to specify the MIG containing the instances created or
removed
Please refer to the Overlord Dynamic Configuration section in the main [documentation](../../configuration/index.md)
for parameters other than the ones specified here, such as `selectStrategy` etc.
## Known limitations
- The module internally uses the [ListManagedInstances](https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/listManagedInstances)
call from the API and, while the documentation of the API states that the call can be paged through using the
`pageToken` argument, the responses to such call do not provide any `nextPageToken` to set such parameter. This means
that the extension can operate safely with a maximum of 500 MiddleManagers instances at any time (the maximum number
of instances to be returned for each call).

117
development/extensions-contrib/graphite.md

@ -0,0 +1,117 @@
---
id: graphite
title: "Graphite Emitter"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information