type
status
date
slug
summary
tags
category
icon
password
这篇文章是 Clickhouse.cloud 整理的在使用ClickHouse时可能会遇到的 13 个致命的问题,并提供了一系列解决方案和最佳实践。
Clickhouse.cloud 是提供 ClickHouse 的云服务厂商,他们对 ClickHouse 的理解还是很专业的,建议所有工作内容涉及到 ClickHouse 的同学都读一遍。
翻译工作是由GPT-4和我完成的,中文如果不好理解可以结合英文一起看看。
原文链接:https://clickhouse.com/blog/common-getting-started-issues-with-clickhousea
Introduction
At ClickHouse, we are constantly thinking about our getting started experience and how we can help users get value from our products in the shortest possible time. While most users have a smooth onboarding, we appreciate that ClickHouse is a complex piece of software that introduces new concepts for many. Coupled with the challenges of managing ClickHouse at scale, this was one of the reasons that led us to develop our serverless ClickHouse Cloud solution, which automatically handles many of the common getting-started and subsequent scaling challenges.
在ClickHouse,我们一直在思考我们的入门体验,以及如何帮助用户在最短的时间内从我们的产品中获得价值。虽然大多数用户都有顺利的入门体验,但我们也意识到ClickHouse是一个复杂的软件,对许多人来说引入了新的概念。加上管理大规模ClickHouse的挑战,这是我们开发无服务器ClickHouse Cloud解决方案的原因之一,它可以自动处理许多常见的入门和后续扩展挑战。
However, some issues are simply the result of misconfiguration or, more commonly, misunderstanding of ClickHouse behavior and appropriate feature usage. In this post, we highlight the top 13 problems we see our new users encounter as the result of either using ClickHouse in an anti-pattern or simply not adhering to best usage practices: aka, the 13 deadly sins of getting started with ClickHouse. All of these apply to self-managed users, with a subset still having relevance to ClickHouse Cloud. For each, we recommend a resolution or correct approach.
然而,有些问题仅仅是由于配置错误或更常见的是对ClickHouse行为和适当功能使用的误解而导致的。在本文中,我们重点介绍了我们看到的新用户在使用ClickHouse时遇到的前13个问题,这些问题要么是由于使用了反模式,要么是由于没有遵循最佳使用实践,也就是说,使用ClickHouse时的13个致命错误。所有这些都适用于自部署的用户,其中一部分仍然与ClickHouse Cloud相关。对于每个问题,我们推荐了一个解决方案或正确的方法。
1. Too many parts
An often-seen ClickHouse error, this usually points to incorrect ClickHouse usage and lack of adherence to best practices. This error will often be experienced when inserting data and will be present in ClickHouse logs or in a response to an INSERT request. To understand this error, users need to have a basic understanding of the concept of a part in ClickHouse.
这是一个经常出现的ClickHouse错误,通常指的是错误的ClickHouse使用和缺乏遵循最佳实践。在插入数据时经常会遇到这个错误,并且会在ClickHouse日志中或作为对INSERT请求的响应中出现。要理解这个错误,用户需要对ClickHouse中的分区概念有基本的了解。
A table in ClickHouse consists of data parts sorted by the user's specified primary key (by default, the ORDER BY clause on table creation but see Index Design for the details). When data is inserted in a table, separate data parts are created, and each of them is lexicographically sorted by primary key. For example, if the primary key is (CounterID, Date), the data in the part is sorted first by CounterID, and within each CounterID value by Date. In the background, ClickHouse merges data parts for more efficient storage, similar to a Log-structured merge tree. Each part has its own primary index to allow efficient scanning and identification of where values lie within the parts. When parts are merged, then the merged part’s primary indexes are also merged.
ClickHouse中的表由用户指定的主键排序的data parts组成(默认情况下,在表创建时使用ORDER BY子句,但详细信息请参阅索引设计)。当数据插入表中时,会创建单独的data parts,并且每个部分都按照主键的字典顺序进行排序。例如,如果主键是
(CounterID, Date)
,则 data parts 中的数据首先按 CounterID
排序,CounterID
相同的按 Date
排序。在后台,ClickHouse合并data parts以实现更高效的存储,类似于 Log-structured merge tree (LSM)。每个data parts都有自己的主索引,以便在部分内高效地扫描和识别值所在的位置。当部分合并时,合并部分的主索引也会合并。As the number of parts increases, queries invariably will slow as a result of the need to evaluate more indices and read more files. Users may also experience slow startup times in cases where the part count is high. The creation of too many parts thus results in more internal merges and “pressure” to keep the number of parts low and query performance high. While merges are concurrent, in cases of misuse or misconfiguration, the number of parts can exceed internal configurable limits[1][2]. While these limits can be adjusted, at the expense of query performance, the need to do so will more often point to issues with your usage patterns. As well as causing query performance to degrade, high part counts can also place greater pressure on ClickHouse Keeper in replicated configurations.
随着data parts数量的增加,由于需要评估更多的索引和读取更多的文件,查询必然会变慢。在data parts数量较多的情况下,用户可能还会遇到启动时间缓慢的问题。因此,创建过多的data parts会导致更多的内部合并,并给系统带来更多压力(需要让data parts数量保持在较低的水平,维持较高的查询性能)。虽然合并是并发的,但在滥用或配置错误的情况下,部件数量可能超过内部可配置的限制[1] [2]。虽然可以调整这些限制,但会牺牲查询性能,这更常常指向您使用模式存在问题的情况。过多的data parts不仅会导致查询性能下降,还会在复制配置中给ClickHouse Keeper带来更大的压力。
So, how is it possible to have too many of these parts?
那么,什么情况下会生成这么多data parts呢?
Poorly chosen partitioning key 选择不当的分区键
A common reason is using a partition key with excessive cardinality. On creating a table, users can specify a column as a partition key by which data will be separated. A new file system directory will be created for every key value. This is typically a data management technique, allowing users to cleanly separate data logically in a table, e.g., by day. Operations such as DROP PARTITION subsequently allow fast deletion of data subsets. This powerful feature can, however, easily be misused, with users interpreting it as a simple optimization technique for queries. Importantly, parts belonging to different partitions are never merged. If a key of high cardinality, e.g., date_time_ms, is chosen as a partition key then parts spread across thousands of folders will never be merge candidates - exceeding preconfigured limits and causing the dreaded "Too many inactive parts (N). Parts cleaning are processing significantly slower than inserts" error on subsequent INSERTs. Addressing this problem is simple: choose a sensible partition key with cardinality < 1000.
常见的原因是使用具有高基数的分区键。在创建表时,用户可以指定一个列作为分区键,通过该键值将数据分隔开。每个键值都会创建一个新的文件系统目录。这通常是一种数据管理技术,允许用户在表中逻辑上清晰地分隔数据,例如按天分隔。随后的操作,如DROP PARTITION,可以快速删除数据子集。然而,这个强大的功能很容易被误用,用户可能将其理解为查询的简单优化技术。重要的是,属于不同分区的部分永远不会合并。如果选择一个基数很高的键,例如
date_time_ms
,作为分区键,那么分布在成千上万个文件夹中的部分将永远不会成为合并候选项,超过预配置的限制,并在随后的INSERT操作中导致可怕的"Too many inactive parts (N). Parts cleaning are processing significantly slower than inserts"错误。解决这个问题很简单:选择一个基数小于1000的合理分区键。Many small inserts 大量小的插入操作
As well as the poor selection of a partition key, this issue can manifest itself as a result of many small inserts. Each INSERT into ClickHouse results in an insert block being converted to a part. To keep the number of parts manageable, users should therefore buffer data client-side and insert data as batches - ideally, at least 1000 rows, although this should be tuned. If client-side buffering is not possible, users can defer this task to ClickHouse through async inserts. In this case, ClickHouse will buffer inserts on the local disk before merging them together for insertion into the underlying table.
除了选择错误的分区键之外,这个问题还可能是由于大量小的插入操作导致的。每次向ClickHouse插入数据都会将插入块(insert block)转换为一个data part。为了保持 data parts 的数量可管理,用户应该在客户端缓冲数据并批量插入数据 - 最好是至少1000行(可以根据根据业务场景来调整)。如果客户端无法进行缓冲,用户可以通过异步插入将此任务推迟给ClickHouse。在这种情况下,ClickHouse将在本地磁盘上缓冲插入操作,然后将它们合并到底层表中进行插入。
Buffer tables are also an alternative option here but are less resilient to failure as they hold inserts in memory until a flush occurs. They do have some advantages over async inserts- principally the data will be queryable whilst in the buffer and their compatibility as a buffer to the target table of a materialized view.
缓冲表(Buffer tables)在这里也是一个备选选项,但它们在故障时不够弹性,因为它们将插入的数据保存在内存中,直到进行刷新。它们确实比异步插入有一些优势-主要是在缓冲区中的数据可以进行查询,并且与物化视图的目标表兼容。
Excessive materialized views 过多的物化视图
Other possible causes of this error are excessive materialized views. Materialized views are, in effect, a trigger that runs when a block is inserted into a table. They transform the data e.g., through a GROUP BY, before inserting the result into a different table. This technique is often used to accelerate certain queries by precomputing aggregations at INSERT time. Users can create these materialized views, potentially resulting in many parts. Generally, we recommended that users create views while being aware of the costs and consolidate them where possible.
Too many parts的其他可能原因是过多的物化视图。物化视图实际上是一个触发器,当一个块被插入到表中时运行。它们通过对数据进行转换(例如通过GROUP BY)然后将结果插入到另一个表中。这种技术通常用于通过在插入时预计算聚合来加速某些查询。用户可以创建这些物化视图,可能导致很多data parts。通常,我们建议用户在创建视图时要意识到成本,并在可能的情况下进行合并。
The above list is not an exhaustive cause of this error. For example, mutations (as discussed below) can also cause merge pressure and an accumulation of parts. Finally, we should note that this error, while the most common, is only one manifestation of the above misconfigurations. For example, users can experience other issues as a result of a poor partitioning key. These include, but are not limited to, “no free inodes on the filesystem”, backups taking a long time, and delays on replication (and high load on ClickHouse Keeper).
以上列表并非此错误的详尽原因。例如,变更(如下所讨论的)也可能导致合并压力和parts的积累。最后,我们应该注意到,这个错误虽然最常见,但只是上述错误配置的一种表现。例如,用户可能因为分区键不良而遇到其他问题。这些问题包括但不限于“文件系统上没有可用的inode”,备份时间过长以及复制延迟(以及ClickHouse Keeper的高负载)。
2. Going horizontal too early 过早地转向横向扩展(加机器)
We often have new self-managed users asking us to provide recommendations around orchestration and how to scale to dozens, if not hundreds, of nodes. While technologies such as Kubernetes have made the deployment of multiple instances of stateless applications relatively simple, this pattern should, in nearly all cases, not be required for ClickHouse. Unlike other databases, which may be restricted to a machine size due to inherent limits, e.g., JVM heap size, ClickHouse was designed from the ground up to utilize the full resources of a machine. We commonly find successful deployments with ClickHouse deployed on servers with hundreds of cores, terabytes of RAM, and petabytes of disk space. Most analytical queries have a sort, filter, and aggregation stage. Each of these can be parallelized independently and will, by default, use as many threads as cores, thus utilizing the full machine resources for a query.
我们经常有新的自我管理用户向我们咨询关于编排和如何扩展到数十甚至数百个节点的建议。虽然像Kubernetes这样的技术使得部署多个无状态应用实例相对简单,但在几乎所有情况下,ClickHouse不需要这种模式。与其他数据库不同,其他数据库在单节点上可能会有一些固有限制(例如JVM堆大小),ClickHouse从一开始就设计为充分利用机器的全部资源。我们经常发现ClickHouse成功地部署在拥有数百个核心、数TB的内存和PB级磁盘空间的服务器上。大多数分析查询都有排序、过滤和聚合阶段。每个阶段都可以独立并行化,并且默认情况下会使用与核心数相同的线程数,从而充分利用查询的全部机器资源。
Scaling vertically first has a number of benefits, principally cost efficiency, lower cost of ownership (with respect to operations), and better query performance due to the minimization of data on the network for operations such as JOINs. Of course, users need redundancy in their infrastructure, but two machines should be sufficient for all but the largest use cases.
首先进行纵向扩展(换更好的机器)具有许多好处,主要包括成本效益、拥有成本较低(就运营而言)以及由于在操作中最小化数据在网络上的传输,从而提高查询性能,例如JOIN操作。当然,用户需要在基础设施中具备冗余性,但对于除了最大的使用场景以外,两台机器应该足够了。
For this reason, in addition to simpler scaling mechanics, we prefer to auto-scale vertically in ClickHouse Cloud before considering horizontal scaling. In summary, go vertical before going horizontal!
因此,除了更简单的扩展机制外,在考虑横向扩展之前,我们更倾向于在ClickHouse Cloud中进行纵向自动扩展。总之,在进行横向扩展之前,请先进行纵向扩展!(这个就是云服务的好处了,不用管扩容...)
3. Mutation Pain 变更之痛
While rare in OLAP use cases, the need to modify data is sometimes unavoidable. To address this requirement, ClickHouse offers mutation functionality which allows users to modify inserted data through ALTER queries. ClickHouse performs best on immutable data, and any design pattern which requires data to be updated post-insert should be reviewed carefully.
在OLAP使用案例中很少见,但有时修改数据是不可避免的。为了满足这个需求,ClickHouse提供了变更数据的功能,允许用户通过ALTER查询修改插入的数据。ClickHouse在不可变数据上表现最佳,任何需要在插入后更新数据的设计模式都应该仔细审查。
Internally, mutations work by rewriting whole data parts. This process relies on the same thread pool as merges. Note also that the mutation needs to be applied on all replicas by default. For this reason, mutations are both CPU and IO-intensive and should be scheduled cautiously with permission to run limited to administrators. Resource pressure as a result of mutations manifests itself in several ways. Typically, normally scheduled merges accumulate, which in turn causes our earlier “too many parts” issue. Furthermore, users may experience replication delays. The system.mutations table should give administrators an indication of currently scheduled mutations. Note that mutations can be cancelled, but not rolled back, with the KILL MUTATION query.
变更通过重写整个data parts来工作。这个过程依赖于与合并相同的线程池。还要注意,默认情况下,变更需要在所有副本上应用。因此,变更既是CPU密集型又是IO密集型的,应该谨慎安排,并且只允许管理员有限制地运行。由于变更导致的资源压力会以多种方式表现出来。通常,按计划进行的合并会累积,从而引起我们之前提到的“部分过多”问题。此外,用户可能会遇到复制延迟。system.mutations表应该为管理员提供当前计划的变更的指示。请注意,可以使用KILL MUTATION查询取消变更,但无法回滚。
Deduplication 去重
We often see users needing to schedule merges as a result of duplicate data. Typically we suggest users address this issue upstream and deduplicate prior to insertion into ClickHouse. If this is not possible, users have a number of options: deduplicate at query time or utilize a ReplacingMergeTree.
我们经常看到用户需要安排合并操作,因为存在重复数据。通常我们建议用户在将数据插入ClickHouse之前解决这个问题,进行去重操作。如果这不可行,用户有几个选择:在查询时进行去重,或者使用ReplacingMergeTree。
Deduplicating at query time can be achieved by grouping the data on the fields, which uniquely identify a row, and using the argMax function with a date field to identify the last value for other fields. ReplacingMergeTree allows rows with the same sorting key (ORDER BY key) to be deduplicated on merges. Note this is “best effort” only: sometimes parts will not be merged with the merge process scheduled at non-deterministic intervals. It, therefore, does not guarantee the absence of duplicates. Users can also utilize the FINAL modifier to force this deduplication at SELECT time (again, use cautiously as it is resource intensive and can be slow despite recent improvements) or force merging on disk via an OPTIMIZE FINAL.
查询时去重可以通过根据唯一标识行的字段对数据进行分组,并使用带有日期字段的argMax函数来识别其他字段的最后一个值来实现。ReplacingMergeTree允许在合并时对具有相同排序键(ORDER BY键)的行进行去重。请注意,这仅是“尽力而为”的操作:有时部分数据将不会在(非确定性间隔的)合并过程中合并。因此,它不能保证不存在重复项。用户还可以使用FINAL修饰符来强制在查询时进行去重(同样,要谨慎使用,因为它会消耗资源,并且尽管最近有所改进,但可能会很慢),或者通过OPTIMIZE FINAL在磁盘上强制合并。
In the case where data needs to be deleted from ClickHouse e.g., for compliance or deduplication reasons, users can also utilize lightweight deletes instead of mutations. These take the form of a DELETE statement which accepts WHERE clause to filter rows. This marks rows as deleted only. These marks will be used to filter rows out at query time and will be removed when parts are merged.
在需要从ClickHouse中删除数据的情况下,例如出于合规性或去重的原因,用户还可以使用轻量级删除而不是变异。这些删除操作采用DELETE语句的形式,接受WHERE子句来过滤行。这只是将行标记为已删除。这些标记将在查询时用于过滤行,并在合并部分时被删除。
Note: this feature is experimental and requires the setting SET allow_experimental_lightweight_delete = true;. It is more efficient than using a mutation in most cases, with the exception of if you are doing a large-scale bulk delete.
注意:此功能是实验性的,需要设置
SET allow_experimental_lightweight_delete = true;
。在大多数情况下,它比使用变更更高效,但如果您要进行大规模批量删除,则除外。4. Unnecessary use of complex types 不必要使用复杂类型
As well as supporting the usual primitive types, ClickHouse has rich support for complex types such as Nested, Tuple, Map, and even JSON. These are supported for good reasons - sometimes, there is no other way to model data, but we recommend using primitive types where possible since they offer the best insertion and query time performance.
除了支持常见的原始类型外,ClickHouse还对复杂类型(如Nested、Tuple、Map甚至JSON)提供了丰富的支持。这些支持是有充分理由的 - 有时候,没有其他方式来建模数据,但我们建议在可能的情况下使用原始类型,因为它们提供了最佳的插入和查询时间性能。
As an example, we have recently seen users keen to exploit the JSON features added to ClickHouse in 22.4. This powerful feature allows the table schema to be dynamically inferred from the data, avoiding the need for the user to specify column types. Use this capability with caution and not as a replacement for avoiding specifying columns explicitly. Specifically, this feature has limitations users should be aware of:
作为一个例子,我们最近看到用户热衷于利用在22.4版本中添加到ClickHouse的JSON功能。这个强大的功能允许从数据中动态推断出表模式,避免了用户需要指定列类型的需要。请谨慎使用这个功能,不要将其作为避免明确指定列的替代品。具体来说,这个功能有一些限制,用户应该注意。
- Increased cost at insert time as columns need to be dynamically created 插入时成本增加,因为需要动态创建列
- Sub-optimal type usage, i.e., no codecs and unnecessary use of Nullable. 次优的类型使用,即没有编解码器和不必要地使用Nullable。
- No ability to use JSON columns in a primary key 无法在主键中使用JSON列
The last two of these invariably lead to poorer compression and query/insert performance. Rather than using it for all of your rows, use this specific type for selective columns e.g., Kubernetes tags, where the data is subject to change. In summary, if you know your schema…specify it!
这两种方式往往会导致较差的压缩和查询/插入性能。与其将其用于所有行,不如仅将其用于选择性的列,例如Kubernetes标签,其中数据可能会发生变化。总之,如果你了解你的模式...就明确指定它!
Note: The JSON Object type is experimental and is undergoing improvements. Our advice with respect to this feature is evolving and may therefore change in later versions.
注意:JSON对象类型是实验性的,正在进行改进。关于此功能的建议正在不断发展,因此可能在以后的版本中发生变化。
We additionally often see users reaching for the Nullable type. This allows the value Null to be differentiated from the default value for a type. This can be useful but requires an additional Uint8 column to determine which values are null. This incurs an extra byte per value with respect to storage (although it compresses well), as well as adding a query time overhead. Only use Nullable if you really need it!
我们经常看到用户使用Nullable类型。这允许将Null值与类型的默认值区分开来。这可能很有用,但需要额外的Uint8列来确定哪些值为null。这会增加每个值一个额外的字节的存储开销(尽管它可以很好地压缩),并增加查询时间开销。只有在真正需要时才使用Nullable!
5. Deduplication at insert time 插入时去重
New users to ClickHouse Cloud are often surprised by ClickHouse’s deduplication strategy. This usually occurs when identical inserts appear to not have any effect. For example, consider the following:
新用户使用ClickHouse Cloud时,常常对ClickHouse的去重策略感到惊讶。当相同的插入操作似乎没有任何效果时,通常会出现这种情况。例如,考虑以下情况:
暂时无法在飞书文档外展示此内容
A new user might be surprised by the result here, especially if their prior experience was on a single local instance of ClickHouse. This behavior is the result of the replicated_deduplication_window setting.
一个新用户可能会对这里的结果感到惊讶,特别是如果他们之前的经验是在一个单独的本地ClickHouse实例上。这种行为是
replicated_deduplication_window
设置的结果。When data is inserted into ClickHouse, it creates one or more blocks (parts). In replicated environments, such as ClickHouse Cloud, a hash is also written in ClickHouse Keeper. Subsequent inserted blocks are compared against these hashes and ignored if a match is present. This is useful since it allows clients to safely retry inserts in the event of no acknowledgement from ClickHouse e.g., because of a network interruption. This requires blocks to be identical i.e., the same size with the same rows in the same order. These hashes are stored for only the most recent 100 blocks, although this can be modified. Note higher values will slow down inserts due to the need for more comparisons.
当数据插入ClickHouse时,它会创建一个或多个块(部分)。在复制环境中,比如ClickHouse Cloud,哈希值也会被写入ClickHouse Keeper。后续插入的块会与这些哈希值进行比较,如果存在匹配,则会被忽略。这很有用,因为它允许客户端在没有从ClickHouse收到确认的情况下安全地重试插入,例如由于网络中断。这要求块是相同的,即具有相同的大小、相同的行并且按相同的顺序。这些哈希值仅存储最近的100个块,尽管可以进行修改。请注意,较高的值会因为需要进行更多的比较而减慢插入速度。
This same behavior can be enabled for non-replicated instances via the setting non_replicated_deduplication_window. In this case, the hashes are stored on a local disk.
通过设置
non_replicated_deduplication_window
,可以为非复制实例启用相同的行为。在这种情况下,哈希值存储在本地磁盘上。6. Poor Primary Key Selection 选择不当的主键
Users new to ClickHouse often struggle to fully understand its unique primary key concepts. Unlike B(+)-Tree-based OLTP databases, which are optimized for fast location of specific rows, ClickHouse utilizes a sparse index designed for millions of inserted rows per second and petabyte-scale datasets. In contrast to OLTP databases, this index relies on the data on disk being sorted for fast identification of groups of rows that could possibly match a query - a common requirement in analytical queries. The index, in effect, allows the matching sections of part files to be rapidly identified before they are streamed into the processing engine. For more detail on the layout of the data on disk, we highly recommend this guide.
对于刚接触ClickHouse的用户来说,他们常常难以完全理解其独特的主键概念。与为快速定位特定行而优化的基于B(+)-Tree的OLTP数据库不同,ClickHouse利用了一种稀疏索引,用于每秒插入数百万行和拥有PB级数据集的场景。与OLTP数据库相比,这种索引依赖于磁盘上的数据排序,以便快速识别可能与查询匹配的行组 - 这是分析查询中常见的需求。实际上,该索引允许在将匹配的部分文件流式传输到处理引擎之前快速识别它们。关于磁盘上数据布局的更多详细信息,请参考本指南。
The effectiveness of this approach, for both query performance and compression, relies on the user selecting good primary key columns via the ORDER BY clause when creating a table. In general, users should select columns for which they will often filter tables with more than 2 to 3 columns rarely required. The order of these columns is critical and can affect the compression and filtering by columns other than the first entry. For both the efficient filtering of secondary key columns in queries and the compression ratio of a table's column files, it is optimal to order the columns in a primary key by their cardinality in ascending order. A full explanation of the reasoning can be found here.
这种方法的有效性,无论是查询性能还是压缩,都依赖于用户在创建表时通过ORDER BY子句选择好的主键列。一般来说,用户应该选择那些经常用于过滤表的列,而不是那些很少需要的2到3列。这些列的顺序是关键的,可以影响到除第一个条目之外的列的压缩和过滤。为了在查询中高效过滤次要键列和表的列文件的压缩比率,最好按照它们的基数升序排列主键列。关于这个推理的完整解释可以在这里找到。
7. Overuse of Data Skipping indices 过度使用数据跳过索引
Primary keys are rightly the first tool users turn to when needing to accelerate queries. However, tables are limited to a single primary key, and query access patterns can render this ineffective i.e., for diverse use cases, queries which cannot exploit the primary key efficiently are inevitable. In these cases ClickHouse can be forced to perform a full table scan of each column when applying a WHERE clause condition. Often this will still be sufficiently fast, but in some cases users reach for data skipping indices, hoping to accelerate these queries easily.
主键是用户在需要加速查询时首先使用的工具。然而,表只能有一个主键,查询访问模式可能会使其失效,即对于多样化的用例,无法有效利用主键的查询是不可避免的。在这些情况下,当应用WHERE子句条件时,ClickHouse可能被迫对每个列执行完整的表扫描。通常情况下,这仍然足够快,但在某些情况下,用户会寻求数据跳过索引,希望轻松加速这些查询。
These indices add data structures which allow ClickHouse to skip reading significant chunks of data that are guaranteed to have no matching values. More specifically, they create an index over blocks granules (effectively marks) allowing these to be skipped if the WHERE clause is not satisfied.
这些索引添加了数据结构,使得ClickHouse能够跳过那些肯定没有匹配值的重要数据块的读取。更具体地说,它们在块粒度上创建了一个索引(有效地标记),如果WHERE子句不满足,则可以跳过这些块。
In some circumstances, these can accelerate specific queries, but are typically overused, not intuitive and require careful design to be effective. As a result, we often see them simply complicating table design and slowing insert performance while rarely (if ever) improving query performance. We always encourage users to read the concepts and best practices.
在某些情况下,这些可以加速特定的查询,但通常被过度使用,不直观,并且需要仔细设计才能发挥作用。因此,我们经常看到它们只是复杂化表设计并减慢插入性能,而很少(如果有的话)改善查询性能。我们始终鼓励用户阅读概念和最佳实践。
In most cases skip indices should only be considered once other alternatives have been exhausted - specifically this advanced functionality should only be used after investigating other alternatives such as modifying the primary key (see Options for creating additional primary indexes), using projections or materialized views. In general, only consider skip-indices if there is a strong correlation between the primary key and the targeted, non-primary column/expression. In the absence of any real correlation, the skipping index will match for most blocks - resulting in all granules being read into memory and evaluated. In this case, the index cost has been incurred for no benefit, effectively slowing the full table scan.
在大多数情况下,只有在其他替代方案耗尽之后才应考虑使用跳过索引 - 具体来说,只有在调查其他替代方案(例如修改主键(请参阅创建其他主索引的选项)、使用投影或物化视图)之后才应使用此高级功能。一般来说,只有在主键和目标非主列/表达式之间存在强相关性时才考虑使用跳过索引。如果没有任何真正的相关性,跳过索引将匹配大多数块 - 导致所有block被读入内存并进行评估。在这种情况下,索引成本没有任何好处,实际上会减慢全表扫描的速度。
8. LIMIT doesn’t always short circuit + point lookups LIMIT 并不总是点查找的捷径。
We often find OLTP users new to ClickHouse reaching for the LIMIT clause to optimize queries by limiting the number of results returned. If coming from an OLTP database this should intuitively optimize queries: less data returned = faster result, surely? Yes and no.
我们经常发现刚接触ClickHouse的OLTP用户会使用LIMIT子句来通过限制返回结果的数量来优化查询。如果从一个OLTP数据库过来,这样做在直觉上应该能优化查询:返回的数据越少,结果就越快,对吗?是也不是。
The effectiveness of this technique depends on whether the query can be run entirely in a streaming fashion. Some queries, such as SELECT * FROM table LIMIT 10 will scan only a few granules of the first few parts before reaching 10 results and returning the result to the user. This is also true for cases where the user orders the SELECT by a primary key field due to the optimize_in_read_order setting defaulting to 1. However, if the user runs SELECT a from table ORDER BY b LIMIT N, whereby the table is ordered by a and not by b, ClickHouse cannot avoid reading the entire table i.e., no early termination of the query is possible.
该技术的有效性取决于查询是否可以完全以流式方式运行。某些查询,例如
SELECT * FROM table LIMIT 10
,在达到10个结果并将结果返回给用户之前,只会扫描前几个部分的几个粒度。对于用户按主键字段对SELECT进行排序的情况,默认情况下 optimize_in_read_order
设置为1,也是如此。然而,如果用户运行 SELECT a from table ORDER BY b LIMIT N
,其中表按 a
而不是按 b
排序,ClickHouse无法避免读取整个表,即查询无法提前终止。For aggregations, things are a little more complex. A full table scan is also required unless the user is grouping by the primary key and sets optimize_aggregation_in_order=1. In this case, a propagation signal is sent once sufficient results are acquired. Provided previous steps of the query are capable of streaming the data, e.g., filter, then this mechanism will work, and the query will terminate early. Normally, however, an aggregation must consume all table data before returning and applying the LIMIT as the final stage.
对于聚合操作,情况会稍微复杂一些。除非用户按照主键进行分组并设置
optimize_aggregation_in_order=1
,否则也需要进行完整的表扫描。在这种情况下,一旦获取到足够的结果,就会发送一个传播信号。如果查询的前几个步骤能够流式处理数据,例如过滤操作,那么这个机制就会起作用,并且查询会提前终止。然而,通常情况下,聚合操作必须在返回结果之前消耗所有的表数据,并将LIMIT应用为最后一个阶段。As an example, we create and load the table from our UK Property Price Paid tutorial with 27.55 million rows. This dataset is available within our play.clickhouse.com environment.
作为一个例子,我们从我们的英国房产价格支付教程中创建并加载了一张拥有2755万行的表。这个数据集可以在我们的play.clickhouse.com环境中获得。
With optimize_aggregation_in_order=0 this aggregation query, that is grouping by the primary keys, performs a full table scan before applying the LIMIT 1 clause:
使用这个聚合查询,即按主键分组,会在应用LIMIT 1子句之前执行全表扫描
With optimize_aggregation_in_order=1, the query is able to shortcut and as a result process less data:
使用
optimize_aggregation_in_order=1
,查询能够进行快速处理,从而处理更少的数据We also see even experienced users being caught by less obvious LIMIT behavior in multi-node environments where a table has many shards. Sharding allows users to split or replicate their data across multiple instances of ClickHouse. When a query with a LIMIT N clause is sent to a sharded table e.g. via a distributed table, this clause will be propagated down to each shard. Each shard will, in turn, need to collate the top N results, returning them to the coordinating node. This can prove particularly resource-intensive when users run queries that require a full table scan. Typically these are “point lookups” where the query aims to just identify a few rows. While this can be achieved in ClickHouse with careful index design a non-optimized variant, coupled with a LIMIT clause, can prove extremely resource-intensive.
我们还看到即使是经验丰富的用户也会在多节点环境中遇到较不明显的LIMIT行为问题,其中一个表有许多分片。分片允许用户将数据分割或复制到ClickHouse的多个实例中。当一个带有LIMIT N子句的查询被发送到一个分片表,例如通过分布式表,这个子句将被传播到每个分片。每个分片将需要整理出前N个结果,并将它们返回给协调节点。当用户运行需要对整个表进行扫描的查询时,这可能会消耗大量资源。通常这些查询是“点查找”,即查询只是为了识别几行数据。虽然在ClickHouse中可以通过仔细设计索引来实现这一点,但非优化的变体结合LIMIT子句可能会消耗大量资源。
9. IP Filtering in Cloud
At ClickHouse, we consider security a first-class citizen and consider this in everything we do. This is epitomized by the need for users to specify the IP addresses from which access is permitted when first creating a cluster. By default, we encourage users to be restrictive and modify the allow list as needed. This, unfortunately, can lead to some confusion when users attempt to connect to external Cloud services, e.g., when connecting from Grafana Cloud. We will continue to optimize this experience and provide helpful guidance when this is the case, but we also recommend users obtain the IPs of any external services early during cluster creation to avoid frustrating connection-denied errors.
在ClickHouse中,我们将安全性视为一等公民,并在我们所做的一切中考虑到这一点。这体现在用户在首次创建集群时需要指定允许访问的IP地址。默认情况下,我们鼓励用户采取限制性措施,并根据需要修改允许列表。不幸的是,当用户尝试连接到外部云服务时(例如,从Grafana Cloud连接时),这可能会导致一些困惑。我们将继续优化这一体验,并在这种情况下提供有用的指导,但我们也建议用户在集群创建早期获取任何外部服务的IP地址,以避免令人沮丧的连接被拒绝错误。
10. Readonly tables 只读表
Although not an issue in ClickHouse Cloud, read-only tables continue to raise their head in self-managed clusters. This occurs in replicated environments when a node loses its connection to ZooKeeper. This is typically nearly always the result of ZooKeeper issues. While many of the challenges associated with ZooKeeper were addressed with the release of ClickHouse Keeper, under-resourcing of this component can still cause this issue to manifest itself. Common causes are the hosting of the keeper on the same host as ClickHouse in production or poorly tuned ZooKeeper JVM resources. This is usually easily resolved by ensuring this component is separated on dedicated hardware and given adequate resources.
尽管在ClickHouse Cloud中不是一个问题,但只读表在自管理的集群中仍然存在。这在复制环境中发生,当一个节点失去与ZooKeeper的连接时。这通常几乎总是由于ZooKeeper的问题导致的。虽然ClickHouse Keeper的发布解决了与ZooKeeper相关的许多挑战,但对该组件的资源配置不足仍可能导致此问题出现。常见原因是在生产环境中将keeper托管在与ClickHouse相同的主机上,或者ZooKeeper的JVM资源调优不当。通常可以通过确保将此组件分离到专用硬件上并提供足够的资源来轻松解决这个问题。
11. Memory Limit Exceeded for Query 查询的内存限制超出了
As a new user, ClickHouse can often seem like magic - every query is super fast, even on the largest datasets and most ambitious queries. Invariably though, real-world usage tests even the limits of ClickHouse. Queries exceeding memory can be the result of a number of causes. Most commonly, we see large joins or aggregations on high cardinality fields. If performance is critical, and these queries are required, we often recommend users simply scale up - something ClickHouse Cloud does automatically and effortlessly to ensure your queries remain responsive. We appreciate, however, that in self-managed scenarios, this is sometimes not trivial, and maybe optimal performance is not even required. Users, in this case, have a few options.
作为一个新用户,ClickHouse常常看起来像是魔法 - 每个查询都非常快速,即使在最大的数据集和最复杂的查询上也是如此。然而,现实世界的使用会不可避免地测试ClickHouse的极限。超过内存限制的查询可能是多种原因导致的。最常见的情况是在高基数字段上进行大型连接或聚合。如果性能至关重要,并且需要这些查询,我们通常建议用户简单地扩容 - 这是ClickHouse Cloud自动且轻松地完成的,以确保您的查询保持响应。然而,我们也理解,在自管理的情况下,这有时并不容易,甚至可能不需要最佳性能。在这种情况下,用户有几个选择。
Aggregations 聚合
For memory-intensive aggregations or sorting scenarios, users can use the settings max_bytes_before_external_group_by and max_bytes_before_external_sort respectively. The former of these is discussed extensively here. In summary, this ensures any aggregations can “spill” out to disk if a memory threshold is exceeded. This will invariably impact query performance but will help ensure queries do not OOM. The latter sorting setting helps address similar issues with memory-intensive sorts. This can be particularly important in distributed environments where a coordinating node receives sorted responses from child shards. In this case, the coordinating server can be asked to sort a dataset larger than its available memory. With max_bytes_before_external_sort, sorting can be allowed to spill over to disk. This setting is also helpful for cases where the user has an ORDER BY after a GROUP BY with a LIMIT, especially in cases where the query is distributed.
对于内存密集型的聚合或排序场景,用户可以分别使用设置
max_bytes_before_external_group_by
和 max_bytes_before_external_sort
。前者在这里进行了广泛讨论。简而言之,如果超过内存阈值,这将确保任何聚合可以“溢出”到磁盘。这必然会影响查询性能,但有助于确保查询不会OOM。后者的排序设置有助于解决内存密集型排序的类似问题。这在协调节点从子分片接收排序响应的分布式环境中尤为重要。在这种情况下,可以要求协调服务器对比其可用内存更大的数据集进行排序。使用 max_bytes_before_external_sort
,排序可以允许溢出到磁盘。这个设置对于用户在 GROUP BY
之后使用 LIMIT
的 ORDER BY
的情况也很有帮助,特别是在查询分布式的情况下。JOINs 连接
For joins, users can select different JOIN algorithms, which can assist in lowering the required memory. By default, joins use the hash join, which offers the most completeness with respect to features and often the best performance. This algorithm loads the right-hand table of the JOIN into an in-memory hash table, against which the left-hand table is then evaluated. To minimize memory, users should thus place the smaller table on the right side. This approach still has limitations in memory-bound cases, however. In these cases, partial_merge join can be enabled via the join_algorithm setting. This derivative of the sort-merge algorithm, first sorts the right table into blocks and creates a min-max index for them. It then sorts parts of the left table by the join key and joins them over the right table. The min-max index is used to skip unneeded right table blocks. This is less memory-intensive at the expense of performance. Taking this concept further, the full_sorting_merge algorithm allows a JOIN to be performed when the right-hand side is very large and doesn't fit into memory and lookups are impossible, e.g. a complex subquery. In this case, both the right and left side are sorted on disk if they do not fit in memory, allowing large tables to be joined.
对于连接操作,用户可以选择不同的连接算法,这可以帮助降低所需的内存。默认情况下,连接使用哈希连接,它在功能上最完整,并且通常具有最佳性能。该算法将连接的右表加载到内存中的哈希表中,然后对左表进行评估。为了最小化内存使用,用户应该将较小的表放在右侧。然而,在内存受限的情况下,这种方法仍然有局限性。在这些情况下,可以通过设置启用
join_algorithm
连接。这种派生自排序合并算法的连接首先将右表按块进行排序,并为其创建一个最小-最大索引。然后,它按连接键对左表的部分进行排序,并在右表上进行连接。最小-最大索引用于跳过不需要的右表块。这种方法在性能上牺牲了一部分,但对内存的需求较低。进一步扩展这个概念,当右侧非常大且无法放入内存并且无法进行查找时(例如复杂子查询),可以使用 full_sorting_merge
算法执行连接操作。 在这种情况下,如果内存无法容纳,右侧和左侧都会在磁盘上进行排序,从而实现大表的连接。Since 20.3, ClickHouse has supported an auto value for the join_algorithm setting. This instructs ClickHouse to apply an adaptive join approach, where the hash-join algorithm is preferred until memory limits are violated, at which point the partial_merge algorithm is attempted. Finally, concerning joins, we encourage readers to be aware of the behavior of distributed joins and how to minimize their memory consumption - more information here.
自20.3版本以来,ClickHouse支持
auto
设置的 join_algorithm
值。这指示ClickHouse应用自适应连接方法,优先选择哈希连接算法,直到内存限制被违反,然后尝试使用partial_merge算法。最后,关于连接操作,我们鼓励读者了解分布式连接的行为以及如何最小化其内存消耗-更多信息请参阅此处。Rogue queries 流氓查询
Other causes for memory issues are unrestricted users. In these cases, we see users issuing rogue queries with no quotas or restrictions on query complexity. These controls are essential in providing a robust service if exposing a ClickHouse instance to a broad and diverse set of users. Our own play.clickhouse.com environment uses these effectively to restrict usage and provide a stable environment.
其他导致内存问题的原因是无限制的用户。在这些情况下,我们看到用户发出了没有配额或查询复杂性限制的恶意查询。如果将ClickHouse实例暴露给广泛和多样化的用户群,这些控制措施是提供稳定服务的关键。我们自己的play.clickhouse.com环境有效地使用这些控制措施来限制使用并提供稳定的环境。
ClickHouse also recently introduced new Memory overcommit capabilities. Historically queries would be limited by the max_memory_usage setting (default 10GB), which provided a hard and rather crude limit. Users could raise this at the expense of a single query, potentially impacting other users. Memory overcommit allows more memory-intensive queries to run, provided sufficient resources exist. When the max server memory limit is reached, ClickHouse will determine which queries are most overcommitted and try to kill the query. This may or may not be the query that triggered this condition. If not, the query will wait a period to allow the high-memory query to be killed before continuing to run. This allows low-memory queries to always run, while more intensive queries can run when the server is idle, and resources are available. This behavior can be tuned at a server and user level.
ClickHouse最近还引入了新的内存超额配置功能。在历史上,查询会受到设置(默认为10GB)的限制,这提供了一个硬性且相当粗糙的限制。用户可以在牺牲单个查询的情况下提高这个限制,可能会影响其他用户。内存超额配置允许更多内存密集型查询运行,前提是存在足够的资源。当达到最大服务器内存限制时,ClickHouse将确定哪些查询超额配置最多,并尝试终止该查询。这可能是触发此条件的查询,也可能不是。如果不是,查询将等待一段时间,以允许高内存查询在继续运行之前被终止。这样,低内存查询始终可以运行,而更密集的查询可以在服务器空闲且资源可用时运行。这种行为可以在服务器和用户级别进行调整。
12. Issues relating to Materialized Views 关于物化视图的问题
Materialized views are a powerful feature of ClickHouse. By allowing the reorientation and transformation of data at insert time, users can optimize for specific queries. We often see users using this technique when more than a single primary index is required. There are a number of common issues with materialized views, probably sufficient for their own blog post. Summarizing the most common:
物化视图是ClickHouse的一个强大功能。通过在插入数据时重新定向和转换数据,用户可以针对特定查询进行优化。当需要多个主索引时,我们经常看到用户使用这种技术。物化视图存在一些常见问题,可能需要单独写一篇博客来总结。以下是最常见的问题:
- We often see users misunderstanding how Materialized views work. They have no knowledge of the source table data and are effectively only triggers on inserts - capable of running over the inserted data block only. They have no visibility of merges, partition drop, or mutations. If users change the source table, they must, therefore, also update any attached materialized views - there is no functionality for keeping these in sync. 我们经常看到用户对物化视图的工作原理存在误解。他们对源表数据一无所知,实际上只是插入触发器 - 只能在插入的数据块上运行。它们无法看到合并、分区删除或变异。因此,如果用户更改源表,他们还必须更新任何附加的物化视图 - 没有保持这些视图同步的功能。
- Users add too many materialized views to a single table. These views aren’t free and must be run on each insert. More than 50 materialized views for a table is typically excessive and will slow inserts. As well as the compute overhead, each materialized view will create a new part from the block over which it runs - potentially causing the “Too Many Parts” issue discussed earlier. Note that performance can be improved by parallelizing the running of the views via the setting
parallel_view_processing
. 用户向单个表添加了太多的物化视图。这些视图并非免费,每次插入都必须运行。一个表上超过50个物化视图通常是过多的,会减慢插入速度。除了计算开销外,每个物化视图还会在其运行的块上创建一个新的部分,可能导致之前讨论过的“部分过多”问题。请注意,通过设置parallel_view_processing
可以通过并行运行视图来提高性能。
- State functions are a compelling feature of ClickHouse and allow data to be summarized for later queries using Aggregate functions. Materialized views with many of these, especially those computing quantile states, can be CPU intensive and lead to slow inserts. 状态函数是ClickHouse的一个引人注目的特性,它允许使用聚合函数对数据进行汇总,以便后续查询。具有许多这些状态函数的物化视图,特别是计算分位数状态的视图,可能会对CPU造成负载,并导致插入速度变慢。
- We often see users mismatching the columns of a target aggregation/summing merge tree with those of the materialized view. The ORDER BY clause of the target table must be consistent with the GROUP BY of the SELECT clause in the materialized view. If these are different, you’ll find data will be corrupted for those inconsistent columns. Correct examples are shown below: 我们经常看到用户在目标聚合/求和合并树与物化视图的列不匹配。目标表的ORDER BY子句必须与物化视图中SELECT子句的GROUP BY一致。如果它们不同,你会发现那些不一致的列的数据会被破坏。以下是正确的示例:
- Similar to the above, the column names of the materialized view’s SELECT must match those of the destination table - do not rely on the order of the columns. Utilize alias to ensure these match. Note that the target table can have default values, so the view’s columns can be a subset of the target table. A correct example is shown below - note the need to alias
count() as counter
: 与上述类似,物化视图的SELECT语句中的列名必须与目标表的列名匹配 - 不要依赖列的顺序。使用别名来确保它们匹配。请注意,目标表可以有默认值,因此视图的列可以是目标表的子集。下面是一个正确的示例 - 请注意需要给count() as counter
起别名:
13. Experimental features in production 生产中的实验性功能
At ClickHouse, we regularly release new features. In some cases, new features are marked “experimental”, which means they would benefit from a period of real-world usage and feedback from the community. Eventually, these features evolve to the point of being deemed “production ready”, or deprecated if it turns out they are not generally useful or there is another way to achieve the original goal. While we encourage users to try out experimental features, we caution against building the core functionality of your apps around them or relying on them in production. For this reason, we require users to request these to be enabled on ClickHouse Cloud and understand the caveats and risks.
在ClickHouse中,我们定期发布新功能。在某些情况下,新功能被标记为“实验性”,这意味着它们需要经过一段时间的实际使用和社区反馈来获得改进。最终,这些功能会发展到被认为“可用于生产环境”的程度,或者如果它们被证明不常用或者有其他方法可以实现原始目标,则被弃用。虽然我们鼓励用户尝试实验性功能,但我们警告不要在核心功能中构建应用程序或依赖它们在生产环境中使用。因此,我们要求用户在ClickHouse Cloud上请求启用这些功能,并理解其中的注意事项和风险。
We label all features as experimental in our docs, and any usage requires the user to set a setting to enable a specific experimental feature, e.g. SET allow_experimental_lightweight_delete = true.
我们在文档中将所有功能标记为实验性,并且任何使用都需要用户设置一个设置来启用特定的实验性功能,例如
SET allow_experimental_lightweight_delete = true
。Conclusion 结论
If you've read this far you should be well prepared to manage a ClickHouse cluster in production - or at least avoid many of the common pitfalls! Managing ClickHouse Clusters with petabytes of data invariably brings its challenges, however, even for the most experienced operators. To avoid these challenges and still experience the speed and power of ClickHouse, try ClickHouse Cloud and start a free trial now.
如果你已经读到这里,你应该已经准备好在生产环境中管理ClickHouse集群了 - 或者至少能够避免许多常见的陷阱!管理拥有PB级数据的ClickHouse集群无疑会带来挑战,即使对于最有经验的操作员来说也是如此。为了避免这些挑战,同时体验ClickHouse的速度和强大功能,试试ClickHouse Cloud,现在开始免费试用吧。