blog

ClickHouse storage architecture and optimization

Ashraf Sharif

Published:

In analytical databases, storage is the dominant factor behind performance, stability, and cost. ClickHouse is designed to scan large volumes of data efficiently, but how well it performs in production depends heavily on storage layout, disk throughput, and merge behavior. 

For operations and support engineers, storage decisions determine whether ClickHouse is predictable and cost-efficient or fragile under load.

This blog post focuses on ClickHouse storage architecture and optimization in production, with particular attention to hybrid on-prem and cloud deployments.

Storage types supported by ClickHouse

ClickHouse supports multiple storage backends, allowing flexible deployment across bare metal, virtualized environments, and cloud platforms. Since ClickHouse supports many types of table engines, we are going to focus on MergeTree-family table engines, which is the core data storage technology behind ClickHouse.

Base MergeTree table engine can be considered the default table engine for single-node ClickHouse instances because it is versatile and practical for a wide range of use cases. For production usage, ReplicatedMergeTree is the way to go, because it adds high-availability to all features of regular MergeTree engines. A bonus is automatic data deduplication on data ingestion, so the software can safely retry if there was some network issue during insert. 

If you are wondering why is it called “MergeTree”? Because it stores data in sorted parts and periodically merges them in the background and is optimized for analytical workloads. The following are the supported storage type for the MergeTree table engines:

  • Local disks
  • Network disks
  • Memory
  • Object storage

Local disks

ClickHouse’s local disk architecture is designed around fast, reliable block storage, with SSD or NVMe disks as the preferred foundation for production deployments. Data is written in a column-oriented format and grouped into immutable parts, which are stored sequentially on disk and later merged in the background. This access pattern minimizes random I/O and aligns well with modern SSDs. In a simple standalone setup, ClickHouse may run on a single server with one or more local disks mounted under /var/lib/clickhouse, often using a single high-capacity NVMe device to store all table data and metadata.

In more robust on-premises environments, local disks are often combined using RAID or JBOD depending on operational priorities. RAID 10 across multiple SSDs or NVMe drives is a common choice for production systems, providing both high throughput and resilience to disk failures. RAID 0 is sometimes used to maximize performance, but only when ClickHouse replication is in place to mitigate the risk of data loss. Alternatively, JBOD (Just a Bunch of Disks) can be used to expose individual disks directly to the filesystem, allowing ClickHouse to consume capacity without RAID overhead, but at the cost of higher operational complexity and less predictable failure handling.

We highly recommend reading the Storage Subsystem tips before installing ClickHouse.

Network disks

From a ClickHouse perspective, any “network disk” exposed as a standard filesystem by the operating system can be used and configured via storage_configuration, similar to local disks. However, “usability” does not imply “recommendation”.

Network-mounted filesystems (like NFS, SMB, and GlusterFS) are considered risky for ClickHouse. This is because ClickHouse relies heavily on features like atomic renames, reliable fsync, and predictable latency, especially during merge operations. These network filesystems often fail to provide reliable atomic renaming guarantees and consistent locking, and they are prone to latency spikes. Such issues can lead to merge stalls, unpredictable query latency, replica desynchronization, and, in the worst cases, data corruption.

Conversely, network block storage solutions are generally safe and production-proven. These include technologies like iSCSI, Fibre Channel, NVMe-oF, or cloud-specific block-storage services like AWS EBS, GCP Persistent Disk, and Azure Managed Disks. They are safe because they present themselves to the OS as direct, raw, local block devices. This setup provides highly predictable latency and correct fsync semantics, making it the most common and robust “network disk” configuration in cloud ClickHouse deployments.

Memory

ClickHouse does support memory-based storage through engines such as Memory and Buffer, but this support is limited and not designed to replace disk-based storage. A Memory table keeps all data in RAM, offers no persistence, no replication, no compression, and loses all data on restart. 

The Buffer engine is often misunderstood as storage, but in practice it is an ingestion optimization layer that temporarily holds data in memory before flushing it to a disk-backed table. Neither engine participates in the MergeTree architecture that underpins ClickHouse’s scalability and reliability.

Because ClickHouse is optimized for sequential disk access, compression, and background merges, using memory as primary storage works against its design. Memory tables do not benefit from merges, TTLs, or deduplication, and they scale poorly as data volume grows. In production, they introduce operational risk: data loss on restart, unpredictable memory pressure, and lack of high availability. In most real-world cases, fast local SSDs combined with proper schema design and the OS page cache deliver near-memory performance without sacrificing durability.

With that being said, memory-based tables do have valid, narrowly scoped use cases. They are appropriate for temporary or session-level data, small lookup or dimension tables, and controlled query acceleration scenarios where the dataset is bounded and data loss is acceptable. 

For anything that represents core business data like events, logs, metrics, or analytical facts, the disk-backed MergeTree tables remain the correct default. From an operations perspective, memory storage in ClickHouse should be treated as ephemeral infrastructure, not a foundation for persistent analytics.

The following is an example of creating a memory table utilizing minimum of 4 KB and maximum of 16KB of RAM:

CREATE TABLE memory (i UInt32) 
ENGINE = Memory 
SETTINGS min_bytes_to_keep = 4096, max_bytes_to_keep = 16384;

Object Storage

ClickHouse has native support for S3-compatible object storage and makes it practical to store cold data outside local disks. Typical use cases are:

  • Long-term data retention storage
  • Cost optimization
  • Elastic scaling without disk re-provisioning

Object storage can be configured as a disk in ClickHouse and combined with storage policies to tier data, typically keeping hot data on local disks while moving colder data to object storage via TTL rules. From ClickHouse’s perspective, object storage is treated as an external disk, with local storage still required for metadata and coordination.

In addition to leveraging major object storage providers (such as Amazon S3, Azure, Google Cloud, or Digital Ocean), organizations can deploy self-hosted solutions like SeaweedFS or Garage. Scaling these self-hosted object stores is straightforward, typically involving the addition of more servers.

However, utilizing object storage introduces operational trade-offs compared to local disks. While object storage offers compelling advantages like lower cost and virtually limitless capacity, these benefits are balanced by drawbacks, including higher latency, variable throughput, and a strong dependency on network stability.
The following is example of the storage configuration inside /etc/clickhouse-server/config.xml:

<storage_configuration>
  <disks>
    <s3_disk1>
      <type>object_storage</type>
      <object_storage_type>s3</object_storage_type>
      <endpoint>https://s3.amazonaws.com</endpoint>
      <bucket>clickhouse-data</bucket>
      <access_key_id>ACCESS_KEY</access_key_id>
      <secret_access_key>SECRET_KEY</secret_access_key>
      <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
    </s3_disk1>
  </disks>
  <policies>
    <s3_main>
      <volumes>
        <main>
          <disk>s3_disk1</disk>
        </main>
      </volumes>
    </s3_main>
  </policies>
</storage_configuration>

We can then use the S3 bucket as data storage by referencing the storage policy for a table:

CREATE TABLE s3_table1 (`id` UInt64, `column1` String) 
ENGINE = MergeTree
ORDER BY id
SETTINGS storage_policy = 's3_main';

ClickHouse storage in hybrid environments

MergeTree table engines can distribute data across several block devices, which is highly beneficial for implementing implicit “hot” and “cold” data tiers. Specifically, high-speed storage, such as NVMe SSDs or in-memory storage, can house the relatively small volume of frequently accessed, recent (“hot”) data. In contrast, the large historical volumes of rarely accessed (“cold”) data can be relegated to slower, more cost-effective media like HDDs or object storage.

This configuration can be achieved by using the storage policy. A storage policy can be defined globally, or per table settings when creating a table. Let’s put it into an example. We can configure our storage to be as below:

<storage_configuration>
  <disks>
    <s3_disk>
      <type>object_storage</type>
      <object_storage_type>s3</object_storage_type>
      <endpoint>https://s3.amazonaws.com</endpoint>
      <bucket>clickhouse-data</bucket>
      <access_key_id>ACCESS_KEY</access_key_id>
      <secret_access_key>SECRET_KEY</secret_access_key>
      <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
    </s3_disk>
  </disks>

  <policies>
    <hybrid_policy>
      <volumes>
        <hot>
          <disk>default</disk>
        </hot>
        <cold>
          <disk>s3_disk</disk>
        </cold>
      </volumes>
    </hybrid_policy>
  </policies>
</storage_configuration>

The hot data will be located on the default disk, which in this case is our local file system located at /var/lib/clickhouse/data/store and once it is full, the data will be moved to our S3 bucket called “clickhouse-data”. 

Even when data is stored in S3, local disk space is still essential for reliable metadata management and merge coordination. A common mistake is under-provisioning local disks because “most data lives in S3”. When merges or TTL moves run, the local disk fills up and the system degrades. It is recommended to always have 30-50% local disk free space and perform inserts in batches to avoid small parts.

Performance optimization for ClickHouse’s storage layer

The performance of ClickHouse’s storage layer heavily depends on the physical arrangement and compression of data on the disk. Partitioning is the most critical element and must be set up correctly. Partitions should be large-grained, for instance, monthly divisions based on a date column. This approach minimizes the total number of partitions, thereby reducing the overhead associated with metadata and the filesystem. 

Conversely, partitioning using values with high uniqueness (high-cardinality), such as a user ID or request ID, should be avoided because it generates too many small partitions and increases the pressure on the merge process. For example, partitioning a log table by toYYYYMM(event_date) (monthly partitions) is generally better than partitioning by user_id, which would create millions of tiny partitions and slow down overall performance.

Compression and indexing further enhance the efficiency of data reads within these partitions. ClickHouse compresses data on a per-column basis. Utilizing stronger codecs like ZSTD for data that is accessed less often (“colder” data) can substantially decrease disk usage and I/O operations, though it uses slightly more CPU resources. 

For time-series data, specialized codecs such as Delta, DoubleDelta, or Gorilla can achieve even better compression and cache utilization. Data skipping indexes including minmax, set, and bloom_filter, providing an additional layer of optimization. They enable ClickHouse to bypass entire ranges of data during query execution. These indexes are particularly beneficial for log and observability workloads, where queries often filter on specific fields or tags that are not part of the primary sorting key (ORDER BY), which reduces unnecessary disk reads and improves query response time.

Avoiding common ClickHouse storage operations pitfalls

Frequent updates and deletes (mutations)

  • Description: Applying frequent UPDATE or DELETE operations triggers internal “mutations.” These mutations are costly, as they involve rewriting the entire affected data part in the background.
  • Impact: In busy, write-heavy systems, excessive mutations can overload the merge process. This leads to long queues, delayed visibility of data changes, and temporary increases in disk usage as old and new data versions coexist.
  • Recommendation: Treat mutations as exceptional administrative tasks, not a standard mechanism for routine data modification.
  • Example: Instead of an hourly UPDATE employees SET salary = new_salary WHERE id = 123;, adopt a pattern where the new salary record is appended with a timestamp, and queries select only the latest record for that employee ID.

Schema and ingestion suboptimization

  • Overuse of Nullable Columns:
    • Description: Using Nullable data types adds complexity to storage and query execution.
    • Impact: This increases storage overhead and requires additional checks during query execution, reducing CPU efficiency and compression effectiveness.
    • Example: Unless data sparsity is extremely high or a NULL value is semantically necessary, prefer non-nullable columns and use a default value (e.g., 0, an empty string, or a specific placeholder) if a value is missing.
  • Very Small, Frequent Inserts:
    • Description: Each small INSERT query creates a new, tiny data part.
    • Impact: Thousands of small parts dramatically increase the load on the merge subsystem and stress disk I/O. This backlog causes unpredictable query latency and can eventually stall new inserts if disk resources are exhausted.
    • Recommendation: Always batch your inserts into larger transactions.
    • Example: Instead of inserting one event at a time (e.g., one INSERT per second), buffer events and perform a single batch insert every 5-10 seconds containing thousands of records.

Using disk IO and object storage latency monitoring in ClickHouse

In ClickHouse, problems related to storage rarely manifest as clear-cut error messages; instead, they often surface indirectly through a decline in query performance or unreliable data ingestion. For instance, if users start experiencing slow queries that take 15 seconds instead of the usual 3 seconds, or intermittent timeouts during peak hours, the root cause is frequently disk or storage pressure rather than a poorly written SELECT statement.

Therefore, continuous monitoring of the storage layer is crucial for early detection. Key metrics to track include disk read and write throughput. If, for example, your disk’s write throughput suddenly jumps from a typical 50 MB/s to over 200 MB/s during a large batch insert, it signals potential saturation or an abnormal spike that needs investigation, especially during heavy merge activity.

Beyond simple I/O, ClickHouse-specific metrics provide vital context. The number of active parts and the size of the merge backlog are strong indicators of storage health. If the part count for a table rapidly increases from 1,000 to 5,000 over a few hours, or if the system reports a merge backlog of “100 merges pending”, it indicates that the system is struggling to consolidate incoming data. These conditions increase disk I/O (e.g., more random reads/writes), increase temporary disk space usage (e.g., temporary merge files consuming an extra 500GB), and make query response times unpredictable. Monitoring these allows teams to distinguish between normal background merges and situations where the system is genuinely falling behind, requiring intervention, like increasing disk speed or adjusting merge settings.

For cloud deployments using object storage (like S3 or Azure Blob Storage), extra layers of monitoring are necessary. Metrics like object storage latency (e.g., reads taking 500ms instead of 50ms), request error rates (e.g., seeing 5xx errors from the storage service), and throttling events can directly slow down queries and background tasks like TTL moves. Even when data is remote, issues may appear as stalled merges or slow queries because ClickHouse still relies on local disks for coordination. Therefore, correlating all metrics from local disk I/O, local CPU, to remote object storage metrics is essential. When queries slow down, such as a report query that usually runs in 10 seconds now taking 30, the culprit is frequently a bottleneck in storage, not the query logic itself.

ClickHouse storage operations best practices for support teams

For support teams operating ClickHouse in production, proactive storage monitoring and alerting are essential to prevent performance degradation and outages. A minimum alerting baseline should include:

  • Set alerts to catch capacity issues early, e.g., warning at 20% free space and critical at 10% free space remaining. This is vital given the temporary space required during merges.
  • Monitor the system.merges table and alert if the number of pending merges exceeds a predefined threshold (e.g., > 100 merges pending for more than 15 minutes). This identifies when background merges are falling behind ingestion.
  • Alerts on INSERT queries failing with errors related to “disk full” or “no space left on device” are often a late warning sign that requires immediate action.
  • In environments using S3 or similar object storage, abnormal latency (e.g., PUT requests > 500ms) or an increased error rate (e.g., > 0.5% 5xx errors) should also trigger alerts, as they can indirectly stall queries and background operations.

In day-to-day operations, ClickHouse’s system tables are indispensable for diagnosing storage-related issues. Support engineers should regularly inspect these tables:

  • system.parts: Provides visibility into data part counts and sizes, helping to identify tables with excessive small parts that could lead to poor query performance (e.g., SELECT table, count() FROM system.parts GROUP BY table HAVING count() > 1000).
  • system.merges: Exposes active and pending merge operations, useful for diagnosing a backlog (e.g., SELECT * FROM system.merges WHERE is_scheduled = 1 ORDER BY merge_time).
  • system.disks: Shows disk usage and available space across configured volumes, allowing engineers to correlate free space alerts with the actual physical location (e.g., SELECT name, path, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total FROM system.disks).

Regularly inspecting these tables allows support engineers to correlate alerts with actual internal state, distinguish transient spikes from structural problems, and take informed corrective actions before storage issues escalate into user-facing incidents.

The essential role of routine storage maintenance in ClickHouse

Routine maintenance is a critical but often overlooked aspect of operating ClickHouse at scale. For instance, regularly reviewing part counts helps identify unhealthy ingestion patterns, such as a high volume of INSERT INTO statements generating thousands of tiny data parts, which increases merge pressure. Unused tables and obsolete partitions should be dropped proactively, like removing a staging table that’s no longer needed or dropping a partition for data older than the legally required retention period. 

This not only reclaims disk space but also reduces metadata overhead and background maintenance work. Validating TTL (Time-To-Live) and tiering policies ensures that data is aging out or moving between storage tiers as intended, for example, confirming that data older than 90 days is successfully compressed and moved from fast NVMe storage to a slower, cheaper HDD-based volume. This prevents silent growth of hot storage or unexpected retention costs.

When storage usage or workload characteristics change, rebalancing storage volumes may also be necessary. This can involve redistributing data across disks, such as using the ALTER TABLE ... MOVE PARTITION command, adjusting storage policies, or revisiting which data belongs on fast local storage versus slower or cheaper tiers. Without this periodic upkeep, ClickHouse systems tend to degrade slowly: merges take longer, queries become less predictable, and resource usage creeps upward without a single clear failure point. Consistent maintenance allows support teams to keep performance stable and avoid the cumulative effects of neglect that are far harder to correct under pressure.

Hybrid and cloud-specific considerations for optimal storage health and performance

In hybrid and cloud environments, storage optimization is a dual concern: managing both performance and controlling costs and network traffic. When computing is separate from object storage, for instance, an on-premises ClickHouse cluster accessing data in Amazon S3, query speed, data merges, and movement depend heavily on available bandwidth and network latency.

Network latency

If the network link between the ClickHouse compute nodes and the S3 bucket has high latency (e.g., >50ms), a query that needs to read 100 small files might take significantly longer than if the data were on local storage, even if total bandwidth is high.

Support teams must actively monitor sustained and peak bandwidth usage. A constrained network link, like a 1 Gbps connection consistently running at 90% utilization, can create “backpressure.” This backpressure can manifest as slow-running queries, or, in the worst case, completely stall background operations like data merges or replications.

Cost management

Cloud egress charges (the cost to move data out of a cloud region) can quickly balloon. If a daily reporting job reads 1TB of data from an S3 bucket in Region A and sends the final report to an application in Region B, this data movement can incur significant egress fees.

Your tiering strategy is equally vital to cost management. Policies for data tiering (e.g., moving old data to cheaper, slower storage tiers via TTLs) should be intentional, based on known access patterns and retention rules, not simply enabled by default. A poorly planned tiering strategy, such as moving frequently-accessed recent data to a low-cost “Archive” tier, can increase both cloud spending (due to expensive retrieval fees from the archive) and query latency. Conversely, a well-planned policy ensures that rarely-accessed historical data moves to the lowest-cost tier without impacting the performance of active workloads.

A word on storage for ClickHouse within multi-database stacks

ClickHouse typically complements systems like PostgreSQL/MySQL for transactional consistency, low-latency lookups, and row updates (e.g, order creation), and Elasticsearch for fast, unstructured data search (e.g, search products), and object storage (S3/GCS) for cost-efficient, durable archiving (e.g, archived data, old logs). 

ClickHouse is optimized differently for high-throughput scans, heavy compression, and append-only data ingestion, excelling at analytical queries (e.g., aggregating sales data, sales dashboard) but not transactional guarantees or frequent row-level mutations. Teams should align storage strategies and retention policies across these systems.

Conclusion

Effective ClickHouse operations depend heavily on thoughtful storage design, as storage choices directly influence system stability, performance, and operational risk. Local disks remain the most predictable option for hot data and latency-sensitive workloads, while object storage offers flexibility and cost advantages for colder data when used appropriately. 

Hybrid architectures combine these approaches but require disciplined monitoring and capacity planning to avoid hidden bottlenecks. Ultimately, storage optimization in ClickHouse is not just a performance concern but also a cost-management exercise, especially in cloud and hybrid environments where storage decisions directly impact long-term operational expenses.

Subscribe below to be notified of fresh posts