本文适用于ClickHouse 20.10.* # 一、分级(level)内存统计 ## MemoryTracker MemoryTracker用于统计相应对象的内存当前使用量(amount)、峰值(peak)、大小限制(limit)。 ```cpp class MemoryTracker { private: std::atomic amount {0}; std::atomic peak {0}; std::atomic hard_limit {0}; std::atomic profiler_limit {0}; /// 以下用于故障注入测试 double fault_probability = 0; double sample_probability = 0; /// 父级 std::atomic parent {}; …… ``` ## 分级定义 分层(级)内存统计与控制,分级定义: ```cpp enum class VariableContext { Global = 0, User, /// Group of processes Process, /// For example, a query or a merge Thread, /// A thread of a process Snapshot /// Does not belong to anybody }; ``` ## 全局MemoryTracker 统计和限制整个ClickHouse Server进程内存使用量。 ```cpp MemoryTracker total_memory_tracker(nullptr, VariableContext::Global); ``` ## 树形分级结构 多个MemoryTracker对象,分别统计不同层级的内存,组成一个树形结构,树的root是全局的MemoryTracker(total\_memory\_tracker)。 - Global:全局MemoryTracker的level是Global。 - User:用户有多个并发查询,每个用户有一个level是“User”的MemoryTracker。 - Process:通常一个查询的MemoryTracker的level是“Process”。 - Thread:查询算子可以有多个计算线程,每个线程有自己的MemoryTracker。 ![img](../../../../ff_internal_upload/img/2020/image-20201009185014952.png) alloc和free时,子级会向上一级传递内存分配与释放统计信息: ```cpp void MemoryTracker::alloc(Int64 size) { …… if (auto * loaded_next = parent.load(std::memory_order_relaxed)) loaded_next->alloc(size); //------------------------------- void MemoryTracker::free(Int64 size) { …… if (auto * loaded_next = parent.load(std::memory_order_relaxed)) loaded_next->free(size); ``` # 二、线程的MemoryTracker ## 线程上下文 查询算子计算线程都带了一个线程上下文数据结构——ThreadStatus。 ```cpp extern thread_local ThreadStatus * current_thread; class ThreadStatus : public boost::noncopyable { public: /// Linux's PID (or TGID) (the same id is shown by ps util) const UInt64 thread_id = 0; MemoryTracker memory_tracker{VariableContext::Thread}; /// Small amount of untracked memory (per thread atomic-less counter) Int64 untracked_memory = 0; /// Each thread could new/delete memory in range of (-untracked_memory_limit, untracked_memory_limit) without access to common counters. Int64 untracked_memory_limit = 4 * 1024 * 1024; protected: ThreadGroupStatusPtr thread_group; …… ``` 其中, - memory\_tracker:负责内存统计与限制。 - thread\_group:该ThreadStatus所属的线程组,通常一个查询的多个计算线程成为一个线程组,对应“Process”层级。 ## untracked_memory 为了避免小数据量alloc和free频繁进行统计影响性能(主要是防止上级MemoryTracker锁争抢),ThreadStatus类设置成员变量untracked\_memory和untracked\_memory\_limit,目的是,当前线程先进行计数,当计数值达到untracked\_memory\_limit限制(默认4MB),才会将统计值更新到MemoryTracker。 因此,当我们通过MemoryTracker::get()方法获得的内存统计值时,(每个线程)会有4MB的误差。 代码片段如下: ```cpp namespace CurrentMemoryTracker { using DB::current_thread; void alloc(Int64 size) { if (auto * memory_tracker = DB::CurrentThread::getMemoryTracker()) { current_thread->untracked_memory += size; if (current_thread->untracked_memory > current_thread->untracked_memory_limit) { /// Zero untracked before track. If tracker throws out-of-limit we would be able to alloc up to untracked_memory_limit bytes /// more. It could be useful to enlarge Exception message in rethrow logic. Int64 tmp = current_thread->untracked_memory; current_thread->untracked_memory = 0; memory_tracker->alloc(tmp); } } } void free(Int64 size) { if (auto * memory_tracker = DB::CurrentThread::getMemoryTracker()) { current_thread->untracked_memory -= size; if (current_thread->untracked_memory < -current_thread->untracked_memory_limit) { memory_tracker->free(-current_thread->untracked_memory); current_thread->untracked_memory = 0; } } } } ``` # 三、算子内存统计 从内存统计结构看,MemoryTracker没有统计单个算子内存,以聚集算子(Aggregator)为例: ```cpp bool Aggregator::executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result, ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, bool & no_more_keys) { …… size_t result_size = result.sizeWithoutOverflowRow(); Int64 current_memory_usage = 0; if (auto * memory_tracker_child = CurrentThread::getMemoryTracker()) // 当前线程的MemoryTracker if (auto * memory_tracker = memory_tracker_child->getParent()) // 父是Process级MemoryTracker current_memory_usage = memory_tracker->get(); /// Here all the results in the sum are taken into account, from different threads. auto result_size_bytes = current_memory_usage - memory_usage_before_aggregation; ``` 可见,这里的current\_memory\_usage是当前查询的内存MemoryTracker。 通过,一个测试用例,也可证明。 用例数据为TPC-H 1s数据库,SQL及其查询计划如下: ```sql EXPLAIN SELECT MAX(cnt) FROM (SELECT count(*) as cnt FROM LINEITEM group by L_ORDERKEY); ┌─explain───────────────────────────────────────────────┐ │ Expression (Projection) │ │ Expression (Before ORDER BY and SELECT) │ │ Aggregating │ │ Expression (Before GROUP BY) │ │ Expression (Projection) │ │ Expression (Before ORDER BY and SELECT) │ │ Aggregating │ │ Expression (Before GROUP BY) │ │ ReadFromStorage (Read from MergeTree) │ └───────────────────────────────────────────────────────┘ ``` 该查询有两层聚集计算,内层聚集结果为1500000行,而外层的聚集结果为1行。 通过gdb调试clickhouse-server进程,可以知道,内层聚集current\_memory\_usage大小约76MB,而外层聚集current\_memory\_usage大小也约为76MB。实际上外层聚集算子需要内存极少。可见,这里的内存统计不是算子级的。