什么？PG单表别超过8.9亿条记录?

Original digoal PostgreSQL码农集散地 2024-07-08

文中参考文档在github需点击阅读原文打开, 同时推荐2个学习环境:

1、懒人Docker镜像, 已打包200+插件:《最好的PostgreSQL学习镜像》

2、有web浏览器就能用的云起实验室: 《免费体验PolarDB开源数据库》

3、PolarDB开源数据库内核、最佳实践等学习图谱: https://www.aliyun.com/database/openpolardb/activity

关注公众号, 持续发布PostgreSQL、PolarDB、DuckDB等相关文章.

为什么PostgreSQL单表不建议超过8.9亿条记录?

《PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem》

maintenance_work_mem , autovacuum_work_mem 这个参数有什么用？

这部分内存被用于记录垃圾tupleid，vacuum进程在进行表扫描时，当扫描到的垃圾记录ID占满了整个内存（autovacuum_work_mem或maintenance_work_mem），那么会停止vacuum表, 记录heap表vacuum到的位置，然后开始该表对应INDEX的vacuum。

扫描INDEX时，清理索引中的哪些tuple，实际上是从刚才内存中记录的这些tupleid来进行匹配。

当所有索引都扫描并清理了一遍后，继续从刚才的vacuum位点开始vacuum表。

显然，如果垃圾回收时autovacuum_work_mem太小，INDEX会被多次扫描，浪费资源，时间。

所以autovacuum_work_mem当然是越大越好, 不过再大也不能超过1GiB, 设置超过1GiB没有意义. 因为The dead tuple ids's array used a single allocation and so was limited to 1GB.

综合以上分析, 为什么PG单表不建议超过8.9亿条记录?

从头文件得知, tupleid为6字节长度。1GiB可存储1.7亿条dead tuple的tupleid。默认垃圾记录约等于表大小的20%时触发垃圾回收, 8.9亿条记录的表20%的垃圾即1.7亿条dead tuple, 超过8.9亿, 该表的垃圾回收就要多次扫描index了。

PostgreSQL 17引入了一个新的数据结构TidStore, 这个数据结构不是blockid,itemid的array, 而是blockid,bitmap的结构, 也就是说一个page里的所有deadtuple可以存储在一个值里面.

没有autovacuum_work_mem 1GiB限制
dead tuples的存储占用空间更少
从TidStore数据结构搜索效率比从dead tuple array搜索更高.
综合性能提升数倍甚至数十倍(当index order is correlated with heap tuple order, 索引顺序和heap里面的对应值顺序线性相关, 也就是一个page里面的垃圾完全对应到index一个page里面的垃圾几率更高.).

PostgreSQL 17解决了如下问题:

The array used a single allocation and so was limited to 1GB.
The allocation was pessimistically sized according to table size.
Lookup with binary search was slow because of poor CPU cache and branch prediction behavior.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=667e65aac354975c6f8090c6146fceb8d7b762d6

Use TidStore for dead tuple TIDs storage during lazy vacuum. master github/master  
author	Masahiko Sawada <msawada@postgresql.org>	  
Tue, 2 Apr 2024 01:15:37 +0000 (10:15 +0900)  
committer	Masahiko Sawada <msawada@postgresql.org>	  
Tue, 2 Apr 2024 01:15:37 +0000 (10:15 +0900)  
commit	667e65aac354975c6f8090c6146fceb8d7b762d6  
tree	ecefa5c922788f32aa643e4639561d7ba9ac1801	tree  
parent	d5d2205c8ddc6670fa87474e172fdfab162b7a73	commit | diff  
Use TidStore for dead tuple TIDs storage during lazy vacuum.  
  
Previously, we used a simple array for storing dead tuple IDs during  
lazy vacuum, which had a number of problems:  
  
* The array used a single allocation and so was limited to 1GB.  
* The allocation was pessimistically sized according to table size.  
* Lookup with binary search was slow because of poor CPU cache and  
  branch prediction behavior.  
  
This commit replaces that array with the TID store from commit  
30e144287a.  
  
Since the backing radix tree makes small allocations as needed, the  
1GB limit is now gone. Further, the total memory used is now often  
smaller by an order of magnitude or more, depending on the  
distribution of blocks and offsets. These two features should make  
multiple rounds of heap scanning and index cleanup an extremely rare  
event. TID lookup during index cleanup is also several times faster,  
even more so when index order is correlated with heap tuple order.  
  
Since there is no longer a predictable relationship between the number  
of dead tuples vacuumed and the space taken up by their TIDs, the  
number of tuples no longer provides any meaningful insights for users,  
nor is the maximum number predictable. For that reason this commit  
also changes to byte-based progress reporting, with the relevant  
columns of pg_stat_progress_vacuum renamed accordingly to  
max_dead_tuple_bytes and dead_tuple_bytes.  
  
For parallel vacuum, both the TID store and supplemental information  
specific to vacuum are shared among the parallel vacuum workers. As  
with the previous array, we don't take any locks on TidStore during  
parallel vacuum since writes are still only done by the leader  
process.  
  
Bump catalog version.  
  
Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar  
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com

-       <para>  
-        Note that for the collection of dead tuple identifiers,  
-        <command>VACUUM</command> is only able to utilize up to a maximum of  
-        <literal>1GB</literal> of memory.  
-       </para>  
       </listitem>  
      </varlistentry>  
   
@@ -1946,13 +1941,6 @@ include_dir 'conf.d'  
         <filename>postgresql.conf</filename> file or on the server command  
         line.  
        </para>  
-       <para>  
-        For the collection of dead tuple identifiers, autovacuum is only able  
-        to utilize up to a maximum of <literal>1GB</literal> of memory, so  
-        setting <varname>autovacuum_work_mem</varname> to a value higher than  
-        that has no effect on the number of dead tuples that autovacuum can  
-        collect while scanning a table.  
-       </para>

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30e144287a

Add TIDStore, to store sets of TIDs (ItemPointerData) efficiently.  
author	Masahiko Sawada <msawada@postgresql.org>	  
Thu, 21 Mar 2024 01:08:42 +0000 (10:08 +0900)  
committer	Masahiko Sawada <msawada@postgresql.org>	  
Thu, 21 Mar 2024 01:08:42 +0000 (10:08 +0900)  
commit	30e144287a72529c9cd9fd6b07fe96eb8a1e270e  
tree	2ced2409e5f62c9f198486b6be3cd74cec18682e	tree  
parent	995e0fbc1c57c9b705c57de456d25c6e448bc5dd	commit | diff  
Add TIDStore, to store sets of TIDs (ItemPointerData) efficiently.  
  
TIDStore is a data structure designed to efficiently store large sets  
of TIDs. For TID storage, it employs a radix tree, where the key is  
a block number, and the value is a bitmap representing offset  
numbers. The TIDStore can be created on a DSA area and used by  
multiple backend processes simultaneously.  
  
There are potential future users such as tidbitmap.c, though it's very  
likely the interface will need to evolve as we come to understand the  
needs of different kinds of users. For example, we can support  
updating the offset bitmap of existing values.  
  
Currently, the TIDStore is not used for anything yet, aside from the  
test code. But an upcoming patch will use it.  
  
This includes a unit test module, in src/test/modules/test_tidstore.  
  
Co-authored-by: John Naylor  
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com

   1 /*-------------------------------------------------------------------------  
   2  *  
   3  * tidstore.c  
   4  *      TID (ItemPointerData) storage implementation.  
   5  *  
   6  * TidStore is a in-memory data structure to store TIDs (ItemPointerData).  
   7  * Internally it uses a radix tree as the storage for TIDs. The key is the  
   8  * BlockNumber and the value is a bitmap of offsets, BlocktableEntry.  
   9  *  
  10  * TidStore can be shared among parallel worker processes by passing DSA area  
  11  * to TidStoreCreate(). Other backends can attach to the shared TidStore by  
  12  * TidStoreAttach().  
  13  *  
  14  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group  
  15  * Portions Copyright (c) 1994, Regents of the University of California  
  16  *  
  17  * IDENTIFICATION  
  18  *    src/backend/access/common/tidstore.c  
  19  *  
  20  *-------------------------------------------------------------------------  
  21  */

数据结构对比:

 295 /*
 296  * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
 297  */
 298 typedef struct VacDeadItems
 299 {
 300     int         max_items;      /* # slots allocated in array */
 301     int         num_items;      /* current # of entries */
 302 
 303     /* Sorted array of TIDs to delete from indexes */
 304     ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
 305 } VacDeadItems;

 296 /*
 297  * VacDeadItemsInfo stores supplemental information for dead tuple TID
 298  * storage (i.e. TidStore).
 299  */
 300 typedef struct VacDeadItemsInfo
 301 {
 302     size_t      max_bytes;      /* the maximum bytes TidStore can use */
 303     int64       num_items;      /* current # of entries */
 304 } VacDeadItemsInfo;


  14 #ifndef TIDSTORE_H
  15 #define TIDSTORE_H
  16 
  17 #include "storage/itemptr.h"
  18 #include "utils/dsa.h"
  19 
  20 typedef struct TidStore TidStore;
  21 typedef struct TidStoreIter TidStoreIter;
  22 
  23 /* Result struct for TidStoreIterateNext */
  24 typedef struct TidStoreIterResult
  25 {
  26     BlockNumber blkno;
  27     int         max_offset;
  28     int         num_offsets;
  29     OffsetNumber *offsets;
  30 } TidStoreIterResult;
  31 
  32 extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
  33                                 int tranche_id);
  34 extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
  35 extern void TidStoreDetach(TidStore *ts);
  36 extern void TidStoreLockExclusive(TidStore *ts);
  37 extern void TidStoreLockShare(TidStore *ts);
  38 extern void TidStoreUnlock(TidStore *ts);
  39 extern void TidStoreDestroy(TidStore *ts);
  40 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
  41                                     int num_offsets);
  42 extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
  43 extern TidStoreIter *TidStoreBeginIterate(TidStore *ts);
  44 extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
  45 extern void TidStoreEndIterate(TidStoreIter *iter);
  46 extern size_t TidStoreMemoryUsage(TidStore *ts);
  47 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
  48 
  49 #endif                          /* TIDSTORE_H */

欢迎关注我的github (https://github.com/digoal/blog) , 学习数据库不迷路.

近期正在写公开课材料, 未来将通过视频号推出, 欢迎关注视频号:

文章中的参考文档请点击阅读原文获得.

继续滑动看下一个

PostgreSQL码农集散地

向上滑动看下一个

一把短刀，怎么就让他连捅18人？！

向杨大市长道歉

向不容妄议的杨市长道歉

以色列搞大了：伊朗说要直接出兵参战

黄晓菁，这位杭州泰隆银行女员工自爆视频火了，带给我们那些思考？

什么？PG单表别超过8.9亿条记录?

文中参考文档在github需点击阅读原文打开, 同时推荐2个学习环境:

1、懒人Docker镜像, 已打包200+插件:《最好的PostgreSQL学习镜像》

为什么PostgreSQL单表不建议超过8.9亿条记录?

您可能也对以下帖子感兴趣

一把短刀，怎么就让他连捅18人？！

向杨大市长道歉

向不容妄议的杨市长道歉

以色列搞大了：伊朗说要直接出兵参战

黄晓菁，这位杭州泰隆银行女员工自爆视频火了，带给我们那些思考？

生成图片，分享到微信朋友圈

什么？PG单表别超过8.9亿条记录?

文中参考文档在github需点击阅读原文打开, 同时推荐2个学习环境:

1、懒人Docker镜像, 已打包200+插件:《最好的PostgreSQL学习镜像》

为什么PostgreSQL单表不建议超过8.9亿条记录?

您可能也对以下帖子感兴趣