Jekyll2022-08-26T03:05:14+00:00/feed.xmlprototypeWorse is better
聊聊文件系统2018-02-12T09:02:00+00:002018-02-12T09:02:00+00:00/kernel/2018/02/12/%E8%81%8A%E8%81%8A%E6%96%87%E4%BB%B6%E7%B3%BB%E7%BB%9F<p>太久没有写文章了,有时候是因为懒,有时候是觉得理解的不够透彻,更多的是因为感觉文笔太差。。</p>
<p>今天就来聊聊文件系统,这也是个庞然大物,先说说现在流行的分布式存储把,现在分布式存储主要分为对象存储,块存储,和文件存储,这些存储其实单机本来就有的,不过是云计算厂商在云上提供了这些服务,把单机的存储转到了云上,准确的说其实应该叫云上的分布式对象存储、块存储、文件存储。</p>
<p>块存储要从Linux系统设备说起,这是Unix系统遗留下来的传统,也是一种优秀的设计思想,把所有的东西都看成文件,设备当然也是一种文件咯,根据读写方式的不同分为:块设备(按照Block为单位读写),字符设备(可以按照offset以字节为单位读写),以及其他设备。在虚拟文件系统里,下一层是就是块设备层,块设备层会调用驱动程序对硬盘进行读写,而对VFS提供的这个块设备的服务抽出来,拿到云上去做就叫块存储。</p>
<p>对象存储其实就是KV存储,因为在比如Java这种语言里,value存储的往往是一个对象的序列化后的数据,所以叫对象存储(理解的不对请告诉我!)。</p>
<p>文件存储指的是提供符合POSIX标准的存储服务,在Linux用户态下,想要对文件进行读写操作,必须通过系统提供的接口,比如read,write,open等,而POSIX标准就是一个针对操作系统提供的接口的标准,也就说说能够提供read,write,open等接口的存储服务就是文件存储。</p>
<h3 id="文件系统">文件系统</h3>
<p>其实文件系统茫茫多,ext,ext2,ext3,ext4等等等,看看linux/fs下面的模块就知道了,为了统一这些模块,linux抽象出了虚拟文件系统VFS,VFS把文件抽象成了inode,每个inode都会关联一组函数指针,这一组函数指针就是不用的文件系统具体的读写操作。</p>
<p>Linux启动时会挂载一个根目录文件系统,这个由你启动参数指定,启动过程中还有一些特殊的文件系统,比如sockfs,tmpfs,procfs等等,这些文件系统都会被挂载到一个挂载树上,挂载树的根就是<code class="language-plaintext highlighter-rouge">/</code>,他是这个系统这次启动的根目录,你可以在这棵树上继续挂载新的节点,每个节点都会针对一个具体的文件系统。</p>
<p>当你做一个操作时,比如open(“/foo/bar”),VFS首先会在挂载树中查找挂载点,然后从挂载点找到对应的inode,从这个inode开始对剩余的路径继续进行相应的操作。</p>
<h3 id="ext2文件系统">EXT2文件系统</h3>
<p>EXT2和EXT3是兼容的是个简单的文件系统,很适合来学习,下面我们来剖析一下EXT2文件系统。</p>
<p>上面说到了,块设备就是按照块来读写的设备,那么EXT2就是决定如何将数据存储到对应的块设备上去的系统。</p>
<p>文件系统其实很简单,只有文件和目录,目录里可能会包含目录和文件,是一个树状的结构,而文件可能会存储数据,数据从几KB到几GB不等,所以要选择合适的文件系统,就首先要确定你的存储场景。而这么多的文件系统其实就是针对不同的场景经过权衡取舍的结果。</p>
<p>EXT2文件系统将许多Blocks分成多个Block Group,每个Block Group都有一个Group Descriptor,这些Group Descriptor都在一起,位于SuperBlock的后面,而每个Group Descriptor都含有block bitmap,inode bitmap,inode table等。</p>
<h4 id="superblock">SuperBlock</h4>
<p>对于EXT2文件系统来说,首先就是SuperBlock,SuperBlock是一个特殊的块,它记载着本文件系统全局的属性信息,位于磁盘起始的1024字节处,假如BlockSize是1024字节,它就位于第二个Block,假如BlockSize是4096字节,那它位于第一个Block,BlockSize在创建具体的文件系统时指定。</p>
<pre><code class="language-C">/*
* Structure of the super block
*/
struct ext2_super_block {
unsigned long s_inodes_count; /* Inodes count */
unsigned long s_blocks_count; /* Blocks count */
unsigned long s_r_blocks_count;/* Reserved blocks count */
unsigned long s_free_blocks_count;/* Free blocks count */
unsigned long s_free_inodes_count;/* Free inodes count */
unsigned long s_first_data_block;/* First Data Block */
unsigned long s_log_block_size;/* Block size */
long s_log_frag_size; /* Fragment size */
unsigned long s_blocks_per_group;/* # Blocks per group */
unsigned long s_frags_per_group;/* # Fragments per group */
unsigned long s_inodes_per_group;/* # Inodes per group */
unsigned long s_mtime; /* Mount time */
unsigned long s_wtime; /* Write time */
unsigned short s_mnt_count; /* Mount count */
short s_max_mnt_count; /* Maximal mount count */
unsigned short s_magic; /* Magic signature */
unsigned short s_state; /* File system state */
unsigned short s_errors; /* Behaviour when detecting errors */
unsigned short s_pad;
unsigned long s_lastcheck; /* time of last check */
unsigned long s_checkinterval; /* max. time between checks */
unsigned long s_reserved[238]; /* Padding to the end of the block */
};
</code></pre>
<p>比较关键的有s_blocks_per_group和s_log_block_size,这就确定了磁盘分区上有多少个block group。</p>
<h4 id="blockdescriptor">BlockDescriptor</h4>
<p>Block Descriptor紧跟SuperBlock,Block Descripotr的数量可以通过公式 <code class="language-plaintext highlighter-rouge">(s_blocks_count - s_first_data_block - 1) / s_blocks_per_group + 1</code>算出,而且每个BlockDescriptor的大小必须不能超过BlockSize,所以我们可能一下将所有的Block Descriptor读出来。</p>
<pre><code class="language-C">struct ext2_group_desc
{
unsigned long bg_block_bitmap; /* Blocks bitmap block */
unsigned long bg_inode_bitmap; /* Inodes bitmap block */
unsigned long bg_inode_table; /* Inodes table block */
unsigned short bg_free_blocks_count; /* Free blocks count */
unsigned short bg_free_inodes_count; /* Free inodes count */
unsigned short bg_used_dirs_count; /* Directories count */
unsigned short bg_pad;
unsigned long bg_reserved[3];
};
</code></pre>
<h4 id="inode">Inode</h4>
<p>虚拟文件系统我们就提到Inode了,这里的Inode其实像是虚拟文件系统的Inode序列化后存放到磁盘中的Inode,每一个目录或者文件都会对应一个Inode。</p>
<p>根目录的Inode是固定的,从根目录开始对于特定Inode的读取,首先要知道Inode的inode号,然后用inode号除以SuperBlock中的s_inodes_per_group得到在哪个Block Descriptor,通过余数得到在这个BlockDescriptor里的哪个inode,从inode table中读取对应的inode。</p>
<pre><code class="language-C">/*
* Structure of an inode on the disk
*/
struct ext2_inode {
unsigned short i_mode; /* File mode */
unsigned short i_uid; /* Owner Uid */
unsigned long i_size; /* Size in bytes */
unsigned long i_atime; /* Access time */
unsigned long i_ctime; /* Creation time */
unsigned long i_mtime; /* Modification time */
unsigned long i_dtime; /* Deletion Time */
unsigned short i_gid; /* Group Id */
unsigned short i_links_count; /* Links count */
unsigned long i_blocks; /* Blocks count */
unsigned long i_flags; /* File flags */
unsigned long i_reserved1;
unsigned long i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
unsigned long i_version; /* File version (for NFS) */
unsigned long i_file_acl; /* File ACL */
unsigned long i_dir_acl; /* Directory ACL */
unsigned long i_faddr; /* Fragment address */
unsigned char i_frag; /* Fragment number */
unsigned char i_fsize; /* Fragment size */
unsigned short i_pad1;
unsigned long i_reserved2[2];
};
</code></pre>
<p>对文件的读写其实就是对于inode的读写,inode是如何存放连续的数据的呢,答案就在i_block里。</p>
<p>首先每个inode都有15个block指针,前12个指针是direct pointer,直接指向一个具体的block。</p>
<p>假设这些block用完了之后,第13个指针是indirect pointer,它指向的是block的指针。</p>
<p>第十四个指针是double indirect pointer,它指向的是block指针的指针。</p>
<p>第十五个指针是triple indirect pointer,它指向的是block指针的指针的指针。</p>
<p>通过这样多级指针的设计我们可以把某个文件扩展的很大,而在文件较小时也不影响它的性能。</p>
<h4 id="direntry">direntry</h4>
<p>direntry是一种固定大小的存储格式,它代表一个具体的目录,它的inode内容就是它所包含的子目录或者文件。</p>
<pre><code class="language-C">struct ext2_dir_entry {
unsigned long inode; /* Inode number */
unsigned short rec_len; /* Directory entry length */
unsigned short name_len; /* Name length */
char name[EXT2_NAME_LEN]; /* File name */
};
</code></pre>太久没有写文章了,有时候是因为懒,有时候是觉得理解的不够透彻,更多的是因为感觉文笔太差。。内存分页2017-08-24T16:35:00+00:002017-08-24T16:35:00+00:00/kernel/2017/08/24/%E5%86%85%E5%AD%98%E5%88%86%E9%A1%B5<h3 id="内存管理-内存分页">内存管理-内存分页</h3>
<p>一直觉得内存管理是操作系统最复杂的模块,没有之一。之前对于内存管理的了解仅限于上学时学过的堆,栈,内存映射,页表,内存分配器等概念层面,对于如何实现的有点模糊。去年想从0写个操作系统,结果写了个hello world就扔在那了。</p>
<p>前几天兴趣重燃,找了好多资料和文章终于拼凑出一个能启动的玩具内核了。话不多说先放个链接:<a href="https://github.com/flex1988/phenix">Github</a></p>
<h4 id="1为何要内存分页">1.为何要内存分页</h4>
<p>内存管理其实是一个非常大的topic,这篇文章我只想写内存映射这一块,任何其他的东西都忽略掉。</p>
<p>内存管理对于C/C++程序员来说其实是个非常非常重要的一块东西,当然其他程序员了解一下也不错。</p>
<p>很久之前的机器是没有分页的,当时加载一个程序直接在内存上跑,跑完了换下一个,但是如果同时让两个程序跑的话,内存就有点捉襟见肘了。后来出现了分段管理,一个逻辑地址由segment+offset组成,可能是分段仍然不够灵活,后来又出现了分页管理。</p>
<p>在了解分页之前,我们先思考一下为什么要分页?</p>
<p>我们都知道每个进程都有一个自己的虚拟进程空间,如果每个进程的内存都是一对一映射话,物理内存是肯定不够的,而且很多时候虚拟进程空间的内存只有一小部分在被使用,所以内存分页可以有效的节约物理内存,同时虚拟内存空间也带来了程序代码地址无关的好处,隔离了不同进程,避免互相影响。</p>
<p>这边文章只讲32位操作系统的内存分页,因为64位的原理大体相同,只是多了一级页表。</p>
<h4 id="2页表">2.页表</h4>
<p>说到内存分页,最核心的部分是页表,页表分成两级,第一级页表是page directory,第二级page table,每级页表大小都是4K,都分为1024个entry,每个entry大小为4个字节。</p>
<p>page directiry每个entry是4个字节,其中0-9是标志位,9-11是预留位,12-31是逻辑地址的前十位。</p>
<p>每个标记位的含义:</p>
<ul>
<li>P <code class="language-plaintext highlighter-rouge">present</code> 代表映射的物理页在内存中,如果为false,代表映射的物理页被swap到了磁盘,会发生缺页中断,由中断处理函数将物理页加载回内存</li>
<li>R <code class="language-plaintext highlighter-rouge">read/write</code> 代表本页能否被修改</li>
<li>U <code class="language-plaintext highlighter-rouge">User/Supervisor</code> 代表页的访问权限是内核态还是用户态</li>
<li>W <code class="language-plaintext highlighter-rouge">Write-Through</code></li>
<li>D <code class="language-plaintext highlighter-rouge">Cache Disable</code> 是否不允许cache</li>
<li>A <code class="language-plaintext highlighter-rouge">Accessed</code> 该页是否被访问过</li>
<li>S <code class="language-plaintext highlighter-rouge">Page Size</code> 表示该页为4MB还是4KB</li>
</ul>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fiva7ystjrj20c0073t96.jpg" alt="img" /></p>
<p>page table的entry跟page directory大致一样,只是有些标记位的含义不同。</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fiva8egq7mj20c00740t3.jpg" alt="img" /></p>
<p>CR3寄存器中存着指向第一级页表的地址,所以分页整体结构如下图所示:</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fiva7e1qecg20m70gn3zm.gif" alt="img" /></p>
<h4 id="3地址转换">3.地址转换</h4>
<p>MMU是Intel CPU中具体处理内存分页的单元,它是如何寻址的呢?</p>
<p>假设我们有一个虚拟内存地址p=0x12345678,4个字节的地址,换成2进制就是0001001000 1101000101 011001111000</p>
<ol>
<li>
<p>首先地址的前十位用来取出第一级页表的entry,也就是0001001000,以这个值为index,在第一级页表中找到entry1,找到entry1后根据entry1的标记位看是否有权限或者该entry是否已经map等</p>
</li>
<li>
<p>取entry1的前20位地址(物理地址),并将后12位全部置0,找到指向的4K物理内存为第二级页表</p>
</li>
<li>
<p>取内存地址p的中间10位地址为index,从第二级页表中取出entry2</p>
</li>
<li>
<p>取entry2的前20位加上内存地址p的后12位组成的物理地址为真正的物理地址,内存地址转换完成</p>
</li>
</ol>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">virt_to_phys</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">virtualaddr</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">pdidx</span> <span class="o">=</span> <span class="n">virtualaddr</span> <span class="o">>></span> <span class="mi">22</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">ptidx</span> <span class="o">=</span> <span class="n">virtualaddr</span> <span class="o">>></span> <span class="mi">12</span> <span class="o">&</span> <span class="mh">0x03ff</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">offset</span> <span class="o">=</span> <span class="n">virtualaddr</span> <span class="o">&</span> <span class="mh">0xfff</span><span class="p">;</span>
<span class="n">page_tabl_t</span><span class="o">*</span> <span class="n">tabl</span> <span class="o">=</span> <span class="n">_kernel_pd</span><span class="o">-></span><span class="n">tabls</span><span class="p">[</span><span class="n">pdidx</span><span class="p">];</span>
<span class="n">page_t</span><span class="o">*</span> <span class="n">page</span> <span class="o">=</span> <span class="o">&</span><span class="n">tabl</span><span class="o">-></span><span class="n">pages</span><span class="p">[</span><span class="n">ptidx</span><span class="p">];</span>
<span class="k">return</span> <span class="n">page</span><span class="o">-></span><span class="n">addr</span> <span class="o"><<</span> <span class="mi">12</span> <span class="o">+</span> <span class="n">offset</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="4page-faults">4.Page Faults</h4>
<p>当进程访问到被swap出去的内存或者写了只读的内存或者用户态的程序试图写内核态的内存的时候都会发生页错误中断,同时CPU会将错误码PUSH到栈中,然后将由中断处理函数来具体处理剩下的逻辑。</p>内存管理-内存分页Write在磁盘满时会有性能瓶颈吗?2017-06-21T12:59:07+00:002017-06-21T12:59:07+00:00/kernel/2017/06/21/write%E5%9C%A8%E7%A3%81%E7%9B%98%E6%BB%A1%E6%97%B6%E4%BC%9A%E6%9C%89%E6%80%A7%E8%83%BD%E7%93%B6%E9%A2%88%E5%90%97%3F<h3 id="背景">背景</h3>
<p>最近在测试环境灰度时,业务反应服务无响应,然后马上摘除节点恢复业务。</p>
<p>到测试机器检查,发现进程存在,日志正常。</p>
<p><code class="language-plaintext highlighter-rouge">printf "stats\r\n"|nc localhost 10101</code></p>
<p>测试端口发现命令卡住,大约10s才回复。</p>
<h3 id="syscallwrite耗时">syscall:write耗时</h3>
<p>df -h发现磁盘写满,然而磁盘满会使写磁盘这么慢吗?</p>
<p>用strace查看每次write耗时</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4c2zmn9j213o0psgx3.jpg" alt="img" /></p>
<p>每次write耗时250ms+…有点长。</p>
<p>修改代码,把_log函数体注释,重新编译发现服务恢复正常。</p>
<p>在印象中,感觉磁盘满不可能会导致写文件这么慢啊,有啥问题呢?</p>
<p>又找了一台新的机器,写了小程序先把磁盘写满,然后启动同样的程序,测试发现一切正常</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4fv7poyj20jm0aotcw.jpg" alt="img" /></p>
<p>这就很有趣了,那么问题在哪呢?</p>
<h3 id="ext4">ext4</h3>
<p>uname -r发现没有问题的内核版本是3.10.0-229.el7.x86_64,有问题的版本是2.6.32-431.11.2.el6.toa.2.x86_64</p>
<p>df -T发现文件系统都是ext4,怀疑可能是ext4文件系统某些机制导致的</p>
<p>使用perf stat查看所有ext4文件系统的trace event</p>
<p><code class="language-plaintext highlighter-rouge">sudo perf stat -e "ext4:*" -p 23488 sleep 10</code></p>
<ol>
<li>
<p>有问题的机器</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4gd89grj20cl0gxjsw.jpg" alt="img" /></p>
</li>
<li>
<p>没问题的机器</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4gm73kxj20bw0jsjtb.jpg" alt="img" /></p>
</li>
</ol>
<p>发现ext4的执行路径差异很大,然后同时trace ext4和syscall:write</p>
<p><code class="language-plaintext highlighter-rouge">sudo perf stat -e "syscalls:sys_enter_write,ext4:*" -p 23488 sleep 10</code></p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4gtu8bqj20bk0h6gn6.jpg" alt="img" /></p>
<p>试了几次发现syscall:write和ext4:ext4_da_write_begin次数完全一样,这说明每次write都会调用ext4的ext4_da_write_begin,但是接下来的执行逻辑就不一样了。</p>
<h3 id="内核源码">内核源码</h3>
<p>翻内核的源码发现,函数ext4_da_write_begin在3.10和2.6差别很多,用systemtap脚本调试N久,尝试复原ext4_da_write_begin不同的路径</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">ext4_da_write_begin</span><span class="p">(){</span>
<span class="cm">/*省略*/</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">==</span> <span class="o">-</span><span class="n">ENOSPC</span> <span class="o">&&</span> <span class="n">ext4_should_retry_alloc</span><span class="p">(</span><span class="n">inode</span><span class="o">-></span><span class="n">i_sb</span><span class="p">,</span> <span class="o">&</span><span class="n">retries</span><span class="p">))</span>
<span class="k">goto</span> <span class="n">retry</span><span class="p">;</span>
<span class="nl">out:</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>看到在写满ENOSPC的错误发生时,ext4_should_retry_alloc函数的嫌疑很大,怀疑是触发了不同的逻辑导致低版本内核频繁的重试</p>
<p>用脚本trace发现在函数ext4_should_retry_alloc里有差异,有问题的机器会去频繁jbd2_journal_force_commit_nested,而没问题的机器直接return 0</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">ext4_should_retry_alloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">super_block</span> <span class="o">*</span><span class="n">sb</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">retries</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ext4_has_free_blocks</span><span class="p">(</span><span class="n">EXT4_SB</span><span class="p">(</span><span class="n">sb</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span> <span class="o">||</span>
<span class="p">(</span><span class="o">*</span><span class="n">retries</span><span class="p">)</span><span class="o">++</span> <span class="o">></span> <span class="mi">3</span> <span class="o">||</span>
<span class="o">!</span><span class="n">EXT4_SB</span><span class="p">(</span><span class="n">sb</span><span class="p">)</span><span class="o">-></span><span class="n">s_journal</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">jbd_debug</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"%s: retrying operation after ENOSPC</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">sb</span><span class="o">-></span><span class="n">s_id</span><span class="p">);</span>
<span class="k">return</span> <span class="n">jbd2_journal_force_commit_nested</span><span class="p">(</span><span class="n">EXT4_SB</span><span class="p">(</span><span class="n">sb</span><span class="p">)</span><span class="o">-></span><span class="n">s_journal</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>stap脚本</p>
<pre><code class="language-stp">global joural
global ext4
probe module("ext4").function("ext4_da_write_begin").call {
ext4++;
}
probe module("jbd2").function("jbd2_journal_force_commit_nested").call {
joural++;
}
probe module("ext4").function("ext4_has_free_clusters").return {
printf("%x\n",@cast($sb->s_fs_info,"ext4_sb_info")->s_journal);
}
probe timer.s(1), end {
ansi_clear_screen();
printf("%10s %10s\n","ext4_write_begin","jbd2_journal_force_commit_nested");
printf("%10d %10d\n",ext4,joural);
}
</code></pre>
<ol>
<li>
<p>有问题的机器</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4hu91xjj209y00yweg.jpg" alt="img" /></p>
</li>
<li>
<p>没问题的机器</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4i3pmzzj20a200w748.jpg" alt="img" /></p>
</li>
</ol>
<p>继续往下看</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">ext4_has_free_blocks</span><span class="p">(</span><span class="k">struct</span> <span class="n">ext4_sb_info</span> <span class="o">*</span><span class="n">sbi</span><span class="p">,</span> <span class="n">s64</span> <span class="n">nblocks</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* 省略 */</span>
<span class="cm">/* Hm, nope. Are (enough) root reserved blocks available? */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sbi</span><span class="o">-></span><span class="n">s_resuid</span> <span class="o">==</span> <span class="n">current_fsuid</span><span class="p">()</span> <span class="o">||</span>
<span class="p">((</span><span class="n">sbi</span><span class="o">-></span><span class="n">s_resgid</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">&&</span> <span class="n">in_group_p</span><span class="p">(</span><span class="n">sbi</span><span class="o">-></span><span class="n">s_resgid</span><span class="p">))</span> <span class="o">||</span>
<span class="n">capable</span><span class="p">(</span><span class="n">CAP_SYS_RESOURCE</span><span class="p">))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">free_blocks</span> <span class="o">>=</span> <span class="p">(</span><span class="n">nblocks</span> <span class="o">+</span> <span class="n">dirty_blocks</span><span class="p">))</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>发现在函数ext4_has_free_blocks中会判断是否会有空闲块,决定是不是重试,对比3.10的代码,在3.10中的函数是</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">ext4_has_free_clusters</span><span class="p">(</span><span class="k">struct</span> <span class="n">ext4_sb_info</span> <span class="o">*</span><span class="n">sbi</span><span class="p">,</span>
<span class="n">s64</span> <span class="n">nclusters</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* 省略 */</span>
<span class="cm">/* No free blocks. Let's see if we can dip into reserved pool */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">EXT4_MB_USE_RESERVED</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">free_clusters</span> <span class="o">>=</span> <span class="p">(</span><span class="n">nclusters</span> <span class="o">+</span> <span class="n">dirty_clusters</span><span class="p">))</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>发现在判断逻辑上的不同,在3.10里,函数ext4_has_free_clusters调用时,flags传入0,所以不考虑root用户的预留空间,而在2.6里会判断用户是否是root用户,假如是root用户,那么在判断空闲块是否够用时会加上root用户的预留空间。</p>
<p>写个脚本看一下free_blocks,nblocks,dirty_blocks,root_blocks分别是多少</p>
<pre><code class="language-stp">probe module("ext4").function("ext4_has_free_blocks") {
printf("free: %d dirty: %x nblocks: %d root: %d\n",$sbi->s_freeblocks_counter->count,$sbi->s_dirtyblocks_counter->count,$nblocks,$sbi->s_es->s_r_blocks_count_hi<<32|$sbi->s_es->s_free_blocks_count_lo);
}
</code></pre>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgtxoumbg3j20bv05fjsm.jpg" alt="img" /></p>
<p>root用户的判断逻辑是free > nblocks + dirty,是true,所以一直在重试</p>
<p>非root用户的判断逻辑是free > nblocks + dirty + root,是false,所以不会重试</p>
<h3 id="验证">验证</h3>
<p>这应该是问题所在了,那么也很好验证,在低内核版本上只有root用户才有权限使用预留空间,那么我们用其他用户启动程序,应该就没有这个问题了。</p>
<p><img src="http://ww1.sinaimg.cn/large/7cb11947ly1fgt4h0iadkj20jp092q5y.jpg" alt="img" /></p>
<p>重新用非root用户启动程序,strace查看,完全正常。</p>
<h3 id="总结">总结</h3>
<p>那么这个问题就是由于低版本的内核文件系统某些缺陷导致的,但是线上的内核版本不一,假如由于某些问题或者其他进程把磁盘刷爆,那么很有可能导致服务不可用。</p>
<p>预防措施:</p>
<ol>
<li>对日志增加监控,防止出现日志把磁盘刷爆的情况</li>
<li>尽量用非root用户启动进程</li>
</ol>背景Linux下锁的原理和实现2017-02-19T08:35:00+00:002017-02-19T08:35:00+00:00/linux/kernel/2017/02/19/Linux%E9%94%81%E7%9A%84%E5%8E%9F%E7%90%86%E5%92%8C%E5%AE%9E%E7%8E%B0<h3 id="锁能解决什么问题">锁能解决什么问题</h3>
<p>在讨论锁的实现机制之前,我们可能需要知道为什么需要锁,锁能解决什么问题?</p>
<p>锁存在的意义在于解决了,多核并发时对于内存数据的操作的一致性的问题(事实上就算单核也有这个问题,因为中断可以影响到线程的执行)。</p>
<p>比如下面这段代码:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <pthread.h>
#include <stdio.h>
</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">add_count</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10000</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">count</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="n">t1</span><span class="p">,</span> <span class="n">t2</span><span class="p">;</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t1</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t2</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t1</span><span class="p">,</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t2</span><span class="p">,</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"count is %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>这个程序很简单,我起了两个线程,每个线程都对count加一,讲道理结果应该是20000,然后执行的结果并不是,而且每次的结果都不一样。</p>
<p>这是因为count++这个操作并不是原子的,它需要经过以下的步骤:</p>
<ol>
<li>首先寄存器把内存中count的值取出存到寄存器</li>
<li>对寄存器中的值加一</li>
<li>将寄存器中的新值写回到count的内存地址</li>
</ol>
<p>由于每个线程的寄存器的状态都是独立的,在多个线程并发加一时,假如t1线程加一之后的值还没写回去的时候,t2又读取了地址指向的值,那么接下来t2会将t1写入的值覆盖,数就不对了。</p>
<p>解决这个问题也很简单用锁就可以了,下面是用锁版本的代码:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <pthread.h>
#include <stdio.h>
</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="n">pthread_mutex_t</span> <span class="n">m</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">add_count</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10000</span><span class="p">;</span><span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">m</span><span class="p">);</span>
<span class="n">count</span><span class="o">++</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">m</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="n">t1</span><span class="p">,</span> <span class="n">t2</span><span class="p">;</span>
<span class="n">pthread_mutex_init</span><span class="p">(</span><span class="o">&</span><span class="n">m</span><span class="p">,</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t1</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t2</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t1</span><span class="p">,</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t2</span><span class="p">,</span><span class="nb">NULL</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"count is %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>你每次运行它,你会发现结果都是20000,这里我们用了mutex来解决问题,mutex能够提供对于临界区(Critical Section)互斥的访问。</p>
<p>除了mutex之外,还有很多其他机制提供了并发线程对于临界区的同步操作,比如semaphore、spinlock、condition等。</p>
<h3 id="同步原语">同步原语</h3>
<p>锁的实现离不开同步原语的支持,而同步原语可以保证原子性的操作,而同步原语的实现需要硬件的支持。</p>
<h4 id="test_and_set">Test_and_Set</h4>
<p>test_and_set是实现锁的一种原语,它能够提供一种原子性的操作,这个操作可以将内存中的值和你给的值互换</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">test_and_set</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span><span class="kt">int</span> <span class="n">val</span><span class="p">){</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
<span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>用c语言来描述就是上面这个样子,test_and_set的实现需要汇编指令前缀lock和指令xchg的支持,lock可以禁止系统总线对于指定内存的访问,而xchg可以交换两个值。</p>
<p>上面的操作其实是原子性的用val的值去和内存中的一个值去做交换,假如我们设定0是未锁,1是锁。</p>
<p>那么初始*p的值都是0表示未锁,假设这时线程t1和t2同时去test_and_set,由于lock的存在所以能够保证不会有两个线程同时lock成功。</p>
<p>假设t1 lock成功,并用1去和内存中的值0交换,得到的值是0,表示t1取到了这把锁,他可以对count进行加一的操作,然后释放锁。</p>
<p>而t2 lock失败,只能继续retry,直到交换出0,表示它得到锁。</p>
<h4 id="compare_and_swap">Compare_and_Swap</h4>
<p>Compare_and_Swap比Test_and_Set更加强大,很多现代的锁等机制都是用CAS来实现的。</p>
<p>CAS的意思就是先比较两个值,假如内存的值等于你给的值1,那么将值2写入到内存中,用c语言来描述是这个样子:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">compare_and_swap</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span><span class="kt">int</span> <span class="n">old</span><span class="p">,</span><span class="kt">int</span> <span class="n">new</span><span class="p">){</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="n">old</span><span class="p">)</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="实现锁">实现锁</h3>
<p>好了,有了上面的同步原语,那么我们实现锁就方便了,下面我们先来实现一个简单的锁。</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">lock</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">__sync_bool_compare_and_swap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">unlock</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="p">}</span>
</code></pre></div></div>
<p>gcc提供了一些builtins的实现,所以我们可以直接用__sync_bool_compare_and_swap。</p>
<p>在cas的帮助下,我们实现了一把锁,将原始的程序修改如下:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <pthread.h>
#include <stdio.h>
#include <stdbool.h>
</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">lock</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">__sync_bool_compare_and_swap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">unlock</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="p">}</span>
<span class="kt">int</span> <span class="n">lock_t</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">add_count</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lock</span><span class="p">(</span><span class="o">&</span><span class="n">lock_t</span><span class="p">);</span>
<span class="n">count</span><span class="o">++</span><span class="p">;</span>
<span class="n">unlock</span><span class="p">(</span><span class="o">&</span><span class="n">lock_t</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="n">t1</span><span class="p">,</span> <span class="n">t2</span><span class="p">;</span>
<span class="n">lock_t</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t1</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t2</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t1</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t2</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"count is %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>编译运行发现每次结果都是20000,我们的锁实现的没有问题,然而这样代码却导致了一个问题while (!__sync_bool_compare_and_swap(p, 0, 1))。</p>
<p>就是当没有获取到锁的时候,线程是继续试着去抢锁的,事实上这是一个自旋锁(spin lock)。</p>
<p>在临界区很小情况下,自旋锁是很适合的,因为试几次可能就会抢到锁,避免了频繁的上下文切换(context switch)。</p>
<p>而对于临界区很大的情况来说,我们更好的方式是让没取到锁的线程睡眠,假如释放锁的时候还有睡眠的线程在等这个锁,那么唤醒这个线程。</p>
<h3 id="进化">进化</h3>
<p>我们的考虑是这样的,每个去lock这把锁的线程都会有两种结果,假设失败,那么我们准备一个阻塞队列保存所有需要这把锁的线程。</p>
<p>为了解决饿死问题,我们希望搞一个优先级队列,先进入睡眠的线程优先抢到锁。</p>
<p>在取到锁的线程释放锁时检查这个队列时,假如队列不空,那么取出第一个线程并唤醒它。</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include <errno.h>
#include <pthread.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
</span>
<span class="cp">#include <linux/futex.h>
#include <sys/syscall.h>
</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">lock_t</span> <span class="n">_l</span><span class="p">;</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">futex</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">uaddr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">futex_op</span><span class="p">,</span> <span class="kt">int</span> <span class="n">val</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">timespec</span> <span class="o">*</span><span class="n">timeout</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">uaddr2</span><span class="p">,</span> <span class="kt">int</span> <span class="n">val3</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">uaddr</span><span class="p">,</span> <span class="n">futex_op</span><span class="p">,</span> <span class="n">val</span><span class="p">,</span> <span class="n">timeout</span><span class="p">,</span> <span class="n">uaddr</span><span class="p">,</span> <span class="n">val3</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">struct</span> <span class="n">thread_node</span> <span class="p">{</span>
<span class="n">pthread_t</span> <span class="n">id</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">flock</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="o">*</span><span class="n">t</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">thread_node</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">lock_t</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">lock</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">thread_node</span> <span class="o">*</span><span class="n">node</span><span class="p">;</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="nf">lock</span><span class="p">(</span><span class="k">struct</span> <span class="n">lock_t</span> <span class="o">*</span><span class="n">l</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">__sync_bool_compare_and_swap</span><span class="p">(</span><span class="o">&</span><span class="n">l</span><span class="o">-></span><span class="n">lock</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">thread_node</span> <span class="o">*</span><span class="n">n</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">thread_node</span><span class="p">));</span>
<span class="n">n</span><span class="o">-></span><span class="n">id</span> <span class="o">=</span> <span class="n">pthread_self</span><span class="p">();</span>
<span class="n">n</span><span class="o">-></span><span class="n">flock</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">n</span><span class="o">-></span><span class="n">next</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">l</span><span class="o">-></span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
<span class="n">l</span><span class="o">-></span><span class="n">node</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
<span class="k">else</span> <span class="p">{</span>
<span class="n">n</span><span class="o">-></span><span class="n">next</span> <span class="o">=</span> <span class="n">l</span><span class="o">-></span><span class="n">node</span><span class="p">;</span>
<span class="n">l</span><span class="o">-></span><span class="n">node</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">futex</span><span class="p">(</span><span class="o">&</span><span class="n">n</span><span class="o">-></span><span class="n">flock</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">unlock</span><span class="p">(</span><span class="k">struct</span> <span class="n">lock_t</span> <span class="o">*</span><span class="n">l</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">l</span><span class="o">-></span><span class="n">node</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">thread_node</span> <span class="o">*</span><span class="n">n</span> <span class="o">=</span> <span class="n">l</span><span class="o">-></span><span class="n">node</span><span class="p">;</span>
<span class="n">l</span><span class="o">-></span><span class="n">node</span> <span class="o">=</span> <span class="n">n</span><span class="o">-></span><span class="n">next</span><span class="p">;</span>
<span class="n">n</span><span class="o">-></span><span class="n">flock</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">futex</span><span class="p">(</span><span class="o">&</span><span class="n">n</span><span class="o">-></span><span class="n">flock</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">l</span><span class="o">-></span><span class="n">lock</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">add_count</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">10000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">lock</span><span class="p">(</span><span class="o">&</span><span class="n">_l</span><span class="p">);</span>
<span class="n">count</span><span class="o">++</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="n">unlock</span><span class="p">(</span><span class="o">&</span><span class="n">_l</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_t</span> <span class="n">t1</span><span class="p">,</span> <span class="n">t2</span><span class="p">,</span> <span class="n">t3</span><span class="p">,</span> <span class="n">t4</span><span class="p">,</span> <span class="n">t5</span><span class="p">;</span>
<span class="n">_l</span><span class="p">.</span><span class="n">node</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">_l</span><span class="p">.</span><span class="n">lock</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t1</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t2</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t3</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t4</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">t5</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">add_count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t1</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t2</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t3</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t4</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_join</span><span class="p">(</span><span class="n">t5</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"count is %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>futex是linux下的快速同步互斥机制,我们用它来实现线程的sleep和awake,并用队列保存线程的进入次序,避免了线程的饿死。</p>锁能解决什么问题Write每次最多能写多少字节的数据2016-10-21T06:29:07+00:002016-10-21T06:29:07+00:00/kernel/2016/10/21/how-many-bytes-can-write-once<h3 id="背景">背景</h3>
<p>最近碰到一个线上的bug,具体情况是进程将24G左右的内存buffer写到磁盘上,但是每次write都写不完24G,然后重试导致磁盘写满,服务不可用。</p>
<p>在修这个bug的时候,对于到底write最大能写多少字节的数据产生了浓厚的兴趣,写了一个测试程序基本上每次最多写到2G多一点。</p>
<p>在研究了一段内核代码后一切霍然开朗。</p>
<h3 id="write限制">write限制</h3>
<p>write函数在以下三种情况下可能写入的字节数小于指定的字节数:</p>
<ol>
<li>在底层的物理介质上没有足够的空间</li>
<li>RLIMIT_FSIZE的限制</li>
<li>写入被信号打断</li>
</ol>
<p>从内核代码来看,count在大于MAX_RW_COUNT的情况下,会赋值为MAX_RW_COUNT</p>
<p>而MAX_RW_COUNT是一个宏,展开为:INT_MAX & PAGE_MASK</p>
<p>INT_MAX也是一个宏,展开为((int)(~0U»1)),也就是无符号数0取反后右移一位转换成int类型,也就是2^31.</p>
<p>PAGE_MASK也是一个宏,展开为(~(PAGE_SIZE-1)),而PAGE_SIZE展开为(_AC(1,UL) « PAGE_SHIFT),PAGE_SHIFT的值为12,也就是每页的大小是2^12,也就是说1左移12位,PAGE_SIZE的值为2^12,然后PAGE_SIZE-1取反</p>
<p>最后MAX_RW_COUNT的值为INT_MAX & PAGE_MASK,也就是说MAX_RW_COUNT的值是int的最大值最后12位屏蔽掉,保持4K地址对齐</p>
<p>所以理论上讲,每次write可写的buff大小是2^31-2^12=2147479552,这也是与实际的测试结果相一致</p>
<h3 id="测试代码">测试代码</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <fcntl.h>
#include <sys/resource.h>
#include <signal.h>
#include <errno.h>
#include <string.h>
</span><span class="kt">void</span> <span class="nf">sigsegvHandler</span><span class="p">(</span><span class="kt">int</span> <span class="n">sig</span><span class="p">,</span> <span class="n">siginfo_t</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">secret</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%d %s"</span><span class="p">,</span><span class="n">sig</span><span class="p">,</span><span class="n">secret</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">sigtermHandler</span><span class="p">(</span><span class="kt">int</span> <span class="n">sig</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="n">sig</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">setupSignalHandlers</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">sigaction</span> <span class="n">act</span><span class="p">;</span>
<span class="n">sigemptyset</span><span class="p">(</span><span class="o">&</span><span class="n">act</span><span class="p">.</span><span class="n">sa_mask</span><span class="p">);</span>
<span class="n">act</span><span class="p">.</span><span class="n">sa_flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">act</span><span class="p">.</span><span class="n">sa_handler</span> <span class="o">=</span> <span class="n">sigtermHandler</span><span class="p">;</span>
<span class="n">sigaction</span><span class="p">(</span><span class="n">SIGTERM</span><span class="p">,</span> <span class="o">&</span><span class="n">act</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">sigemptyset</span><span class="p">(</span><span class="o">&</span><span class="n">act</span><span class="p">.</span><span class="n">sa_mask</span><span class="p">);</span>
<span class="n">act</span><span class="p">.</span><span class="n">sa_flags</span> <span class="o">=</span> <span class="n">SA_NODEFER</span> <span class="o">|</span> <span class="n">SA_RESETHAND</span> <span class="o">|</span> <span class="n">SA_SIGINFO</span><span class="p">;</span>
<span class="n">act</span><span class="p">.</span><span class="n">sa_sigaction</span> <span class="o">=</span> <span class="n">sigsegvHandler</span><span class="p">;</span>
<span class="n">sigaction</span><span class="p">(</span><span class="n">SIGSEGV</span><span class="p">,</span> <span class="o">&</span><span class="n">act</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">sigaction</span><span class="p">(</span><span class="n">SIGBUS</span><span class="p">,</span> <span class="o">&</span><span class="n">act</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">sigaction</span><span class="p">(</span><span class="n">SIGFPE</span><span class="p">,</span> <span class="o">&</span><span class="n">act</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">sigaction</span><span class="p">(</span><span class="n">SIGILL</span><span class="p">,</span> <span class="o">&</span><span class="n">act</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span><span class="kt">char</span><span class="o">**</span> <span class="n">argv</span><span class="p">){</span>
<span class="n">signal</span><span class="p">(</span><span class="n">SIGHUP</span><span class="p">,</span> <span class="n">SIG_IGN</span><span class="p">);</span>
<span class="n">signal</span><span class="p">(</span><span class="n">SIGPIPE</span><span class="p">,</span> <span class="n">SIG_IGN</span><span class="p">);</span>
<span class="n">setupSignalHandlers</span><span class="p">();</span>
<span class="c1">//extern int errno;</span>
<span class="k">struct</span> <span class="n">rlimit</span> <span class="n">limit</span><span class="p">;</span>
<span class="n">limit</span><span class="p">.</span><span class="n">rlim_cur</span> <span class="o">=</span> <span class="n">RLIM_INFINITY</span><span class="p">;</span>
<span class="n">limit</span><span class="p">.</span><span class="n">rlim_max</span> <span class="o">=</span> <span class="n">RLIM_INFINITY</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="n">setrlimit</span><span class="p">(</span><span class="n">RLIMIT_FSIZE</span><span class="p">,</span><span class="o">&</span><span class="n">limit</span><span class="p">)){</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"set limit failed</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">buff_size</span> <span class="o">=</span> <span class="n">atoll</span><span class="p">(</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">buff</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">buff_size</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">char</span><span class="p">));</span>
<span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="s">"io.dat"</span><span class="p">,</span> <span class="n">O_CREAT</span><span class="o">|</span><span class="n">O_WRONLY</span><span class="o">|</span><span class="n">O_TRUNC</span><span class="p">,</span> <span class="mo">0644</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="n">buff_size</span><span class="o">></span><span class="mi">0</span><span class="p">){</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">write</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span><span class="n">buff</span><span class="p">,</span><span class="n">buff_size</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"written %d bytes</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="n">ret</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">ret</span><span class="o"><</span><span class="mi">0</span><span class="p">){</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"write error: %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">buff_size</span><span class="o">-=</span><span class="n">ret</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="n">buff_size</span><span class="o">==</span><span class="mi">0</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"write %lld bytes</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="n">buff_size</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>背景