産品特色
編輯推薦
鼕瓜哥對技術的追求已經到瞭“癡迷”的境界,與10年前相比,文筆解析更為到位,技術理解更為精準。其公眾號的每篇文章,都是存儲業界風嚮標。
內容簡介
全書分為:靈活的數據布局、應用感知及可視化存儲智能、存儲類芯片、儲海鈎沉、集群和多控製器、傳統存儲係統、新興存儲係統、大話光存儲係統、體係結構、I/O協議棧及性能分析、存儲軟件、固態存儲等,其中每章又有多個小節。每一個小節都是一個獨立的課題。本書秉承作者一貫的寫作風格,完全從讀者角度來創作本書,語言優美深刻,包羅萬象。另外,不僅闡釋瞭存儲技術,而且同時也加入瞭計算機係統技術和網格技術的一些解讀,使讀者大開眼界,茅塞頓開,激發讀者的閱讀興趣。
本書適閤存儲領域所有從業人員閱讀研習,同時可以作為《大話存儲*版》的讀者的延伸高新資源。
作者簡介
鼕瓜哥(張鼕),現任某半導體公司係統架構師,著有《大話存儲》係列圖書。存儲領域技術專傢和布道者。
目錄
第一章 靈活的數據布局 ·········································································1
1.1 Raid1.0和Raid1.5 ······························································································2
1.2 Raid5EE和Raid2.0 ·····························································································4
1.3 Lun2.0/SmartMotion ························································································13
第二章 應用感知及可視化存儲智能 ·····················································23
2.1 應用感知精細化自動存儲分層······································································25
2.2 應用感知精細化SmartMotion ········································································27
2.3 應用感知精細化QoS ······················································································28
2.4 産品化及可視化展現······················································································31
2.5 包裝概念製作PPT ···························································································43
2.6 評浪潮“活性”存儲概念··············································································49
第三章 存儲類芯片 ··············································································53
3.1 通道及Raid控製器架構 ··················································································54
3.2 SAS Expander架構 ··························································································60
第四章 儲海鈎沉 ··················································································65
4.1 你絕對想不到的兩種高格調存儲器······························································66
4.2 JBOD裏都有什麼····························································································70
4.3 Raid4校驗盤之殤 ····························································································72
4.4 為什麼說Raid卡是颱小電腦 ··········································································73
4.5 為什麼Raid卡電池被換為超級電容 ······························································74
4.6 固件和微碼到底什麼區彆··············································································75
4.7 FC成環器內部真的是個環嗎 ·········································································76
4.8 為什麼說SAS、FC對CPU耗費比TCPIP+以太網低 ····································77
4.9 雙控存儲之間的心跳綫都跑瞭哪些流量······················································78
第五章集群和多控製器 ······································································· 79
5.1 淺談雙活和多路徑··························································································80
5.2 “淺”談容災和雙活數據中心(上)··························································82
5.3 “淺”談容災和雙活數據中心(下)··························································87
5.4 集群文件係統架構演變深度梳理圖解··························································96
5.5 從多控緩存管理到集群鎖············································································107
5.6 共享式與分布式各論····················································································115
5.7 “鼕瓜哥畫PPT”雙活是個坑 ·····································································118
第六章傳統存儲係統 ········································································· 121
6.1 與存儲係統相關的一些基本話題分享························································122
6.2 高端存儲係統江湖風雲錄!········································································133
6.3 驚瞭!原來高端存儲架構是這樣演進的!················································145
6.4 傳統高端存儲係統把數據緩存集中外置一石三鳥····································155
6.5 傳統外置存儲已近黃昏················································································156
6.6 存儲圈老炮大戰小鮮肉················································································166
6.7 傳統存儲老矣,新興存儲能當大任否?····················································167
第七章次世代存儲係統 ····································································· 185
7.1 一杆老槍照玩次世代存儲係統····································································187
7.2 最有傳統存儲格調的次世代存儲係統························································192
7.3 最適閤大規模數據中心的次世代存儲係統················································203
7.4 最高性能的次世代存儲係統········································································206
7.5 最具備感知應用能力的次世代存儲係統····················································214
7.6 最具有數據管理靈活性的次時代存儲係統················································225
第八章光存儲係統············································································ 237
8.1 光存儲基本原理····························································································238
8.2 神秘的激光頭及藍光技術············································································244
8.3 剖析藍光存儲係統························································································249
8.4 光存儲係統生態····························································································253
8.5 站在未來看現在····························································································259
第九章體係結構 ················································································ 263
9.1 大話眾核心處理器體係結構········································································264
9.2 緻敬龍芯!鼕瓜哥手工設計瞭一個CPU譯碼器! ····································271
9.3 NUNA體係結構首次落地InCloudRack機櫃 ···············································274
9.4 評宏杉科技的CloudSAN架構 ······································································278
9.5 內存竟然還能這麼玩?!············································································283
9.6 PCIe交換,什麼鬼?····················································································293
9.7 聊聊FPGA/GPCPU/PCIe/Cache-Coherency ················································300
9.8 【科普】超算到底是怎樣算的?································································305
第十章 I/O 協議棧及性能分析 ···························································· 317
10.1 最完整的存儲係統接口/協議/連接方式總結 ···········································318
10.2 I/O協議棧前沿技術研究動態 ····································································332
10.3 Raid組的Stripe Size到底設置為多少閤適? ·············································344
10.4 並發I/O——係統性能的根本! ································································347
10.5 關於I/O時延你被騙瞭多久? ····································································349
10.6 如何測得整條I/O路徑上的並發度? ························································351
10.7 隊列深度、時延、並發度、吞吐量的關係到底是什麼··························351
10.8 為什麼Raid對於某些場景沒有任何提速作用? ······································365
10.9 為什麼測試時性能齣色,上綫時卻慘不忍睹?······································366
10.10 隊列深度過淺有什麼影響?····································································368
10.11 隊列深度調節為多大最理想? ································································369
10.12 機械盤的隨機I/O平均時延為什麼有一過性降低? ······························370
10.13 數據布局到底是怎麼影響性能的?························································371
10.14 關於同步I/O與阻塞I/O的誤解 ·································································374
10.15 原子寫,什麼鬼?!················································································375
10.16 何不做個USB Target? ·············································································385
10.17 鼕瓜哥的一項新存儲技術專利已正式通過············································385
10.18 小梳理一下iSCSI底層 ··············································································394
10.19 FC的4次Login過程簡析 ···········································································396
第十一章存儲軟件············································································ 397
11.1 Thin就是個坑誰用誰找抽!······································································398
11.2 存儲係統OS變遷 ·························································································400
第十二章固態存儲············································································ 409
12.1 淺析固態介質在存儲係統中的應用方式··················································410
12.2 關於SSD元數據及掉電保護的誤解··························································420
12.3 關於閃存FTL的Host Base和Device Based的誤解 ····································421
12.4 關於SSD HMB與CMB ···············································································423
12.5 同有科技展翅歸來······················································································424
12.6 和老唐說相聲之SSD性能測試之“玉”··················································435
12.7 固態盤到底該怎麼做Raid? ······································································441
12.8 當Raid2.0遇上全固態存儲 ·········································································448
12.9 上/下頁、快/慢頁、MSB/LSB都些什麼鬼? ··········································451
12.10 關於對MSB/LSB寫0時的步驟 ·································································457
精彩書摘
1.1 Raid1.0和Raid1.5
在機械盤時代,影響最終I/O性能的根本因素無非就是兩個,一個是頂端源頭,
也就是應用的I/O調用方式和I/O屬性;另一個是底端源頭,那就是數據最終是以什麼
形式、狀態存放在多少機械盤上的。應用如何I/O調用完全不是存儲係統可以控製的
事情,所以從這個源頭來解決性能問題對於存儲係統來講是無法做什麼工作的。但是
數據如何組織、排布,絕對是存儲係統重中之重的工作。
這一點從Raid誕生開始就一直在不斷的演化當中。舉個最簡單的例子,從Raid3
到Raid4再到Raid5,Raid3當時設計的時候緻力於單綫程大塊連續地址I/O吞吐量最大
化,為瞭實現這個目的,Raid3的條帶非常窄,窄到每次上層下發的I/O目標地址基本
上都落在瞭所有盤上,這樣幾乎每個I/O都會讓多個盤並行讀寫來服務於這個I/O,而
其他I/O就必須等待,所以我們說Raid3陣列場景下,上層的I/O之間是不能並發的,但
是單個I/O是可以采用多盤為其並發的。所以,如果係統內隻有一個綫程(或者說用
戶、程序、業務),而且這個綫程是大塊連續地址I/O追求吞吐量的業務,那麼Raid3
非常閤適。但是大部分業務其實不是這樣,而是追求上層的I/O能夠充分地並行執
行,比如多綫程、多用戶發齣的I/O能夠並發地被響應,此時就需要增大條帶到一個
閤適的值,讓一個I/O目標地址範圍不至於牽動Raid組中所有盤為其服務,這樣就有一
定幾率讓一組盤同時響應多個I/O,而且盤數越多,並發幾率就越大。Raid4相當於條
帶可調的Raid3,但是Raid4獨立校驗盤的存在不但讓其成為高故障率的熱點盤,而且
也製約瞭本可以並發的I/O,因為伴隨著每個I/O的執行,校驗盤上對應條帶的校驗塊
都需要被更新,而由於所有校驗塊隻存放在這塊盤上,所以上層的I/O隻能一個一個
第一章 靈活的數據布局3
地順著執行,不能並發。Raid5則通過把校驗塊打散在Raid組中所有磁盤上,從而實現
瞭並發I/O。大部分存儲廠商提供針對條帶寬度的設置,比如從32KB到128KB。假設
一個I/O請求讀16KB,在一個8塊盤做的Raid5組裏,如果條帶為32KB,則每塊盤上的
段(Segment)為4KB,這個I/O起碼要占用4塊盤,假設並發幾率為100%,那麼這個
Raid組能並發兩個16KB的I/O,並發8個4KB的I/O;如果將條帶寬度調節為128KB,則
在100%並發幾率的條件下可並發8個小於等於16KB的I/O。
講到這裏,我們可以看到單單是調節條帶寬度,以及優化校驗塊的布局,就可以
得到迥異的性能錶現。但是再怎麼摺騰,I/O性能始終受限在Raid組那少得可憐的幾
塊或者十幾塊盤上。為什麼是幾塊或者十幾塊?難道不能把100塊盤做成一個大Raid5
組,然後,通過把所有邏輯捲創建在它上麵來增加每個邏輯捲的性能麼?你不會選擇
這麼做的,當一旦有一塊盤壞掉,係統需要重構的時候,你會後悔當時的決定,因為
你會發現此時整個係統性能大幅降低,哪個邏輯捲也彆想好過,因為此時99塊盤都
在全速讀齣數據,係統計算xor校驗塊,然後把校驗塊寫入熱備盤中。當然,你可以
控製降速重構,來緩解在綫業務的I/O性能,但是付齣的代價就是增加瞭重構時間,
重構周期內如果有盤再壞,那麼全部數據蕩然無存。所以,必須縮小故障影響域,
所以一個Raid組最好是幾塊或者十幾塊盤。這比較尷尬,所以人們想齣瞭解決辦法,
那就是把多個小Raid5/6組拼接成大Raid0,也就是Raid50/60,然後將邏輯捲分布在其
上。當然,目前的存儲廠商黔驢技窮,再也弄齣什麼新花樣,所以它們習慣把這個大
Raid50/60組成“Pool”,也就是池,從而迷惑一部分人,認為存儲又在革新瞭,存儲依
然生命力旺盛。
那鼕瓜哥在這裏也不妨順水推舟忽悠一下,如果把傳統的Raid組叫作Raid1.0,把
Raid50/60叫作Raid1.5。我們其實在這裏可以體會齣一種周期式上升的規律,早期盤數
較少,主要靠條帶寬度來調節不同場景的性能;後來人們想通瞭,為何不用Raid50呢?
把數據直接分布到幾百塊盤中,豈不快哉?上層的並發綫程I/O在底層可以實現大規模
並發,達到超高吞吐量。此時,人們被成功衝昏瞭頭腦,沒人再去考慮另一個可怕的
問題。
至這些文字傾諸筆端時仍沒有人考慮這個問題,至少從廠商的産品動嚮裏沒有看
齣。究其原因,可能是另一輪底層的演變,那就是固態介質。底層的車輪是不斷地提
速的,上層的形態是循環往復的,但有時候上層可能直接跨越式前進,跨越瞭其中應
該有的一個形態,這個形態或者轉瞬即逝,亦或者根本沒齣現過,但是總會有人産生
火花,即便這火花是那麼微弱。
這個可怕的問題其實被一個更可怕的問題蓋過瞭,這個更可怕的問題就是重構時
間過長。一塊4TB的SATA盤,在重構的時候就算全速寫入,其轉速決定瞭其吞吐量極
4 大話存儲後傳——次世代數據存儲思維與技術
限也基本在80MB/s左右,可以算一下,需要58h,實際中為瞭保證在綫業務的性能,
一般會限製在中速重構,也就是40MB/s左右,此時需要116h,也就是5天5夜,我敢打
賭沒有哪個係統管理員能在這一周內睡好覺。
1.2 Raid5EE和Raid2.0
20年前有人發明過一種叫作Raid5EE的技術,其目的有兩個,第一是把平時閑著
沒事乾的熱備盤用起來,第二就是加速重構。
很顯然,如果把下圖中用“H(hot spare)”錶示的熱備盤的空間也像校驗盤一
樣,打散到所有盤上的話,就會變成圖右側所示的布局,每個P塊都跟著一個H塊。這
樣整個Raid組能比原來多一塊磁盤可用於工作。另外,由於H空間也被打散瞭,當有
一塊盤損壞時,重構的速度理應被加快,因為此時可以多盤並發寫入瞭。但是實際卻
不然,整個係統的重構速度其實並不是被這塊單獨的熱備盤限製瞭,而是被所有盤一
起限製瞭,因為熱備盤以滿速率寫入重構後的數據的前提是,其他所有盤都以滿速率
讀齣數據,然後係統對其做xor。就算把熱備盤打散,甚至把熱備盤換成SSD、內存,
對結果也毫無影響。
那到底怎樣纔能加速重構呢?唯一的辦法隻有像下圖所示這樣,把原本擠在5塊
盤裏的條帶,橫嚮打散,請注意,是以條帶為粒度打散,打散單盤是毫無用處的。這
樣,纔能成倍地提升重構速度。
前言/序言
前言
眨眼間,距離《大話存儲》一書齣版已經8年瞭。在這8年間,鼕瓜哥也一直在不斷地學習積纍並輸齣,並在2015年5月份創立瞭微信公眾號“大話存儲”,繼續總結和輸齣各類存儲係統知識,皆為原創。本書即對這一年多來鼕瓜哥的輸齣文章進行瞭整理再加工,並特意增加瞭30%的從未發布的額外內容。
如果說《大話存儲》係列圖書是一部係統性講述存儲係統底層的小說的話,那麼本書相當於一部散文集,全篇形散神聚,自由穿梭於存儲和計算機係統的底層和頂層世界中。其中的每一篇都錶述瞭某個領域、課題或者技術,並圍繞該技術展開敘述。鼕瓜哥把全書劃分為12個技術領域部分,每一個部分又包含多篇相關的文章。
其中有些文章中帶有鄙人手繪的圖片,為瞭保持原汁原味,決定保留原樣,如果侮辱瞭你的審美觀,請見諒。
閱讀本書要求對存儲係統有一定瞭解,最好是相當瞭解,否則會感到比較吃力。不過,吃力是好事,證明有提升空間,那就趕緊去買本《大話存儲終極版》看看正傳吧,然後再來看後傳。當年鼕瓜哥看一些文檔的時候,也是很吃力,但是總感覺很有意思,也就堅持瞭下來。
可能有人會想,後續會不會有《大話存儲外傳》呢?嗯,或許吧,順其自然!
鼕瓜哥