摘要:BulletproofStorageDisksystemswillrepairthemselvesorcanbeleftunrepairedforyears.Youcanflyatwo-engineplanewithoneengine,buthowmanypassengerswouldwanttobeonit?That’stheideabehind“bulletproofstorage,”aconceptthatIBMhasbeendevelopingfortwoyearsandplans
Bulletproof Storage
Disk systems will repair themselves or can be left unrepaired for years.
You can fly a two-engine plane with one engine, but how many passengers would want to be on it?
That’s the idea behind “bulletproof storage,” a concept that IBM has been developing for two years and plans to begin unveiling incrementally over the next one to three years.
IBM’s technology initiative deals with fault tolerance in every part of a storage system: disk, controller, network cards, power supplies and software. By building more-robust storage systems that can defer replacement of failed parts for up to three years because of redundant components, IBM believes it can also eliminate many human errors that happen when failing components are replaced.
According to Stanley Zaffos, an analyst at Gartner Inc. the bulletproof storage concept still has another five to 10 years before it’s broadly embraced by users. But once it is, storage systems will require less maintenance and, therefore, cost less to maintain.
“We know how to build very reliable code. We use appliances every day that have software built into them that work forever: your automobile, your calculator, the disk drive in your PC, your telephone,”Zaffos says.
But IBM is looking to attack far more complex systems than telephones or calculators.
Under its bulletproof initiative, IBM is addressing disk-sector failures that grow along with disk capacity. While disk capacities double every 12 to 18 months, uncorrectable read/write error rates haven’t improved, nor has the probability of an uncorrectable error occurring on a disk read decreased. There are more sectors on today’s disks and, therefore, a greater chance of an uncorrectable error.
The answer is to create self-healing capabilities for storage management software and more-robust RAID configurations.
IBM says that in about a year it will release storage systems that can support three simultaneous disk-drive failures in a single array by introducing additional parity disks into RAID configurations, offering many times the resiliency of a RAID configuration with two parity disks. Today, standard systems allow for only two disk failures.
But Zaffos argues that 80% of downtime today is caused by user error and software failures, not hardware failures. He says that the failures resulting from software are created by complexity and that there is an almost infinite number of failures that can occur in a complex system.
IBM is addressing those code failures with a software project called N-Version Programming, where two pieces of code in the same application save data and then compare the data to ensure that there are no errors.
In N-Version Programming, two copies of data are protected using different means. One copy might be protected by standard RAID-5 programming coded by Programmer A.
The second copy is protected by a different algorithm coded by Programmer B. That way, if the first copy gets corrupted due to a particular bug in the program written by Programmer A, then the second copy can be used.
The second copy may have its own bugs, but they will manifest in different ways at different times, and when they do, the first copy will be the one which is good and which you can then use. It’s kind of like having a second person check the work of a first person and keep fixing it whenever it finds mistakes.
One way IBM plans to detect and correct corrupted data is to create more-resilient storage software with repairable data structures. The code checks that certain conditions, which are described in rules, are met. For example, in a file system with multiple files, the sum of the space taken by the files plus the free space in the system must be equal to the total available space. The code will check this property automatically at various times and use a procedure to repair and fix problems if the property isn’t met.
In this case, the software isn’t checking the code to see that it’s functioning properly and isn’t checking data contents. If certain properties aren’t met, the software knows how to fix the data structures.
But don’t expect to see fruit from N-Version Programming or checkable data structures for another two to three years.
防彈存儲(chǔ)
磁盤(pán)系統(tǒng)自行修理或者幾年不用修理。
雙引擎飛機(jī)能用一個(gè)引擎飛行,但有多少乘客愿意乘坐?
“防彈存儲(chǔ)”背后的想法就是這樣一個(gè)概念,IBM已經(jīng)研究了兩年,并計(jì)劃在今后一至三年中不斷公布進(jìn)展。
IBM的此項(xiàng)技術(shù)首創(chuàng)是要在存儲(chǔ)系統(tǒng)的方方面面:磁盤(pán)、控制器、網(wǎng)卡、電源和軟件,實(shí)現(xiàn)容錯(cuò)。IBM相信,通過(guò)制造更健壯的、并由于有冗余部件從而能將故障部件的更換推遲兩至三年的存儲(chǔ)系統(tǒng),能避免很多在更換故障部件時(shí)產(chǎn)生的人為錯(cuò)誤。
Gartner公司的分析師Stanley Zaffos稱,防彈存儲(chǔ)概念能為用戶廣為接受還需要5至10年的時(shí)間。但一旦得到認(rèn)可,存儲(chǔ)系統(tǒng)將需要更少的維護(hù),因而需要更低的維護(hù)成本。
Zaffos說(shuō):“我們知道如何編制非常可靠的程序。我們每天使用各種各樣的裝置:汽車(chē)、計(jì)算器、PC機(jī)中的磁盤(pán)機(jī)和電話,它們都內(nèi)裝了使其能永遠(yuǎn)工作的軟件。”
但I(xiàn)BM著眼于攻克比電話或計(jì)算器更復(fù)雜的系統(tǒng)。
在此項(xiàng)技術(shù)首創(chuàng)中,IBM要解決隨磁盤(pán)容量增加而增加的磁盤(pán)部分故障。磁盤(pán)容量每12至18個(gè)月就翻一番,但無(wú)法糾正的讀/寫(xiě)錯(cuò)誤率沒(méi)有得到改進(jìn),而且發(fā)生在磁盤(pán)讀時(shí)的無(wú)法糾正的錯(cuò)誤概率也沒(méi)有降低。今天的磁盤(pán)上有更多的扇區(qū),因而出現(xiàn)無(wú)法糾正錯(cuò)誤的機(jī)會(huì)就更多。
這個(gè)問(wèn)題的答案是提供存儲(chǔ)管理軟件的自修復(fù)能力以及更健壯的RAID(冗余磁盤(pán)陣列)配置。
IBM稱,約在一年的時(shí)間里,將公布通過(guò)在RAID配置中增加一個(gè)奇偶盤(pán)而能在單個(gè)陣列中支持三個(gè)磁盤(pán)同時(shí)發(fā)生故障的存儲(chǔ)系統(tǒng),這將比兩個(gè)奇偶盤(pán)RAID配置的彈性高出了很多倍。今天,標(biāo)準(zhǔn)的系統(tǒng)只允許兩個(gè)磁盤(pán)出現(xiàn)故障。
但Zaffos認(rèn)為,今天80%的宕機(jī)是由于用戶的錯(cuò)誤和軟件故障,而不是硬件故障引起的。他說(shuō),軟件帶來(lái)的故障是因復(fù)雜性造成的,而在復(fù)雜系統(tǒng)中可能發(fā)生的故障幾乎是不計(jì)其數(shù)的。
IBM用一個(gè)叫N-Version Programming的軟件項(xiàng)目來(lái)解決這些程序故障,其中同一應(yīng)用軟件中有兩段程序保存數(shù)據(jù),然后通過(guò)比較數(shù)據(jù)來(lái)確保沒(méi)有錯(cuò)誤。
在N-Version Programming中,使用不同的方式保護(hù)數(shù)據(jù)的兩個(gè)備份。一個(gè)備份可以用由程序員A編寫(xiě)的標(biāo)準(zhǔn)RAID-5編程保護(hù)。
第二個(gè)備份由程序員B編寫(xiě)的不同算法進(jìn)行保護(hù)。這樣,如果第一個(gè)備份由于程序員A編寫(xiě)的程序中的特定錯(cuò)誤而被破壞了,就可以使用第二個(gè)備份。
第二個(gè)備份也可能有其自己的錯(cuò)誤,但這些錯(cuò)誤將以不用的方式、在不同的時(shí)間表現(xiàn)出來(lái),當(dāng)出現(xiàn)這些錯(cuò)誤時(shí),第一個(gè)備份將是好的,你可以使用。這好像是有第二個(gè)人來(lái)檢查第一個(gè)人的工作,一發(fā)現(xiàn)錯(cuò)誤就糾正。
IBM計(jì)劃用來(lái)檢測(cè)和糾正被破壞數(shù)據(jù)的一個(gè)方法,就是用可修理的數(shù)據(jù)結(jié)構(gòu)來(lái)生成更有彈性的存儲(chǔ)軟件。這種程序檢查在規(guī)則中描述的某些條件是否得到滿足。例如,在有多個(gè)文件的文件系統(tǒng)中,文件占用的空間與系統(tǒng)中未用的空間之和應(yīng)該等于總的可用空間。上述程序在不同的時(shí)間自動(dòng)檢查此特性,并在此特性未能得到滿足時(shí)啟用程序進(jìn)行修理并糾正此問(wèn)題。
此時(shí),軟件不是檢查此程序,看看它是否正常運(yùn)行,也不是檢查數(shù)據(jù)內(nèi)容。如果某些特性未能滿足,軟件知道如何來(lái)修正數(shù)據(jù)結(jié)構(gòu)。
但不要指望在今后兩三年內(nèi)就能見(jiàn)到N-Version Programming項(xiàng)目,即可檢查數(shù)據(jù)結(jié)構(gòu)的成果。
軟考備考資料免費(fèi)領(lǐng)取
去領(lǐng)取
共收錄117.93萬(wàn)道題
已有25.02萬(wàn)小伙伴參與做題