摘要:DeepWebMostwritersthesedaysdoasignificantpartoftheirresearchusingtheWorldWideWeb,withthehelpofpowerfulsearchenginessuchasGoogleandYahoo.Thereissomuchinformationavailablethatonecouldbeforgivenforthinkingthat“everything”isaccessiblethisway,butnothingc
Deep Web
Most writers these days do a significant part of their research using the World Wide Web, with the help of powerful search engines such as Google and Yahoo. There is so much information available that one could be forgiven for thinking that “everything”is accessible this way, but nothing could ber further from the truth. For example, as of August 2005, Google claimed to have indexed 8.2 billion Web pages and 2.1 billion images. That sounds impressive, but it’s just the tip of the iceberg. Behold the deep Web.
According to Mike Bergman, chief technology officer at BrightPlanet Corp., more than 500 times as much information as traditional search engines “know about” is available in the deep Web. This massive store of information is locked up inside databases from which Web pages are generated in response to specific queries. Although these dynamic pages have a unique URL address with which they can be retrieved again, they are not persistent or stored as static pages, nor are there links to them from other pages.
The deep Web also includes sites that require registration or otherwise restrict access to their pages, prohibiting search engines from browsing them and creating cached copies.
Let’s recap how conventional search engines create their databases. Programs called spiders or Web crawlers start by reading pages from a starting list of Web sites. These spiders first read each page on a site, index all their content and add the words they find to the search engine’s growing database. When a spider finds a hyperlink to another page, it adds that new link to the list of pages to be indexed. In time, the program reaches all linked pages, presuming that the search engine doesn’t run out of time or storage space. These linked pages constitute what most of us use and refer to as the Internet or the Web. In fact, we have only scratched the surface, which is why this realm of information is often called the surface Web.
Why don’t our search engines find the deeper information? For starters, let’s consider a typical data store that an individual or enterprise has collected, containing books, texts, articles, images, laboratory results and various other kinds of data in diverse formats. Typically we access such database information by means of a query or search 。
we type in the subject or keyword we’re looking for, the database retrieves the appropriate content, and we are shown a page of results to our query.
If we can do this easily, why can’t a search engine? We assume that the search engine can reach the query input (or search) page, and it will capture the text on that page and in any pages that may have static hyperlinks to it. But unlike the typical human user, the spider can’t know what words it should type into the query field. Clearly, it can’t type in every word it knows about, and it doesn’t know what’s relevant to that particular site or database. If there’s no easy way to query, the underlying data remains invisible to the search engine. Indeed, any pages that are not eventually connected by links from pages in a spider’s initial list will be invisible and thus are not part of the surface Web as that spider defines it.
How Deep? How Big?
According to a 2001 BrightPlanet study, the deep Web is very big indeed: The company found that the 60 largest deep Web sources contained 84 billion pages of content with about 750TB of information. These 60 sources constituted a resource 40 times larger than the surface Web. Today, BrightPlanet reckons the deep Web totals 7500TB, with more than 250,000 sites and 500 billion individual documents. And that’s just for Web sites in English or European character sets. (For comparison, remember that Google, the largest crawler-based search engine, now indexes some 8 billion pages.)
The deep Web is getting deeper and bigger all the time. Two factors seem to account for this. First, newer data sources (especially those not in English) tend to be of the dynamic-query/searchable type, which are generally more useful than static pages. Second, governments at all levels around the world have made commitments to making their official documents and records available on the Web.
Interestingly, deep Web sites appear to receive 50% more monthly traffic than surface sites do, and they have more sites linked to them, even though they are not really known to the public. They are typically narrower in scope but likely to have deeper, more detailed content.
深度Web
如今,大多數(shù)的作者在他們的研究中利用萬維網(wǎng),并借助Google、Yahoo等公司強(qiáng)大的搜索引擎做大量的工作。(在網(wǎng)上)有如此多的信息可資利用,以至于那種認(rèn)為“所有信息”可以用此方法獲得的想法是有情可愿的。然而,真實(shí)情況是難以掩蓋的。例如,在2005年8月,Google公司稱,它擁有82億頁(yè)和21億張編了索引的網(wǎng)頁(yè)和圖形。這聽起來非常驚人,但這還只是冰上一角??纯瓷疃萕eb(就更驚人)。
據(jù)BrightPlanet公司的首席技術(shù)官M(fèi)ike Bergman稱,在深度Web上,信息量是傳統(tǒng)搜索引擎“知道的”信息量的500倍。這些大量的信息鎖定在數(shù)據(jù)庫(kù)內(nèi),從中可以響應(yīng)具體的查詢而生成網(wǎng)頁(yè)。雖然這些動(dòng)態(tài)的網(wǎng)頁(yè)擁有URL地址,用此地址可以再次檢索到,但它們不是穩(wěn)定不變的或者不是按靜態(tài)頁(yè)面存儲(chǔ)的,互相之間也沒有鏈接。
深度Web還包括需要注冊(cè)的或者對(duì)頁(yè)面限制訪問、禁止搜索引擎瀏覽和生成緩沖拷貝的網(wǎng)站。
讓我們重新看看常規(guī)搜索引擎是如何生成數(shù)據(jù)庫(kù)的。那些叫做蜘蛛程序或爬行程序的程序開始從網(wǎng)站的起始表讀取網(wǎng)頁(yè)。首先閱讀網(wǎng)站上的每一頁(yè),對(duì)所有內(nèi)容編索引,再把它們發(fā)現(xiàn)的字加入到不斷壯大的搜索引擎數(shù)據(jù)庫(kù)中。
當(dāng)蜘蛛程序發(fā)現(xiàn)與另一頁(yè)面的超鏈接時(shí),它就把新的鏈接加入要編索引的頁(yè)面表中。只要搜索引擎沒有用光時(shí)間或存儲(chǔ)空間該程序就能馬上到達(dá)所有鏈接的頁(yè)面,這些鏈接的頁(yè)面就構(gòu)成了我們中的大多數(shù)人使用和參照的因特網(wǎng)或萬維網(wǎng)。事實(shí)上,我們只是察看了表面,所以這個(gè)信息領(lǐng)域常常叫做表層Web。
為什么我們的搜索引擎不能發(fā)現(xiàn)更深一些的信息呢?對(duì)于一名初始者來說,讓我們來考慮一下個(gè)人或企業(yè)收集的典型數(shù)據(jù)集,它包括按不同格式保存的圖書、文章、圖像、實(shí)驗(yàn)結(jié)果以及其他各種各樣的數(shù)據(jù)。一般而言,我們是通過查詢或搜索的方式訪問這些數(shù)據(jù)庫(kù)信息。
通常我們從鍵盤上輸入尋找的題目或關(guān)鍵字,數(shù)據(jù)庫(kù)會(huì)檢索相應(yīng)的內(nèi)容,并給我們顯示查詢結(jié)果的頁(yè)面。
如果我們能很容易做到這件事,那么為什么搜索引擎就不能做到呢?假定搜索引擎能到達(dá)查詢輸入(或搜索)的頁(yè)面,它將捕捉到該頁(yè)面以及任何與其有靜態(tài)鏈接的頁(yè)面中的文本。但是蜘蛛程序與人類用戶不同,它不會(huì)知道應(yīng)該把哪個(gè)字鍵入查詢字段。很明顯,它不會(huì)鍵入它知道的每個(gè)字,它也不知道與某個(gè)特定網(wǎng)站或數(shù)據(jù)庫(kù)存在什么樣的關(guān)系。如果沒有簡(jiǎn)易的查詢方法,底層的數(shù)據(jù)就看不見,因此不會(huì)成為表層Web的一部分,蜘蛛程序也不能定義它。的確,任何網(wǎng)頁(yè)如果與蜘蛛程序最初列表中的網(wǎng)頁(yè)沒有鏈接關(guān)系的話,我們是看不見的,它也無法成為蜘蛛程序所定義的表面Web中的一部分。
多深?多大?
據(jù)BrightPlanet公司2001年的研究,深度Web實(shí)際上非常滌:該公司發(fā)現(xiàn)60個(gè)最大的深度Web源包含了840億頁(yè)的內(nèi)容,約750TB的信息量。這60個(gè)源構(gòu)成的資源比表層Web大了40倍。如今,BrightPlanet公司估計(jì)深度Web總量達(dá)到7500TB,有25萬以上的網(wǎng)站和5000億個(gè)個(gè)人文檔。這僅是對(duì)英文或歐洲字符集的網(wǎng)站。(作為比較,請(qǐng)記住Google這個(gè)最大的基于爬行程序的搜索引擎也只對(duì)80億頁(yè)面編有索引。)
隨時(shí)間的推移,深度Web變得越來越深、越來越大。造成此種情況有兩個(gè)因素:第一,較新的數(shù)據(jù)源(特別是那些非英語的數(shù)據(jù)源)更愿意選擇動(dòng)態(tài)查詢/可搜索類型,通常比靜態(tài)頁(yè)面更有用。第二,全世界的各級(jí)政府都已承諾文檔和記錄要上網(wǎng)。
有意思的是,似乎深度Web每月獲得的流量比表層Web多50%,它們擁有更多的網(wǎng)站與之鏈接,即使它們不為公眾所知。通常,它們的范圍比較窄,但可能擁有更深、更詳細(xì)的內(nèi)容。
軟考備考資料免費(fèi)領(lǐng)取
去領(lǐng)取
共收錄117.93萬道題
已有25.02萬小伙伴參與做題