首頁(yè) > 軟考 > 軟考英語 > 計(jì)算機(jī)專業(yè)時(shí)文選讀(990)

計(jì)算機(jī)專業(yè)時(shí)文選讀(990)

軟考責(zé)任編輯：jack0087 2006-03-02

添加老師微信

備考咨詢

希賽網(wǎng)公眾號(hào) 點(diǎn)擊領(lǐng)取軟考備考資料

摘要：DeepWebMostwritersthesedaysdoasignificantpartoftheirresearchusingtheWorldWideWeb,withthehelpofpowerfulsearchenginessuchasGoogleandYahoo.Thereissomuchinformationavailablethatonecouldbeforgivenforthinkingthat“everything”isaccessiblethisway,butnothingc

Deep Web

Most writers these days do a significant part of their research using the World Wide Web, with the help of powerful search engines such as Google and Yahoo. There is so much information available that one could be forgiven for thinking that “everything”is accessible this way, but nothing could ber further from the truth. For example, as of August 2005, Google claimed to have indexed 8.2 billion Web pages and 2.1 billion images. That sounds impressive, but it’s just the tip of the iceberg. Behold the deep Web.

According to Mike Bergman, chief technology officer at BrightPlanet Corp., more than 500 times as much information as traditional search engines “know about” is available in the deep Web. This massive store of information is locked up inside databases from which Web pages are generated in response to specific queries. Although these dynamic pages have a unique URL address with which they can be retrieved again, they are not persistent or stored as static pages, nor are there links to them from other pages.

The deep Web also includes sites that require registration or otherwise restrict access to their pages, prohibiting search engines from browsing them and creating cached copies.

Let’s recap how conventional search engines create their databases. Programs called spiders or Web crawlers start by reading pages from a starting list of Web sites. These spiders first read each page on a site, index all their content and add the words they find to the search engine’s growing database. When a spider finds a hyperlink to another page, it adds that new link to the list of pages to be indexed. In time, the program reaches all linked pages, presuming that the search engine doesn’t run out of time or storage space. These linked pages constitute what most of us use and refer to as the Internet or the Web. In fact, we have only scratched the surface, which is why this realm of information is often called the surface Web.

Why don’t our search engines find the deeper information? For starters, let’s consider a typical data store that an individual or enterprise has collected, containing books, texts, articles, images, laboratory results and various other kinds of data in diverse formats. Typically we access such database information by means of a query or search 。

we type in the subject or keyword we’re looking for, the database retrieves the appropriate content, and we are shown a page of results to our query.

If we can do this easily, why can’t a search engine? We assume that the search engine can reach the query input (or search) page, and it will capture the text on that page and in any pages that may have static hyperlinks to it. But unlike the typical human user, the spider can’t know what words it should type into the query field. Clearly, it can’t type in every word it knows about, and it doesn’t know what’s relevant to that particular site or database. If there’s no easy way to query, the underlying data remains invisible to the search engine. Indeed, any pages that are not eventually connected by links from pages in a spider’s initial list will be invisible and thus are not part of the surface Web as that spider defines it.

How Deep? How Big?

According to a 2001 BrightPlanet study, the deep Web is very big indeed: The company found that the 60 largest deep Web sources contained 84 billion pages of content with about 750TB of information. These 60 sources constituted a resource 40 times larger than the surface Web. Today, BrightPlanet reckons the deep Web totals 7500TB, with more than 250,000 sites and 500 billion individual documents. And that’s just for Web sites in English or European character sets. (For comparison, remember that Google, the largest crawler-based search engine, now indexes some 8 billion pages.)

The deep Web is getting deeper and bigger all the time. Two factors seem to account for this. First, newer data sources (especially those not in English) tend to be of the dynamic-query/searchable type, which are generally more useful than static pages. Second, governments at all levels around the world have made commitments to making their official documents and records available on the Web.

Interestingly, deep Web sites appear to receive 50% more monthly traffic than surface sites do, and they have more sites linked to them, even though they are not really known to the public. They are typically narrower in scope but likely to have deeper, more detailed content.

深度Web

如今，大多數(shù)的作者在他們的研究中利用萬維網(wǎng),并借助Google、Yahoo等公司強(qiáng)大的搜索引擎做大量的工作。(在網(wǎng)上)有如此多的信息可資利用，以至于那種認(rèn)為“所有信息”可以用此方法獲得的想法是有情可愿的。然而，真實(shí)情況是難以掩蓋的。例如，在2005年8月，Google公司稱，它擁有82億頁(yè)和21億張編了索引的網(wǎng)頁(yè)和圖形。這聽起來非常驚人，但這還只是冰上一角?？纯瓷疃萕eb(就更驚人)。

據(jù)BrightPlanet公司的首席技術(shù)官M(fèi)ike Bergman稱，在深度Web上，信息量是傳統(tǒng)搜索引擎“知道的”信息量的500倍。這些大量的信息鎖定在數(shù)據(jù)庫(kù)內(nèi)，從中可以響應(yīng)具體的查詢而生成網(wǎng)頁(yè)。雖然這些動(dòng)態(tài)的網(wǎng)頁(yè)擁有URL地址，用此地址可以再次檢索到，但它們不是穩(wěn)定不變的或者不是按靜態(tài)頁(yè)面存儲(chǔ)的，互相之間也沒有鏈接。

深度Web還包括需要注冊(cè)的或者對(duì)頁(yè)面限制訪問、禁止搜索引擎瀏覽和生成緩沖拷貝的網(wǎng)站。

讓我們重新看看常規(guī)搜索引擎是如何生成數(shù)據(jù)庫(kù)的。那些叫做蜘蛛程序或爬行程序的程序開始從網(wǎng)站的起始表讀取網(wǎng)頁(yè)。首先閱讀網(wǎng)站上的每一頁(yè)，對(duì)所有內(nèi)容編索引，再把它們發(fā)現(xiàn)的字加入到不斷壯大的搜索引擎數(shù)據(jù)庫(kù)中。

當(dāng)蜘蛛程序發(fā)現(xiàn)與另一頁(yè)面的超鏈接時(shí)，它就把新的鏈接加入要編索引的頁(yè)面表中。只要搜索引擎沒有用光時(shí)間或存儲(chǔ)空間該程序就能馬上到達(dá)所有鏈接的頁(yè)面，這些鏈接的頁(yè)面就構(gòu)成了我們中的大多數(shù)人使用和參照的因特網(wǎng)或萬維網(wǎng)。事實(shí)上，我們只是察看了表面，所以這個(gè)信息領(lǐng)域常常叫做表層Web。

為什么我們的搜索引擎不能發(fā)現(xiàn)更深一些的信息呢?對(duì)于一名初始者來說，讓我們來考慮一下個(gè)人或企業(yè)收集的典型數(shù)據(jù)集，它包括按不同格式保存的圖書、文章、圖像、實(shí)驗(yàn)結(jié)果以及其他各種各樣的數(shù)據(jù)。一般而言，我們是通過查詢或搜索的方式訪問這些數(shù)據(jù)庫(kù)信息。

通常我們從鍵盤上輸入尋找的題目或關(guān)鍵字，數(shù)據(jù)庫(kù)會(huì)檢索相應(yīng)的內(nèi)容，并給我們顯示查詢結(jié)果的頁(yè)面。

如果我們能很容易做到這件事，那么為什么搜索引擎就不能做到呢?假定搜索引擎能到達(dá)查詢輸入(或搜索)的頁(yè)面，它將捕捉到該頁(yè)面以及任何與其有靜態(tài)鏈接的頁(yè)面中的文本。但是蜘蛛程序與人類用戶不同，它不會(huì)知道應(yīng)該把哪個(gè)字鍵入查詢字段。很明顯，它不會(huì)鍵入它知道的每個(gè)字，它也不知道與某個(gè)特定網(wǎng)站或數(shù)據(jù)庫(kù)存在什么樣的關(guān)系。如果沒有簡(jiǎn)易的查詢方法，底層的數(shù)據(jù)就看不見，因此不會(huì)成為表層Web的一部分，蜘蛛程序也不能定義它。的確，任何網(wǎng)頁(yè)如果與蜘蛛程序最初列表中的網(wǎng)頁(yè)沒有鏈接關(guān)系的話，我們是看不見的，它也無法成為蜘蛛程序所定義的表面Web中的一部分。

多深?多大?

據(jù)BrightPlanet公司2001年的研究，深度Web實(shí)際上非常滌：該公司發(fā)現(xiàn)60個(gè)最大的深度Web源包含了840億頁(yè)的內(nèi)容，約750TB的信息量。這60個(gè)源構(gòu)成的資源比表層Web大了40倍。如今，BrightPlanet公司估計(jì)深度Web總量達(dá)到7500TB，有25萬以上的網(wǎng)站和5000億個(gè)個(gè)人文檔。這僅是對(duì)英文或歐洲字符集的網(wǎng)站。(作為比較，請(qǐng)記住Google這個(gè)最大的基于爬行程序的搜索引擎也只對(duì)80億頁(yè)面編有索引。)

隨時(shí)間的推移，深度Web變得越來越深、越來越大。造成此種情況有兩個(gè)因素：第一，較新的數(shù)據(jù)源(特別是那些非英語的數(shù)據(jù)源)更愿意選擇動(dòng)態(tài)查詢/可搜索類型，通常比靜態(tài)頁(yè)面更有用。第二，全世界的各級(jí)政府都已承諾文檔和記錄要上網(wǎng)。

有意思的是，似乎深度Web每月獲得的流量比表層Web多50%，它們擁有更多的網(wǎng)站與之鏈接，即使它們不為公眾所知。通常，它們的范圍比較窄，但可能擁有更深、更詳細(xì)的內(nèi)容。

溫馨提示：因考試政策、內(nèi)容不斷變化與調(diào)整，本網(wǎng)站提供的以上信息僅供參考，如有異議，請(qǐng)考生以權(quán)威部門公布的內(nèi)容為準(zhǔn)！

視頻教程【新版】軟考各科精講班視頻教程

備考學(xué)習(xí) 2025年上半年軟考信息系統(tǒng)監(jiān)理師考試備考資料匯總

備考學(xué)習(xí) 2025年上半年軟考復(fù)習(xí)計(jì)劃如何制定

備考學(xué)習(xí) 從零到一：軟考中級(jí)備考時(shí)間線及關(guān)鍵節(jié)點(diǎn)

備考資料 2025年上半年軟考各科案例簡(jiǎn)答合集

歷年真題軟考各科歷年真題全集練習(xí)

每日一練備考2025年軟考不慌，每日一練陪伴你

報(bào)考指導(dǎo) 2025年信息系統(tǒng)項(xiàng)目管理師備考指導(dǎo)課及精講試聽

延伸閱讀

更多精彩內(nèi)容請(qǐng)關(guān)注
軟考微信公眾號(hào)
掃碼加入免費(fèi)獲得

掃碼加入軟考QQ群
（群號(hào)：838864597）
+點(diǎn)擊加入

軟考備考資料免費(fèi)領(lǐng)取

去領(lǐng)取

共收錄117.93萬道題
已有25.02萬小伙伴參與做題

距離考試還有

天

備考必讀報(bào)考相關(guān) 培訓(xùn)課程

2025年全國(guó)軟考報(bào)名時(shí)間及報(bào)名通知匯總表軟考機(jī)考熱點(diǎn)問答各地軟考報(bào)名審核證明材料匯總非計(jì)算機(jī)專業(yè)可以考軟考嗎？軟考適合哪些人報(bào)考？軟考難嗎？通過率高不高？軟考可以評(píng)職稱嗎？軟考科目有哪些？3個(gè)級(jí)別27個(gè)科目，一次性搞懂！軟考和PMP?哪個(gè)含金量更高？報(bào)考人數(shù)超500萬！軟考為什么這么火？軟考是什么？軟考5個(gè)高級(jí)科目怎么選？上岸考友：這一點(diǎn)，比通過率更重要！

報(bào)名時(shí)間報(bào)名入口報(bào)名條件報(bào)考指南備考資料

軟考題庫(kù) 我的題庫(kù)

專注在線職業(yè)教育24年

項(xiàng)目管理

信息系統(tǒng)項(xiàng)目管理師

考試指南

廠商認(rèn)證

信息系統(tǒng)項(xiàng)目管理師

建筑工程

信息系統(tǒng)項(xiàng)目管理師

金融財(cái)會(huì)

信息系統(tǒng)項(xiàng)目管理師

考博考研

信息系統(tǒng)項(xiàng)目管理師

學(xué)歷提升

防詐騙聲明培訓(xùn)證書查詢違法信息舉報(bào) 資質(zhì)&榮譽(yù)

客服熱線：400-111-9811

售后投訴：156-1612-8671

軟考

專注軟考培訓(xùn)24年

希賽網(wǎng)主編的教材一百余種，為全國(guó)數(shù)萬家企業(yè)、政府部門和事業(yè)單位提供了 ...

課程咨詢
學(xué)員服務(wù)
電話咨詢

客服熱線：400-111-9811

售后投訴：156-1612-8671
公眾號(hào)
掃描二維碼
關(guān)注希賽網(wǎng)站
APP下載
掃描二維碼
下載APP
返回頂部

掃描二維碼
下載APP

聯(lián)系我們

售前電話：400-111-9811（僅收市話費(fèi)）

售后投訴：156-1612-8671

在線客服

關(guān)注希賽網(wǎng)微信
下載希賽網(wǎng)APP

PMP^?，PMI-ACP^?和PMBOK^?是Project Management Institute，Inc.的注冊(cè)商標(biāo)
ITIL^?、PRINCE2^?是PeopleCert集團(tuán)的注冊(cè)商標(biāo)，經(jīng)PeopleCert授權(quán)使用，保留所有權(quán)利
湖南希賽網(wǎng)絡(luò)科技有限公司　版權(quán)所有　©2001-2025　湘ICP備10203241號(hào)-14　湘公網(wǎng)安備43019002000749號(hào)
違法和不良信息舉報(bào)電話：15673157832　舉報(bào)/反饋/投訴郵箱：ujigu@ujigu.com
出版物經(jīng)營(yíng)許可證：4301042021177　高新技術(shù)企業(yè)證書：GR202143001539　廣播電視節(jié)目制作經(jīng)營(yíng)許可證： (湘)字00833號(hào)

咨詢?cè)诰€老師!

計(jì)算機(jī)專業(yè)時(shí)文選讀(990)

延伸閱讀

項(xiàng)目管理

信息系統(tǒng)項(xiàng)目管理師

軟考通信

信息系統(tǒng)項(xiàng)目管理師

廠商認(rèn)證

信息系統(tǒng)項(xiàng)目管理師

建筑工程

信息系統(tǒng)項(xiàng)目管理師

金融財(cái)會(huì)

信息系統(tǒng)項(xiàng)目管理師

考博考研

信息系統(tǒng)項(xiàng)目管理師

學(xué)歷提升

軟考