計算機專業(yè)時文選讀之十七

軟考責任編輯：hnldzy 2004-12-31

添加老師微信

備考咨詢

摘要：WebHarvestingAstheamountofinformationontheWebgrows,thatinformationbecomeseverhardertokeeptrackofanduse.Searchenginesareabighelp,buttheycandoonlypartofthework,andtheyarehard-pressedtokeepupwithdailychanges.Considerthatevenwhenyouuseasearchengin

Web Harvesting

As the amount of information on the Web grows, that information becomes ever harder to keep track of and use. Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes.

Consider that even when you use a search engine to locate data, you still have to do the following tasks to capture the information you need: scan the content until you find the information，mark the information (usually by highlighting with a mouse)，switch to another application (such as a spreadsheet, database or word processor),paste the information into that application.

A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors, lies with Web harvesting tools.

Web harvesting software automatically extracts information from the Web and picks up where search engines leave off, doing the work the search engine can't. Extraction tools automate the reading, copying and pasting necessary to collect information for analysis, and they have proved useful for pulling together information on competitors, prices and financial data of all types.

There are three ways we can extract more useful information from the Web.

The first technique, Web content harvesting, is concerned directly with the specific content of documents or their descriptions, such as HTML files, images or e-mail messages. Since most text documents are relatively unstructured (at least as far as machine interpretation is concerned), one common approach is to exploit what's already known about the general structure of documents and map this to some data model.

Another approach to Web content harvesting involves trying to improve on the content searches that tools like search engines perform. This type of content harvesting goes beyond keyword extraction and the production of simple statistics relating to words and phrases in documents.

Another technique, Web structure harvesting, takes advantage of the fact that Web pages can reveal more information than just their obvious content. Links from other sources that point to a particular Web page indicate the popularity of that page, while links within a Web page that point to other resources may indicate the richness or variety of topics covered in that page. This is like analyzing bibliographical citations—a paper that's often cited in bibliographies and other papers is usually considered to be important.

The third technique, Web usage harvesting, uses data recorded by Web servers about user interactions to help understand user behavior and evaluate the effectiveness of the Web structure.

General access-pattern tracking analyzes Web logs to understand access patterns and trends in order to identify structural issues and resource groupings.

Customized usage tracking analyzes individual trends so that Web sites can be personalized to specific users. Over time, based on access patterns, a site can be dynamically customized for a user in terms of the information displayed, the depth of the site structure and the format of the resources presented.

時文選讀

Web收割

隨著網(wǎng)上信息量的增加，信息變得越來越難以跟蹤和使用。雖然搜索引擎給予了很大的幫助，但它們只能做一小部分工作，也很難迫使它們跟上每天的變化。

考慮到即使你在用搜索引擎確定數(shù)據(jù)位置，你還是不得不完成下列任務，以捕捉到所需的信息 : 掃描內(nèi)容，直到找到信息為止; 給信息置上標記(通常用鼠標使它更亮些); 轉(zhuǎn)到其他應用(如電子數(shù)據(jù)表、數(shù)據(jù)庫或字處理程序); 把信息粘貼到那個應用程序。

Web收割工具是一個更好的解決方案，尤其是對那些要大量利用市場或競爭對手的數(shù)據(jù)的公司而言。

Web收割軟件自動從網(wǎng)上提取信息，在搜索引擎脫身的地方精選信息，完成搜索引擎不能做的工作。提取工具自動完成收集供分析用信息所需的讀出、復制和粘貼，這些工具對于匯總有關競爭對手的信息、各種各樣的價格和財務數(shù)據(jù)而言，已被證明是有用的。

從網(wǎng)上提取更有用信息的方法有三種：

第一種技術叫 Web內(nèi)容收割，與具體的文檔內(nèi)容或它們的描述，如HTML文件、圖像或電子郵件信息直接有關。由于大多數(shù)文本文檔相對而言是非結構化的(至少就機器解釋而言)，一個常用的方法就是利用對文檔一般結構已知的信息，將它映射到某個數(shù)據(jù)模型。

Web內(nèi)容收割的另一種方法涉及到試著改進內(nèi)容搜索，像搜索引擎一類工具所做的那樣。此類內(nèi)容收割超過關鍵詞提取，和產(chǎn)生與文檔中的詞和短語有關的簡單統(tǒng)計。

第二種技術叫 Web結構收割，它利用了網(wǎng)頁能比顯而易見(如紙面上的)的內(nèi)容揭示更多的信息。指向特定網(wǎng)頁的其他來源的鏈接，表明了該頁的流行性，而同一頁內(nèi)指向其他資源的鏈接，表明了該頁所覆蓋的題目的豐富性和多樣性。這類似于書目引用的分析——常常被引用的論文通常就被認為比較重要。

第三種方法叫 Web使用收割，它使用Web服務器記錄下的有關用戶交互行為的數(shù)據(jù)，來幫助理解用戶的行為和評價Web結構的有效性。

通用的訪問模式跟蹤分析 Web日志，來理解訪問模式和傾向以便鑒別結構問題和資源分組。

定制的用途跟蹤分析了個別傾向，從而能針對特定用戶使網(wǎng)站實現(xiàn)個性化。隨著時間的推移，基于訪問模式，網(wǎng)站就能按顯示的信息、網(wǎng)站結構的深度和展示資源的格式，為用戶進行動態(tài)定制。

溫馨提示：因考試政策、內(nèi)容不斷變化與調(diào)整，本網(wǎng)站提供的以上信息僅供參考，如有異議，請考生以權威部門公布的內(nèi)容為準！

視頻教程【新版】軟考各科精講班視頻教程

備考學習 2025年上半年軟考信息系統(tǒng)監(jiān)理師考試備考資料匯總

備考學習 2025年上半年軟考復習計劃如何制定

備考學習從零到一：軟考中級備考時間線及關鍵節(jié)點

備考資料 2025年上半年軟考各科案例簡答合集

歷年真題軟考各科歷年真題全集練習

每日一練備考2025年軟考不慌，每日一練陪伴你

報考指導 2025年信息系統(tǒng)項目管理師備考指導課及精講試聽

延伸閱讀

更多精彩內(nèi)容請關注
軟考微信公眾號
掃碼加入免費獲得

掃碼加入軟考QQ群
（群號：838864597）
+點擊加入

軟考備考資料免費領取

去領取

共收錄117.93萬道題
已有25.02萬小伙伴參與做題

軟考題庫我的題庫

專注在線職業(yè)教育24年

項目管理

信息系統(tǒng)項目管理師

考試指南

信息系統(tǒng)項目管理師

廠商認證

信息系統(tǒng)項目管理師

建筑工程

信息系統(tǒng)項目管理師

金融財會

信息系統(tǒng)項目管理師

考博考研

信息系統(tǒng)項目管理師

學歷提升

防詐騙聲明培訓證書查詢違法信息舉報資質(zhì)&榮譽

客服熱線：400-111-9811

售后投訴：156-1612-8671

軟考

專注軟考培訓24年

希賽網(wǎng)主編的教材一百余種，為全國數(shù)萬家企業(yè)、政府部門和事業(yè)單位提供了 ...

課程咨詢
學員服務
電話咨詢

客服熱線：400-111-9811

售后投訴：156-1612-8671
公眾號
掃描二維碼
關注希賽網(wǎng)站
APP下載
掃描二維碼
下載APP
返回頂部

掃描二維碼
下載APP

聯(lián)系我們

售前電話：400-111-9811（僅收市話費）

售后投訴：156-1612-8671

在線客服

關注希賽網(wǎng)微信
下載希賽網(wǎng)APP

PMP^?，PMI-ACP^?和PMBOK^?是Project Management Institute，Inc.的注冊商標
ITIL^?、PRINCE2^?是PeopleCert集團的注冊商標，經(jīng)PeopleCert授權使用，保留所有權利
湖南希賽網(wǎng)絡科技有限公司　版權所有　©2001-2025　湘ICP備10203241號-14　湘公網(wǎng)安備43019002000749號
違法和不良信息舉報電話：15673157832　舉報/反饋/投訴郵箱：ujigu@ujigu.com
出版物經(jīng)營許可證：4301042021177　高新技術企業(yè)證書：GR202143001539　廣播電視節(jié)目制作經(jīng)營許可證： (湘)字00833號

咨詢在線老師!

計算機專業(yè)時文選讀之十七

延伸閱讀

項目管理

信息系統(tǒng)項目管理師

軟考通信

信息系統(tǒng)項目管理師

廠商認證

信息系統(tǒng)項目管理師

建筑工程

信息系統(tǒng)項目管理師

金融財會

信息系統(tǒng)項目管理師

考博考研

信息系統(tǒng)項目管理師

學歷提升

軟考