摘要:WebHarvestingAstheamountofinformationontheWebgrows,thatinformationbecomeseverhardertokeeptrackofanduse.Searchenginesareabighelp,buttheycandoonlypartofthework,andtheyarehard-pressedtokeepupwithdailychanges.Considerthatevenwhenyouuseasearchengin
Web Harvesting
As the amount of information on the Web grows, that information becomes ever harder to keep track of and use. Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes.
Consider that even when you use a search engine to locate data, you still have to do the following tasks to capture the information you need: scan the content until you find the information,mark the information (usually by highlighting with a mouse),switch to another application (such as a spreadsheet, database or word processor),paste the information into that application.
A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors, lies with Web harvesting tools.
Web harvesting software automatically extracts information from the Web and picks up where search engines leave off, doing the work the search engine can't. Extraction tools automate the reading, copying and pasting necessary to collect information for analysis, and they have proved useful for pulling together information on competitors, prices and financial data of all types.
There are three ways we can extract more useful information from the Web.
The first technique, Web content harvesting, is concerned directly with the specific content of documents or their descriptions, such as HTML files, images or e-mail messages. Since most text documents are relatively unstructured (at least as far as machine interpretation is concerned), one common approach is to exploit what's already known about the general structure of documents and map this to some data model.
Another approach to Web content harvesting involves trying to improve on the content searches that tools like search engines perform. This type of content harvesting goes beyond keyword extraction and the production of simple statistics relating to words and phrases in documents.
Another technique, Web structure harvesting, takes advantage of the fact that Web pages can reveal more information than just their obvious content. Links from other sources that point to a particular Web page indicate the popularity of that page, while links within a Web page that point to other resources may indicate the richness or variety of topics covered in that page. This is like analyzing bibliographical citations—a paper that's often cited in bibliographies and other papers is usually considered to be important.
The third technique, Web usage harvesting, uses data recorded by Web servers about user interactions to help understand user behavior and evaluate the effectiveness of the Web structure.
General access-pattern tracking analyzes Web logs to understand access patterns and trends in order to identify structural issues and resource groupings.
Customized usage tracking analyzes individual trends so that Web sites can be personalized to specific users. Over time, based on access patterns, a site can be dynamically customized for a user in terms of the information displayed, the depth of the site structure and the format of the resources presented.
時(shí)文選讀
Web收割
隨著網(wǎng)上信息量的增加,信息變得越來越難以跟蹤和使用。雖然搜索引擎給予了很大的幫助,但它們只能做一小部分工作,也很難迫使它們跟上每天的變化。
考慮到即使你在用搜索引擎確定數(shù)據(jù)位置,你還是不得不完成下列任務(wù),以捕捉到所需的信息 : 掃描內(nèi)容,直到找到信息為止; 給信息置上標(biāo)記(通常用鼠標(biāo)使它更亮些); 轉(zhuǎn)到其他應(yīng)用(如電子數(shù)據(jù)表、數(shù)據(jù)庫或字處理程序); 把信息粘貼到那個(gè)應(yīng)用程序。
Web收割工具是一個(gè)更好的解決方案,尤其是對(duì)那些要大量利用市場(chǎng)或競(jìng)爭(zhēng)對(duì)手的數(shù)據(jù)的公司而言。
Web收割軟件自動(dòng)從網(wǎng)上提取信息,在搜索引擎脫身的地方精選信息,完成搜索引擎不能做的工作。提取工具自動(dòng)完成收集供分析用信息所需的讀出、復(fù)制和粘貼,這些工具對(duì)于匯總有關(guān)競(jìng)爭(zhēng)對(duì)手的信息、各種各樣的價(jià)格和財(cái)務(wù)數(shù)據(jù)而言,已被證明是有用的。
從網(wǎng)上提取更有用信息的方法有三種:
第一種技術(shù)叫 Web內(nèi)容收割,與具體的文檔內(nèi)容或它們的描述,如HTML文件、圖像或電子郵件信息直接有關(guān)。由于大多數(shù)文本文檔相對(duì)而言是非結(jié)構(gòu)化的(至少就機(jī)器解釋而言),一個(gè)常用的方法就是利用對(duì)文檔一般結(jié)構(gòu)已知的信息,將它映射到某個(gè)數(shù)據(jù)模型。
Web內(nèi)容收割的另一種方法涉及到試著改進(jìn)內(nèi)容搜索,像搜索引擎一類工具所做的那樣。此類內(nèi)容收割超過關(guān)鍵詞提取,和產(chǎn)生與文檔中的詞和短語有關(guān)的簡(jiǎn)單統(tǒng)計(jì)。
第二種技術(shù)叫 Web結(jié)構(gòu)收割,它利用了網(wǎng)頁能比顯而易見(如紙面上的)的內(nèi)容揭示更多的信息。指向特定網(wǎng)頁的其他來源的鏈接,表明了該頁的流行性,而同一頁內(nèi)指向其他資源的鏈接,表明了該頁所覆蓋的題目的豐富性和多樣性。這類似于書目引用的分析——常常被引用的論文通常就被認(rèn)為比較重要。
第三種方法叫 Web使用收割,它使用Web服務(wù)器記錄下的有關(guān)用戶交互行為的數(shù)據(jù),來幫助理解用戶的行為和評(píng)價(jià)Web結(jié)構(gòu)的有效性。
通用的訪問模式跟蹤分析 Web日志,來理解訪問模式和傾向以便鑒別結(jié)構(gòu)問題和資源分組。
定制的用途跟蹤分析了個(gè)別傾向,從而能針對(duì)特定用戶使網(wǎng)站實(shí)現(xiàn)個(gè)性化。隨著時(shí)間的推移,基于訪問模式,網(wǎng)站就能按顯示的信息、網(wǎng)站結(jié)構(gòu)的深度和展示資源的格式,為用戶進(jìn)行動(dòng)態(tài)定制。
軟考備考資料免費(fèi)領(lǐng)取
去領(lǐng)取
共收錄117.93萬道題
已有25.02萬小伙伴參與做題