Abstract
The data for Web mining is usually extracted from the WWW server or proxy server log files. The paper examines the advantages and disadvantages of exploiting another source of input data – the browser buffer. The properties of data extracted from different types of sources are compared. The browser buffer contains data about user navigational habits as well as the formal properties and the content of all recently accessed WWW objects. The paper uses the data obtained from this source to examine the statistical properties of different types of texts extracted from HTML pages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ajiferuke, I., Wolfram, D.: Analysis of Web Page Image Tag Distribution. Information Processing and Management 41, 987–1002 (2005)
Cunha, C.A., Bestavros, A., Crovella, M.E.: Characteristics of WWW Client Traces. Boston University Department of Computer Science, Technical Report TR-95-010 (April 1995)
Gelbukh, A., Sidorov, G.: Zipf and Heaps Laws’ Coefficients Depend on Language. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 332–335. Springer, Heidelberg (2001)
Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and computation Linguistics 11(1), 23–31 (1968)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Rabinowich, M., Spatschech, O.: Web Caching and Replication. Addison-Wesley, USA (2002)
Siemiński, A.: The Cacheability of WWW Pages. In: Multimedia and Network Information Systems 2004, Technical University of Wrocław, Poland (2004)
Sieminski, A.: Changebility of Web Objects. In: ISDA 2005 5th International Conference on Intelligent Systems Desin and Implementation, Wrocław (2005)
Srivastava, J., Desikan, P., Kumar, V.: Web Mining: Accomplishments & Future Directions. In: National Science Foundation Workshop on Next Generation Data Mining (NGDM 2002) (2002)
Szafran, K.: SAM 95 - Morphological Analyzer, TR 96-05 (226), Instytut Informatyki Uniwersytetu Warszawskiego (1996)
Tran, L., Moon, C., Le, D., Thoma, G.: Web Page Downloading and Classification. In: The Fourteenth IEEE Symposium on Computer-Based Medical Systems (July 2001)
Weiss, D.: A Survey of Freely Available Polish Stemmers and Evaluation of Their Applicability in Information Retrieval. In: 2nd Language and Technology Conference, Poznań, Poland, pp. 216–221 (2005)
Zipf, G.K.: Human behavior and the principle of least effort. Addison-Wesley, Cambridge (1949)
Common Log Format: http://www.bacuslabs.com/WsvlCLF.html
Gain Network: http://www.gainpublishing.com/
log data: http://www.ircache.net/Traces/
http://www.theregister.co.uk/2004/10/15/google_desktop_privacy/
Music Machines log data: http://www.cs.washington.edu/ai/adaptive-data/
Reed, D.: Privacy and the Future of Behavioral Marketing, http://www.claria.com/advertise/oas_archive/privacy.html?pub=imedia_module
WorldCup98 log data: http://ita.ee.lbl.gov/html/contrib/WorldCup.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Siemiński, A. (2006). Local Buffer as Source of Web Mining Data. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4253. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893011_99
Download citation
DOI: https://doi.org/10.1007/11893011_99
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46542-3
Online ISBN: 978-3-540-46544-7
eBook Packages: Computer ScienceComputer Science (R0)