Discover ways to examine web server log files to learn ways of users website browsing and forecast next browsed content. This article explains applying extensible Markov model to cluster web pages on a website and predict the place user will move next. The algorithm utilizes InfoSphere® Streams and R for regular issue prognostications based on model.

Preamble

Webserver log files are used to examine users surfing behaviour. As an illustration, in “Predicting Web Users’ Next Access Based on Log Data”, Rituparna Sen and Mark Hansen have utilized combination of first-order Markov models to examine clusters of pages on a website. They applied these models for prognostication which webpage user supposed to visit next. They suggested implementing this information to pre-fetch a resource before a real request by user. This article will explain how to use IBM InfoSphere Streams, combined with R to run an identical analysis of webserver logs.
This solution is implementing extensible Markov models (EMMs), initially released in 2004 by Margaret Dunham, Yu Meng, and Jie Huang, to mix a stream clustering algorithm with a Markov chain. A Markov chain is a mathematical system that reviews transformations from one state to other, in which the following state is relying only on present and not the sequence of proceedings that came before.
The states of Markov chain are aggregation specified by stream clustering algorithm. The EMM can transform eventually by including new states since they are discovered and also damping or trimming current states with time. Consequently, the model is able to make adjustments eventually. This opportunity is particularly crucial in systems with dynamic usage style that changes over the time. As an example, website will probably display dynamic usage pattern, as well as improvements in structure, in some time.

Advantages of integration

The majority of machine learning models designed for forecasting are performed offline on big amounts of training info. Right after the model are properly trained, prediction could be done right away. This technique is suitable for numerous sorts of issues, however if the patterns for prediction are changing regularly, this method could create models that drop behind the system they are attempting to forecast. Since EMM could be educated dynamically, they are effective for modelling systems like network traffic, auto traffic, or another system in which clustering patterns can transform eventually. Web server traffic is one of those sphere. Server logs deliver an infinite source of streaming information to educate the model when the system is already performing forecasting.

Prognosticating content requests from web server logs

Internet servers are keeping logs of resource queries. Every log entry consists IP address of user, timestamp for request, and the destination for requested data. All this information characterize user and requests to website.

Summary

This article shows how to forecast users actions on a website to predict content requests using webserver log files. The modelling and prognosticating are completed by applying EMM. The solution represented here is a testament to concept. Upcoming work is essential for developing a genuine solution. Next actions involve enhancing overall performance by clustering sets of webpages, incremental studying, and using InfoSphere Streams to carry several cases of R.