Daniel A. Menascé Bruno D.
Virgílio A.F. Almeida Flávia P. Ribeiro
|Dept. of Computer Science||Dept. of Computer Science|
|George Mason University||Universidade Federal de Minas Gerais|
|Fairfax, VA 22030 USA||Belo Horizonte, MG 31270 Brazil|
|+1 703 993 1537||+55 31 3499 5860|
|Dept of Information and Software Engineering|
|George Mason University|
|Fairfax, VA 22030 USA|
|+1 703 993 1627|
Keywords: Web workload characterization, e-commerce, fractal clustering, fractal dimension, -means clustering.
Word count: 6,300.
Understanding the workload of Web and e-commerce sites is a fundamental step in sizing the IT infrastructure that supports these sites and in planning for their evolution so that Quality of Service (QoS) goals are met within cost constraints . The first attempts at characterizing the workload of Web sites focused on information providing sites and considered only the stream of HTTP requests coming to the sites [3,4,8].
Some of the characteristics analyzed in previous studies are: file size distributions, file popularity distribution, self-similarity in HTTP traffic, reference locality, and user request patterns. A number of studies of different Web sites found that file sizes have heavy-tailed distributions and that the relationship between object popularity and frequency of access follows a Zipf distribution. Other studies of different Web site environments demonstrated long-range dependencies in the user request process, primarily resulting from strong correlations in the user requests.
More recently , the authors proposed a characterization approach for e-commerce workloads that takes into account the user model and its interactions with the e-commerce site. In particular, they proposed a hierarchical view of the workload consisting of a session layer, a function layer, and protocol (i.e., HTTP) layer. A session is a sequence of requests of different types made by a single customer during a single visit to a site. During a session, a customer requests the execution of various e-business functions such as browse, search, select, add to the shopping cart, register, and pay. A request to execute an e-business function may generate many HTTP requests to the site. For example, several images may have to be retrieved to display the page that contains the results of the execution of an e-business function. Using this approach to characterize the workload of online bookstores and auction sites, they were able to show, among other things, that i) most sessions last less than 1,000 sec, ii) more than 70% of the functions performed are product selection functions as opposed to product ordering functions, iii) the popularity of search terms follows a Zipf distribution, iv) there is a very strong correlation in the arrival process at the HTTP request level, indicated by a Hurst parameter with a value of 0.9273, v) at least 16% of the requests are generated by robots, vi) 88% of the sessions have less than 10 requests to execute e-business functions, and vii) the session length, measured in number of requests to execute e-business functions, distribution is heavy-tailed, especially for sites subject to requests generated by robots.
One of the main potential benefits of properly characterizing the workload is improve the quality of a customer's experience at the site. Once one understands what customers do, what navigational patterns they follow, one can improve their experience and even attempt to dynamically identify a customer with a well-known behavior in order to provide a better experience for the customer, i.e., personalized services, and maybe generate more profit to e-commerce sites.
Menascé and Almeida proposed two approaches to characterize user sessions: Customer Behavior Model Graphs (CBMGs) and Customer Visit Models (CVMs) . The CBMG is a state transition graph, in which the nodes correspond to states in the session (e.g., browsing, searching, selecting, checking out, and paying) and the arcs correspond to transitions between states. Probabilities are associated with transitions as in a Markov Chain. In , a clustering-based method was presented to process HTTP logs and obtain clusters of CBMGs with ``similar'' patterns (e.g., heavy buyers, occasional buyers).
A Customer Visit Model (CVM) represents sessions as a collection of session vectors, one per session. A session vector for the j-th session indicates the number of times, , that each of the different functions (e.g., search, browse, add to cart, etc) were invoked during the session. In this paper, we use clustering techniques based on the CVM model rather than on the CBMG since the CVM is a more compact representation of the workload than a CBMG. This reduces the number of attributes that have to be considered. More importantly, we use the fractal dimension  to reduce the dimensionality of the space and extract the most relevant attributes that should be used to characterize the workload. Besides using distance-based clustering algorithms (e.g., k-means, minimum spanning tree), as done in , we use fractal clustering (FC) . One of the motivations for using fractal clustering is to improve the ``quality'' of the clusters so that one can assign meaningful interpretations to each of them. Fractal clustering is quite appropriate to find sets of points that are ``similar'' with respect to a fractal dimension. This implies that clusters do not have to be shaped as hyperspheres, as is the case with distance-based clustering approaches. In this paper, we compare the results obtained with fractal clustering and distance-based clustering on a dataset generated from an e-commerce workload.
The rest of this paper is organized as follows. Section 2 briefly summarizes the results of previous work relevant to this paper. Section 3 discusses fractal dimension and its application to Web workloads and shows, through the use of Pair-count plots, that the workload analyzed is fractal in nature. The next section presents the results of applying fractal clustering to our Web workloads and offers an interpretation of the resulting clusters. Section 5 presents the results of performing attribute selection (i.e., dimension reduction) to our Web workloads. Section 6 compares FC with k-means clustering on the same dataset. Finally, section 7 presents some concluding remarks.
While there are many references on workload characterization in the WWW focusing on information provider Web sites , there is much less work on workload characterization for e-commerce sites.
In , the notion of a session which consists of many individual HTTP requests is introduced. Reference  presents several models (e.g., the Customer Behavior Model Graph and the Customer Visit Model) and shows how user sessions can be represented by these models and how to obtain them from HTTP logs. In , the authors use a hierarchical model to characterize e-commerce workloads from a real Web store log. This method takes into account, not only requests for files, but also user sessions consisting of requests for e-commerce functions (i.e., add to cart and pay). In addition, they detect and characterize some entities widely mentioned in this work as Shopbots and Crawlers. They also present the summary and the characteristics of the Webstore log we are using in this work.
In the database field, fractal dimension has been proven to be a suitable tool to estimate spatial joins and range queries , indexing and feature selection  amongst others. A good coverage of fractal dimension, including the correlation fractal dimension , is given by Belussi . As well as presenting the concepts behind those metrics, they also show an efficient algorithm to compute them.
In , the Pair Count Exponent power law is presented and its application to estimating the selectivity on spatial joins queries. As a special case, it includes . Since our datasets follow this perfect power law, it motivated us to examine the fractal properties of our workload.
Last, the method used to perform attribute selection in our datasets is described in . In addition to presenting the algorithm, they also show results using real and synthetic databases. We are using the Fractal Clustering algorithm (FC) proposed by Barbará et. al.  to cluster datasets of n-dimensional points and group them in clusters not restricted by its shape.
Many phenomena observed in nature seem to have chaotic behavior, but many, under close inspection, exhibit patterns that repeat themselves as the observation scale varies. These patterns are called fractals and the phenomenom is called self-similarity [5,11]. Self-similarity has been observed in many phenomena related to computer systems and to Web and e-commerce workloads. For example, the number of bytes retrieved from a Web server in time slots of duration is a self-similar process over a wide range of values of . So, is the number of HTTP requests arriving at an e-commerce site over a time slot of duration .
One of the goals of this work is to use fractal properties
to refine the characterization of e-commerce workloads. These
workloads are characterized at higher levels of abstraction,
i.e., sessions as opposed to HTTP requests. For that purpose,
we analyzed large HTTP logs of an online bookstore. The logs
correspond to 15 days in which 955,818 HTTP requests were
processed. Entries corresponding to images, errors, etc were
deleted and the URLs of the remaining entries in the log were
mapped to one of 12 e-business functions defined in
|Acc||Account Login and Creation|
|Add||Add items to the cart|
|Browse||Navigate through product categories|
|Home||Request the site's home page|
|Info||Obtain product information|
|Pay||Pay for an item|
|Postpay||Request a summary of items paid for|
|Prepay||Submit payment info|
|Robot||Request the file robot.txt|
|Search||Search for products based on keywords|
Then, sections within the log were identified using a combination of session ids generated by the online bookstore and a 30-minute inactivity period threshold . A session vector was generated for each session. This vector indicates the number of times, , that each of the 12 different functions (e.g., Search, Browse, Add) was invoked during the session. The resulting dataset is what we call the complete dataset. We define the session length, , as the total number of requests to execute e-business functions during the session. So, .
We derived three additional datasets from the complete dataset. One, called UpTo50, only includes session vectors with session length not exceeding 50. By considering only sessions whose length does not exceeed 50, we exclude sessions generated by robots and a few sessions generated by human beings as well. It should be noted however that the UpTo50 log contains 99.5% of the sessions of the complete log. This log can be used to focus on human-generated requests only.
The other dataset, called Add, includes sessions in which customers had an intention to buy. This is indicated by , i.e., at least one item was placed in the shopping cart. The other dataset, called Pay, includes sessions in which and and , i.e., sessions in which a purchase occurred.
As a preliminary investigation of the fractal nature of our datasets, we used the method in  to obtain the pair-count plot or PC-plot of the dataset. The pair count, , of a dataset for a given radius is defined as the number of pairs in the dataset in which points are within a distance of each other. The study in  showed that follows a power law for many real datasets for a usable range of distances. In other words, where is a constant and the exponent is called the pair-count exponent. The PC-plot is then a plot of vs. . Clearly, if follows a perfect power law, the PC-plot is a straight line for a given range of distances. The pair-count exponent is the slope of the straight line in the PC-plot and is also the ``correlation fractal dimension'' [10,17] discussed later in this section. This is also called the intrinsic dimensionality of the dataset.
Figure 1 shows the PC-plot for the complete dataset. The distance metric used is the Euclidean distance and the natural logarithm was used for both axes. As expected , similar results were obtained with the Manhattan distance  and when the dataset was sampled. As it can be seen, for a range of distances ranging from 1 () to 5.5 () the PC-plot is pretty much linear with a slope (i.e., correlation fractal dimension) of approximately 3.86.
A closer look at the PC-plot in Figure 1 reveals that the plot has
some inflection points, which indicates the presence of more
than one fractal cluster. In order to better analyze the
fractal dimension of the dataset, we can refer to
Figure 2 that
uses the approach in . The idea is to
partition the space into a grid of -dimensional cells of side equal to . We then count the number of points in the -th cell of side and compute the frequency, , in which points fall within cell , i.e., the count of points in the
cell divided by the total number of points in the dataset. The
``correlation fractal dimension'' is
then defined as
for the range in which the dataset is self-similar [10,17]. The correlation fractal dimension is quite useful in data mining and measures the probability that two points chosen at random will be within a certain distance from each other .
Figure 2 plots the log of the sum of the squares of the cell occupancy frequencies vs. the log of . Thus, the slope of this graph in selected ranges is the fractal dimension of the dataset. The figure shows that the dataset is self-similar but that there is more than one range with different values of .
Our original dataset has dimension equal to 12 since there are 12 e-business functions in the session vector. Some of the attributes may not be relevant when attempting to group sessions into clusters. We examine in this section how one can select the relevant attributes in a dataset. Since we intend to use fractal clustering (see Section 3), which places points in clusters by taking into account the fractal dimension of each cluster, we use the fractal dimension as the criterium for attribute selection.
The basic process works as follows. We start by computing the fractal dimension of the complete dataset. Then we examine one attribute (say attribute ) at a time and compute the partial fractal dimension , defined as the fractal dimension of the dataset using all attributes except attribute . Select the attribute such that . Set equal to and remove attribute from the dataset. The process continues until all attributes have been removed.
Figure 3 shows the results of applying this process to the complete dataset. The y-axis value for a given attribute is the partial fractal dimension before the attribute is removed. The order in which attributes are removed as a result of executing the procedure outlined above is from right to left. So, as the figure shows, removing Postpay through Undef does not significantly change the fractal dimension of the dataset. Similar results for the UpTo50, Add, and Pay datasets, are shown in Figs. 4, 5, and 6, respectively. These results indicate that fractal clustering analysis of these datasets can be carried out with a number of attributes much smaller than the original 12.
Finding correlations between attributes is important in explaining user behavior. These correlations can be found through the use of association rules . The data is assumed to be a basket of items, which in its simplest form is a vector of binary values: an item is either in the basket or it is not. This technique finds rules of the form , where and are itemsets in the data, i.e., sets of items. The rule comes with two measures: the support, which indicates the probability of the itemset , in the data set (number of times and occur together in the baskets divided by the number of baskets), and the confidence, or probability that occurs in the baskets already containing (number of times and occur together in the baskets divided by the number of baskets containing ). The rules by themselves do not indicate correlation.
However, a different measure, called the lift of the rule, defined as the ratio between the confidence and the probability of the consequent (in our example the itemset ) is useful in establishing correlations. A lift greater than 1 indicates a positive correlation and a lift less than 1 a negative correlation.
In our logs, we took as the baskets the sessions, as items the presence or absence of the functions (attributes of the log). For instance, one can have the item or , indicating whether the session used the browse function or not.
The most interesting correlations were found in the Pay dataset and are shown in Table 2.
The first two indicate a strong correlation between browsing and visiting the information pages (the customer either uses both or none). The third one indicates that when the rule is broken (the customer does not use the browsing function, but visits the information pages), then the search function is commonly used.
This section presents the results of performing FC on the online bookstore dataset and offers an interpretation of the resulting clusters. The FC algorithm used is described in . There, the authors presented a novel algorithm for clustering data points, based on exploiting self-similarity. This algorithm, named FC for Fractal Clustering, incrementally processes each point by placing it in the cluster in which it causes the minimum fractal impact, that is the cluster whose fractal dimension changes the least when the point is added to it.
Since the workload data sets exhibit a high degree of self-similarity, we wanted to use FC to cluster them, showing the advantage that FC exhibits over other clustering methods such as -means.
The results of using FC over the complete set of web logs are shown in Table 3 in the form of the centroids of the three clusters found. The last column indicates the percentage of points in each cluster. The three clusters correspond to human sessions, since the robot sessions were considered as outliers by FC (and form a fourth very small cluster). The clustering was done in a reduced-attribute set of five attributes: add, browse, home, info, and search. The centroids suggest the presence of three groups of human sessions, whose main difference is the intensity on using the functions.
Clustering techniques based on the definition of a distance (e.g., Euclidean or Manhattan) are commonly used. Some examples of these algorithms include -means, minimum spanning tree, and others . The purpose of this section is to compare results obtained with k-means and FC clustering with respect to the quality of the clusters obtained and the interpretations obtained from them.
The k-means algorithm chooses points as the initial clusters and adds each remaining point to the closest cluster using a predefined distance metric. Every time a point is added to a cluster, the coordinates of the centroid of the cluster have to be recomputed . Point allocation to clusters may have to be repeated until no points change their cluster allocation or a maximum number of passes is performed.
We present in Tables 4, 5, and 6 the results of applying -means clustering to our e-commerce workload,
where, as before, each point is a session vector. Table 4 shows the results of applying
-means clustering to the complete
log using 5, 6, and 7 clusters. Only seven--acc, add, browse,
home, info, pay, and search--of the twelve attributes were used
in the clustering processes. Using more attributes did not
improve the quality of the clustering process. For each number
of clusters, the table shows the coordinates of the centroid
for each cluster. These coordinates represent the average
number of executions of each of the seven e-business functions
by sessions represented by that cluster. Column 9 indicates the
percentage of sessions in each cluster. The next column, S,
indicates the sum of the number of executions of each of the
seven e-business functions for that cluster. This sum can be
considered as the session length restricted to these seven
functions. The last column of the table presents a possible
interpretation for the type of sessions represented by each
We used the following labels in Tables 4, 5, and 6 to indicate the following categories of sessions:
The characterization of sessions could be useful for planning purposes. It helps to answer questions such as What would be the response time if the percentage of bots increase by 100%? How could one reduce the percentage of chm customers? What kind of customers usually buy? Is there a common reason why customers change their mind?
In each of the three cases (, and 7) in Table 4, there is one cluster of sessions labeled as ``buyers''. One interesting observation is that the centroid of the ``buyers'' cluster is the same regardless of the number of clusters selected: . The analysis of the complete log indicates also that shopbots and crawlers seem to be very well separated for , and 7. The coordinates of the centroid of the ``crawlers'' cluster is the same in all three cases and the coordinates of the ``shopbots'' centroid are not much different. The non robot clusters, referred here as human clusters, are separated into three groups when . One is composed of buyers and the other two belong to the hit & run category. When the number of clusters is increased to 6 and 7, we start to see a new type of customer, those who change their mind. In fact, with 7 clusters we can see that 9.2% of the customers changed their mind, 0.3% bought something, 85.8% are non-serious customers (hit&run), and 4.7% are robots.
Table 5 shows the results when only sessions that added at least one item to the cart are considered. In this case, the ``buyers'' centroid in the Add log is the same for 5 and 7 clusters and is very similar to the ``buyers'' centroid in the complete log. We can see that this log does not provide a clean distinction between shopbots and crawlers. Still, the bots with a very large number of requests manage to remain isolated for both and 7.
Finally, Table 6 shows the results of applying -means when robots are excluded. The results show two ``hit&run'' clusters and three clusters that show add to cart activity but with product selection dominated by either info, browsing, or search requests.
Looking at the results of Table 3, obtained when FC was performed, we were able to derive a relationship between the sum of the variables browse, info, and search and the add variable. This relationship would not have revealed itself without performing FC. We then performed Chi-Square tests over the variables and the sum , along with a linear regression between these variables. The results are summarized in Table 7. The interesting fact is that the variables are correlated in all cases (each cluster and complete set), but while the correlation is positive in Cluster 1 and the complete set, it is negative in both of the other clusters (the slope of the regression curve is negative).
Clusters 2 and 3 have significantly more Add activity than cluster 1 (see Table 3). In these two cases, as customers increase their product selection activity, they tend to add less items to the shopping cart (negative slope). Another interesting difference between the results obtained between FC and -means can be seen by comparing Table 6 for -means without robots and Table 3 for FC without robots. In the first case, there is very little separation due to Add activity while in the second case the centroids provide a very clear distinction based on Add to cart activity. It may be more useful, from a management standpoint, to separate customers based on their product ordering activity behavior.
The interactions of customers of Web and e-commerce sites are represented by sessions, which are sequences of requests of different types made by a single customer during a single visit to a site. During a session, a customer requests the execution of various e-business functions such as browse, search, select, add to the shopping cart, register, and pay. In this paper, we used the Customer Visit Model (CVM)  to represent an e-commerce site workload. A CVM represents sessions as a collection of session vectors, one per session. A session vector for the j-th session indicates the number of times, , that each of the different functions (e.g., search, browse, add to cart, etc) were invoked during the session.
We used clustering techniques to group ``similar'' sessions and to understand the behavior of users within each cluster. In this paper, we proposed a fractal-based approach to characterize Web workloads. First, we showed how the fractal dimension of the dataset can be used to reduce the number of attributes, i.e., the dimension of the session vector. By analyzing variations in the fractal dimension, the method selects those attributes that are more relevant to the workload characterization. This process leads to a reduced complexity of the Web workload characterization and a better understanding of real workloads.
Then, fractal-based clustering algorithms were used to identify groups of user sessions that share some common features. We compared the results of the fractal clustering with those obtained by traditional clustering techniques, based on the -means algorithm. Additional contributions include the following:
Obtaining logs from actual e-commerce sites is not trivial due to the proprietary nature of these logs. We were fortunate to be given logs from a real site, that shall remain unnamed for obvious reasons. The results reported in this paper should not be seen as general results that apply to many e-commerce sites but as an application of a methodology, based on fractal clustering, to characterize and reduce the complexity of Web and E-commerce workloads.
The authors would like to thank Ping Chen and Julia Couto for their help in running some of the experiments.
Daniel A. Menascé is a Professor of Computer Science, co-director of the E-Center for E-business, and Director of the MS in E-commerce program at George Mason University.
Bruno D. Abrahão is a student in the BS in Computer Science program at the Federal University of Minas Gerais, Brazil.
Daniel Barbará is an Associate Professor of Information and Software Engineering at George Mason University.
Virgilio Almeida is a Professor of Computer Science at the Federal University of Minas Gerais, Brazil.
Flávia Ribeiro is a Ph.D. student in the Computer Science Department at the Federal University of Minas Gerais, Brazil.