Introduction
Today,
many companies have thousand or more than billion of links on their website.
Some of these links are not accessed or useless to users however, the links
still available on the website for long. This morning, I decide to do an experience
by browsing IBM’s website. The experience follows as: I start on the host of
the website (ibm.com), I click arbitrary on a link then I access to a new page;
from this new page I click arbitrary on a link then I access to another new
page and so one. Infinitely I have a chance to find a new link so to a new
page. The interesting remark is that it was not easy for me to come back on my
initial page. It is good and amazing to follow different links and successive
pages, but is it necessary for the reader? Is it easy to the reader to follow
the original idea that brings him on the website by browsing successive links
and pages? Some remarkable lacks of thousand or billion of links are:
difficulty to navigate and difficulty to follow information we are looking for
on the website in one click. According to me, the website should be like novel
that reader can pass through chapters without lost the main idea and without
spend a lot of time.
The
aim of this post is to show how you can you use analytics technique to optimize
your website and avoid infinitely link click. The principal analytics technique
is graph mining. Graph mining is also what we call analytics on a large graph.
The graph is a set of object in which some pairs of these objects are connected
by links. Euler introduced the graph theory since 18th century and it becomes
more and more use in many fields of researches like chemistry, physics,
economy, etc.
Dataset:
Since,
it is not easy to access to the dataset of any website; we simulate a dataset
on a virtual website. The node or vertex is one website’s page and the edge is
a link to move from one page to another. We use the barabasi graph to simulate
a random weighted graph with 500 nodes and 499 edges. The weight of the edge is
the probability to move from one page to another. The probability to visit a
page B moving from A is equal to (1/NL)*(NB/NA). Where NL is a total number of
links on the page A, NB number of readers who visit B coming from A and NA is
number of visitor who visit the page A. We decide to penalize pages that have no link, their probability to access to new page is 0. The probability to be on the host of the website is one.
We draw below the sample of the weighted graph. Note that the edges size depends on its weight.
When we study graph, there are some useful statistics such as node
degree, degree distribution, clustering coefficient and connected components.
Suppose we want to know how many pages I can access from page “A” or how
many pages point to page “A”, so the node degree helps us to compute this
statistics. The degree of the vertex “A” in the graph is the number of edges
incident on the vertex. Within the directed graph, we have the out-degree and
the in-degree. The out-degree is the number of edges pointing from the node.
Similarly, in-degree denotes the
number of edges pointing towards the node. For undirected graph, for every node V, the number
of in-degree is equal to the number
of out-degree.
In addition, the degree distribution is the probabilistic distribution
of the node degree statistics. The maximum degree is 58. Let’s denote “A” the
page with 58 degree. “A” has 58 degree means that the number of incoming links
and out-coming link to page “A” is 58. The minimum degree is 1, so every page
has unless one incoming link or out-coming link. In average, the number of
incoming or out-coming links on every page of our website is two.
The clustering coefficient determines the transitivity in the graph (H.Strogate, 1998) , i.e. in the
language of social network, the friend of a friend is more likely to be also my
friend. Watts and Strogatz introduced the local clustering coefficient C_v=Number of triangle connected to vertex v/Number of triples centered on vertex v. Assuming that, for vertices with degree 0 or 1, for which both numerator and
denominator are zero, C_v=0. According to this definition, the local clustering C_v is the proportion of links among
the vertices within its neighborhood divided by the number of links that could
possibly exist between them. This local clustering is also formulated as the
fraction of links that exist between the neighbors k of the vertex v, over the total possible numbers of edges k(k+1)/2. (John Tang, 2009)
We find that, the local
clustering (in most graphs) decreases when the degree of the node increases.
Moreover, we have C_v tends to 1/d_v (Newman, 2003) . In our case study
the clustering coefficient is zero.
A connected subgraph is a set of vertices where for every pair of the
nodes in the set there exists a path connecting them. The plot of the number of
the component over the number of size can give an idea about the distribution
of the connected subgraph. For example, the figure tells us that the connected
subgraph has a power law distribution such as the degree distribution.
II.
Clustering
implementation
There is many
graphs’ clustering algorithm for example Markov Cluster Algorithm (MCL
algorithm) (Stijn, 2000) , fast greedy
algorithm (A Clauset, 2004) , random walk
algorithm (Pascal Pons,
Dec 2005)
and so one. We choose to implement MCL algorithm in this paper because of it is
easy and fast to compute.
We decide to
present just guideline of the algorithm and invite people for more detail to
follow links (http://micans.org/mcl/,
or read (Stijn, 2000) ).
Here is the algorithm
Step1: Input is an un-directed graph, power parameter
e, and inflation parameter r. we choose the parameter r=2 and e=2 for our case
study.
Step2: Create the associated matrix; we compute the
adjacent matrix of the graph
Step3: Add self loops to each node (optional); this
step has been skipped.
Step4: Normalize the adjacent matrix; the normalization
consists to divide each element in the adjacent matrix column by the sum of the
column.
Step5: Expand by taking the eth power of the matrix: we
multiply the matrix with power two
Step6: Inflate by taking inflation of the resulting
matrix with parameter r. We multiply each element of the expanded matrix by
power two and normalize each column.
Step7: Repeat steps 5 and 6 until a steady state is
reached (convergence).
Step8: Interpret resulting matrix to discover clusters.
We compute
the MCL algorithm with the software R by using a powerful package “igraph”. The
R igraph package has some useful functions on graph operation, graph plotting,
graph attribution, etc. The package can be downloaded on R-cran package[1].
Figure4: MCL clustering result.
The result of the algorithm finds that there are 190
communities. This means that the five hundred pages can be gathered in only 190
pages. We have one community with 27 nodes it means that 27 pages of our
website page can be assembled in just one page. By respecting scrupulously the
result of the clustering we can well design our website and be sure that the
reader will not lost within it.
The plot of
the connected component versus the size[2]
shows that there are many communities with two pages. We have more than 10
communities which have more than five pages within the community. The table
below is the frequency of the community size.
|
Number of pages within the community
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
27
|
|
Number of community
|
87
|
41
|
26
|
11
|
5
|
4
|
3
|
5
|
2
|
2
|
2
|
1
|
1
|
The value of
this technique is that it allows you to gather information from 500 pages in
only 190 pages. It is around 38.0% of space’s reduction. User can focus
straightforward to information he needs on the website.
The main
difference between this post and the previous is that we have more communities.
In addition, we have more pages that initially were link to some other which
become alone. We think that these pages can be remove from the website or
should be analyzed with more attention.
Bibliography
A Clauset MEJ Newman, C Moore Finding
community structure in very large networks [Revue] // Cornell University
/ éd. 10.1103/PhysRevE.70.066111. - [s.l.] :
http://arxiv.org/abs/condmat/0408187, 2004. - 066111 : Vol.
70. - E 70, 066111 .
H.Strogate D.J. Watts and Collective dynamics of
“small-world” network [Livre] / éd. Nature. - 1998. - Vol. 393
: pp. pp440-442.
John Tang Mirco Musolesi, Cecilia Mascolo, Vito Latora Temporal
Distance Metrics for Social Network Analysis [Revue]. - [s.l.] :
http://conferences.sigcomm.org/sigcomm/2009/workshops/wosn/papers/p31.pdf,
2009.
Newman M. E. J. The structure and function
of complex networks [Revue] // SIAM Review. - 2003. - pp.
168-240.
Pascal Pons Matthieu Latapy Computing communities in
large networks using random walks [Revue]. - Cornell University :
Cornell University library, http://arxiv.org/abs/physics/0512106, Dec 2005.
Stijn Van Dongen Graph Clustering by Flow
Simulation [Report] : PhD Thesis. - Netherlands : University
of Utrecht, The, 2000.




No comments:
Post a Comment