Introduction

Today, many companies have thousand or more than billion of links on their website. Some of these links are not accessed or useless to users however, the links still available on the website for long. This morning, I decide to do an experience by browsing IBM’s website. The experience follows as: I start on the host of the website (ibm.com), I click arbitrary on a link then I access to a new page; from this new page I click arbitrary on a link then I access to another new page and so one. Infinitely I have a chance to find a new link so to a new page. The interesting remark is that it was not easy for me to come back on my initial page. It is good and amazing to follow different links and successive pages, but is it necessary for the reader? Is it easy to the reader to follow the original idea that brings him on the website by browsing successive links and pages? Some remarkable lacks of thousand or billion of links are: difficulty to navigate and difficulty to follow information we are looking for on the website in one click. According to me, the website should be like novel that reader can pass through chapters without lost the main idea and without spend a lot of time.

The aim of this post is to show how you can you use analytics technique to optimize your website and avoid infinitely link click. The principal analytics technique is graph mining. Graph mining is also what we call analytics on a large graph. The graph is a set of object in which some pairs of these objects are connected by links. Euler introduced the graph theory since 18th century and it becomes more and more use in many fields of researches like chemistry, physics, economy, etc.

Dataset:

Since, it is not easy to access to the dataset of any website; we simulate a dataset on a virtual website. The node or vertex is one website’s page and the edge is a link to move from one page to another. We use the barabasi graph to simulate a random weighted graph with 500 nodes and 499 edges. The weight of the edge is the probability to move from one page to another. The probability to visit a page B moving from A is equal to (1/NL)*(NB/NA). Where NL is a total number of links on the page A, NB number of readers who visit B coming from A and NA is number of visitor who visit the page A. We decide to penalize pages that have no link, their probability to access to new page is 0. The probability to be on the host of the website is one.

We draw below the sample of the weighted graph. Note that the edges size depends on its weight.

Figure 1: Example of graph

I. Statistical measures

When we study graph, there are some useful statistics such as node degree, degree distribution, clustering coefficient and connected components.

a. Node degree & degree distribution

Suppose we want to know how many pages I can access from page “A” or how many pages point to page “A”, so the node degree helps us to compute this statistics. The degree of the vertex “A” in the graph is the number of edges incident on the vertex. Within the directed graph, we have the out-degree and the in-degree. The out-degree is the number of edges pointing from the node. Similarly, in-degree denotes the number of edges pointing towards the node. For undirected graph, for every node V, the number of in-degree is equal to the number of out-degree.

In addition, the degree distribution is the probabilistic distribution of the node degree statistics. The maximum degree is 58. Let’s denote “A” the page with 58 degree. “A” has 58 degree means that the number of incoming links and out-coming link to page “A” is 58. The minimum degree is 1, so every page has unless one incoming link or out-coming link. In average, the number of incoming or out-coming links on every page of our website is two.

Figure 2: Plots of the number of vertices versus the degree

b. Clustering coefficient

The clustering coefficient determines the transitivity in the graph (H.Strogate, 1998), i.e. in the language of social network, the friend of a friend is more likely to be also my friend. Watts and Strogatz introduced the local clustering coefficient C_v=Number of triangle connected to vertex v/Number of triples centered on vertex v. Assuming that, for vertices with degree 0 or 1, for which both numerator and denominator are zero, C_v=0. According to this definition, the local clustering C_v is the proportion of links among the vertices within its neighborhood divided by the number of links that could possibly exist between them. This local clustering is also formulated as the fraction of links that exist between the neighbors k of the vertex v, over the total possible numbers of edges k(k+1)/2. (John Tang, 2009)

We find that, the local clustering (in most graphs) decreases when the degree of the node increases. Moreover, we have C_v tends to 1/d_v (Newman, 2003). In our case study the clustering coefficient is zero.

c. Connected subgraph (CS)

A connected subgraph is a set of vertices where for every pair of the nodes in the set there exists a path connecting them. The plot of the number of the component over the number of size can give an idea about the distribution of the connected subgraph. For example, the figure tells us that the connected subgraph has a power law distribution such as the degree distribution.

Figure3: The number of components for each possible size

II. Clustering implementation

There is many graphs’ clustering algorithm for example Markov Cluster Algorithm (MCL algorithm) (Stijn, 2000), fast greedy algorithm (A Clauset, 2004), random walk algorithm (Pascal Pons, Dec 2005) and so one. We choose to implement MCL algorithm in this paper because of it is easy and fast to compute.

a. MCL algorithm description

We decide to present just guideline of the algorithm and invite people for more detail to follow links (http://micans.org/mcl/, or read (Stijn, 2000)). Here is the algorithm

Step1: Input is an un-directed graph, power parameter e, and inflation parameter r. we choose the parameter r=2 and e=2 for our case study.

Step2: Create the associated matrix; we compute the adjacent matrix of the graph

Step3: Add self loops to each node (optional); this step has been skipped.

Step4: Normalize the adjacent matrix; the normalization consists to divide each element in the adjacent matrix column by the sum of the column.

Step5: Expand by taking the eth power of the matrix: we multiply the matrix with power two

Step6: Inflate by taking inflation of the resulting matrix with parameter r. We multiply each element of the expanded matrix by power two and normalize each column.

Step7: Repeat steps 5 and 6 until a steady state is reached (convergence).

Step8: Interpret resulting matrix to discover clusters.

We compute the MCL algorithm with the software R by using a powerful package “igraph”. The R igraph package has some useful functions on graph operation, graph plotting, graph attribution, etc. The package can be downloaded on R-cran package[1].

Figure4: MCL clustering result.

b. Interpretation and conclusion

The result of the algorithm finds that there are 190 communities. This means that the five hundred pages can be gathered in only 190 pages. We have one community with 27 nodes it means that 27 pages of our website page can be assembled in just one page. By respecting scrupulously the result of the clustering we can well design our website and be sure that the reader will not lost within it.

The plot of the connected component versus the size[2] shows that there are many communities with two pages. We have more than 10 communities which have more than five pages within the community. The table below is the frequency of the community size.

Number of pages within the community	1	2	3	4	5	6	7	8	9	10	11	12	27
Number of community	87	41	26	11	5	4	3	5	2	2	2	1	1

The value of this technique is that it allows you to gather information from 500 pages in only 190 pages. It is around 38.0% of space’s reduction. User can focus straightforward to information he needs on the website.

The main difference between this post and the previous is that we have more communities. In addition, we have more pages that initially were link to some other which become alone. We think that these pages can be remove from the website or should be analyzed with more attention.

Bibliography

A Clauset MEJ Newman, C Moore Finding community structure in very large networks [Revue] // Cornell University / éd. 10.1103/PhysRevE.70.066111. - [s.l.] : http://arxiv.org/abs/condmat/0408187, 2004. - 066111 : Vol. 70. - E 70, 066111 .

H.Strogate D.J. Watts and Collective dynamics of “small-world” network [Livre] / éd. Nature. - 1998. - Vol. 393 : pp. pp440-442.

John Tang Mirco Musolesi, Cecilia Mascolo, Vito Latora Temporal Distance Metrics for Social Network Analysis [Revue]. - [s.l.] : http://conferences.sigcomm.org/sigcomm/2009/workshops/wosn/papers/p31.pdf, 2009.

Newman M. E. J. The structure and function of complex networks [Revue] // SIAM Review. - 2003. - pp. 168-240.

Pascal Pons Matthieu Latapy Computing communities in large networks using random walks [Revue]. - Cornell University : Cornell University library, http://arxiv.org/abs/physics/0512106, Dec 2005.

Stijn Van Dongen Graph Clustering by Flow Simulation [Report] : PhD Thesis. - Netherlands : University of Utrecht, The, 2000.

[1] http://cran.r-project.org/web/packages/igraph/index.html

[2] See figure 3

http://www.kookaburralicorice.com/fruity-liquorice/sweet-mango-liquorice.html

Suppose the marketing department wants to launch a new product (see picture) on the market. After select the design of the product and decide about the taste of the product in the laboratory, how can we be sure that the product will be admitted by the consumers? Machine learning can help you take good decision about the tasting and the design of the product.

A sampling of consumer taste and score over 100 about the tasting and the designing of our product. At the end of the test they admit or not to sell the product.

Set Sample of the population

Tasting Score	Design Score	Y
34,62365962	78,02469282	0
30,28671077	43,89499752	0
35,84740877	72,90219803	0
60,18259939	86,3085521	1
79,03273605	75,34437644	1
45,08327748	56,31637178	0
61,10666454	96,51142588	1
75,02474557	46,55401354	1
76,0987867	87,42056972	1
84,43281996	43,53339331	1
95,86155507	38,22527806	0
75,01365839	30,60326323	0
82,30705337	76,4819633	1
69,36458876	97,71869196	1
39,53833914	76,03681085	0
53,97105215	89,20735014	1

Source: Adapted from Stanford ML Course

Data visualization

We first of all realize that in general a product with more than 50% designing score and more than 50% score tasting is selected to be admitted.

Logistic regression is one of the most useful supervise technique in classified problem. This method is used here to answer the question.

The coefficients of the model are Theta_0=-25.16; Theta_1=0.206 and Theta_2=0.201 this shows that our variables (tasting score and designing score) affect positively the admission of the product.

Misclassification matrix

	Predicted Value
	Not Admitted	Admitted	Total
Not Admitted	35	5	40
Admitted	5	55	60
Total	40	60	100

The model misclassified 8% of admitted product and 12.5% of not admitted product. Globally, the error rate is 10% ((5+5)/100) and the overall accuracy rate is 90% ((35+55)/100). This means that we have 90% of chance to predict according to the model that a product will be admitted in the market.

The decision boundary of the model is drawn in the following figure. The figure implies that all points over the boundary line are predicted to be admitted and others are predicted to be not admitted. The model predicts four false positive (points that are negative and the model predict that they are positive) and five false negative (points that are positive and model predicts that they are negative).

Application of the model to predict the admission of a product with a tasting score of 50 and a designing score of 90, the admission probability is 96.38%

size of the house (square feet)	number of bedrooms	Price (US$)
2104	3	399900
1600	3	329900
2400	3	369000
1416	2	232000
3000	4	539900
1985	4	299900
1534	3	314900
1427	3	198999
1380	3	212000
1494	3	242500
1940	4	239999

Data analytics group

Wednesday, January 25, 2012

HOW CAN YOU OPTIMIZE MY WEBSITE?

Bibliography

Tuesday, January 17, 2012

LOGISTIC REGRESSION IN MARKETING DEPARTMENT

Friday, January 13, 2012

LINEAR REGRESSION VIDEOS

Thursday, January 12, 2012

HOW CAN I USE LINEAR REGRESSION IN MY REAL LIFE?