Wednesday, January 25, 2012

HOW CAN YOU OPTIMIZE MY WEBSITE?


Introduction
Today, many companies have thousand or more than billion of links on their website. Some of these links are not accessed or useless to users however, the links still available on the website for long. This morning, I decide to do an experience by browsing IBM’s website. The experience follows as: I start on the host of the website (ibm.com), I click arbitrary on a link then I access to a new page; from this new page I click arbitrary on a link then I access to another new page and so one. Infinitely I have a chance to find a new link so to a new page. The interesting remark is that it was not easy for me to come back on my initial page. It is good and amazing to follow different links and successive pages, but is it necessary for the reader? Is it easy to the reader to follow the original idea that brings him on the website by browsing successive links and pages? Some remarkable lacks of thousand or billion of links are: difficulty to navigate and difficulty to follow information we are looking for on the website in one click. According to me, the website should be like novel that reader can pass through chapters without lost the main idea and without spend a lot of time.
The aim of this post is to show how you can you use analytics technique to optimize your website and avoid infinitely link click. The principal analytics technique is graph mining. Graph mining is also what we call analytics on a large graph. The graph is a set of object in which some pairs of these objects are connected by links. Euler introduced the graph theory since 18th century and it becomes more and more use in many fields of researches like chemistry, physics, economy, etc.



Dataset:
Since, it is not easy to access to the dataset of any website; we simulate a dataset on a virtual website. The node or vertex is one website’s page and the edge is a link to move from one page to another. We use the barabasi graph to simulate a random weighted graph with 500 nodes and 499 edges. The weight of the edge is the probability to move from one page to another. The probability to visit a page B moving from A is equal to (1/NL)*(NB/NA). Where NL is a total number of links on the page A, NB number of readers who visit B coming from A and NA is number of visitor who visit the page A. We decide to penalize pages that have no link, their  probability to access to new page is 0. The probability to be on the host of the website is one.
We draw below the sample of the weighted graph. Note that the edges size depends on its weight. 
Figure 1: Example of graph
When we study graph, there are some useful statistics such as node degree, degree distribution, clustering coefficient and connected components.
Suppose we want to know how many pages I can access from page “A” or how many pages point to page “A”, so the node degree helps us to compute this statistics. The degree of the vertex “A” in the graph is the number of edges incident on the vertex. Within the directed graph, we have the out-degree and the in-degree. The out-degree is the number of edges pointing from the node. Similarly, in-degree denotes the number of edges pointing towards the node. For undirected graph, for every node V, the number of in-degree is equal to the number of out-degree.
In addition, the degree distribution is the probabilistic distribution of the node degree statistics. The maximum degree is 58. Let’s denote “A” the page with 58 degree. “A” has 58 degree means that the number of incoming links and out-coming link to page “A” is 58. The minimum degree is 1, so every page has unless one incoming link or out-coming link. In average, the number of incoming or out-coming links on every page of our website is two.

The clustering coefficient determines the transitivity in the graph (H.Strogate, 1998), i.e. in the language of social network, the friend of a friend is more likely to be also my friend. Watts and Strogatz introduced the local clustering coefficient C_v=Number of triangle connected to  vertex v/Number of triples centered on vertex v. Assuming that, for vertices with degree 0 or 1, for which both numerator and denominator are zero, C_v=0. According to this definition, the local clustering C_v is the proportion of links among the vertices within its neighborhood divided by the number of links that could possibly exist between them. This local clustering is also formulated as the fraction of links that exist between the neighbors k of the vertex v, over the total possible numbers of edges k(k+1)/2. (John Tang, 2009)
We find that, the local clustering (in most graphs) decreases when the degree of the node increases. Moreover, we have C_v tends to 1/d_v (Newman, 2003). In our case study the clustering coefficient is zero.
A connected subgraph is a set of vertices where for every pair of the nodes in the set there exists a path connecting them. The plot of the number of the component over the number of size can give an idea about the distribution of the connected subgraph. For example, the figure tells us that the connected subgraph has a power law distribution such as the degree distribution.

 II.          Clustering implementation
There is many graphs’ clustering algorithm for example Markov Cluster Algorithm (MCL algorithm) (Stijn, 2000), fast greedy algorithm (A Clauset, 2004), random walk algorithm (Pascal Pons, Dec 2005) and so one. We choose to implement MCL algorithm in this paper because of it is easy and fast to compute.
We decide to present just guideline of the algorithm and invite people for more detail to follow links (http://micans.org/mcl/, or read (Stijn, 2000)). Here is the algorithm
Step1: Input is an un-directed graph, power parameter e, and inflation parameter r. we choose the parameter r=2 and e=2 for our case study.
Step2: Create the associated matrix; we compute the adjacent matrix of the graph
Step3: Add self loops to each node (optional); this step has been skipped.
Step4: Normalize the adjacent matrix; the normalization consists to divide each element in the adjacent matrix column by the sum of the column.
Step5: Expand by taking the eth power of the matrix: we multiply the matrix with power two
Step6: Inflate by taking inflation of the resulting matrix with parameter r. We multiply each element of the expanded matrix by power two and normalize each column.
Step7: Repeat steps 5 and 6 until a steady state is reached (convergence).
Step8: Interpret resulting matrix to discover clusters.
We compute the MCL algorithm with the software R by using a powerful package “igraph”. The R igraph package has some useful functions on graph operation, graph plotting, graph attribution, etc. The package can be downloaded on R-cran package[1].

Figure4: MCL clustering result.
The result of the algorithm finds that there are 190 communities. This means that the five hundred pages can be gathered in only 190 pages. We have one community with 27 nodes it means that 27 pages of our website page can be assembled in just one page. By respecting scrupulously the result of the clustering we can well design our website and be sure that the reader will not lost within it.
The plot of the connected component versus the size[2] shows that there are many communities with two pages. We have more than 10 communities which have more than five pages within the community. The table below is the frequency of the community size.

Number of pages within the community
1
2
3
4
5
6
7
8
9
10
11
12
27
Number of community
87
41
26
11
5
4
3
5
2
2
2
1
1
The value of this technique is that it allows you to gather information from 500 pages in only 190 pages. It is around 38.0% of space’s reduction. User can focus straightforward to information he needs on the website.
The main difference between this post and the previous is that we have more communities. In addition, we have more pages that initially were link to some other which become alone. We think that these pages can be remove from the website or should be analyzed with more attention.



Bibliography

A Clauset MEJ Newman, C Moore Finding community structure in very large networks [Revue] // Cornell University  / éd. 10.1103/PhysRevE.70.066111. - [s.l.] : http://arxiv.org/abs/condmat/0408187, 2004. - 066111 : Vol. 70. - E 70, 066111 .
H.Strogate D.J. Watts and Collective dynamics of “small-world” network [Livre] / éd. Nature. - 1998. - Vol. 393  : pp. pp440-442.
John Tang Mirco Musolesi, Cecilia Mascolo, Vito Latora Temporal Distance Metrics for Social Network Analysis [Revue]. - [s.l.] : http://conferences.sigcomm.org/sigcomm/2009/workshops/wosn/papers/p31.pdf, 2009.
Newman M. E. J. The structure and function of complex networks [Revue] // SIAM Review. - 2003. - pp. 168-240.
Pascal Pons Matthieu Latapy Computing communities in large networks using random walks [Revue]. - Cornell University  : Cornell University library, http://arxiv.org/abs/physics/0512106, Dec 2005.
Stijn Van Dongen Graph Clustering by Flow Simulation [Report] : PhD Thesis. - Netherlands : University of Utrecht, The, 2000.



Tuesday, January 17, 2012

LOGISTIC REGRESSION IN MARKETING DEPARTMENT



Suppose the marketing department wants to launch a new product (see picture) on the market. After select the design of the product and decide about the taste of the product in the laboratory, how can we be sure that the product will be admitted by the consumers?  Machine learning can help you take good decision about the tasting and the design of the product.
A sampling of consumer taste and score over 100 about the tasting and the designing of our product. At the end of the test they admit or not to sell the product.
Set Sample of the population
Tasting Score
Design Score
Y
34,62365962
78,02469282
0
30,28671077
43,89499752
0
35,84740877
72,90219803
0
60,18259939
86,3085521
1
79,03273605
75,34437644
1
45,08327748
56,31637178
0
61,10666454
96,51142588
1
75,02474557
46,55401354
1
76,0987867
87,42056972
1
84,43281996
43,53339331
1
95,86155507
38,22527806
0
75,01365839
30,60326323
0
82,30705337
76,4819633
1
69,36458876
97,71869196
1
39,53833914
76,03681085
0
53,97105215
89,20735014
1
                                               Source: Adapted from Stanford ML Course
Data visualization
We first of all realize that in general a product with more than 50% designing score and more than 50% score tasting is selected to be admitted.


Logistic regression is one of the most useful supervise technique in classified problem.  This method is used here to answer the question.
The coefficients of the model are Theta_0=-25.16; Theta_1=0.206 and Theta_2=0.201 this shows that our variables (tasting score and designing score) affect positively the admission of the product.
Misclassification matrix
Predicted Value
Not Admitted
Admitted
Total
Not Admitted
35
5
40
Admitted
5
55
60
Total
40
60
100

The model misclassified 8% of admitted product and 12.5% of not admitted product. Globally, the error rate is 10% ((5+5)/100) and the overall accuracy rate is 90% ((35+55)/100). This means that we have 90% of chance to predict according to the model that a product will be admitted in the market.

The decision boundary of the model is drawn in the following figure. The figure implies that all points over the boundary line are predicted to be admitted and others are predicted to be not admitted. The model predicts four false positive (points that are negative and the model predict that they are positive) and five false negative (points that are positive and model predicts that they are negative).  

Application of the model to predict the admission of a product with a tasting score of 50 and a designing score of 90, the admission probability is 96.38%

Thursday, January 12, 2012

HOW CAN I USE LINEAR REGRESSION IN MY REAL LIFE?



Sometime in our life, we have huge investment to do and it is useful to take few hours to investigate instead of waiting commercial advisor that the aim is to earn money on you because either you are ignorance or you hurry to invest on what you need to do. Suppose you are selling your house and you want to know what a good market price would be. This article’s aim is to open your mind about what you can do to save money (buyer) or earn money (seller) if you decide to buy or sell your house.

First step: collect information on recent houses sold in your region.
size of the house
(square feet)
number of bedrooms
Price (US$)
2104
3
399900
1600
3
329900
2400
3
369000
1416
2
232000
3000
4
539900
1985
4
299900
1534
3
314900
1427
3
198999
1380
3
212000
1494
3
242500
1940
4
239999
Source: housing prices in Portland, Oregon
The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

Dataset visualization.

Second step: download octave software and install it on your laptop or desktop.

Third step: download the program, modify parameters and execute it in your octave.
The parameters of the program are the location of your raw file, Number of bedroom in your house and the size.

Last step: take your conclusion.
The model predicts the price of a house (1650 square feet, 3 bedrooms) to US$293081.4634. Meantime, someone who has 1650 square feet and 4 bedrooms the predicted price is US$284343.445.

If you need the code of the program you can contact me