Marketing department main duty is to maximize the company’s profit by minimizing the advertising expense. Communicate efficiently at lower cost can be done if and only if you understand very well your customers behaviours. The customer’s behaviours can be characterised by the customer attitude beside the product or service. In addition, customer’s behaviours can be affected by the effect of their environment. Generally, people has risk-averse attitude mainly when they want to buy new product. So, they seem to be influenced by their neighbours or friends when they want to take important decision. In this post, we are going to show how your company can exploit the influence of the customer environment to minimize the advertising cost. This concept is also call the viral marketing. The viral marketing consists to target people with high connectivity in the communities and advertise beside them. The probability that this targeted people brings client to the company is high than the probability of those with low connectivity. Here our understanding does not focus only on a snapshot of the customer’s communities but on the whole time period

How does analytics can help you target this people with large connection? What is the structure of our communities over the time? In other words, how our communities evolve over the time? Traditional statistics cannot help us to answer easily these questions. That is the raison why we focus on graph mining techniques.

First of all, we present the definition of some graph concepts and secondly we explore the graph technique with its application on the dataset of phone call over ten days.

The graph is characterized by its form and the relationships among its elements. We analyze the topology of our graph by focus principally on it cohesiveness. Cohesiveness indicates the presence of strong relationships among network’s member and also the likelihood of their access to the same information in few time intervals. So, more our graph is cohered more our marketing communication can spread rapidly through the social communities. There are many metrics but we will describe here only useful one which fit with our marketing domain.

• The distance between two vertices is the minimum number of edges connecting them (in case of a weighted graph, the distance is the path whose total weight is the lowest possible). According to the definition, the information costs more if this distance is large between members.

• Neighbors (or first neighbors) of a vertex are all the other vertices directly connected with it; the neighborhood of a vertex is the subgraph containing it and all its first neighbors. In our graph, nodes with more neighbors can be targeted as the cornerstone which we can focus our communicating plan and probably be sure the information will affect many people with less cost.

• The clustering coefficient determines the transitivity in the graph (H.Strogate, 1998), i.e. in the language of social network, the friend of a friend is more likely to be also my friend. Watts and Strogatz introduced the local clustering coefficient . Assuming that, for vertices with degree 0 or 1, for which both numerator and denominator are zero, . According to this definition, the local clustering is the proportion of links among the vertices within its neighborhood divided by the number of links that could possibly exist between them. This local clustering is also formulated as the fraction of links that exist between the neighbors of the vertex , over the total possible numbers of edges . (John Tang, 2009)

We find that, the local clustering (in most graphs) decreases when the degree of the node increases. Moreover, we have (Newman, 2003). In our case study the clustering coefficient is zero.

• The closeness of a vertex is the reciprocal of the sum of geodesic distances to all other vertices in the graph; it can be interpreted as a measure of how long it will take information to spread from a given vertex to others in the network.

• The betweenness is a centrality measure of a vertex within a graph. It is normally calculated as the fraction of shortest paths between node pairs that pass through the one of interest. In social sciences, the betweenness is a measure of the importance or of the influence of an actor in a group.

• The degree of a vertex is the number of edges that connects it to other vertices (if the graph is directed, the degree can be distinguished in in-degree and out-degree, meaning the number of arcs arriving to or departing from the vertex).

• The connected subgraph is a set of vertices where for every pair of the nodes in the set there exists a path connecting them. The plot of the number of the component over the number of size can give an idea about the distribution of the connected subgraph.

Recall that all these metrics will be analyzed in the context of dynamism.

II. Data collection

The data we collect is from the cell phone on Isla del Sueño. This data covers ten-day period in June 2006 and it narrow to about 400 unique cell phones during the period.

The data set includes records with the following fields:

From: Identifier for the calling phone

To: Identifier for the receiving phone

Datetime: Format for date and time is: yyyymmdd hhmm

Duration: Duration of the call in seconds

Cell Tower: Location of the call origination cell towe

III. Data visualization

In this section, we plot the statistical metrics we define in the previous section over the ten days call phone. According to these graphics we note that the number of callers (nodes) decreases in the eighth day but in other hand the number of call increase that day. The number of calls differs significantly from day to day and varies between 900 and 1500 calls over the ten days. The particularity of our graph structure over the ten days is that, more we have nodes less we have call among them.

The number of user over the ten days and the number of degree are opposite. In few words, the degree metric helps to capture the activity of node within the graph. In this case we have in average five incoming/outgoing calling between callers per day. Although we have fewer callers in the eighth day phone call, the call activity is at the top; we find that the average degree of the graph is around 5.5.

Recall that the local clustering coefficient measures the probability that the third person calls the first one knowing that the second calls the third. By averaging this metric, we have the clustering coefficient. We observe this statistic on our graphs over the ten days and find that it is very low. This let’s conclude that our graph is less dense over the ten days. This is confirmed by the high values of betweeness and edges betweeness. In practice, it can mean that those guys know less each other in other hands our population can be more heterogeneous.

IV. Clustering implementation

There is many graphs’ clustering algorithm for example Markov Cluster Algorithm (MCL algorithm) (Stijn, 2000), fast greedy algorithm (A Clauset, 2004), random walk algorithm (Pascal Pons, Dec 2005) and so one. We choose to implement MCL algorithm in this paper because of it is easy and fast to compute.

a. MCL algorithm description

We decide to present just guideline of the algorithm and invite people for more detail to follow links (http://micans.org/mcl/, or read (Stijn, 2000)). Here is the algorithm

Step1: Input is an un-directed graph, power parameter e, and inflation parameter r. we choose the parameter r=2 and e=2 for our case study.

Step2: Create the associated matrix; we compute the adjacent matrix of the graph

Step3: Add self loops to each node (optional); this step has been skipped.

Step4: Normalize the adjacent matrix; the normalization consists to divide each element in the adjacent matrix column by the sum of the column.

Step5: Expand by taking the eth power of the matrix: we multiply the matrix with power two

Step6: Inflate by taking inflation of the resulting matrix with parameter r. We multiply each element of the expanded matrix by power two and normalize each column.

Step7: Repeat steps 5 and 6 until a steady state is reached (convergence).

Step8: Interpret resulting matrix to discover clusters.

We compute the MCL algorithm with the software R by using a powerful package “igraph”. The R igraph package has some useful functions on graph operation, graph plotting, graph attribution, etc. The package can be downloaded on R-cran package[1].

b. Interpretation

Recall that the connected subgraph is a set of vertices where for every pair of the nodes in the set there exists a path connecting them. Over the ten days of call phone, we have up to 22 nodes in one community. The number of nodes into the community varies between 12 and 22 and differs significantly from day to day. This proves the heterogeneity of our population.

The first column in the table below shows the visualization of the graph over the ten days and the clustering algorithm result applied on it. We have 17 communities with more than five members. Many vertices (more than 25% of communities) are alone when we applied the clustering algorithm. The can explain the heterogeneity of the graph. We decide now to observe the graph on the first day call; it is in the second column in the following table. On the first day we have people who calls or receives 45 calls. They are more 11 over the 374 person in the graph who has more than 15 incoming and or out-going call. In addition, we have 34% of the population receive more than 5 incoming and or out-going call. When we applied clustering algorithm on it we have 12 communities with more than 5members.

c. Communities evolution


First day call	Second day call

Analyze the community over the time involve to know how many new vertices we have between two times or how many vertices leave the community. The previous graph shows our community with 22 vertices in the first day but in the second day we have two new vertices (green) and one vertex (red) leaves the community.

In this section, we analyze over the ten days call the evolution of our communities we detected previously. There are specific indicators that help to catch the vertices activity in the dynamic graph. We will present here four metrics. They are the vertex activity within the community , the edge activity within the community: , the vertex and edge stationarity respectively .

One defines the vertex activity of a connected subgraph to measure the activities of the community . (Gergely Palla, 2007), where is the number of the common vertices (members) in the community at time and time ; and is the number of vertices in the union of and .

Similarly, one defines the edge activity as within the community: ,

is the number of new edges appear in the community in step corresponds to the maximum of edges in the community in two consecutives step.

The vertex and edge activities measure the fraction of relative overlap between two steps and .

Finally, one defines the vertex and edge stationarity respectively as an average of vertex and edge activity during the time. There are and

denotes the birth of the community, is the last step before the extinction of the community. and allow appreciating the level of variation of each community.

Let’s consider the huge connected subgraph in the time period 1 we want to understand the structure of this community over the ten days. We compute the vertex activity and the vertex stationarity of this connected subgraph. The plot below shows the users activity of our community. The values are very low so our community lost more than 50% of its members over the ten days call.

The figure shows how the vertices activity was significantly equal over the nine others day phone call. This proofs that we can build a sophisticate marketing strategy on each community. We can focus only on each community headquarter and be sure that our marketing message will reach others in the same community.

V. Conclusion and future work

The post is to show how graph mining can help company to reduce their marketing department budget. This can be done by targeting customer with more relation into the social network. We apply the method on ten days phone call graph. To learn about the community over the whole period, we define a vertices activity metric and find that it is significantly equal over the time. This stability help drive the conclusion that the community element change rarely. So, good marketing strategy can be designed on each community headquarter and be sure this program can be drive over the ten days.

We note also that the number of user varies over the time and on the eight day the number of user was the lowest but the call activity is high.

Note that all this analysis is based only on two variables, the vertices features are not considered what is the impact of this features on the community structure?

Bibliographie

A Clauset MEJ Newman, C Moore Finding community structure in very large networks [Revue] // Cornell University / éd. 10.1103/PhysRevE.70.066111. - [s.l.] : http://arxiv.org/abs/condmat/0408187, 2004. - 066111 : Vol. 70. - E 70, 066111 .

H.Strogate D.J. Watts and Collective dynamics of “small-world” network [Livre] / éd. Nature. - 1998. - Vol. 393 : pp. pp440-442.

John Tang Mirco Musolesi, Cecilia Mascolo, Vito Latora Temporal Distance Metrics for Social Network Analysis [Revue]. - [s.l.] : http://conferences.sigcomm.org/sigcomm/2009/workshops/wosn/papers/p31.pdf, 2009.

Gergely Palla, Albert-László Barabás and Tamás Vicsek Quantifying social group evolution [Revue] // Nature. - Budapest : [s.n.], 2007. - 446. - pp. 664-667.

Newman M. E. J. The structure and function of complex networks [Revue] // SIAM Review. - 2003. - pp. 168-240.

Pascal Pons Matthieu Latapy Computing communities in large networks using random walks [Revue]. - Cornell University : Cornell University library, http://arxiv.org/abs/physics/0512106, Dec 2005.

Stijn Van Dongen Graph Clustering by Flow Simulation [Report] : PhD Thesis. - Netherlands : University of Utrecht, The, 2000.

[1] http://cran.r-project.org/web/packages/igraph/index.html

Data analytics group

Wednesday, July 25, 2012

LEARN MORE ABOUT YOUR CUSTOMERS COMMUNITIES

Bibliographie