Marketing
department main duty is to maximize the company’s profit by minimizing the
advertising expense. Communicate efficiently at lower cost can be done if and
only if you understand very well your customers behaviours. The customer’s
behaviours can be characterised by the customer attitude beside the product or
service. In addition, customer’s behaviours can be affected by the effect of
their environment. Generally, people has risk-averse attitude mainly when they
want to buy new product. So, they seem to be influenced by their neighbours or
friends when they want to take important decision. In this post, we are going
to show how your company can exploit the influence of the customer environment
to minimize the advertising cost. This concept is also call the viral marketing.
The viral marketing consists to target people with high connectivity in the
communities and advertise beside them. The probability that this targeted
people brings client to the company is high than the probability of those with
low connectivity. Here our understanding does not focus only on a snapshot of
the customer’s communities but on the whole time period
How does
analytics can help you target this people with large connection? What is the
structure of our communities over the time? In other words, how our communities
evolve over the time? Traditional statistics cannot help us to answer easily
these questions. That is the raison why we focus on graph mining techniques.
First of all,
we present the definition of some graph concepts and secondly we explore the
graph technique with its application on the dataset of phone call over ten
days.
The graph is
characterized by its form and the relationships among its elements. We analyze
the topology of our graph by focus principally on it cohesiveness. Cohesiveness
indicates the presence of strong relationships among network’s member and also
the likelihood of their access to the same information in few time intervals.
So, more our graph is cohered more our marketing communication can spread
rapidly through the social communities. There are many metrics but we will
describe here only useful one which fit with our marketing domain.
• The distance between two
vertices is the minimum number of edges connecting them (in case of a weighted
graph, the distance is the path whose total weight is the lowest possible). According
to the definition, the information costs more if this distance is large between
members.
• Neighbors (or first neighbors)
of a vertex are all the other vertices directly connected with it; the neighborhood
of a vertex is the subgraph containing it and all its first neighbors. In our
graph, nodes with more neighbors can be targeted as the cornerstone which we
can focus our communicating plan and probably be sure the information will
affect many people with less cost.
• The clustering coefficient
determines the transitivity in the graph (H.Strogate, 1998), i.e. in the
language of social network, the friend of a friend is more likely to be also my
friend. Watts and Strogatz introduced the local clustering coefficient
.
Assuming
that, for vertices with degree 0 or 1,
for which both numerator and denominator are zero,
. According to this
definition, the local clustering
is the proportion of links among the vertices
within its neighborhood divided by the number of links that could possibly
exist between them. This local clustering is also formulated as the fraction of
links that exist between the neighbors
of the vertex
, over the total possible
numbers of edges
. (John Tang, 2009)
We find that, the local clustering (in most graphs) decreases when the
degree of the node increases. Moreover, we have
(Newman, 2003). In our case study
the clustering coefficient is zero.
• The closeness of a vertex
is the reciprocal of the sum of geodesic distances to all other vertices in the
graph; it can be interpreted as a measure of how long it will take information
to spread from a given vertex to others in the network.
• The betweenness is a
centrality measure of a vertex within a graph. It is normally calculated as the
fraction of shortest paths between node pairs that pass through the one of
interest. In social sciences, the betweenness is a measure of the importance or
of the influence of an actor in a group.
• The degree
of a vertex is the number of edges that connects it to other vertices (if
the graph is directed, the degree can be distinguished in in-degree and out-degree,
meaning the number of arcs arriving to or departing from the vertex).
• The connected
subgraph is a set of vertices where for every pair of the nodes in the set
there exists a path connecting them. The plot of the number of the component
over the number of size can give an idea about the distribution of the
connected subgraph.
Recall that all these metrics will be analyzed in the context of
dynamism.
The data we collect is from the cell phone on Isla del Sueño. This data covers ten-day period in June 2006 and it narrow to about 400 unique cell phones during the period.
The data we collect is from the cell phone on Isla del Sueño. This data covers ten-day period in June 2006 and it narrow to about 400 unique cell phones during the period.
The data set includes
records with the following fields:
From: Identifier
for the calling phone
To: Identifier for the receiving
phone
Datetime: Format for date and time is: yyyymmdd hhmm
Duration: Duration of the call in seconds
Cell Tower: Location
of the call origination cell towe
The number of user over the ten days
and the number of degree are opposite. In few words, the degree metric helps to
capture the activity of node within the graph. In this case we have in average
five incoming/outgoing calling between callers per day. Although we have fewer
callers in the eighth day phone call, the call activity is at the top; we find
that the average degree of the graph is around 5.5.
Recall that the local clustering
coefficient measures the probability that the third person calls the first one
knowing that the second calls the third. By averaging this metric, we have the
clustering coefficient. We observe this statistic on our graphs over the ten
days and find that it is very low. This let’s conclude that our graph is less
dense over the ten days. This is confirmed by the high values of betweeness and
edges betweeness. In practice, it can mean that those guys know less each other
in other hands our population can be more heterogeneous.
There is many
graphs’ clustering algorithm for example Markov Cluster Algorithm (MCL
algorithm) (Stijn, 2000), fast greedy
algorithm (A Clauset, 2004), random walk
algorithm (Pascal Pons,
Dec 2005)
and so one. We choose to implement MCL algorithm in this paper because of it is
easy and fast to compute.
a. MCL algorithm description
We decide to
present just guideline of the algorithm and invite people for more detail to
follow links (http://micans.org/mcl/,
or read (Stijn, 2000)).
Here is the algorithm
Step1: Input is an un-directed graph, power parameter
e, and inflation parameter r. we choose the parameter r=2 and e=2 for our case
study.
Step2: Create the associated matrix; we compute the
adjacent matrix of the graph
Step3: Add self loops to each node (optional); this
step has been skipped.
Step4: Normalize the adjacent matrix; the normalization
consists to divide each element in the adjacent matrix column by the sum of the
column.
Step5: Expand by taking the eth power of the matrix: we
multiply the matrix with power two
Step6: Inflate by taking inflation of the resulting
matrix with parameter r. We multiply each element of the expanded matrix by
power two and normalize each column.
Step7: Repeat steps 5 and 6 until a steady state is
reached (convergence).
Step8: Interpret resulting matrix to discover clusters.
We compute
the MCL algorithm with the software R by using a powerful package “igraph”. The
R igraph package has some useful functions on graph operation, graph plotting,
graph attribution, etc. The package can be downloaded on R-cran package[1].
b. Interpretation
Recall that
the connected subgraph is a set
of vertices where for every pair of the nodes in the set there exists a path
connecting them. Over the ten days of call phone, we have up to 22 nodes in one
community. The number of nodes into the community varies between 12 and 22 and
differs significantly from day to day. This proves the heterogeneity of our
population.
The first
column in the table below shows the visualization of the graph over the ten
days and the clustering algorithm result applied on it. We have 17 communities
with more than five members. Many vertices (more than 25% of communities) are
alone when we applied the clustering algorithm. The can explain the
heterogeneity of the graph. We decide now to observe the graph on the first day
call; it is in the second column in the following table. On the first day we
have people who calls or receives 45 calls. They are more 11 over the 374
person in the graph who has more than 15 incoming and or out-going call. In
addition, we have 34% of the population receive more than 5 incoming and or
out-going call. When we applied clustering algorithm on it we have 12
communities with more than 5members.
|
| |
First day call
|
Second day call
|
Analyze the
community over the time involve to know how many new vertices we have between
two times or how many vertices leave the community. The previous graph shows
our community with 22 vertices in the first day but in the second day we have
two new vertices (green) and one vertex (red) leaves the community.
In this
section, we analyze over the ten days call the evolution of our communities we
detected previously. There are specific indicators that help to catch the
vertices activity in the dynamic graph. We will present here four metrics. They
are the vertex activity within the community
, the edge activity within the community:
, the vertex and edge stationarity
respectively
.
One defines
the vertex activity
of a connected
subgraph
to measure
the activities of the community
.
(Gergely
Palla, 2007),
where
is the
number of the common vertices (members) in the community at time
and time
; and
is the
number of vertices in the union of
and
.
Similarly, one
defines the edge activity as within the community:
,
is the
number of new edges appear in the community in step
corresponds
to the maximum of edges in the community in two consecutives step.
The vertex
and edge activities measure the fraction of relative overlap between two steps
and
.
Finally, one
defines the vertex and edge stationarity
respectively
as an
average of vertex and edge activity during the time. There are
and
denotes
the birth of the community,
is the
last step before the extinction of the community.
and
allow
appreciating the level of variation of each community.
Let’s
consider the huge connected subgraph in the time period 1 we want to understand
the structure of this community over the ten days. We compute the vertex
activity
and the
vertex stationarity
of this
connected subgraph. The plot below shows the users activity of our community. The
values are very low so our community lost more than 50% of its members over the
ten days call.
The figure shows how
the vertices activity was significantly equal over the nine others day phone
call. This proofs that we can build a sophisticate marketing strategy on each
community. We can focus only on each community headquarter and be sure that our
marketing message will reach others in the same community.
The post is to
show how graph mining can help company to reduce their marketing department
budget. This can be done by targeting customer with more relation into the social
network. We apply the method on ten days phone call graph. To learn about the
community over the whole period, we define a vertices activity metric and find
that it is significantly equal over the time. This stability help drive the
conclusion that the community element change rarely. So, good marketing
strategy can be designed on each community headquarter and be sure this program
can be drive over the ten days.
We note also that the number of user varies over the
time and on the eight day the number of user was the lowest but the call
activity is high.
Note that all this analysis is based only on two
variables, the vertices features are not considered what is the impact of this
features on the community structure?
Bibliographie
A Clauset MEJ Newman, C Moore Finding
community structure in very large networks [Revue] // Cornell University
/ éd. 10.1103/PhysRevE.70.066111. - [s.l.] :
http://arxiv.org/abs/condmat/0408187, 2004. - 066111 : Vol.
70. - E 70, 066111 .
H.Strogate D.J. Watts and Collective dynamics of
“small-world” network [Livre] / éd. Nature. - 1998. - Vol. 393
: pp. pp440-442.
John Tang Mirco Musolesi, Cecilia Mascolo, Vito Latora Temporal
Distance Metrics for Social Network Analysis [Revue]. - [s.l.] :
http://conferences.sigcomm.org/sigcomm/2009/workshops/wosn/papers/p31.pdf,
2009.
Gergely Palla, Albert-László
Barabás and Tamás Vicsek Quantifying social group evolution [Revue] //
Nature. - Budapest : [s.n.], 2007. - 446. - pp. 664-667.
Newman M. E. J. The structure and function
of complex networks [Revue] // SIAM Review. - 2003. - pp.
168-240.
Pascal Pons Matthieu Latapy Computing communities in
large networks using random walks [Revue]. - Cornell University :
Cornell University library, http://arxiv.org/abs/physics/0512106, Dec 2005.
Stijn Van Dongen Graph Clustering by Flow
Simulation [Report] : PhD Thesis. - Netherlands : University
of Utrecht, The, 2000.



No comments:
Post a Comment