Towards A Better Measure of Business Proximity:
Topic Modeling for Industry Intelligence
July 31, 2015
Abstract
In this article, we propose a new data-analytic approach to measure rms’ dyadic
business proximity. Specically, our method analyzes the unstructured texts that de-
scribe rms’ businesses using the statistical learning technique of topic modeling, and
constructs a novel business proximity measure based on the output. When compared
with existent methods, our approach is scalable for large datasets and provides ner
granularity on quantifying rms’ positions in the spaces of product, market, and tech-
nology. We then validate our business proximity measure in the context of industry
intelligence and show the measure’s eectiveness in an empirical application of ana-
lyzing mergers and acquisitions in the U.S. high technology industry. Based on the
research, we also build a cloud-based information system to facilitate competitive in-
telligence on the high technology industry.
Keywords: Big Data analytics, business proximity, topic modeling, industry in-
telligence, information system
1
1 Introduction
Business proximity measures rms’ relatedness in the spaces of product, market, and tech-
nology, which is an important concept in industry intelligence and also a central building
block in many studies of rm strategy and industrial organization. Not surprisingly, prior
studies in dierent management disciplines have used or developed a handful of measures
of business proximity. One common practice has been to classify rms into industries (or
sub-industries) and to operationalize business proximity as a binary variable that indicates
common industry (or sub-industry) membership. Under this denition, two rms’ businesses
are either identical or completely dierent. A rened extension of the binary denition has
been to better utilize the hierarchical information provided by some industry classication
system, such as Standard Industrial Classication (SIC) or North American Industrial Clas-
sication System (NAICS). For example, in Wang and Zajac (2007), the similarity of two
rms’ businesses was determined by the number of common consecutive digits in their in-
dustry classication codes under NAICS. Since they used the rst four digits in NAICS, the
similarity quantity was one of ve possible values: 0.00, 0.25, 0.50, 0.75, or 1.00. However,
this measure is still discrete, and the level of granularity it can achieve is constrained by the
industry classication system on which it depends. There are several other measures that
were aimed at some specic aspect of rms’ businesses, and they typically had stronger data
requirements. Stuart (1998), Mowery et al. (1998), and others constructed a “technological
overlap” measure using data of rms’ patent holdings. The closeness of a pair of rms was
assumed to be proportional to the number of common antecedent patents cited. While this
is an elegant, continuous measure in the technology space, it requires complete data on rms’
patent portfolios and does not explicitly cover the product and market spaces. Mitsuhashi
and Greve (2009) applied the Jaccard distance on rms’ customer geographic regions in mea-
suring “market complementarity. Likewise, this measure focuses only on the (geographic)
market space and requires all relevant rms’ customer geography data to be available.
While these measures have served the researchers’ purposes well, we see an opportunity
for a new and more general methodology in light of the increasing availability of public,
unstructured data and recent advances in Big Data analytics. In this paper, we propose a
method that requires little manual preprocessing yet provides ner granularity on quantify-
ing rms’ positions in the spaces of product, market, and technology. Utilizing a statistical
learning technique called topic modeling (Blei 2012), we analyze the publicly available, un-
2
structured texts that describe rms’ businesses. Our automatic approach, the core of which
is a Latent Dirichlet Allocation (LDA) algorithm, represents each rm’s textual descrip-
tion as a probabilistic distribution over a set of underlying topics, which we interpret as
aspects of its business. The data-analytic framework thus greatly reduces the complexity
of representing the business environment, and produces structured information that enables
further examination and derivation. The new business proximity measure is then naturally
constructed by quantifying the “distance” between a pair of rms’ topic distributions.
An important advantage of our method for measuring business proximity is that it im-
poses a much less strong requirement on structured data than the existent measures. This
makes our approach particularly appealing when the rms under study are small and pri-
vately held, for which detailed information on industry classication, patent holding, and
product/customer is either highly sparse or not available at all. Motivated by this advantage,
we choose the U.S. high technology (high-tech) industry as the empirical context to demon-
strate our approach. We collect data from CrunchBase, an open and comprehensive source
for high-tech startup activity. For the majority of companies in our dataset, the standard-
ized industry classication code is unavailable, and due to various strategic reasons, most do
not disclose their customer information and key intellectual property, so the conventional
methods for measuring business proximity cannot be operationalized. Using this dataset as
an example, we detail the procedure of our data-analytic approach, and compute business
proximity for each pair of the companies. We then show the validity and eectiveness of
the new measure in the context of industry intelligence by (1) examining the relationships
between business proximity and simple category classication, between business proximity
and job mobility, and between business proximity and investment respectively, and (2) using
the measure in a novel empirical application of modeling matching of companies in merg-
ers and acquisitions (M&As). Our comprehensive, continuous measure is an enabler in the
analysis to show the nuanced relationship between M&A transaction and the rms’ business
similarity and complementarity. Methodologically, to recognize the increasingly networked
business environment as well as to accommodate the relational nature of the matching data,
we employ an innovative statistical framework called Exponential Random Graph Models
(ERGMs) in the M&A analysis.
This research joins the rapidly growing stream of information systems literature that
leverages newly developed data science techniques in examining Big Data for business ana-
lytics (e.g., Adomavicius and Tuzhilin 2005, Shmueli and Koppius 2011, Chen et al. 2012,
3
Chiang et al. 2012, Ghose et al. 2012, Shi et al. 2014, Xu et al. 2014). Our research shows
how Big Data analytics can potentially transform competitive intelligence, particularly for
the high-tech industry, where recent years have seen an “entrepreneurial boom” character-
ized by the explosion of digital startups. Such explosion has made it ever more dicult to
purely rely on individuals’ industry knowledge to depict the rapidly changing landscape of
the startup world. Our empirical analysis demonstrates the potential of extracting econom-
ically meaningful information from publicly available, unstructured data through large-scale
computation as well as the value of the proposed business proximity measure as an important
metric in the analytics of M&A matching and as a search tool for navigating the networked
startup world. To further illuminate the practical implication of our data-analytic frame-
work, we build an information system that allows managers and analysts to use business
proximity to explore the competitive landscape of the U.S. high-tech industry. The back
end of our system handles data collection, storage, and large-scale computation using Big
Data computation platform (Condor), NoSQL database technology (MongoDB), and various
programming languages (Python, Scala). The front end of the system is hosted on Google’s
Cloud Platform and provides users an easy-to-use web interface. It is available to access at
http://146.6.99.242/bizprox.
We organize the remainder of this paper as follows. To provide a context for describ-
ing the data-analytic method, we rst introduce our dataset in Section 2. In Section 3, we
elaborate the procedure for constructing our business proximity. In Section 4, we demon-
strate the validity and eectiveness of our measure. We describe the information system
implementation in Section 5. We lastly discuss and conclude our paper.
2 Data
The dataset for demonstrating our methodology was collected from CrunchBase.
1
Crunch-
Base is an open and free database of high-tech companies, people, and investors. Regarded
as the Wikipedia of the high-tech industry, it provides a comprehensive view of the “startup
world. CrunchBase keeps track of the industry by automatically retrieving and extracting
information from professionally edited news articles on technology-focused websites.
2
In ad-
1
http://www.crunchbase.com.
2
For example, http://www.allthingsd.com, http://www.techcrunch.com,
and http://www.businessinsider.com.
4
dition, ordinary users can contribute to CrunchBase in a crowdsourcing manner. For quality
assurance, each update is reviewed by moderators. Existing data points are also constantly
reviewed by the editors. Compared with other high-tech-focused data vendors, CrunchBase
has the advantage of more complete coverage on early-stage startups, especially those not
(yet) funded by venture capitalists.
Data collection was carried out between April 2013 and April 2015. The companies and
their information were collected at the beginning of the period. We limit our dataset to the
U.S.-based companies and exclude those for which some basic information (e.g., founding
date, business description) is missing. We further exclude companies that had already been
acquired as of April 2013. The resultant dataset contains 24, 382 companies, the vast major-
ity of which are privately held, early-stage startups that are unclassied under SIC or NAICS.
As of April 2013, 345 of the companies (1.41%) in the dataset were publicly traded, and the
median age of the whole sample was 5.66 years old. For each company, we also observe its
headquarter location, industry sector (CrunchBase-dened category), (co)founders, board
members, key employees, angel and venture investors that participated in each of its fund-
ing rounds, acquisitions, investments, and a business description. Conrming the common
knowledge about the high-tech industry, we observe considerable geographic clustering. Fig-
ure 1(a) visualizes the spatial distribution of the companies using the headquarter-location
data aggregated at the city level. The circles are centered at the cities and their radius
is proportional to the number of companies. The major high-tech hub cities include New
York City (8.08% of the companies), San Francisco (7.92%), Los Angeles (2.17%), Chicago
(2.10%), Seattle (1.93%), Austin (1 .84%), and Palo Alto (1.81%). At the state level, as
shown in Figure 2(a), California leads with 34.72% of the companies, followed by New York
(11.99%), Massachusetts (5.89%), and Texas (5.20%). We also observe a highly uneven dis-
tribution of companies across the 19 industry sectors (CrunchBase-dened categories). The
leading sectors are “software” (19.23%) and “web” (17.13%), and the trailing sectors are
“semiconductor” (1.00%) and “legal” (0.73%), as shown in Figure 2(b). In the dataset, the
people’s proles also contain their past professional experiences. The unstructured, textual
descriptions are mostly of short to moderate length, comprising one or more paragraphs on
the key facts about the companies’ products, markets, and technologies.
For the validation of the proposed method, we use three types of inter-rm interac-
tions: M&A (one rm acquires another), investment (one rm invests in another), and job
5
(a) Companies
(b) M&A Transactions
Figure 1: Geo-mapping Company Locations and M&A Transactions
6
AK
SD
WV
WY
ND
MS
MT
NM
AR
HI
ID
ME
VT
AL
IA
RI
OK
NE
LA
KS
DE
SC
KY
NH
IN
WI
TN
DC
MO
NV
CT
UT
MN
MI
OR
MD
OH
AZ
NC
VA
CO
NJ
GA
PA
IL
WA
FL
TX
MA
NY
CA
0 2000 4000 6000 8000
count
state
(a) State
legal
semiconductor
security
education
search
cleantech
network_hosting
hardware
public_relations
biotech
enterprise
consulting
games_video
mobile
advertising
ecommerce
other
web
software
0 1000 2000 3000 4000
count
industry
(b) Industry Sector
Figure 2: Distribution of Companies over State and Industry Sector
mobility (an individual changes job from one rm to another). We constantly monitored
these activities to April 2015. Our dataset includes a total of 1, 689 M&A transactions since
2008. Figure 1(b) geo-maps each of the M&A transactions using the headquarter locations
of the involved companies. A little less than two-thirds (62.59%) of the deals is cross state.
A numerically similar portion of transactions (63.56%) is cross sector. The distribution of
the number of transactions per company is highly skewed the top 10 and top 20 buyers
made 14.32% and 21.23% of all the deals respectively. Among these M&A transactions,
394 (23.32%) occurred between April 2013 and April 2015. For investments, a total of
531 transactions are recorded and the post-April-2013 number is 129 (24.29%). Lastly, the
job mobility data are computed based on position changes among the 24, 334 people in the
dataset. There are 19, 697 company pairs connected by the job transitions in total and 9, 792
pairs (49.71%) by post-April-2013 activities.
7
3 Measuring Business Proximity: Data-Analytic Frame-
work
Business proximity measures rms’ closeness in the spaces of product, market, and technol-
ogy. Our objective is to develop a data-driven, analytics-based business proximity measure
to improve on scalability, classication granularity, and comprehensiveness. The input of our
method — an unstructured, textual business description for each rm — requires no manual
classication, and is also much more likely to be available than structured information such
as NAICS/SIC code or patent portfolio, especially for high-tech startups.
Our approach builds upon a text mining technique called topic modeling, a statistical
method that discovers abstract “topics” from a large collection of documents. At present,
the most common topic modeling algorithm is Latent Dirichlet Allocation (Blei et al. 2003).
LDA does not require manually labeling each document, so it is an unsupervised learning
algorithm. The underlying model of LDA is generative — the assumption is that each word
in each document is probabilistically drawn from the vocabulary of a topic discussed in that
document. Given a large collection of documents, the vocabularies of topics and the topics
of the documents are jointly estimated.
More formally, we let the number of input descriptions (i.e., the total number of com-
panies) be D, where each description d {1, 2, . . . , D} is a collection of words {w
d
n
|n =
1, 2, . . . , N
d
}. Let the total number of latent “topics” (business aspects) expressed by the
descriptions be K. Each topic k {1, 2, . . . , K} is a probabilistic distribution over the
whole vocabulary, i.e., the set of unique words in the description corpus. This distribution
is denoted ϕ
k
, where ϕ
k
w
is the probability of word w in topic k. The topic proportions for
description d are θ
d
, where θ
d
k
is the topic proportion for topic k in description d. Assume
z
d
n
is the topic assignment of the n’th word in description d. Then, given θ
d
and ϕ
k
, the
probability of observing description d is
N
d
n=1
K
k=1
P(w
d
n
|z
d
n
= k, ϕ
k
)P(z
d
n
= k|θ
d
)
=
N
d
n=1
K
k=1
ϕ
k
w
d
n
θ
d
k
, (1)
where the term inside the product operator is the probability of the n’th word in description
d being w
d
n
. LDA takes the Bayesian approach and is a complete generative model. It further
assumes Dirichlet priors for both θ and ϕ, with hyperparameters α and β respectively. Thus,
8
the generative process of LDA can be represented by the following joint distribution:
P(w, z, θ, ϕ|α, β) =
K
k=1
P(ϕ
k
|β)
D
d=1
P(θ
d
|α)
N
d
n=1
P(w
d
n
|z
d
n
, ϕ
k
)P(z
d
n
|θ
d
)
. (2)
Having observed the descriptions, hence w, we compute the posterior distribution
P(z, θ, ϕ|α, β, w) =
P(w, z, θ, ϕ|α, β)
P(w|α, β)
, (3)
using Monte Carlo methods in Bayesian statistics. Finally, the estimates of θ and ϕ are
obtained by examining the posterior distribution.
In summary, LDA is utilized in the data-analytic framework to analyze the textual
descriptions of the rms. Each description is a document, and all the descriptions together
are the input of LDA. The algorithm produces K topics (K is a parameter specied by the
researcher), each of which is represented by a probabilistic distribution over the set of words.
In addition, LDA computes the topic distribution for each company description. For each
company, a probability value, or weight, is assigned to each discovered topic and the values
sum up to 1. Essentially, through topic modeling, a company i’s description is represented
by a topic distribution T
i
= {T
i,1
, T
i,2
, . . . , T
i,K
}, where T
i,k
is the weight on the k-th topic
and
K
k=1
T
i,k
= 1.
We interpret the discovered topics as the dierent components of the companies’ busi-
nesses. If a particular T
i,k
= 0, then component k is irrelevant to company i’s business.
Finally, we dene the business proximity p
b
(i, j) between two companies i and j as the co-
sine similarity
3
of the two corresponding topic distributions T
i
and T
j
, which can be written
as follows:
p
b
(i, j) =
T
i
· T
j
||T
i
||||T
j
||
=
K
k=1
T
i,k
T
j,k
K
k=1
(T
i,k
)
2
K
k=1
(T
j,k
)
2
. (4)
The resulting proximity values range between 0 and 1, where a bigger value indicates closer
proximity between the pair of companies. The measure equals 0 if and only if the two rms
have no common business component; the measure equals 1 if and only if the two rms share
exactly the same business components as well as the same weights.
3
Cosine similarity is one measure of similarity between two distributions. We can apply other similarity
measures such as normalized Euclidean distance. We can also view each topic distribution as a set where
the elements are the topics with strictly positive probability, and then use set comparison metrics such as
Jaccard index and Dice’s coecient. Our main results are robust to these alternative measures.
9
Topic Dimension Top 5 Words
1 Product video,music,digital,entertainment,artists
2 Product news,site,blog,articles,publishing
3 Product job,jobs,search,employers,career
4 Product people,community,members,share,friends
30 Technology/Product phone,email,text,voice,messaging
31 Technology/Product wireless,networks,communications,internet,providers
32 Technology/Product cloud,storage,hosting,server,servers
33 Technology/Product app,apps,iphone,android,applications
38 Market sales,customer,lead,email,leads
39 Market solution,cost,costs,applications,enterprise
Table 1: LDA Results of CrunchBase Data (Partial)
Note: Only top ve words are presented for brevity.
We carry out the proposed method on the CrunchBase dataset. We run the LDA model
and compute the corresponding business proximity for a set of dierent K values: 50, 100,
200, and 500. The main results on coecient signs and their statistical signicance reported
in the empirical validation and application section are robust to the dierent choices. Due
to the page limit, we report in the main text for K = 50. To illustrate that the topic model
results comprehensively capture multiple dimensions of a rm’s business, in Table 1 we list
10 topics that LDA produces from our dataset. Note that each topic is a distribution over
all words in the vocabulary and that we only show the top ve words in terms of their
probability for brevity. The full 50-topic list is shown in Table 9 in Appendix A. We have
checked all 50 topics to nd that each topic consists of frequent words that are tightly related
to each other. We also observe that the topics capture the current trends in the high-tech
industry. Using the LDA results, we compute business proximity for all company pairs in
the dataset. Owing to the huge number of pairs (close to 300 million), we parallelize the
computation algorithm for speedy processing.
Our new data-driven approach for measuring business proximity has overcome many of
the limitations faced by the existing methods. First, the approach is scalable because the
construction of the business aspects and business proximity is automated, which is a sharp
contrast to the domain-expert-based industry classication in which manual annotation is
required as the rst step. Second, our approach is generally applicable to a wide range of
rms (either public or private) as long as textual business descriptions exist for the rms. In
contrast, industry classication is only sparsely available for small companies and nancial
10
lings data are only available to public companies. Note that only 1.41% of the high-tech
companies in our dataset are public, as discussed in Section 2. Third, our approach provides
ner granularity than the existing discrete similarity measures as the algorithm provides
continuous similarity measures. Fourth, the proposed method provides exibility to cope
with dynamic industry changes. As the underlying business descriptions in the industry
change, the algorithm can automatically detect the emerging topics in the industry and
incorporate them into the business proximity.
4 Empirical Validation and Application
4.1 Validation
To validate the constructed business proximity measure, we rst examine the relationship
between it and a simple category-based classication. Because the NAICS-based proximity
cannot be operationalized due to the data limitation (in fact, the CrunchBase companies are
already in a narrowly focused industry), we leverage the simple industry sector information,
i.e., the categories dened by CrunchBase (see Figure 2). We construct a binary indicator
for same-category membership, denoted category match, and let it serve as a benchmark
business proximity measure. We then compare the distributions of the proposed analytics-
based measure in two groups of company pairs: (i) company pairs in the same category
(category match = 1), and (ii) those belonging to dierent categories (category match = 0 ).
Figure 3 compares the business proximity values between the two groups. The upper and
lower hinges of the boxes indicate the rst and third quartiles (the 25th and 75th percentiles).
The results show that the same-category company group (mean: 0.12) has a mean business
proximity value twice as large as the other (mean: 0.06). The Pearson’s correlation coecient
between business proximity and category match is 0.11, with the t-statistic being 61.94 and
p-value being smaller than 2.2e
16
. The large t-statistic and low p-value indicate a very
high correlation between the proposed business proximity and the simple category-based
classication.
For further validation, we test the predictive power of the proposed business proximity on
11
0.00
0.25
0.50
0.75
1.00
0 1
category_match
business proximity ([0, 1])
Figure 3: Distributions of Business Proximity: Same- and Cross-Category Company Pairs
Note: The upper and lower hinges of the boxes indicate the 25th and 75th percentiles.
0.00
0.25
0.50
0.75
1.00
M&A invest jobmob random
group
business proximity ([0, 1])
Figure 4: Distributions of Business Proximity: M&A, Investment, Job Mobility, and Random
Samples
Note: The upper and lower hinges of the boxes indicate the 25th and 75th percentiles.
12
three types of inter-rm interactions: M&A, investment, and job mobility.
4
Operationally,
we compare the realized business proximity among four groups (M&A, invest, job mobility,
and random) of company pairs to test if the business proximity has a leading eect on the
corresponding inter-rm interactions. One caveat is that high business proximity values
could be the result of rm transactions. For instance, after an M&A transaction takes place,
it is very likely that the acquiring company’s business description will incorporate various
aspects of the acquired company. To avoid this reversal eect, we only consider the inter-
rm transactions after April 2013, which is the time when all the company descriptions
were collected. Our inter-rm interaction dataset contains 394 company pairs associated to
M&A transactions, 129 with inter-rm investments, and 9, 792 with job mobility.
5
Lastly,
to construct the baseline, we randomly select company pairs from the whole dataset.
Figure 4 compares the distribution of business proximity value among the company pairs
dened by M&A, investments, job mobility, and random selection. We nd that the proposed
business proximity has higher values between company pairs connected by the three types
of inter-rm interactions than random pairs, thus indicating a positive association between
each of the transactions and the proximity. On average, the rst three groups have more than
three times higher proximity than the randomly-paired group: M&A (0.293), investments
(0.224), job mobility (0.218), and random (0.068). Given the fact that M&A is a rare,
signicant inter-rm transaction, it is intuitive to nd that M&A-paired rms have higher
similarities than other two interaction types (investments and job mobility).
4.2 Empirical Application on M&As
In this subsection, we demonstrate the business proximity measure’s value for empirical mod-
eling. Specically, we apply it in analyzing high-tech M&As. Recognizing the increasingly
4
The rationale of choosing these interactions is the following: M&A is an important inter-rm transaction
that in theory creates business synergy (e.g., Rhodes-Kropf and Robinson 2008); inter-rm investments are
associated with technological or market overlaps (e.g., Mowery et al. 1998), and may lead to future M&A
transactions (Mikkelson and Ruback 1985); the labor economics literature found evidence that a signicant
portion of the job moves involve companies that are in the same industry (e.g., Moscarini and Thomsson
2007, Fallick et al. 2006).
5
For job mobility, if a person made a job transition from a company A to another one B, then we consider
A and B are associated.
13
networked business environment,
6
we construct a network structure by incorporating rm
proximity in dierent dimensions, and then use a statistical network model to analyze their
interactions. Our objective is to examine the relationship between the likelihood of a pair
of rms’ matching in an M&A transaction and their individual and pairwise characteristics,
among which the newly developed business proximity is of our primary interest. We rst
summarize the theoretical basis for the importance of business proximity as well as proximity
in three other dimensions in modeling M&As. Next, we introduce the statistical network
analysis method and explain our empirical specications. Lastly, we present estimation
results.
4.2.1 Proximity and M&A
The high-tech industry is characterized by active and rapid innovation, signicant geographic
clustering (at a handful of high-tech hubs), rapid job mobility, high concentration of owner-
ship at the company level, and strong inuence of angel and venture investors. We posit that
business proximity, geographic vicinity, social linkage, and common ownership are associated
with the likelihood of two rms’ matching in an M&A transaction.
Business Proximity
Business proximity measures rms’ relatedness in the spaces of product, market, and tech-
nology. It has been widely recognized in the nance and management literature that the
potential synergy in products, markets, and technologies is a key driver for M&As (e.g.,
Rhodes-Kropf and Robinson 2008) and is especially important in high-tech acquisitions (e.g.,
Ahuja and Katila 2001). The central idea of business synergy is that economic surplus can
be created from novel recombination of the acquirer and target’s resources and capabili-
ties. Hence, one of the determinants for the matching of the acquirer and target should
be the recombination potential, which is in turn inuenced by the relatedness of the two
rms’ products, markets, and technology. Therefore, we expect the business proximity is
associated with the M&A matching likelihood.
6
See “Revolution in Progress: The Networked Economy,” MIT Technology Review Custom, August 27,
2014.
14
Geographic Proximity
Geographic or spatial proximity refers to the closeness of physical locations and it has been
shown to have a moderating eect in a diversity of nancial transactions. In the M&A
domain, Erel et al. (2012) analyzed cross-border mergers to show that, among other factors,
geographic proximity increases the likelihood of mergers between two countries. At the rm
level, Chakrabarti and Mitchell (2013) found that chemical manufacturers prefer spatially
proximate acquisition targets. The main reasoning behind these ndings is that information
propagation is subject to spatial distance; geographic proximity brings a higher level of
knowledge exchange and hence a lower level of information asymmetry. For the same reason,
we predict that geographic proximity is positively associated with the M&A likelihood.
We operationalize geographic proximity by measuring the great-circle distance
7
between
two companies’ headquarters. First, we translate the street address of each company’s head-
quarters into its latitude (ϕ) and longitude (λ) coordinates by using Google Maps API.
8
For
companies whose full street address is missing, we use the city center as an approximate.
Next, we use the latitude and longitude coordinates to calculate the great-circle distance.
Specically, let (ϕ
i
, λ
i
) and (ϕ
j
, λ
j
) be the coordinates for companies i and j, and λ be
the absolute dierence in their longitudes. Then the geographic proximity p
g
(i, j) between
companies i and j is dened as
p
g
(i, j) = R arccos(sin ϕ
i
sin ϕ
j
+ cos ϕ
i
cos ϕ
j
cos λ), (5)
where the constant R is the sphere radius of the earth. The negative sign is to convert
distance to proximity.
Social Proximity
Social proximity of two rms is dened according to the social linkage between the individ-
uals associated with the two rms. Personal linkage is an important factor in coordinating
transactions and promoting private information exchange between business entities through
mutual trust and kinship (e.g., Hochberg et al. 2007, Cohen et al. 2008, Stuart and Yim
2010). We believe two factors about the high-tech industry greatly contribute to the impor-
tance of personal linkage’s role in transmitting vital information across companies. First,
7
http://en.wikipedia.org/wiki/Great-circle_distance
8
https://developers.google.com/maps/
15
the U.S. high-tech industry, especially the startup sphere of it, is characterized by high job
mobility, which creates the paths and opportunities for private information ow (Fallick et
al. 2006). Second, early-stage digital startups are mostly very small in size; thus, informa-
tion about them is often scarce outside the teams’ social circles. Moreover, many startups
intentionally stay in a “stealth mode” before their products and technologies mature. Hence,
we argue that companies with closer social proximity are likely to be aware of each other’s
products and intellectual property, which would lead to a higher M&A probability.
We operationalize social proximity by using the “people” part of our dataset. For each
company, we observe the individuals who are or have previously been aliated with it either
as a (co)founder, or as a board member, or as an employee. Let S
i
denote this set of
individuals for company i. Then we dene the social proximity p
s
(i, j) between two companies
i and j as
p
s
(i, j) = |S
i
S
j
|, (6)
i.e., the number of people who are identied having experiences in both companies.
Investor Proximity
Investment proximity is dened according to the common angel and venture investors who
have founded the rms. In the high-tech industry, startups depend on external investments
to support product development before they establish a stable cash ow. As compared
with other types of investors, angel and venture investors often play a more active role in
management and can be highly inuential on strategic decisions (e.g., Amit et al. 1990,
Gompers 1995), such as pursuing M&A opportunities. Hence, common early investors of
two high-tech companies can form a critical information bridge or even an initiator and
enabler of collaboration between them, which we predict leads a higher likelihood of M&A.
Our operationalization of investor proximity is methodologically similar to that of social
proximity. Given two companies i and j, their investor proximity p
f
(i, j) is dened as
p
f
(i, j) = |I
i
I
j
|, (7)
where I
i
and I
j
are the sets of investors who have funded companies i and j in any of the
funding rounds respectively.
16
Correlation Analysis
We explore the realizations of the business, geographic, social, and investor proximities in
our CrunchBase dataset and analyze their correlations with the matching of M&A. Note
that we compute all proximity measures using company data collected in April 2013 and
only use the M&A transactions that occurred between April 2013 and April 2015 to avoid
any possible reversal eect.
For each of the four proximity measures, we compare its dierent distributions in two
groups of company pairs: (1) group of M&A-matched company pairs and (2) that of ran-
domly selected pairs. Figure 5 shows the empirical cumulative distribution functions (CDF)
of the four proximity measures. For the (b) geographic dimension, we plot the distance
rather than proximity for intuitiveness. Also note that the business and geographic proxim-
ity values are continuous, whereas the other two are discrete. In each subgure, the red line
represents the distribution for the group of company pairs dened by M&A transactions and
the green line shows that of random pairs. For each proximity measure, we observe a distinc-
tion between the two lines, suggesting the existence of dependency between the proximity
measures and M&A transactions (the dierences in the two lower subplots are visually less
distinct because both social and investor proximity measures are discrete and have a large
point mass at 0). Next, we appeal to a more rigorous statistical model for further analysis.
4.2.2 Statistical Model
Using statistical terminology, the matching of a pair of rms is a binary outcome: Either
they are part of an M&A transaction or they are not. Thus it could be tempting to use
binary response econometric models such as logistic regression for the empirical analysis.
However, they are inappropriate in this context due to the relational nature of the data.
For example, an M&A transaction between rms i and j and that between i and k (which
would be two observations in a logistic regression) are correlated since they involve a common
party, i.e. rm i. Hence, the key assumption of independent observations, which underlies
the binary response econometric models, is clearly violated. So instead of treating the M&A
transactions as independent observations, we model all of them together as a network.
Exponential random graph models (ERGMs), also known as p
models, have been de-
17
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Empirical CDF
business proximity ([0, 1])
group
M&A
random
(a) Business
0
1000
2000
3000
4000
5000
0.00 0.25 0.50 0.75 1.00
Empirical CDF
geographic distance (km)
group
M&A
random
(b) Geographic
0
1
2
3
4
5
0.00 0.25 0.50 0.75 1.00
Empirical CDF
social proximity (# people)
group
M&A
random
(c) Social
0
1
2
3
0.00 0.25 0.50 0.75 1.00
Empirical CDF
investor proximity (# investors)
group
M&A
random
(d) Investor
Figure 5: Distributions of Proximity: M&A Sample v.s. Random Sample
Note: In (b), we plot geographic distance rather than geographic proximity.
18
veloped in statistical network analysis over the past three decades and recently have become
perhaps the most important and popular class of statistical models of network structure (see
Goldenberg et al. 2010 for a survey of models in this eld). As far as we are aware, this
modeling framework has not been widely used in the information systems literature thus far,
so we briey introduce it here.
9
We also provide a list of important notations used in this
and the following sections in Table 6 in Appendix A for reference.
A network is a way to represent relational data in the form of a mathematical graph. A
graph consists of a set of nodes and a set of edges, where an edge is a directed or undirected
link between a pair of nodes. A network of n nodes can also be mathematically represented by
an n ×n adjacency matrix Y , where each element Y
ij
can be zero or one, with one indicating
the existence of the i-j edge and zero meaning otherwise. Self-edges are disallowed so Y
ii
= 0
i. If edges are undirected (i.e., the i-j edge is not distinguished from the j-i edge), then
Y
ij
= Y
ji
i, j (i.e., Y is a symmetric matrix).
In applications, the nodes in a network are used to represent economic or social en-
tities, and the edges are used to represent certain relations between the entities. In this
present research, the nodes and the edges are high-tech companies and the M&A transac-
tions between them respectively, and they together form an M&A network. In terms of the
adjacency-matrix representation, we dene
Y
ij
=
1, if i and j are part of an M&A transaction,
0, otherwise.
With this denition, the resultant M&A network is undirected.
10
ERGMs treat network graph, or equivalently adjacency matrix Y , as a random outcome.
For a network of n nodes, the set of all possible graphs (denoted Y) is nite. The observed
network is one realization of the underlying random graph generation process. For some
9
The only papers using ERGMs by information systems scholars that we are aware of are Skerlavaj et
al. (2010) and Faraj and Johnson (2011).
10
Alternatively, we could dene a directed “acquisition network” where the edges are asymmetric. That
is, we could distinguish the acquirer and the acquired. For our purpose of assessing the business proximity
measure, the distinction is not very important since business proximity is symmetric (and it is also true for
the other three proximity measures). In addition, our assumption of undirected M&A network reduces the
time needed for computation when we perform the estimations.
19
y Y, the probability of it occurring is assumed to be
P(Y = y) =
1
Ψ
exp{
K
k=1
θ
k
z
k
(y)}, (8)
where K is the number of network statistics, z
k
(y) is the k-th network statistic, the θ
k
’s are
parameters, and the denominator Ψ is a normalizing constant.
11
The z
k
(y) terms capture
certain properties of the network and are assumed to aect the likelihood of its occurring.
They are analogous to the independent variables in a regression model. One common example
of network statistics is the total number of edges in the network (or a constant multiple of it).
z
k
(y) can be a function of not only the network graph y, but also other exogenous covariates
on the nodes. For example, suppose we have a categorical variable on the nodes. Then
one such statistic is the number of edges where the two ending nodes belong to the same
category. To interpret the parameters θ
k
, we can rewrite equation (8) in terms of log-odds
of the conditional probability:
logit(P(Y
ij
= 1|Y
ij
)) =
K
k=1
θ
k
z
k
, (9)
where Y
ij
is all but the ij element in the adjacency matrix. Therefore, the interpretation
of θ
k
is: If forming the i-j edge increases z
k
by 1 and the other statistics stay constant, then
the log-odds of it forming is θ
k
.
12 13
4.2.3 Specication
Our ERGM specication includes the statistics (z
k
’s) for degree distribution, selective mixing,
and proximity. We iterate them and explain their interpretations in the M&A context in the
11
y∈Y
P(Y = y) = 1, so Ψ =
y∈Y
exp{
K
k=1
θ
k
z
k
(y)}
12
It is noteworthy that if the z
k
’s do not depend on Y
ij
i, j, then the edges are independent of each
other, and hence the ERGM model reduces to a standard logistic regression where each edge is considered
an independent observation.
13
The above summarizes the basic formulation of ERGMs. Despite its relatively straightforward interpre-
tation and analytic convenience, applications had been limited until just a few years ago due to signicant
computational burdens. The diculty lies in evaluating the normalizing constant in the equation (8), which
involves a sum over a very large sample space even for a moderate n. It is not hard to see that the number
of possible graphs is 2
n(n1)
if the network is directed, and the number of possible graphs is 2
n(n1)
2
if the
network is undirected. Recent advances in computing capability and Monte Carlo estimation techniques
(Snijders 2002, Handcock et al. 2008 among others) have made possible the signicant growth of ERGMs
applications in academic elds such as sociology and demography.
20
following paragraphs. In the discussion, we translate the generic terms nodes and edges into
the more specic terms rms and transactions.
The degree distribution statistics include: t, the total number of M&A transactions, and
d
2
, the number of rms that each are a party of at least two dierent transactions. t measures
the density of transactions in the M&A network and its coecient serves a similar role as
the constant term in a regression model. In fact, equation (9) implies that the coecient of
t is the log-odds of transaction happening if t were the only statistic in the equation. Given
the sparsity of the M&A network, we expect t’s coecient to be negative. The reason why
we also include the d
2
statistic is because it has been demonstrated in the prior research
that rms with dierent relational capabilities (Lorenzoni and Lipparini 1999) participate in
signicantly dierent levels of M&A activities. Wang and Zajac (2007) specically showed
that an acquisition is more likely to occur if any of the two parties have prior acquisition
experiences. Moreover, we have found in the exploratory data analysis in Section 2 that the
number of M&A transactions in which a rm is a party follows the power-law distribution.
Hence we predict a transaction where either of the two parties that has previously engaged in
M&A transactions should have a dierent likelihood than when neither has. The d
2
statistic
captures exactly this eect and we expect its coecient to be positive.
Selective mixing captures the matching of rms according to the combination of their
nodal-level characteristics. In other words, these characteristics are rst dened at the in-
dividual rm level, and then combined to the pair level and lastly aggregated to the cor-
responding network statistics. In the network analysis literature, one widely adopted form
of selective mixing is assortative mixing: Social and economic entities tend to form rela-
tionships with others that are “similar. We include two groups of statistics that reect an
analogous kind of selective mixing in M&As and they are constructed based on two categor-
ical covariates we have on the rms, i.e., state and industry sector. We expect that a pair of
rms belonging to the same category are more likely to match than otherwise. Specically,
statistic h
sta
s
is the number of transactions between two rms whose headquarters are both
located in state s, where s is one of the 50 states plus the District of Columbia; h
sec
c
is the
number of transactions between two rms that belong to the same industry sector c, where
c is any of the 19 sectors described in Section 2. We also want to point out that these two
groups of statistics can serve as alternative operationalizations of geographic and business
proximity.
21
Lastly, the statistics of most interest are the four proximity measures that capture the
matching process based on dyadic-level characteristics. We normalize the four proximity
measures to ensure they have the same standard deviation. The four statistics each equal
the sum of the corresponding characteristic values over all transactions. We use p
g
, p
s
, p
f
,
and p
b
to denote the sums of geographic proximity, social proximity, investor proximity,
and business proximity respectively. The rationale of including them has been discussed
in Section 4.2.1. In the benchmark specication, we include a linear term for p
b
. We also
estimate an additional specication with a quadratic term of p
b
to allow for a curvilinear
eect of business proximity on matching.
To sum up, our benchmark model specication can be written:
P(Y = y) =
1
Ψ
exp{θ
t
t + θ
d2
d
2
+
s
θ
sta
s
h
sta
s
+
c
θ
cat
c
h
cat
c
+ θ
g
p
g
+ θ
s
p
s
+ θ
f
p
f
+ θ
b
p
b
}, (10)
and the corresponding conditional form is
logit(P(Y
ij
= 1|Y
ij
))
=θ
t
t + θ
d2
d
2
+
s
θ
sta
s
h
sta
s
+
c
θ
cat
c
h
cat
c
+ θ
g
p
g
+ θ
s
p
s
+ θ
f
p
f
+ θ
b
p
b
=θ
t
+ θ
d2
d
2
+
s
θ
sta
s
I(s
i
= s
j
= s) +
c
θ
cat
c
I(c
i
= c
j
= c)
+ θ
g
p
g,ij
+ θ
s
p
s,ij
+ θ
f
p
f,ij
+ θ
b
p
b,ij
.
(11)
where I(·) is an indicator function, and, for instance, I(s
i
= s
j
= s) means companies i and
j are in the same state s and I(c
i
= c
j
= c) means i and j belong to the same sector c.
4.2.4 Results
The nal dataset contains a total of 24,382 companies. This seemingly moderate number of
nodes is actually huge for estimating network models, since the number of potential edges
in our case un-ordered pairs close to 300 million. Given our current computational
capacity, we cannot handle the whole dataset in one estimation procedure. To carry out the
analysis, we decide to randomly select 25% of the whole dataset for estimation and repeatedly
do so 100 times. Since the estimation for each subsample is an independent, computation-
intensive task, we parallelized the estimation job using Condor system,
14
which is a Big Data
14
http://research.cs.wisc.edu/htcondor/
22
Number of Number of Median
Samples with Samples with Coecient
Expected Sign p-value Value
< 1.0%
θ
t
edges 100(<0) 98 -14.7837
θ
d2
degree> 2 97(>0) 92 3.0064
Table 2: Degree Distribution Coecients (100 Samples)
platform to support high throughput computing. For each of the 100 dierent samples (6,096
companies each), we estimate the model coecients by using the Markov Chain Monte Carlo
maximum likelihood estimation procedure outlined in Hunter and Handcock (2006).
We summarize the resultant 100 set of coecients for the degree distribution, selective
mixing, and proximity statistics in Tables 2, 3, and 4 respectively. For each statistic, we re-
port the number of samples that yield a coecient with the expected sign, and the number(s)
of samples that yield a coecient that has the expected sign and is statistically signicant
at one or more selected condence level(s). Also, to provide an example, we report the full
estimation result for one particular sample in Table 7 in Appendix A.
Table 2 reports the coecients of the degree distribution statistics. Among the 100
samples, all θ
t
coecients are negative and 97 θ
d2
coecients are positive. At the 99.0%
condence level, 98 θ
t
estimates are signicant and 92 θ
d2
estimates are signicant. Hence the
results for the two degree distribution statistics are both consistent with our expectations.
As discussed, the negativity of θ
t
indicates only the overall small probability of an M&A
transaction occurring; the positive sign of θ
d2
means that an M&A transaction of which
rms with some M&A experience are involved is more likely to occur.
In part (a) of Table 3, we nd most state-based selective mixing statistics are dropped.
This is due the sparsity of M&A transactions during the data collection period the likeli-
hood that two same-state companies merged in an individual sample is low for most states.
Indeed, the states that yield the most coecients, namely CA, NY, and MA, are where
well-known high-tech hubs are located. In part (b) of Table 3, we observe that for almost
all category-based selective mixing statistics, an overwhelmingly large proportion of the co-
23
Number of Number of Number of Number of Number of Number of
Samples Samples Samples Samples Samples Samples
with Coecient p-value with Coecient p-value
Coecients > 0 < 1.0% Coecient > 0 < 1.0%
AK 0 - - MT 0 - -
AL 0 - - NC 0 - -
AR 0 - - ND 0 - -
AZ 0 - - NE 0 - -
CA 100 94 43 NH 5 5 3
CO 7 7 7 NJ 4 4 3
CT 0 - - NM 0 - -
DC 5 5 4 NV 0 - -
DE 0 - - NY 61 61 22
FL 0 - - OH 0 - -
GA 7 7 6 OK 0 - -
HI 0 - - OR 0 - -
IA 0 - - PA 0 - -
ID 0 - - RI 0 - -
IL 5 5 5 SC 0 - -
IN 0 - - SD 0 - -
KS 0 - - TN 0 - -
KY 0 - - TX 19 19 13
LA 0 - - UT 0 - -
MA 28 28 16 VA 0 - -
MD 6 6 5 VT 0 - -
ME 0 - - WA 11 11 6
MI 0 - - WI 0 - -
MN 0 - - WV 0 - -
MO 0 - - WY 0 - -
MS 0 - -
(a) State
Number of Number of Number of Number of Number of Number of
Samples Samples Samples Samples Samples Samples
with Coecient p-value with Coecient p-value
Coecient > 0 < 1.0% Coecient > 0 < 1.0%
advertising 26 25 7 mobile 28 26 11
biotech 38 37 5 net hosting 7 6 6
cleantech 11 11 6 other 0 - -
consulting 11 10 3 pub rel 8 8 8
ecommerce 13 13 3 search 0 - -
education 0 - - security 0 - -
enterprise 22 22 20 semiconductor 15 15 5
games video 26 25 11 software 87 78 37
hardware 32 31 25 web 76 66 21
legal 0 - -
(b) Category
Table 3: Selective Mixing Coecients (100 Samples)
24
Number of Number of Number of Number of Median
Samples with Samples with Samples with Samples with Estimate
Coecient p-value p-value p-value
> 0 < 5.0% < 1.0% < 0.1%
θ
g
Geographic 46 8 5 3 -0.0173
θ
s
Social 79 73 70 69 0.1460
θ
f
Investor 62 52 51 46 0.0689
θ
b
Business 100 92 86 79 0.5315
Table 4: Proximity Coecients (100 Samples)
ecient estimates are positive, but it turns out their statistical signicance, when using the
99.0% condence level, is not strongly supported. One possible explanation of their statis-
tical insignicance is the inclusion of our business proximity measure. As mentioned, the
selective mixing statistics based on industry sector can also be thought of as alternative,
but coarser operationalizations of business proximity. Therefore, when including both the
selective mixing statistics and our business proximity measure in the ERGM specication,
the eect of the selective mixing statistics is superceded by the eect of the more rened
proximity measure, causing the model to produce insignicant coecients for the selective
mixing statistics. To test the validity of this explanation, we also estimate another ERGM
specication, which excludes the business proximity measures and for which we report the
corresponding results for the selective mixing coecients in Table 8 in Appendix A. Com-
paring the last columns of Tables 3 and 8, we nd that when using the specication without
proposed business proximity, a much higher proportion of the samples produces statistically
signicant (at the 1.0% signicance level) estimates for the selective mixing coecients. This
is thus supporting evidence for the superiority of the proximity measures we use: They are
correlated with the alternative, coarser measures, but statistically more powerful in explain-
ing the matching in M&As.
In Table 4 we report the estimation results for the four proximity measures. First and
foremost, the new business proximity measure is found to be strongly associated with the
matching likelihood: All the samples produce positive coecients and among them 79 es-
timates are signicant at the 99.9% condence level. Furthermore, when comparing the
proximity measures across the rows, we observe three among the four proximity measures
(except θ
g
geographic) are positively associated with the likelihood of matching in M&As,
and in particular, our newly developed business proximity measure also outperforms the
25
other three in terms of statistical signicance. Moreover, since we normalize the proxim-
ity measures, we can evaluate their economic signicance by comparing the magnitude of
the coecients. Using the median estimate from the 100 samples (last column of Table
4), we nd that the business proximity measure has the largest eect on the matching
likelihood: A 1-standard-deviation increase in business proximity has the same eect as a
3.64-standard-deviation increase in social proximity, or a 6.89-standard-deviation increase
in investor proximity. These results thus support the value of business proximity in mod-
eling M&As. Interestingly, in our dataset, the geographic proximity appears to play an
insignicant role in identifying high-tech rms’ matching in M&As.
The estimation result of equation (10) shows business proximity is positively associated
with the M&A matching likelihood. However, a linear structure might not best capture the
true relationship between business proximity and M&A matching since the economic benets
of merging two rms’ businesses may result from not only their similarity but also their
complementarity (e.g., Chung et al. 2000, Sears and Hoetker 2013). The value of M&A could
decrease in cases where two rms’ businesses are too similar but lack complementarity, so
little value of synergy can be achieved through merger. We test this hypothesis by estimating
a specication that includes a squared term of business proximity, θ
b2
p
b2
= θ
b2
p
2
b,ij
, and
that is otherwise the same as equation (10). We expect θ
b2
to be negative and θ
b
to be still
positive. The estimation results on the proximity measures (of the 100 samples) are reported
in Table 5. We do observe that for a large number of the samples business proximity
is estimated to have a curvilinear eect on the M&A matching likelihood. Specically,
for 86 out of the 100 samples, the coecient of the squared term is negative and that of
the linear term is positive, suggesting the matching likelihood rst increases with business
proximity and then decreases after a certain point. This evidence is thus consistent with
our expectation. Meanwhile, we note that the evidence for the statistical signicance of the
squared term is not as strong as that for the linear term.
26
Number of Number of Number of Number of
Samples with Samples with Samples with Samples with
Coecient p-value p-value p-value
Expected Sign < 5.0% < 1.0% < 0.1%
θ
g
Geographic 47(>0) 6 4 2
θ
s
Social 85(>0) 77 77 73
θ
f
Investor 67(>0) 56 52 50
θ
b
Business 100(>0) 86 76 61
θ
b2
Business
2
86(<0) 42 28 13
Table 5: Proximity Coecients (100 Samples):
Equation (10) plus θ
b2
p
b2
5 Scaling up to Big Data: A System Prototype
for Navigating the Networked Startup World
During the recent boom of the high-tech industry, the media are often full of reports about
high-prole M&As involving startups. It is well known that M&As are an important alter-
native to IPOs as an exit option for high-tech entrepreneurs and early investors. Meanwhile,
industry giants spend tens of billions of dollars each year in acquiring smaller rms for market
entrance, strategic intellectual property, and talented employees.
15
Venture capitalists also
arrange mergers between their partially owned startups in order to consolidate resources and
reduce competitive pressure.
16
The erce competition in both demand and supply instan-
taneously creates the problem of matching between acquirers and targets, since the value
(or disvalue) of an M&A critically depends on the synergy of the companies’ products, tech-
nologies, and markets. More broadly, the challenge lies in the search for startups. While
almost everyone knows who the top competitors are in a particular space, it is a dicult
and time-consuming task to nd the small companies in the vast startup universe with the
right products or technology. The problem can only become increasingly challenging over
time given the speed of technological innovation. Solving this search problem will be bene-
cial not only for M&A executives, but also for entrepreneurs to position their products and
identify competitors, for venture capitalists to monitor niche markets, and for high-tech an-
15
See “Internet Mergers and Takeovers: Platforms upon Platforms,” The Economist, May 25, 2013.
16
An example is the acquisition of Summize by Twitter in 2008. See “Finding A Perfect Match,” Twitter
Blog, https://blog.twitter.com/2008/finding-perfect-match and Nick Bilton’s 2013 book Hatching
Twitter: A True Story of Money, Power, Friendship, and Betrayal.
27
alysts to examine the industry trend. Observers have noted data analytics can complement
executives’ industry knowledge in alleviating many of the problems, and transform the way
M&A matching and startup search have been done it is reported that many large M&A
players have already been investing heavily in analytics for identifying the win-win matches
by rendering the decision-making processes more “data-driven.
17
Along these lines, our empirical analysis indicates the potential practical value of the pro-
posed business proximity measure as an important metric in the analytics of M&A matching
and a search tool for navigating the networked startup world. To show the practical appli-
cation in a concrete way, we build a prototype for a cloud-based information system that
allows entrepreneurs, managers, and analysts to explore the competitive landscape of the
U.S. high-tech industry (Whinston and Geng 2004). By incorporating business proximity
and making it explicitly available to the users in the search and navigation tools, the plat-
form expedites the process of startup search and competition analysis as well as facilitates
ecient new niche-market discovery. Built upon the latest Big Data and cloud technologies,
the system largely consists of two components as shown in Figure 6: The back end collects
raw data from the data sources, integrates and cleans the data, computes business proximity,
and stores the processed data in local databases. The front end is a web application that
enables users to explore the data stored in a cloud-based database.
5.1 Back-End System
The back-end system comprises two modules and two databases. The rst module is the data
collector written in Python to retrieve data from our data sources, including CrunchBase.
The collector runs periodically to ensure our data is up-to-date. The raw data is stored in
a MongoDB
18
database, which is a document-oriented, NoSQL database that stores records
in JSON format. The reason why we do not use a relational database is that the structure
of the company data may change over time, so the traditional relational database, which
requires a pre-dened schema, is not the best technology for our system. Another feature
of MongoDB is that it supports scalability: As the data size grows load balancing can be
17
See “Google Ventures Stresses Science of Deal, Not Art of the Deal,” New York Times, June 23, 2013,
and “One of The Richest Men in The World Is Backing a Startup That Ranks Wall Street’s Hedge Funds,”
Business Insider, http://read.bi/1KqhHzr.
18
https://www.mongodb.org.
28
Front end
Back end
Data
collector
(Python)
Raw data
(MongoDB)
Industry data
(Crunchbase, etc.)
Topic model
builder
(Scala)
Processed data
(MongoDB)
Webpages
(HTML/CSS,
Javascript)
Company
meta info
(JSON)
Business proximity
(JSON)
Users
Search
Company info
Cloud DB
(Google Datastore,
Google Cloud Storage)
API Engine
(Google App
Engine)
Figure 6: Prototype Architecture and Components
performed using the shrading mechanism. This is a basis for the cloud-based information
system.
The second module, the topic model builder, constructs and estimates topic models using
the textual company descriptions extracted from the raw data in MongoDB. To run the LDA
topic modeling algorithm, we use a Scala implementation in Stanford Topic Model Toolkit.
19
The topic model builder produces two sets of results: First, underlying business topics of
the whole industry are generated, where each topic is essentially a set of related keywords
that represent the topic. Second, each company’s prole is transformed into a topic vector,
which is stored in the database of processed data in MongoDB.
We then compute business proximity to identify the top N nearest neighbors from each
rm. A naive, brute-force approach that calculates the business proximity values for all
pairs of companies can be used to nd the nearest neighbors. However, as we continuously
collect data and the dataset grows, the number of company pairs increases exponentially to
a point that the exhaustive computation is impractical for the real-world system. Hence, we
propose an algorithm that reduces the required computation while providing a reasonable
approximation in nding nearest neighbors. The intuition behind the algorithm is that a
19
http://www-nlp.stanford.edu/software/tmt/tmt-0.4/.
29
pair of companies is likely to have a high proximity value only if they share high weights
on some common topics in their topic distributions. Hence, we maintain a bucket list for
each topic that keeps track of the companies with a high weight on that specic topic. Then
we only compute the business proximity values for company pairs that co-occur in at least
one of the bucket lists, because those pairs that do not fall into any of the bucket lists are
unlikely to be very close to each other. The pseudocode is given in Algorithm 1.
input : set of companies C, companies’ topic distributions T , number of topics K,
threshold θ
output: N nearest neighbors for each company
for each topic k K do
B
k
end
for each company c C do
for each topic k K do
if T [c][k] >= θ then
B
k
B
k
c
end
end
end
for each company c C do
Nset
for each topic k K do
if T [c][k] >= θ then
Nset Nset B
k
end
end
for each company c
N set do
bizprox[c
] cosinesimilarity(T [c], T [c
])
end
Find N nearest neighbors by sorting bizprox list
end
Algorithm 1: Faster Nearest-Neighbor Computation
To measure the speed of business proximity computation and the accuracy of nearest-
30
0
100
200
300
brute−force fast(th=0.0) fast(th=0.1) fast(th=0.15) fast(th=0.20) fast(th=0.30)
algorithms
Number of comparisons (million)
(a) Calculation speed
90.0
92.5
95.0
97.5
100.0
th=0.00 th=0.10 th=0.15 th=0.20 th=0.30
algorithms
Accuracy (%)
variable
top10
top20
top30
top50
top100
(b) Accuracy of nearest neighbors
Figure 7: Performance Measures of Algorithm 1
neighbor identication, we run experiments using the dataset described in Section 2. The
results are reported in Figure 7. In terms of the computation speed, we count the number of
business proximity values calculated. We use this metric instead of the actual computation
time to avoid potential environmental biases. The brute-force algorithm, which computes all
pairwise proximity values, requires 341 million calculations. In the meantime, our algorithm
with threshold 0.00 only needs 123 million, which is 36% of the naive approach. As we
increase the threshold to 0.30, only 3% of calculations are needed. Faster computation
comes with a modest cost on accuracy. We compare the N nearest neighbors identied by
the algorithm with dierent thresholds and vary N to be 10, 20, 30, 50, and 100. As expected,
the algorithm provides accurate results for closest neighbors, where the performance degrades
gracefully to the not-so-near neighbors. We want to note that the algorithm with threshold
0.00 provides 100% accurate neighborhood sets comparing to the brute-force algorithm. Even
for the case of threshold 0.30, the algorithm gives a 92.5% accuracy in identifying 50 nearest
neighbors.
31
(a) Search companies and topics of interest
(b) Search results
(c) Focal company with its competitors based on business proximity
Figure 8: Prototype Front End: User Interface Screenshots
32
5.2 Front-End System
The front end is a cloud-based web application, available at http://146.6.99.242/bizprox,
to let users explore various company information with the proposed business proximity.
Figure 8 shows the screenshots of the user interface. Given a keyword from the user, the
search results show the topics and companies associated to the keyword. By selecting topics,
the user can interpret the topic with 20 (additional) relevant keywords and the signicance
of each. If a company is selected from the search results, the interface provides (1) the basic
information about the company along with the topic distribution, and (2) a list of nearest
neighbors to the focal company. The basic information of a company includes the founding
date, founders, headquarters, and a short business description. With the topic distribution,
users can recognize various business aspects of the company. The nearest neighbors are
computed using Algorithm 1 and are sorted by the business proximity.
From the system architecture perspective, the front end is a cloud-based system lever-
aging platform-as-a-service (PaaS). The static webpages in HTML/CSS are hosted by our
local Apache Web Server. The server interacts with the various user inputs such as keyword
searches and page navigations. Each webpage is instrumented with Google Analytics
20
so
that web analytics is performed to understand user engagement and potentially optimize the
service. An API Engine, deployed in Google App Engine,
21
receives queries from the HTML
pages and returns relevant data from the cloud database. The cloud database consists of two
components: First, the dynamic data is managed in Google Cloud Datastore,
22
a cloud-based
NoSQL database system; second, the static data is stored in Google Cloud Storage,
23
which
provides a cost-eective content distribution service for static information. The cloud-based
approach gives two main benets: scalability (e.g., the system scales automatically according
to user demand and data size) and availability (e.g., almost no downtime due to replication).
20
http://www.google.com/analytics/.
21
https://developers.google.com/appengine/.
22
https://developers.google.com/datastore/.
23
https://cloud.google.com/products/cloud-storage/.
33
6 Discussion and Conclusion
The advent of digital economy is creating a business environment that is characterized by
the unprecedented complexity of technology and connectedness between rms and people.
With the goal of reducing the diculty to understand and depict the business landscape,
we set out in this paper to develop a general, data-analytic framework for quantifying rms’
positions in the spaces of product, market, and technology and for measuring rms’ dyadic
business proximity. Using a unique dataset of the U.S. high-tech industry as an example,
we detailed the procedure and system of using topic models to analyze the publicly avail-
able, textual descriptions of company business and constructing proximity according to the
structured results. We then validated the new measure by relating it to the simple category-
based classication and analyzing its statistical relationships with rm interactions including
M&A, investment, and job mobility. In a more rigorous statistical analysis, we also demon-
strated the new measure’s usefulness in modeling matching of M&As, where we constructed a
network of high-tech companies and documented empirical evidence on the nuanced relation-
ship between matching and business proximity. Moreover, to show the practical value of the
proposed data-analytic framework, we deployed various Big Data and analytics technologies
to build a prototype of a cloud-based information system for industry intelligence.
This research sheds light on the value of leveraging data science techniques in the de-
velopment of novel measures (Einav and Levin 2013) for large-scale business analytics. Our
data-driven, analytics-based approach requires no expert preprocessing, provides ner gran-
ularity (compared with the SIC- or NAICS-based methods), is more comprehensive on quan-
tifying rms’ positions in the spaces of product, market, and technology (compared with the
patent- or customer-based methods), and can be better automated and scaled to Big Data
(compared with all previous methods). When built into an automated system as in Section 5,
the method is also more responsive in capturing industry trends than any human-annotation-
based approach. Substantively, the comprehensive, granular business proximity measure is
an enabler in the M&A application to show the nuanced relationship between the transac-
tion likelihood and the rms’ business similarity and complementarity. The result manifests
economically meaningful information can be extracted from unstructured data through care-
ful analysis and large-scale computation. Thus, our methodology greatly complements the
toolkit for measuring business proximity, and it is especially useful when researchers or ana-
lysts are studying either an already narrowly focused industry or a highly dynamic industry
34
or when the rms under study are small and privately held (e.g., startups) so industry classi-
cation is largely unavailable. Meanwhile, we wish to stress that our measure is not intended
as a replacement for the existing methods in all scenarios. For instance, when the research
question is at a relatively macro level, only rms’ broad industry membership is important,
and all rms’ SIC or NAICS codes are available, the researcher should not be hesitant to
use the SIC- or NAICS-based methods.
More broadly, the data-analytic framework used in the study presents a general approach
for understanding industry structure and it also demonstrates the potential transformation
Big Data analytics can bring into both industry intelligence practice and strategy and indus-
trial organization research. For analytics-minded managers, rms’ relatedness in business
is a very important metric for identifying potential partners, competitors, and alliance or
acquisition targets. The saying in management goes, “if you cannot measure it, you cannot
manage it. As shown in our study, the proposed proximity measure provides ner granular-
ity, and is proved to be eective in high-tech M&A analytics. More importantly, as a general
approach to organize unstructured data for industry intelligence, the usefulness of the pro-
posed framework is not limited to measuring proximity and analyzing M&As. Rather, as
argued and demonstrated in Section 5, it provides a handy leverage for entrepreneurs, ven-
ture capitalists, and analysts to navigate the constantly changing landscape of the networked
business environment, which is much needed in light of the rapid evolution of technology
and increasing complexity of the digital economy. Our prototype can be the rst step in
building a Business Intelligence platform to fully realize the framework’s practical potential.
In response to the transformation, even for outside the domain of industry intelligence, or-
ganizations need to invest in IT infrastructure and capability to better organize and analyze
unstructured data, as the ability of distilling value from unstructured data will be an im-
portant competitive advantage in the digital economy. Our prototype is also an example of
organizing unstructured data and integrating the state-of-the-art storage and computation
technologies to build a decision support system. For business and economics scholars, our
method can perhaps be adapted and serve as an alternative approach of dening market
boundary or identifying industry rivals, which is a crucial step in the empirical research of
industrial organization. Additionally, future research can explore the possibility of combin-
ing topic modeling results and clustering algorithms to build an industry hierarchy, which
could be a data-driven alternative to the expert-labeled systems that are currently in use.
A data-driven approach is especially desirable for industries such as high-tech because the
underlying technology is rapidly changing and the manually labeled industry classication
35
system can be stale.
This research also advances the understanding and analysis of M&As. We documented
systematic evidence on the relationship between matching and rm proximity in the high-tech
industry which complements the previous empirical M&A literature which largely focused
on larger, public corporations (Betton et al. 2008). The proposed new measure also enabled
us to test the non-monotone relationship between business proximity and M&A matching.
More importantly, we constructed a network structure using rm proximity measured in four
dierent dimensions and adopted the statistical modeling framework of ERGMs to accom-
modate the relational nature of the matching data. The network/graph approach has been
fruitfully applied to analyzing a variety of economic exchanges and markets (as surveyed
in Easley and Kleinberg 2010, Jackson 2010). However, whereas the literature is abundant
with studies on how networks aect the interaction and performance of rms, research us-
ing rigorous statistical methods to analyze the structure of inter-rm networks is relatively
underdeveloped. To our knowledge, the M&A application in the study is the rst to use a
statistical network model to analyze relational transactions among companies. We believe
statistical network models are currently underutilized by management scholars in their em-
pirical research on inter-organizational linkage despite the fact that relational data is actually
not uncommon in the studies of many very important questions. For example, strategic al-
liances, investments, and patent license agreements among companies can all be visualized
and carefully analyzed as graphs/networks. We predict that with the growing availability of
network datasets and ongoing development of large-scale computing technologies, statistical
network models’ value in management research will be increasingly recognized.
In closing, we wish to point out some additional caveats and limitations of the research.
First, since SIC- or NAICS-based industry classication or patent data is unavailable for most
companies in CrunchBase, we could not directly compare the proposed business proximity
measure with that based on industry hierarchy (Wang and Zajac 2007) or the measure based
on patent citation (Stuart 1998) in terms of their explanatory power for M&A matching.
Though this is less crucial for this paper, since our goal is not to search for the best empirical
model for M&As, it could be an interesting research project to nd a suitable dataset where
all the new and traditional measures could be operationalized and compared directly. Second,
for our data-analytic approach, the number of topics in LDA is a free parameter for users to
choose. When performing topic modeling on the CrunchBase descriptions, we selected a nite
set of values for this parameter, which is sucient for our purpose of illustrating the general
36
methodology. Nevertheless, from a practical point of view, it is worth investigating whether
an “optimal” number of topics exists, and if so, how it should be determined. Third, in the
machine learning literature, there are several extensions to the LDA algorithm (e.g., Teh et
al. 2004, Inoyue et al. 2014). Future research could investigate how these extensions could
benet understanding company businesses through text analysis. Fourth, some important
company-level characteristics — notably company size and revenue — are unavailable in our
dataset, which inevitably limited our ability to extend our empirical application on M&A
matching. For instance, had we observed company size, we would be able to study the
moderating eect of companies’ size on the relationship between business proximity and the
matching likelihood. Lastly, the model we employed in the empirical analysis is a static
network model. To deepen our understanding about the dependence structure of M&A
transactions, future research could examine the evolution of the M&A network by using
some dynamic network models.
References
[1] Adomavicius, G. and A. Tuzhilin 2005, “Toward the Next Generation of Recommender
Systems: A Survey of the State-of- the-Art and Possible Extensions,” IEEE Transac-
tions on Knowledge and Data Engineering, 17(6), 734-749.
[2] Ahuja, G. and R. Katila 2001, “Technological Acquisitions and the Innovation Per-
formance of Acquiring Firms: A Longitudinal Study,” Strategic Management Journal,
22(3), 197-220.
[3] Amit, R., L. Glosten, and E. Muller 1990, “Entrepreneurial Ability, Venture Invest-
ments, and Risk Sharing,” Management Science, 36(10), 1233-1246.
[4] Betton, S., B.E. Eckbo, and K.S. Thorburn 2008, “Corporate Takeovers,” Chapter 15
in B.E. Eckbo ed., Handbook of Corporate Finance: Empirical Corporate Finance ed.
1, Vol. 2, 291-430. Elsevier/North-Holland, 2008.
[5] Blei, D.M. 2012, “Introduction to Probabilistic Topic Models,” Communications of the
ACM, 55(4), 77-84.
[6] Blei, D.M., A.Y. Ng, and M.I. Jordan 2003, “Latent Dirichlet Allocation,” Journal of
Machine Learning Research, 3, 993-1022.
37
[7] Chakrabarti, A. and W. Mitchell 2013, “The Persistent Eect of Geographic Distance
in Acquisition Target Selection,” Organization Science, 24(6), 1805-1826.
[8] Chen, H. R.H.L. Chiang, and V.C. Storey 2012, “Business Intelligence and Analytics:
From Big Data to Big Impact,” MIS Quarterly, 36(4), 1165-1188.
[9] Chiang, R.H.L., P. Goes, and E.A. Stohr 2012, “Business Intelligence and Analytics Ed-
ucation and Program Development: A Unique Opportunity for the Information Systems
Discipline,” ACM Transactions on Management Information Systems, 3(3), 12:1–12:13.
[10] Chung, S., H. Singh, K. Lee 2000, “Complementarity, Status Similarity and Social
Capital as Drivers of Alliance Formation,” Strategic Management Journal, 21(1), 1-22.
[11] Cohen, L., A. Frazzini, and C.J. Malloy 2008, “The Small World of Investing: Board
Connections and Mutual Fund Returns,” Journal of Political Economy, 116(5), 951-979.
[12] Easley, D. and J. Kleinberg 2010, Networks, Crowds, and Markets: Reasoning About a
Highly Connected World. Cambridge University Press, 2010.
[13] Einav, L. and J.D. Levin 2013, “The Data Evolution and Economic Analysis,” NBER
Working Paper 19035, May 2013.
[14] Erel, I., R.C. Liao, and M.S. Weisbach 2012, “Determinants of Cross-Border Mergers
and Acquisitions,” Journal of Finance, 67(3), 1045-1082.
[15] Fallick. B, C.A. Fleischman and J.B. Rebitzer 2006, “Job-Hopping in Silicon Valley:
Some Evidence concerning the Microfoundations of a High-Technology Cluster,” The
Review of Economics and Statistics, 88(3), 472-481.
[16] Faraj, S. and S.L. Johnson 2011, “Network Exchange Patterns in Online Communities,”
Organization Science, 22(6), 1464-1480.
[17] Ghose, A., P.G. Ipeirotis, and B. Li 2012, “Designing Ranking Systems for Hotels
on Travel Search Engines by Mining User-Generated and Crowd-Sourced Content,”
Marketing Science, 31(3), 493-520.
[18] Goldenberg, A., A.X. Zheng, S.E. Fienberg, and E.M. Airoldi 2010, “A Survey of Sta-
tistical Network Models,” Foundations and Trends in Machine Learning, 2(2), 129-233.
[19] Gompers, P.A. 1995, “Optimal Investment, Monitoring, and the Staging of Venture
Capital,” Journal of Finance, 50(5), 1461-1489.
38
[20] Griths, T.L. and M. Steyvers 2004, “Finding Scientic Topics,” Proceedings of the
National Academy of Science, 101, 5228-5235.
[21] Handcock, M.S., D.R. Hunter, C.T. Butts, S.M. Goodreau, and M. Morris 2008,
statnet: Software Tools for the Representation, Visualization, Analysis and Simu-
lation of Network Data,” Journal of Statistical Software, 24, 1-11.
[22] Hochberg, Y., A. Ljungqvist, and Y. Lu 2007, “Whom You Know Matters: Venture
Capital Networks and Investment Performance,” Journal of Finance, 62(1), 251-301.
[23] Hunter, D.R. and M.S. Handcock 2006, “Inference in Curved Exponential Family Models
for Networks,” Journal of Computational and Graphical Statistics, 15(3), 565-583.
[24] Inouye, D. P. Ravikumar, and I. Dhillon 2014, “Admixture of Poisson MRFs: A Topic
Model with Word Dependencies,” Proceedings of International Conference on Machine
Learning, 31, 683–691.
[25] Jackson, M.O. 2010, Social and Economic Networks. Princeton University Press, 2010.
[26] Lorenzoni, G. and A. Lipparini 1999, “The Leveraging of Interrm Relationships as A
Distinctive Organizational Capability: A Longitudinal Study,” Strategic Management
Journal, 20(4), 317-338.
[27] Mikkelson, W.H. and R.S. Ruback 1985, “An Empirical Analysis of The Interrm Equity
Investment Process,” Journal of Financial Economics, 14(4), 523-553.
[28] Mitsuhashi, H. and H.R. Greve 2009, “A Matching Theory of Alliance Formation and
Organizational Success: Complementarity and Compatibility,” Academy of Management
Journal, 52(5), 975-995.
[29] Moscarini, G. and K. Thomsson 2007, “Occupational and Job Mobility in the US,”
Scandinavian Journal of Economics, 109(4), 807-836.
[30] Mowery, D.C., J.E. Oxley, and B.S. Silverman 1998, “Technological Overlap and Inter-
rm Cooperation: Implications for The Resource-Based View of The Firm,” Research
Policy, 27(5), 507-523.
[31] Rhodes-Kropf, M. and D.T. Robinson 2008, “The Market for Mergers and the Bound-
aries of the rm,” Journal of Finance, 63(3), 1161-1211.
39
[32] Sears, J. and G. Hoetker 2014, “Technological Overlap, Technological Capabilities, and
Resource Recombination in Technological Acquisitions,” Strategic Management Journal,
35(1), 48-67.
[33] Shi, Z., H. Rui, and A.B. Whinston 2014, “Content Sharing in A Social Broadcasting
Environment: Evidence from Twitter,” MIS Quarterly, 38(1), 123-142.
[34] Shmueli, G. and O.R. Koppius 2011, “Predictive Analytics in Information Systems
Research,” MIS Quarterly, 35(3), 553-572.
[35] Skerlavaj, M., V. Dimovski, and K.C. Desouza 2010, “Patterns and Structures of Intra-
Organizational Learning Networks within a Knowledge-Intensive Organization,” Jour-
nal of Information Technology, 25(2), 189-204.
[36] Snijders, T.A.B. 2002, “Markov Chain Monte Carlo Estimation of Exponential Random
Graph Models,” Journal of Social Structure, 3(2), 1-40.
[37] Stuart, T.E. 1998, “Network Positions and Propensities to Collaborate: An Investiga-
tion of Strategic Alliance Formation in a High-Technology Industry,” Administrative
Science Quarterly, 43(3), 668-698.
[38] Stuart, T.E. and S. Yim 2010, “Board Interlocks and The Propensity to Be Targeted
in Private Equity Transactions,” Journal of Financial Economics, 97(1), 174-189.
[39] Teh, Y.W., M.I. Jordan, M.J. Beal, and D.M. Blei 2006, “Hierarchical Dirichlet Pro-
cesses,” Journal of the American Statistical Association, 101, 1566-1581.
[40] Wang, L. and E.J. Zajac 2007, “Alliance or Acquisition? A Dyadic Perspective on
Interrm Resource Combinations,” Strategic Management Journal, 28(13), 1291-1317.
[41] Whinston, A.B. and X. Geng, “Operationalizing The Essential Role of The Information
Technology Artifact in Information Systems Research: Gray Area, Pitfalls, and The
Importance of Strategic Ambiguity,” MIS Quarterly, 28(2), 149-159.
[42] Xu, L, J.A. Duan, and A.B. Whinston 2014, “Path to Purchase: A Mutually Exciting
Point Process Model for Online Advertising and Conversion,” Management Science,
60(6), 1392-1412.
40
A Additional Tables
Network graph
Y , Y
ij
a random network graph matrix, its i, j element
Y
ij
all elements except i, j
Y the set of all possible graphs for a xed set of nodes
y, y
ij
a realization of the random network graph and its i, j element
z
k
(y) a statistic of network graph y
Network statistics
t total number of edges
d
2
number of nodes which have at least 2 edges
h
sta
s
number of edges within state s
h
cat
c
number of edges within category c
p
g
sum of geographic proximity over all edges
p
s
sum of social proximity over all edges
p
f
sum of investor proximity over all edges
p
b
sum of business proximity over all edges
Nodal characteristics
s
i
state where i’s headquarter is located
c
i
category to which i belongs
Dyadic characteristics
p
g,ij
geographic proximity of i and j
p
s,ij
social proximity of i and j
p
f,ij
investor proximity of i and j
p
b,ij
business proximity of i and j
Table 6: ERGM Notations
41
Coe S.E. p-value Coe S.E. p-value
Geographic -0.2699 0.3440 0.4326 NV - - -
Social 0.0532 0.0108 0.0000 NY - - -
Investor 0.0270 0.0522 0.6049 OH - - -
Business 0.4635 0.1378 0.0008 OK - - -
Edges -12.5625 3.7908 0.0009 OR - - -
Degree> 2 2.4820 0.6438 0.0001 PA - - -
State RI - - -
AL - - - SC - - -
AR - - - SD - - -
AZ - - - TN - - -
CA 2.3899 0.8178 0.0035 TX - - -
CO - - - UT - - -
CT - - - VA - - -
DC - - - VT - - -
DE - - - WA - - -
FL - - - WI - - -
GA - - - WV - - -
HI - - - WY - - -
IA - - - Category
ID - - - advertising - - -
IL - - - biotech - - -
IN - - - cleantech - - -
KS - - - consulting - - -
KY - - - ecommerce - - -
LA - - - education - - -
MA 4.6361 1.1201 0.0000 enterprise 2.9201 0.8882 0.0010
MD - - - games video 3.0284 1.0953 0.0057
ME - - - hardware 3.7045 1.7912 0.0386
MI - - - legal - - -
MN - - - mobile 1.8611 1.2047 0.1223
MO - - - network hosting - - -
MS - - - other - - -
MT - - - public relations - - -
NC - - - search - - -
NE - - - security - - -
NH 9.7899 1.5931 0.0000 semiconductor - - -
NJ 5.6899 1.6428 0.0005 software - - -
NM - - - web -0.9020 2.1375 0.6731
Table 7: Model Coecients from Sample 1
42
Number of Number of Number of Number of Number of Number of
Samples Samples Samples Samples Samples Samples
with Coecient p-value Coecient Coecient p-value
Coecient > 0 < 1.0% > 0 < 1.0%
advertising 28 28 14 mobile 27 27 16
biotech 37 37 32 net hosting 8 8 6
cleantech 12 12 10 other 0 - -
consulting 12 12 9 pub rel 10 10 6
ecommerce 12 12 6 search 0 - -
education 0 - - security 0 - -
enterprise 22 22 20 semiconductor 17 17 14
games video 28 28 16 software 89 85 55
hardware 31 31 29 web 78 70 22
legal 0 - -
(a) Category
Table 8: Category-Based Selective Mixing Coecients (100 Samples): Equation (10) exclud-
ing θ
b
p
b
43
Topic Dimension Top 5 Words
1 Product video,music,digital,entertainment,artists
2 Product news,site,blog,articles,publishing
3 Product job,jobs,search,employers,career
4 Product people,community,members,share,friends
5 Product facebook,friends,share,twitter,photos
6 Product energy,power,solar,systems,water
7 Product systems,design,applications,devices,semiconductor
8 Product consulting,clients,support,systems,experience
9 Product event,sports,events,fans,tickets
10 Product insurance,financial,credit,tax,mortgage
11 Product deals,shopping,consumers,local,retailers
12 Product health,care,medical,healthcare,patient
13 Product students,learning,education,college,school
14 Product food,restaurants,fitness,restaurant,pet
15 Product investment,financial,investors,capital,trading
16 Product advertising,publishers,advertisers,brands,digital
17 Product manage,project,documents,document,tools
18 Product treatment,medical,research,clinical,diseases
19 Product games,game,gaming,virtual,entertainment
20 Product security,compliance,secure,protection,access
21 Product search,engine,website,seo,optimization
22 Product search,user,engine,results,relevant
23 Product fashion,art,brands,custom,design
24 Product equipment,repair,car,home,accessories
25 Product law,legal,government,public,federal
26 Product analytics,research,analysis,intelligence,performance
27 Product travel,travelers,vacation,hotel,hotels
28 Product real,estate,home,buyers,property
29 Product payment,card,cards,credit,payments
30 Technology/Product phone,email,text,voice,messaging
31 Technology/Product wireless,networks,communications,internet,providers
32 Technology/Product cloud,storage,hosting,server,servers
33 Technology/Product app,apps,iphone,android,applications
34 Technology/Product design,applications,application,custom,website
35 Technology/Product site,website,free,allows,user
36 Technology/Product testing,test,monitoring,tracking,performance
37 Market/Technology digital,clients,brand,agency,design
38 Market sales,customer,lead,email,leads
39 Market solution,cost,costs,applications,enterprise
40 Market organizations,community,support,organization,businesses
41 Market make,people,time,just,way
42 Market quality,customer,needs,clients,provide
43 Market systems,operates,headquartered,subsidiary,serves
44 Market united,states,offices,america,europe
45 Market san,york,city,california,francisco
46 Market award,magazine,awards,best,world
47 Market million,world,leading,largest,global
48 Market/Team team,experience,industry,world,market
49 Team partners,ventures,capital,including,san
50 Team launched,million,product,ceo,acquirede
Table 9: LDA Results of CrunchBase Data
44