Towards A Better Measure of Business Proximity:

Topic Modeling for Industry Intelligence

July 31, 2015

Abstract

In this article, we propose a new data-analytic approach to measure rms’ dyadic

business proximity. Specically, our method analyzes the unstructured texts that de-

scribe rms’ businesses using the statistical learning technique of topic modeling, and

constructs a novel business proximity measure based on the output. When compared

with existent methods, our approach is scalable for large datasets and provides ner

granularity on quantifying rms’ positions in the spaces of product, market, and tech-

nology. We then validate our business proximity measure in the context of industry

intelligence and show the measure’s eectiveness in an empirical application of ana-

lyzing mergers and acquisitions in the U.S. high technology industry. Based on the

research, we also build a cloud-based information system to facilitate competitive in-

telligence on the high technology industry.

Keywords: Big Data analytics, business proximity, topic modeling, industry in-

telligence, information system

1

1 Introduction

Business proximity measures rms’ relatedness in the spaces of product, market, and tech-

nology, which is an important concept in industry intelligence and also a central building

block in many studies of rm strategy and industrial organization. Not surprisingly, prior

studies in dierent management disciplines have used or developed a handful of measures

of business proximity. One common practice has been to classify rms into industries (or

sub-industries) and to operationalize business proximity as a binary variable that indicates

common industry (or sub-industry) membership. Under this denition, two rms’ businesses

are either identical or completely dierent. A rened extension of the binary denition has

been to better utilize the hierarchical information provided by some industry classication

system, such as Standard Industrial Classication (SIC) or North American Industrial Clas-

sication System (NAICS). For example, in Wang and Zajac (2007), the similarity of two

rms’ businesses was determined by the number of common consecutive digits in their in-

dustry classication codes under NAICS. Since they used the rst four digits in NAICS, the

similarity quantity was one of ve possible values: 0.00, 0.25, 0.50, 0.75, or 1.00. However,

this measure is still discrete, and the level of granularity it can achieve is constrained by the

industry classication system on which it depends. There are several other measures that

were aimed at some specic aspect of rms’ businesses, and they typically had stronger data

requirements. Stuart (1998), Mowery et al. (1998), and others constructed a “technological

overlap” measure using data of rms’ patent holdings. The closeness of a pair of rms was

assumed to be proportional to the number of common antecedent patents cited. While this

is an elegant, continuous measure in the technology space, it requires complete data on rms’

patent portfolios and does not explicitly cover the product and market spaces. Mitsuhashi

and Greve (2009) applied the Jaccard distance on rms’ customer geographic regions in mea-

suring “market complementarity.” Likewise, this measure focuses only on the (geographic)

market space and requires all relevant rms’ customer geography data to be available.

While these measures have served the researchers’ purposes well, we see an opportunity

for a new and more general methodology in light of the increasing availability of public,

unstructured data and recent advances in Big Data analytics. In this paper, we propose a

method that requires little manual preprocessing yet provides ner granularity on quantify-

ing rms’ positions in the spaces of product, market, and technology. Utilizing a statistical

learning technique called topic modeling (Blei 2012), we analyze the publicly available, un-

2

structured texts that describe rms’ businesses. Our automatic approach, the core of which

is a Latent Dirichlet Allocation (LDA) algorithm, represents each rm’s textual descrip-

tion as a probabilistic distribution over a set of underlying topics, which we interpret as

aspects of its business. The data-analytic framework thus greatly reduces the complexity

of representing the business environment, and produces structured information that enables

further examination and derivation. The new business proximity measure is then naturally

constructed by quantifying the “distance” between a pair of rms’ topic distributions.

An important advantage of our method for measuring business proximity is that it im-

poses a much less strong requirement on structured data than the existent measures. This

makes our approach particularly appealing when the rms under study are small and pri-

vately held, for which detailed information on industry classication, patent holding, and

product/customer is either highly sparse or not available at all. Motivated by this advantage,

we choose the U.S. high technology (high-tech) industry as the empirical context to demon-

strate our approach. We collect data from CrunchBase, an open and comprehensive source

for high-tech startup activity. For the majority of companies in our dataset, the standard-

ized industry classication code is unavailable, and due to various strategic reasons, most do

not disclose their customer information and key intellectual property, so the conventional

methods for measuring business proximity cannot be operationalized. Using this dataset as

an example, we detail the procedure of our data-analytic approach, and compute business

proximity for each pair of the companies. We then show the validity and eectiveness of

the new measure in the context of industry intelligence by (1) examining the relationships

between business proximity and simple category classication, between business proximity

and job mobility, and between business proximity and investment respectively, and (2) using

the measure in a novel empirical application of modeling matching of companies in merg-

ers and acquisitions (M&As). Our comprehensive, continuous measure is an enabler in the

analysis to show the nuanced relationship between M&A transaction and the rms’ business

similarity and complementarity. Methodologically, to recognize the increasingly networked

business environment as well as to accommodate the relational nature of the matching data,

we employ an innovative statistical framework called Exponential Random Graph Models

(ERGMs) in the M&A analysis.

This research joins the rapidly growing stream of information systems literature that

leverages newly developed data science techniques in examining Big Data for business ana-

lytics (e.g., Adomavicius and Tuzhilin 2005, Shmueli and Koppius 2011, Chen et al. 2012,

3

Chiang et al. 2012, Ghose et al. 2012, Shi et al. 2014, Xu et al. 2014). Our research shows

how Big Data analytics can potentially transform competitive intelligence, particularly for

the high-tech industry, where recent years have seen an “entrepreneurial boom” character-

ized by the explosion of digital startups. Such explosion has made it ever more dicult to

purely rely on individuals’ industry knowledge to depict the rapidly changing landscape of

the startup world. Our empirical analysis demonstrates the potential of extracting econom-

ically meaningful information from publicly available, unstructured data through large-scale

computation as well as the value of the proposed business proximity measure as an important

metric in the analytics of M&A matching and as a search tool for navigating the networked

startup world. To further illuminate the practical implication of our data-analytic frame-

work, we build an information system that allows managers and analysts to use business

proximity to explore the competitive landscape of the U.S. high-tech industry. The back

end of our system handles data collection, storage, and large-scale computation using Big

Data computation platform (Condor), NoSQL database technology (MongoDB), and various

programming languages (Python, Scala). The front end of the system is hosted on Google’s

Cloud Platform and provides users an easy-to-use web interface. It is available to access at

http://146.6.99.242/bizprox.

We organize the remainder of this paper as follows. To provide a context for describ-

ing the data-analytic method, we rst introduce our dataset in Section 2. In Section 3, we

elaborate the procedure for constructing our business proximity. In Section 4, we demon-

strate the validity and eectiveness of our measure. We describe the information system

implementation in Section 5. We lastly discuss and conclude our paper.

2 Data

The dataset for demonstrating our methodology was collected from CrunchBase.

1

Crunch-

Base is an open and free database of high-tech companies, people, and investors. Regarded

as the Wikipedia of the high-tech industry, it provides a comprehensive view of the “startup

world.” CrunchBase keeps track of the industry by automatically retrieving and extracting

information from professionally edited news articles on technology-focused websites.

2

In ad-

1

http://www.crunchbase.com.

2

For example, http://www.allthingsd.com, http://www.techcrunch.com,

and http://www.businessinsider.com.

4

dition, ordinary users can contribute to CrunchBase in a crowdsourcing manner. For quality

assurance, each update is reviewed by moderators. Existing data points are also constantly

reviewed by the editors. Compared with other high-tech-focused data vendors, CrunchBase

has the advantage of more complete coverage on early-stage startups, especially those not

(yet) funded by venture capitalists.

Data collection was carried out between April 2013 and April 2015. The companies and

their information were collected at the beginning of the period. We limit our dataset to the

U.S.-based companies and exclude those for which some basic information (e.g., founding

date, business description) is missing. We further exclude companies that had already been

acquired as of April 2013. The resultant dataset contains 24, 382 companies, the vast major-

ity of which are privately held, early-stage startups that are unclassied under SIC or NAICS.

As of April 2013, 345 of the companies (1.41%) in the dataset were publicly traded, and the

median age of the whole sample was 5.66 years old. For each company, we also observe its

headquarter location, industry sector (CrunchBase-dened category), (co)founders, board

members, key employees, angel and venture investors that participated in each of its fund-

ing rounds, acquisitions, investments, and a business description. Conrming the common

knowledge about the high-tech industry, we observe considerable geographic clustering. Fig-

ure 1(a) visualizes the spatial distribution of the companies using the headquarter-location

data aggregated at the city level. The circles are centered at the cities and their radius

is proportional to the number of companies. The major high-tech hub cities include New

York City (8.08% of the companies), San Francisco (7.92%), Los Angeles (2.17%), Chicago

(2.10%), Seattle (1.93%), Austin (1 .84%), and Palo Alto (1.81%). At the state level, as

shown in Figure 2(a), California leads with 34.72% of the companies, followed by New York

(11.99%), Massachusetts (5.89%), and Texas (5.20%). We also observe a highly uneven dis-

tribution of companies across the 19 industry sectors (CrunchBase-dened categories). The

leading sectors are “software” (19.23%) and “web” (17.13%), and the trailing sectors are

“semiconductor” (1.00%) and “legal” (0.73%), as shown in Figure 2(b). In the dataset, the

people’s proles also contain their past professional experiences. The unstructured, textual

descriptions are mostly of short to moderate length, comprising one or more paragraphs on

the key facts about the companies’ products, markets, and technologies.

For the validation of the proposed method, we use three types of inter-rm interac-

tions: M&A (one rm acquires another), investment (one rm invests in another), and job

5

(a) Companies

(b) M&A Transactions

Figure 1: Geo-mapping Company Locations and M&A Transactions

6

AK

SD

WV

WY

ND

MS

MT

NM

AR

HI

ID

ME

VT

AL

IA

RI

OK

NE

LA

KS

DE

SC

KY

NH

IN

WI

TN

DC

MO

NV

CT

UT

MN

MI

OR

MD

OH

AZ

NC

VA

CO

NJ

GA

PA

IL

WA

FL

TX

MA

NY

CA

0 2000 4000 6000 8000

count

state

(a) State

legal

semiconductor

security

education

search

cleantech

network_hosting

hardware

public_relations

biotech

enterprise

consulting

games_video

mobile

advertising

ecommerce

other

web

software

0 1000 2000 3000 4000

count

industry

(b) Industry Sector

Figure 2: Distribution of Companies over State and Industry Sector

mobility (an individual changes job from one rm to another). We constantly monitored

these activities to April 2015. Our dataset includes a total of 1, 689 M&A transactions since

2008. Figure 1(b) geo-maps each of the M&A transactions using the headquarter locations

of the involved companies. A little less than two-thirds (62.59%) of the deals is cross state.

A numerically similar portion of transactions (63.56%) is cross sector. The distribution of

the number of transactions per company is highly skewed — the top 10 and top 20 buyers

made 14.32% and 21.23% of all the deals respectively. Among these M&A transactions,

394 (23.32%) occurred between April 2013 and April 2015. For investments, a total of

531 transactions are recorded and the post-April-2013 number is 129 (24.29%). Lastly, the

job mobility data are computed based on position changes among the 24, 334 people in the

dataset. There are 19, 697 company pairs connected by the job transitions in total and 9, 792

pairs (49.71%) by post-April-2013 activities.

7

3 Measuring Business Proximity: Data-Analytic Frame-

work

Business proximity measures rms’ closeness in the spaces of product, market, and technol-

ogy. Our objective is to develop a data-driven, analytics-based business proximity measure

to improve on scalability, classication granularity, and comprehensiveness. The input of our

method — an unstructured, textual business description for each rm — requires no manual

classication, and is also much more likely to be available than structured information such

as NAICS/SIC code or patent portfolio, especially for high-tech startups.

Our approach builds upon a text mining technique called topic modeling, a statistical

method that discovers abstract “topics” from a large collection of documents. At present,

the most common topic modeling algorithm is Latent Dirichlet Allocation (Blei et al. 2003).

LDA does not require manually labeling each document, so it is an unsupervised learning

algorithm. The underlying model of LDA is generative — the assumption is that each word

in each document is probabilistically drawn from the vocabulary of a topic discussed in that

document. Given a large collection of documents, the vocabularies of topics and the topics

of the documents are jointly estimated.

More formally, we let the number of input descriptions (i.e., the total number of com-

panies) be D, where each description d ∈ {1, 2, . . . , D} is a collection of words {w

d

n

|n =

1, 2, . . . , N

d

}. Let the total number of latent “topics” (business aspects) expressed by the

descriptions be K. Each topic k ∈ {1, 2, . . . , K} is a probabilistic distribution over the

whole vocabulary, i.e., the set of unique words in the description corpus. This distribution

is denoted ϕ

k

, where ϕ

k

w

is the probability of word w in topic k. The topic proportions for

description d are θ

d

, where θ

d

k

is the topic proportion for topic k in description d. Assume

z

d

n

is the topic assignment of the n’th word in description d. Then, given θ

d

and ϕ

k

, the

probability of observing description d is

N

d



n=1



K



k=1

P(w

d

n

|z

d

n

= k, ϕ

k

)P(z

d

n

= k|θ

d

)



=

N

d



n=1



K



k=1

ϕ

k

w

d

n

θ

d

k



, (1)

where the term inside the product operator is the probability of the n’th word in description

d being w

d

n

. LDA takes the Bayesian approach and is a complete generative model. It further

assumes Dirichlet priors for both θ and ϕ, with hyperparameters α and β respectively. Thus,

8

the generative process of LDA can be represented by the following joint distribution:

P(w, z, θ, ϕ|α, β) =

K



k=1

P(ϕ

k

|β)

D



d=1

P(θ

d

|α)





N

d



n=1

P(w

d

n

|z

d

n

, ϕ

k

)P(z

d

n

|θ

d

)





. (2)

Having observed the descriptions, hence w, we compute the posterior distribution

P(z, θ, ϕ|α, β, w) =

P(w, z, θ, ϕ|α, β)

P(w|α, β)

, (3)

using Monte Carlo methods in Bayesian statistics. Finally, the estimates of θ and ϕ are

obtained by examining the posterior distribution.

In summary, LDA is utilized in the data-analytic framework to analyze the textual

descriptions of the rms. Each description is a document, and all the descriptions together

are the input of LDA. The algorithm produces K topics (K is a parameter specied by the

researcher), each of which is represented by a probabilistic distribution over the set of words.

In addition, LDA computes the topic distribution for each company description. For each

company, a probability value, or weight, is assigned to each discovered topic and the values

sum up to 1. Essentially, through topic modeling, a company i’s description is represented

by a topic distribution T

i

= {T

i,1

, T

i,2

, . . . , T

i,K

}, where T

i,k

is the weight on the k-th topic

and



K

k=1

T

i,k

= 1.

We interpret the discovered topics as the dierent components of the companies’ busi-

nesses. If a particular T

i,k

= 0, then component k is irrelevant to company i’s business.

Finally, we dene the business proximity p

b

(i, j) between two companies i and j as the co-

sine similarity

3

of the two corresponding topic distributions T

i

and T

j

, which can be written

as follows:

p

b

(i, j) =

T

i

· T

j

||T

i

||||T

j

||

=



K

k=1

T

i,k

T

j,k





K

k=1

(T

i,k

)

2





K

k=1

(T

j,k

)

2

. (4)

The resulting proximity values range between 0 and 1, where a bigger value indicates closer

proximity between the pair of companies. The measure equals 0 if and only if the two rms

have no common business component; the measure equals 1 if and only if the two rms share

exactly the same business components as well as the same weights.

3

Cosine similarity is one measure of similarity between two distributions. We can apply other similarity

measures such as normalized Euclidean distance. We can also view each topic distribution as a set where

the elements are the topics with strictly positive probability, and then use set comparison metrics such as

Jaccard index and Dice’s coecient. Our main results are robust to these alternative measures.

9

Topic Dimension Top 5 Words

1 Product video,music,digital,entertainment,artists

2 Product news,site,blog,articles,publishing

3 Product job,jobs,search,employers,career

4 Product people,community,members,share,friends

30 Technology/Product phone,email,text,voice,messaging

31 Technology/Product wireless,networks,communications,internet,providers

32 Technology/Product cloud,storage,hosting,server,servers

33 Technology/Product app,apps,iphone,android,applications

38 Market sales,customer,lead,email,leads

39 Market solution,cost,costs,applications,enterprise

Table 1: LDA Results of CrunchBase Data (Partial)

Note: Only top ve words are presented for brevity.

We carry out the proposed method on the CrunchBase dataset. We run the LDA model

and compute the corresponding business proximity for a set of dierent K values: 50, 100,

200, and 500. The main results on coecient signs and their statistical signicance reported

in the empirical validation and application section are robust to the dierent choices. Due

to the page limit, we report in the main text for K = 50. To illustrate that the topic model

results comprehensively capture multiple dimensions of a rm’s business, in Table 1 we list

10 topics that LDA produces from our dataset. Note that each topic is a distribution over

all words in the vocabulary and that we only show the top ve words in terms of their

probability for brevity. The full 50-topic list is shown in Table 9 in Appendix A. We have

checked all 50 topics to nd that each topic consists of frequent words that are tightly related

to each other. We also observe that the topics capture the current trends in the high-tech

industry. Using the LDA results, we compute business proximity for all company pairs in

the dataset. Owing to the huge number of pairs (close to 300 million), we parallelize the

computation algorithm for speedy processing.

Our new data-driven approach for measuring business proximity has overcome many of

the limitations faced by the existing methods. First, the approach is scalable because the

construction of the business aspects and business proximity is automated, which is a sharp

contrast to the domain-expert-based industry classication in which manual annotation is

required as the rst step. Second, our approach is generally applicable to a wide range of

rms (either public or private) as long as textual business descriptions exist for the rms. In

contrast, industry classication is only sparsely available for small companies and nancial

10

lings data are only available to public companies. Note that only 1.41% of the high-tech

companies in our dataset are public, as discussed in Section 2. Third, our approach provides

ner granularity than the existing discrete similarity measures as the algorithm provides

continuous similarity measures. Fourth, the proposed method provides exibility to cope

with dynamic industry changes. As the underlying business descriptions in the industry

change, the algorithm can automatically detect the emerging topics in the industry and

incorporate them into the business proximity.

4 Empirical Validation and Application

4.1 Validation

To validate the constructed business proximity measure, we rst examine the relationship

between it and a simple category-based classication. Because the NAICS-based proximity

cannot be operationalized due to the data limitation (in fact, the CrunchBase companies are

already in a narrowly focused industry), we leverage the simple industry sector information,

i.e., the categories dened by CrunchBase (see Figure 2). We construct a binary indicator

for same-category membership, denoted category match, and let it serve as a benchmark

business proximity measure. We then compare the distributions of the proposed analytics-

based measure in two groups of company pairs: (i) company pairs in the same category

(category match = 1), and (ii) those belonging to dierent categories (category match = 0 ).

Figure 3 compares the business proximity values between the two groups. The upper and

lower hinges of the boxes indicate the rst and third quartiles (the 25th and 75th percentiles).

The results show that the same-category company group (mean: 0.12) has a mean business

proximity value twice as large as the other (mean: 0.06). The Pearson’s correlation coecient

between business proximity and category match is 0.11, with the t-statistic being 61.94 and

p-value being smaller than 2.2e

−16

. The large t-statistic and low p-value indicate a very

high correlation between the proposed business proximity and the simple category-based

classication.

For further validation, we test the predictive power of the proposed business proximity on

11

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0.00

0.25

0.50

0.75

1.00

0 1

category_match

business proximity ([0, 1])

Figure 3: Distributions of Business Proximity: Same- and Cross-Category Company Pairs

Note: The upper and lower hinges of the boxes indicate the 25th and 75th percentiles.

●

●●

●

●●

●

0.00

0.25

0.50

0.75

1.00

M&A invest jobmob random

group

business proximity ([0, 1])

Figure 4: Distributions of Business Proximity: M&A, Investment, Job Mobility, and Random

Samples

Note: The upper and lower hinges of the boxes indicate the 25th and 75th percentiles.

12

three types of inter-rm interactions: M&A, investment, and job mobility.

4

Operationally,

we compare the realized business proximity among four groups (M&A, invest, job mobility,

and random) of company pairs to test if the business proximity has a leading eect on the

corresponding inter-rm interactions. One caveat is that high business proximity values

could be the result of rm transactions. For instance, after an M&A transaction takes place,

it is very likely that the acquiring company’s business description will incorporate various

aspects of the acquired company. To avoid this reversal eect, we only consider the inter-

rm transactions after April 2013, which is the time when all the company descriptions

were collected. Our inter-rm interaction dataset contains 394 company pairs associated to

M&A transactions, 129 with inter-rm investments, and 9, 792 with job mobility.

5

Lastly,

to construct the baseline, we randomly select company pairs from the whole dataset.

Figure 4 compares the distribution of business proximity value among the company pairs

dened by M&A, investments, job mobility, and random selection. We nd that the proposed

business proximity has higher values between company pairs connected by the three types

of inter-rm interactions than random pairs, thus indicating a positive association between

each of the transactions and the proximity. On average, the rst three groups have more than

three times higher proximity than the randomly-paired group: M&A (0.293), investments

(0.224), job mobility (0.218), and random (0.068). Given the fact that M&A is a rare,

signicant inter-rm transaction, it is intuitive to nd that M&A-paired rms have higher

similarities than other two interaction types (investments and job mobility).

4.2 Empirical Application on M&As

In this subsection, we demonstrate the business proximity measure’s value for empirical mod-

eling. Specically, we apply it in analyzing high-tech M&As. Recognizing the increasingly

4

The rationale of choosing these interactions is the following: M&A is an important inter-rm transaction

that in theory creates business synergy (e.g., Rhodes-Kropf and Robinson 2008); inter-rm investments are

associated with technological or market overlaps (e.g., Mowery et al. 1998), and may lead to future M&A

transactions (Mikkelson and Ruback 1985); the labor economics literature found evidence that a signicant

portion of the job moves involve companies that are in the same industry (e.g., Moscarini and Thomsson

2007, Fallick et al. 2006).

5

For job mobility, if a person made a job transition from a company A to another one B, then we consider

A and B are associated.

13

networked business environment,

6

we construct a network structure by incorporating rm

proximity in dierent dimensions, and then use a statistical network model to analyze their

interactions. Our objective is to examine the relationship between the likelihood of a pair

of rms’ matching in an M&A transaction and their individual and pairwise characteristics,

among which the newly developed business proximity is of our primary interest. We rst

summarize the theoretical basis for the importance of business proximity as well as proximity

in three other dimensions in modeling M&As. Next, we introduce the statistical network

analysis method and explain our empirical specications. Lastly, we present estimation

results.

4.2.1 Proximity and M&A

The high-tech industry is characterized by active and rapid innovation, signicant geographic

clustering (at a handful of high-tech hubs), rapid job mobility, high concentration of owner-

ship at the company level, and strong inuence of angel and venture investors. We posit that

business proximity, geographic vicinity, social linkage, and common ownership are associated

with the likelihood of two rms’ matching in an M&A transaction.

Business Proximity

Business proximity measures rms’ relatedness in the spaces of product, market, and tech-

nology. It has been widely recognized in the nance and management literature that the

potential synergy in products, markets, and technologies is a key driver for M&As (e.g.,

Rhodes-Kropf and Robinson 2008) and is especially important in high-tech acquisitions (e.g.,

Ahuja and Katila 2001). The central idea of business synergy is that economic surplus can

be created from novel recombination of the acquirer and target’s resources and capabili-

ties. Hence, one of the determinants for the matching of the acquirer and target should

be the recombination potential, which is in turn inuenced by the relatedness of the two

rms’ products, markets, and technology. Therefore, we expect the business proximity is

associated with the M&A matching likelihood.

6

See “Revolution in Progress: The Networked Economy,” MIT Technology Review Custom, August 27,

2014.

14

Geographic Proximity

Geographic or spatial proximity refers to the closeness of physical locations and it has been

shown to have a moderating eect in a diversity of nancial transactions. In the M&A

domain, Erel et al. (2012) analyzed cross-border mergers to show that, among other factors,

geographic proximity increases the likelihood of mergers between two countries. At the rm

level, Chakrabarti and Mitchell (2013) found that chemical manufacturers prefer spatially

proximate acquisition targets. The main reasoning behind these ndings is that information

propagation is subject to spatial distance; geographic proximity brings a higher level of

knowledge exchange and hence a lower level of information asymmetry. For the same reason,

we predict that geographic proximity is positively associated with the M&A likelihood.

We operationalize geographic proximity by measuring the great-circle distance

7

between

two companies’ headquarters. First, we translate the street address of each company’s head-

quarters into its latitude (ϕ) and longitude (λ) coordinates by using Google Maps API.

8

For

companies whose full street address is missing, we use the city center as an approximate.

Next, we use the latitude and longitude coordinates to calculate the great-circle distance.

Specically, let (ϕ

i

, λ

i

) and (ϕ

j

, λ

j

) be the coordinates for companies i and j, and ∆λ be

the absolute dierence in their longitudes. Then the geographic proximity p

g

(i, j) between

companies i and j is dened as

p

g

(i, j) = −R arccos(sin ϕ

i

sin ϕ

j

+ cos ϕ

i

cos ϕ

j

cos ∆λ), (5)

where the constant R is the sphere radius of the earth. The negative sign is to convert

distance to proximity.

Social Proximity

Social proximity of two rms is dened according to the social linkage between the individ-

uals associated with the two rms. Personal linkage is an important factor in coordinating

transactions and promoting private information exchange between business entities through

mutual trust and kinship (e.g., Hochberg et al. 2007, Cohen et al. 2008, Stuart and Yim

2010). We believe two factors about the high-tech industry greatly contribute to the impor-

tance of personal linkage’s role in transmitting vital information across companies. First,

7

http://en.wikipedia.org/wiki/Great-circle_distance

8

https://developers.google.com/maps/

15

the U.S. high-tech industry, especially the startup sphere of it, is characterized by high job

mobility, which creates the paths and opportunities for private information ow (Fallick et

al. 2006). Second, early-stage digital startups are mostly very small in size; thus, informa-

tion about them is often scarce outside the teams’ social circles. Moreover, many startups

intentionally stay in a “stealth mode” before their products and technologies mature. Hence,

we argue that companies with closer social proximity are likely to be aware of each other’s

products and intellectual property, which would lead to a higher M&A probability.

We operationalize social proximity by using the “people” part of our dataset. For each

company, we observe the individuals who are or have previously been aliated with it either

as a (co)founder, or as a board member, or as an employee. Let S

i

denote this set of

individuals for company i. Then we dene the social proximity p

s

(i, j) between two companies

i and j as

p

s

(i, j) = |S

i

∩ S

j

|, (6)

i.e., the number of people who are identied having experiences in both companies.

Investor Proximity

Investment proximity is dened according to the common angel and venture investors who

have founded the rms. In the high-tech industry, startups depend on external investments

to support product development before they establish a stable cash ow. As compared

with other types of investors, angel and venture investors often play a more active role in

management and can be highly inuential on strategic decisions (e.g., Amit et al. 1990,

Gompers 1995), such as pursuing M&A opportunities. Hence, common early investors of

two high-tech companies can form a critical information bridge or even an initiator and

enabler of collaboration between them, which we predict leads a higher likelihood of M&A.

Our operationalization of investor proximity is methodologically similar to that of social

proximity. Given two companies i and j, their investor proximity p

f

(i, j) is dened as

p

f

(i, j) = |I

i

∩ I

j

|, (7)

where I

i

and I

j

are the sets of investors who have funded companies i and j in any of the

funding rounds respectively.

16

Correlation Analysis

We explore the realizations of the business, geographic, social, and investor proximities in

our CrunchBase dataset and analyze their correlations with the matching of M&A. Note

that we compute all proximity measures using company data collected in April 2013 and

only use the M&A transactions that occurred between April 2013 and April 2015 to avoid

any possible reversal eect.

For each of the four proximity measures, we compare its dierent distributions in two

groups of company pairs: (1) group of M&A-matched company pairs and (2) that of ran-

domly selected pairs. Figure 5 shows the empirical cumulative distribution functions (CDF)

of the four proximity measures. For the (b) geographic dimension, we plot the distance

rather than proximity for intuitiveness. Also note that the business and geographic proxim-

ity values are continuous, whereas the other two are discrete. In each subgure, the red line

represents the distribution for the group of company pairs dened by M&A transactions and

the green line shows that of random pairs. For each proximity measure, we observe a distinc-

tion between the two lines, suggesting the existence of dependency between the proximity

measures and M&A transactions (the dierences in the two lower subplots are visually less

distinct because both social and investor proximity measures are discrete and have a large

point mass at 0). Next, we appeal to a more rigorous statistical model for further analysis.

4.2.2 Statistical Model

Using statistical terminology, the matching of a pair of rms is a binary outcome: Either

they are part of an M&A transaction or they are not. Thus it could be tempting to use

binary response econometric models such as logistic regression for the empirical analysis.

However, they are inappropriate in this context due to the relational nature of the data.

For example, an M&A transaction between rms i and j and that between i and k (which

would be two observations in a logistic regression) are correlated since they involve a common

party, i.e. rm i. Hence, the key assumption of independent observations, which underlies

the binary response econometric models, is clearly violated. So instead of treating the M&A

transactions as independent observations, we model all of them together as a network.

Exponential random graph models (ERGMs), also known as p

∗

models, have been de-

17

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Empirical CDF

business proximity ([0, 1])

group

M&A

random

(a) Business

0

1000

2000

3000

4000

5000

0.00 0.25 0.50 0.75 1.00

Empirical CDF

geographic distance (km)

group

M&A

random

(b) Geographic

0

1

2

3

4

5

0.00 0.25 0.50 0.75 1.00

Empirical CDF

social proximity (# people)

group

M&A

random

(c) Social

0

1

2

3

0.00 0.25 0.50 0.75 1.00

Empirical CDF

investor proximity (# investors)

group

M&A

random

(d) Investor

Figure 5: Distributions of Proximity: M&A Sample v.s. Random Sample

Note: In (b), we plot geographic distance rather than geographic proximity.

18

veloped in statistical network analysis over the past three decades and recently have become

perhaps the most important and popular class of statistical models of network structure (see

Goldenberg et al. 2010 for a survey of models in this eld). As far as we are aware, this

modeling framework has not been widely used in the information systems literature thus far,

so we briey introduce it here.

9

We also provide a list of important notations used in this

and the following sections in Table 6 in Appendix A for reference.

A network is a way to represent relational data in the form of a mathematical graph. A

graph consists of a set of nodes and a set of edges, where an edge is a directed or undirected

link between a pair of nodes. A network of n nodes can also be mathematically represented by

an n ×n adjacency matrix Y , where each element Y

ij

can be zero or one, with one indicating

the existence of the i-j edge and zero meaning otherwise. Self-edges are disallowed so Y

ii

= 0

∀i. If edges are undirected (i.e., the i-j edge is not distinguished from the j-i edge), then

Y

ij

= Y

ji

∀i, j (i.e., Y is a symmetric matrix).

In applications, the nodes in a network are used to represent economic or social en-

tities, and the edges are used to represent certain relations between the entities. In this

present research, the nodes and the edges are high-tech companies and the M&A transac-

tions between them respectively, and they together form an M&A network. In terms of the

adjacency-matrix representation, we dene

Y

ij

=



1, if i and j are part of an M&A transaction,

0, otherwise.

With this denition, the resultant M&A network is undirected.

10

ERGMs treat network graph, or equivalently adjacency matrix Y , as a random outcome.

For a network of n nodes, the set of all possible graphs (denoted Y) is nite. The observed

network is one realization of the underlying random graph generation process. For some

9

The only papers using ERGMs by information systems scholars that we are aware of are Skerlavaj et

al. (2010) and Faraj and Johnson (2011).

10

Alternatively, we could dene a directed “acquisition network” where the edges are asymmetric. That

is, we could distinguish the acquirer and the acquired. For our purpose of assessing the business proximity

measure, the distinction is not very important since business proximity is symmetric (and it is also true for

the other three proximity measures). In addition, our assumption of undirected M&A network reduces the

time needed for computation when we perform the estimations.

19

y ∈ Y, the probability of it occurring is assumed to be

P(Y = y) =

1

Ψ

exp{

K



k=1

θ

k

z

k

(y)}, (8)

where K is the number of network statistics, z

k

(y) is the k-th network statistic, the θ

k

’s are

parameters, and the denominator Ψ is a normalizing constant.

11

The z

k

(y) terms capture

certain properties of the network and are assumed to aect the likelihood of its occurring.

They are analogous to the independent variables in a regression model. One common example

of network statistics is the total number of edges in the network (or a constant multiple of it).

z

k

(y) can be a function of not only the network graph y, but also other exogenous covariates

on the nodes. For example, suppose we have a categorical variable on the nodes. Then

one such statistic is the number of edges where the two ending nodes belong to the same

category. To interpret the parameters θ

k

, we can rewrite equation (8) in terms of log-odds

of the conditional probability:

logit(P(Y

ij

= 1|Y

−ij

)) =

K



k=1

θ

k

∆z

k

, (9)

where Y

−ij

is all but the ij element in the adjacency matrix. Therefore, the interpretation

of θ

k

is: If forming the i-j edge increases z

k

by 1 and the other statistics stay constant, then

the log-odds of it forming is θ

k

.

12 13

4.2.3 Specication

Our ERGM specication includes the statistics (z

k

’s) for degree distribution, selective mixing,

and proximity. We iterate them and explain their interpretations in the M&A context in the

11



y∈Y

P(Y = y) = 1, so Ψ =



y∈Y

exp{



K

k=1

θ

k

z

k

(y)}

12

It is noteworthy that if the ∆z

k

’s do not depend on Y

−ij

∀i, j, then the edges are independent of each

other, and hence the ERGM model reduces to a standard logistic regression where each edge is considered

an independent observation.

13

The above summarizes the basic formulation of ERGMs. Despite its relatively straightforward interpre-

tation and analytic convenience, applications had been limited until just a few years ago due to signicant

computational burdens. The diculty lies in evaluating the normalizing constant in the equation (8), which

involves a sum over a very large sample space even for a moderate n. It is not hard to see that the number

of possible graphs is 2

n(n−1)

if the network is directed, and the number of possible graphs is 2

n(n−1)

2

if the

network is undirected. Recent advances in computing capability and Monte Carlo estimation techniques

(Snijders 2002, Handcock et al. 2008 among others) have made possible the signicant growth of ERGMs

applications in academic elds such as sociology and demography.

20

following paragraphs. In the discussion, we translate the generic terms nodes and edges into

the more specic terms rms and transactions.

The degree distribution statistics include: t, the total number of M&A transactions, and

d

2

, the number of rms that each are a party of at least two dierent transactions. t measures

the density of transactions in the M&A network and its coecient serves a similar role as

the constant term in a regression model. In fact, equation (9) implies that the coecient of

t is the log-odds of transaction happening if t were the only statistic in the equation. Given

the sparsity of the M&A network, we expect t’s coecient to be negative. The reason why

we also include the d

2

statistic is because it has been demonstrated in the prior research

that rms with dierent relational capabilities (Lorenzoni and Lipparini 1999) participate in

signicantly dierent levels of M&A activities. Wang and Zajac (2007) specically showed

that an acquisition is more likely to occur if any of the two parties have prior acquisition

experiences. Moreover, we have found in the exploratory data analysis in Section 2 that the

number of M&A transactions in which a rm is a party follows the power-law distribution.

Hence we predict a transaction where either of the two parties that has previously engaged in

M&A transactions should have a dierent likelihood than when neither has. The d

2

statistic

captures exactly this eect and we expect its coecient to be positive.

Selective mixing captures the matching of rms according to the combination of their

nodal-level characteristics. In other words, these characteristics are rst dened at the in-

dividual rm level, and then combined to the pair level and lastly aggregated to the cor-

responding network statistics. In the network analysis literature, one widely adopted form

of selective mixing is assortative mixing: Social and economic entities tend to form rela-

tionships with others that are “similar.” We include two groups of statistics that reect an

analogous kind of selective mixing in M&As and they are constructed based on two categor-

ical covariates we have on the rms, i.e., state and industry sector. We expect that a pair of

rms belonging to the same category are more likely to match than otherwise. Specically,

statistic h

sta

s

is the number of transactions between two rms whose headquarters are both

located in state s, where s is one of the 50 states plus the District of Columbia; h

sec

c

is the

number of transactions between two rms that belong to the same industry sector c, where

c is any of the 19 sectors described in Section 2. We also want to point out that these two

groups of statistics can serve as alternative operationalizations of geographic and business

proximity.

21

Lastly, the statistics of most interest are the four proximity measures that capture the

matching process based on dyadic-level characteristics. We normalize the four proximity

measures to ensure they have the same standard deviation. The four statistics each equal

the sum of the corresponding characteristic values over all transactions. We use p

g

, p

s

, p

f

,

and p

b

to denote the sums of geographic proximity, social proximity, investor proximity,

and business proximity respectively. The rationale of including them has been discussed

in Section 4.2.1. In the benchmark specication, we include a linear term for p

b

. We also

estimate an additional specication with a quadratic term of p

b

to allow for a curvilinear

eect of business proximity on matching.

To sum up, our benchmark model specication can be written:

P(Y = y) =

1

Ψ

exp{θ

t

t + θ

d2

d

2

+



s

θ

sta

s

h

sta

s

+



c

θ

cat

c

h

cat

c

+ θ

g

p

g

+ θ

s

p

s

+ θ

f

p

f

+ θ

b

p

b

}, (10)

and the corresponding conditional form is

logit(P(Y

ij

= 1|Y

−ij

))

=θ

t

∆t + θ

d2

∆d

2

+



s

θ

sta

s

∆h

sta

s

+



c

θ

cat

c

∆h

cat

c

+ θ

g

∆p

g

+ θ

s

∆p

s

+ θ

f

∆p

f

+ θ

b

∆p

b

=θ

t

+ θ

d2

∆d

2

+



s

θ

sta

s

I(s

i

= s

j

= s) +



c

θ

cat

c

I(c

i

= c

j

= c)

+ θ

g

p

g,ij

+ θ

s

p

s,ij

+ θ

f

p

f,ij

+ θ

b

p

b,ij

.

(11)

where I(·) is an indicator function, and, for instance, I(s

i

= s

j

= s) means companies i and

j are in the same state s and I(c

i

= c

j

= c) means i and j belong to the same sector c.

4.2.4 Results

The nal dataset contains a total of 24,382 companies. This seemingly moderate number of

nodes is actually huge for estimating network models, since the number of potential edges

— in our case un-ordered pairs — close to 300 million. Given our current computational

capacity, we cannot handle the whole dataset in one estimation procedure. To carry out the

analysis, we decide to randomly select 25% of the whole dataset for estimation and repeatedly

do so 100 times. Since the estimation for each subsample is an independent, computation-

intensive task, we parallelized the estimation job using Condor system,

14

which is a Big Data

14

http://research.cs.wisc.edu/htcondor/

22

Number of Number of Median

Samples with Samples with Coecient

Expected Sign p-value Value

< 1.0%

θ

t

edges 100(<0) 98 -14.7837

θ

d2

degree> 2 97(>0) 92 3.0064

Table 2: Degree Distribution Coecients (100 Samples)

platform to support high throughput computing. For each of the 100 dierent samples (6,096

companies each), we estimate the model coecients by using the Markov Chain Monte Carlo

maximum likelihood estimation procedure outlined in Hunter and Handcock (2006).

We summarize the resultant 100 set of coecients for the degree distribution, selective

mixing, and proximity statistics in Tables 2, 3, and 4 respectively. For each statistic, we re-

port the number of samples that yield a coecient with the expected sign, and the number(s)

of samples that yield a coecient that has the expected sign and is statistically signicant

at one or more selected condence level(s). Also, to provide an example, we report the full

estimation result for one particular sample in Table 7 in Appendix A.

Table 2 reports the coecients of the degree distribution statistics. Among the 100

samples, all θ

t

coecients are negative and 97 θ

d2

coecients are positive. At the 99.0%

condence level, 98 θ

t

estimates are signicant and 92 θ

d2

estimates are signicant. Hence the

results for the two degree distribution statistics are both consistent with our expectations.

As discussed, the negativity of θ

t

indicates only the overall small probability of an M&A

transaction occurring; the positive sign of θ

d2

means that an M&A transaction of which

rms with some M&A experience are involved is more likely to occur.

In part (a) of Table 3, we nd most state-based selective mixing statistics are dropped.

This is due the sparsity of M&A transactions during the data collection period — the likeli-

hood that two same-state companies merged in an individual sample is low for most states.

Indeed, the states that yield the most coecients, namely CA, NY, and MA, are where

well-known high-tech hubs are located. In part (b) of Table 3, we observe that for almost

all category-based selective mixing statistics, an overwhelmingly large proportion of the co-

23

Number of Number of Number of Number of Number of Number of

Samples Samples Samples Samples Samples Samples

with Coecient p-value with Coecient p-value

Coecients > 0 < 1.0% Coecient > 0 < 1.0%

AK 0 - - MT 0 - -

AL 0 - - NC 0 - -

AR 0 - - ND 0 - -

AZ 0 - - NE 0 - -

CA 100 94 43 NH 5 5 3

CO 7 7 7 NJ 4 4 3

CT 0 - - NM 0 - -

DC 5 5 4 NV 0 - -

DE 0 - - NY 61 61 22

FL 0 - - OH 0 - -

GA 7 7 6 OK 0 - -

HI 0 - - OR 0 - -

IA 0 - - PA 0 - -

ID 0 - - RI 0 - -

IL 5 5 5 SC 0 - -

IN 0 - - SD 0 - -

KS 0 - - TN 0 - -

KY 0 - - TX 19 19 13

LA 0 - - UT 0 - -

MA 28 28 16 VA 0 - -

MD 6 6 5 VT 0 - -

ME 0 - - WA 11 11 6

MI 0 - - WI 0 - -

MN 0 - - WV 0 - -

MO 0 - - WY 0 - -

MS 0 - -

(a) State

Number of Number of Number of Number of Number of Number of

Samples Samples Samples Samples Samples Samples

with Coecient p-value with Coecient p-value

Coecient > 0 < 1.0% Coecient > 0 < 1.0%

advertising 26 25 7 mobile 28 26 11

biotech 38 37 5 net hosting 7 6 6

cleantech 11 11 6 other 0 - -

consulting 11 10 3 pub rel 8 8 8

ecommerce 13 13 3 search 0 - -

education 0 - - security 0 - -

enterprise 22 22 20 semiconductor 15 15 5

games video 26 25 11 software 87 78 37

hardware 32 31 25 web 76 66 21

legal 0 - -

(b) Category

Table 3: Selective Mixing Coecients (100 Samples)

24

Number of Number of Number of Number of Median

Samples with Samples with Samples with Samples with Estimate

Coecient p-value p-value p-value

> 0 < 5.0% < 1.0% < 0.1%

θ

g

Geographic 46 8 5 3 -0.0173

θ

s

Social 79 73 70 69 0.1460

θ

f

Investor 62 52 51 46 0.0689

θ

b

Business 100 92 86 79 0.5315

Table 4: Proximity Coecients (100 Samples)

ecient estimates are positive, but it turns out their statistical signicance, when using the

99.0% condence level, is not strongly supported. One possible explanation of their statis-

tical insignicance is the inclusion of our business proximity measure. As mentioned, the

selective mixing statistics based on industry sector can also be thought of as alternative,

but coarser operationalizations of business proximity. Therefore, when including both the

selective mixing statistics and our business proximity measure in the ERGM specication,

the eect of the selective mixing statistics is superceded by the eect of the more rened

proximity measure, causing the model to produce insignicant coecients for the selective

mixing statistics. To test the validity of this explanation, we also estimate another ERGM

specication, which excludes the business proximity measures and for which we report the

corresponding results for the selective mixing coecients in Table 8 in Appendix A. Com-

paring the last columns of Tables 3 and 8, we nd that when using the specication without

proposed business proximity, a much higher proportion of the samples produces statistically

signicant (at the 1.0% signicance level) estimates for the selective mixing coecients. This

is thus supporting evidence for the superiority of the proximity measures we use: They are

correlated with the alternative, coarser measures, but statistically more powerful in explain-

ing the matching in M&As.

In Table 4 we report the estimation results for the four proximity measures. First and

foremost, the new business proximity measure is found to be strongly associated with the

matching likelihood: All the samples produce positive coecients and among them 79 es-

timates are signicant at the 99.9% condence level. Furthermore, when comparing the

proximity measures across the rows, we observe three among the four proximity measures

(except θ

g

geographic) are positively associated with the likelihood of matching in M&As,

and in particular, our newly developed business proximity measure also outperforms the

25

other three in terms of statistical signicance. Moreover, since we normalize the proxim-

ity measures, we can evaluate their economic signicance by comparing the magnitude of

the coecients. Using the median estimate from the 100 samples (last column of Table

4), we nd that the business proximity measure has the largest eect on the matching

likelihood: A 1-standard-deviation increase in business proximity has the same eect as a

3.64-standard-deviation increase in social proximity, or a 6.89-standard-deviation increase

in investor proximity. These results thus support the value of business proximity in mod-

eling M&As. Interestingly, in our dataset, the geographic proximity appears to play an

insignicant role in identifying high-tech rms’ matching in M&As.

The estimation result of equation (10) shows business proximity is positively associated

with the M&A matching likelihood. However, a linear structure might not best capture the

true relationship between business proximity and M&A matching since the economic benets

of merging two rms’ businesses may result from not only their similarity but also their

complementarity (e.g., Chung et al. 2000, Sears and Hoetker 2013). The value of M&A could

decrease in cases where two rms’ businesses are too similar but lack complementarity, so

little value of synergy can be achieved through merger. We test this hypothesis by estimating

a specication that includes a squared term of business proximity, θ

b2

p

b2

= θ

b2



p

2

b,ij

, and

that is otherwise the same as equation (10). We expect θ

b2

to be negative and θ

b

to be still

positive. The estimation results on the proximity measures (of the 100 samples) are reported

in Table 5. We do observe that for a large number of the samples business proximity

is estimated to have a curvilinear eect on the M&A matching likelihood. Specically,

for 86 out of the 100 samples, the coecient of the squared term is negative and that of

the linear term is positive, suggesting the matching likelihood rst increases with business

proximity and then decreases after a certain point. This evidence is thus consistent with

our expectation. Meanwhile, we note that the evidence for the statistical signicance of the

squared term is not as strong as that for the linear term.

26

Number of Number of Number of Number of

Samples with Samples with Samples with Samples with

Coecient p-value p-value p-value

Expected Sign < 5.0% < 1.0% < 0.1%

θ

g

Geographic 47(>0) 6 4 2

θ

s

Social 85(>0) 77 77 73

θ

f

Investor 67(>0) 56 52 50

θ

b

Business 100(>0) 86 76 61

θ

b2

Business

2

86(<0) 42 28 13

Table 5: Proximity Coecients (100 Samples):

Equation (10) plus θ

b2

p

b2

5 Scaling up to Big Data: A System Prototype

for Navigating the Networked Startup World

During the recent boom of the high-tech industry, the media are often full of reports about

high-prole M&As involving startups. It is well known that M&As are an important alter-

native to IPOs as an exit option for high-tech entrepreneurs and early investors. Meanwhile,

industry giants spend tens of billions of dollars each year in acquiring smaller rms for market

entrance, strategic intellectual property, and talented employees.

15

Venture capitalists also

arrange mergers between their partially owned startups in order to consolidate resources and

reduce competitive pressure.

16

The erce competition in both demand and supply instan-

taneously creates the problem of matching between acquirers and targets, since the value

(or disvalue) of an M&A critically depends on the synergy of the companies’ products, tech-

nologies, and markets. More broadly, the challenge lies in the search for startups. While

almost everyone knows who the top competitors are in a particular space, it is a dicult

and time-consuming task to nd the small companies in the vast startup universe with the

right products or technology. The problem can only become increasingly challenging over

time given the speed of technological innovation. Solving this search problem will be bene-

cial not only for M&A executives, but also for entrepreneurs to position their products and

identify competitors, for venture capitalists to monitor niche markets, and for high-tech an-

15

See “Internet Mergers and Takeovers: Platforms upon Platforms,” The Economist, May 25, 2013.

16

An example is the acquisition of Summize by Twitter in 2008. See “Finding A Perfect Match,” Twitter

Blog, https://blog.twitter.com/2008/finding-perfect-match and Nick Bilton’s 2013 book Hatching

Twitter: A True Story of Money, Power, Friendship, and Betrayal.

27

alysts to examine the industry trend. Observers have noted data analytics can complement

executives’ industry knowledge in alleviating many of the problems, and transform the way

M&A matching and startup search have been done — it is reported that many large M&A

players have already been investing heavily in analytics for identifying the win-win matches

by rendering the decision-making processes more “data-driven.”

17

Along these lines, our empirical analysis indicates the potential practical value of the pro-

posed business proximity measure as an important metric in the analytics of M&A matching

and a search tool for navigating the networked startup world. To show the practical appli-

cation in a concrete way, we build a prototype for a cloud-based information system that

allows entrepreneurs, managers, and analysts to explore the competitive landscape of the

U.S. high-tech industry (Whinston and Geng 2004). By incorporating business proximity

and making it explicitly available to the users in the search and navigation tools, the plat-

form expedites the process of startup search and competition analysis as well as facilitates

ecient new niche-market discovery. Built upon the latest Big Data and cloud technologies,

the system largely consists of two components as shown in Figure 6: The back end collects

raw data from the data sources, integrates and cleans the data, computes business proximity,

and stores the processed data in local databases. The front end is a web application that

enables users to explore the data stored in a cloud-based database.

5.1 Back-End System

The back-end system comprises two modules and two databases. The rst module is the data

collector written in Python to retrieve data from our data sources, including CrunchBase.

The collector runs periodically to ensure our data is up-to-date. The raw data is stored in

a MongoDB

18

database, which is a document-oriented, NoSQL database that stores records

in JSON format. The reason why we do not use a relational database is that the structure

of the company data may change over time, so the traditional relational database, which

requires a pre-dened schema, is not the best technology for our system. Another feature

of MongoDB is that it supports scalability: As the data size grows load balancing can be

17

See “Google Ventures Stresses Science of Deal, Not Art of the Deal,” New York Times, June 23, 2013,

and “One of The Richest Men in The World Is Backing a Startup That Ranks Wall Street’s Hedge Funds,”

Business Insider, http://read.bi/1KqhHzr.

18

https://www.mongodb.org.

28

Front end

Back end

Data

collector

(Python)

Raw data

(MongoDB)

Industry data

(Crunchbase, etc.)

Topic model

builder

(Scala)

Processed data

(MongoDB)

Webpages

(HTML/CSS,

Javascript)

Company

meta info

(JSON)

Business proximity

(JSON)

Users

Search

Company info

Cloud DB

(Google Datastore,

Google Cloud Storage)

API Engine

(Google App

Engine)

Figure 6: Prototype Architecture and Components

performed using the shrading mechanism. This is a basis for the cloud-based information

system.

The second module, the topic model builder, constructs and estimates topic models using

the textual company descriptions extracted from the raw data in MongoDB. To run the LDA

topic modeling algorithm, we use a Scala implementation in Stanford Topic Model Toolkit.

19

The topic model builder produces two sets of results: First, underlying business topics of

the whole industry are generated, where each topic is essentially a set of related keywords

that represent the topic. Second, each company’s prole is transformed into a topic vector,

which is stored in the database of processed data in MongoDB.

We then compute business proximity to identify the top N nearest neighbors from each

rm. A naive, brute-force approach that calculates the business proximity values for all

pairs of companies can be used to nd the nearest neighbors. However, as we continuously

collect data and the dataset grows, the number of company pairs increases exponentially to

a point that the exhaustive computation is impractical for the real-world system. Hence, we

propose an algorithm that reduces the required computation while providing a reasonable

approximation in nding nearest neighbors. The intuition behind the algorithm is that a

19

http://www-nlp.stanford.edu/software/tmt/tmt-0.4/.

29

pair of companies is likely to have a high proximity value only if they share high weights

on some common topics in their topic distributions. Hence, we maintain a bucket list for

each topic that keeps track of the companies with a high weight on that specic topic. Then

we only compute the business proximity values for company pairs that co-occur in at least

one of the bucket lists, because those pairs that do not fall into any of the bucket lists are

unlikely to be very close to each other. The pseudocode is given in Algorithm 1.

input : set of companies C, companies’ topic distributions T , number of topics K,

threshold θ

output: N nearest neighbors for each company

for each topic k ∈ K do

B

k

← ∅

end

for each company c ∈ C do

for each topic k ∈ K do

if T [c][k] >= θ then

B

k

← B

k

∪ c

end

for each company c ∈ C do

Nset ← ∅

for each topic k ∈ K do

if T [c][k] >= θ then

Nset ← Nset ∪ B

k

end

for each company c

′

∈ N set do

bizprox[c

′

] ← cosinesimilarity(T [c], T [c

′

])

end

Find N nearest neighbors by sorting bizprox list

end

Algorithm 1: Faster Nearest-Neighbor Computation

To measure the speed of business proximity computation and the accuracy of nearest-

30

●

0

100

200

300

brute−force fast(th=0.0) fast(th=0.1) fast(th=0.15) fast(th=0.20) fast(th=0.30)

algorithms

Number of comparisons (million)

(a) Calculation speed

●

90.0

92.5

95.0

97.5

100.0

th=0.00 th=0.10 th=0.15 th=0.20 th=0.30

algorithms

Accuracy (%)

variable

●

top10

top20

top30

top50

top100

(b) Accuracy of nearest neighbors

Figure 7: Performance Measures of Algorithm 1

neighbor identication, we run experiments using the dataset described in Section 2. The

results are reported in Figure 7. In terms of the computation speed, we count the number of

business proximity values calculated. We use this metric instead of the actual computation

time to avoid potential environmental biases. The brute-force algorithm, which computes all

pairwise proximity values, requires 341 million calculations. In the meantime, our algorithm

with threshold 0.00 only needs 123 million, which is 36% of the naive approach. As we

increase the threshold to 0.30, only 3% of calculations are needed. Faster computation

comes with a modest cost on accuracy. We compare the N nearest neighbors identied by

the algorithm with dierent thresholds and vary N to be 10, 20, 30, 50, and 100. As expected,

the algorithm provides accurate results for closest neighbors, where the performance degrades

gracefully to the not-so-near neighbors. We want to note that the algorithm with threshold

0.00 provides 100% accurate neighborhood sets comparing to the brute-force algorithm. Even

for the case of threshold 0.30, the algorithm gives a 92.5% accuracy in identifying 50 nearest

neighbors.

31

(a) Search companies and topics of interest

(b) Search results

(c) Focal company with its competitors based on business proximity

Figure 8: Prototype Front End: User Interface Screenshots

32

5.2 Front-End System

The front end is a cloud-based web application, available at http://146.6.99.242/bizprox,

to let users explore various company information with the proposed business proximity.

Figure 8 shows the screenshots of the user interface. Given a keyword from the user, the

search results show the topics and companies associated to the keyword. By selecting topics,

the user can interpret the topic with 20 (additional) relevant keywords and the signicance

of each. If a company is selected from the search results, the interface provides (1) the basic

information about the company along with the topic distribution, and (2) a list of nearest

neighbors to the focal company. The basic information of a company includes the founding

date, founders, headquarters, and a short business description. With the topic distribution,

users can recognize various business aspects of the company. The nearest neighbors are

computed using Algorithm 1 and are sorted by the business proximity.

From the system architecture perspective, the front end is a cloud-based system lever-

aging platform-as-a-service (PaaS). The static webpages in HTML/CSS are hosted by our

local Apache Web Server. The server interacts with the various user inputs such as keyword

searches and page navigations. Each webpage is instrumented with Google Analytics

20

so

that web analytics is performed to understand user engagement and potentially optimize the

service. An API Engine, deployed in Google App Engine,

21

receives queries from the HTML

pages and returns relevant data from the cloud database. The cloud database consists of two

components: First, the dynamic data is managed in Google Cloud Datastore,

22

a cloud-based

NoSQL database system; second, the static data is stored in Google Cloud Storage,

23

which

provides a cost-eective content distribution service for static information. The cloud-based

approach gives two main benets: scalability (e.g., the system scales automatically according

to user demand and data size) and availability (e.g., almost no downtime due to replication).

20

http://www.google.com/analytics/.

21

https://developers.google.com/appengine/.

22

https://developers.google.com/datastore/.

23

https://cloud.google.com/products/cloud-storage/.

33

6 Discussion and Conclusion

The advent of digital economy is creating a business environment that is characterized by

the unprecedented complexity of technology and connectedness between rms and people.

With the goal of reducing the diculty to understand and depict the business landscape,

we set out in this paper to develop a general, data-analytic framework for quantifying rms’

positions in the spaces of product, market, and technology and for measuring rms’ dyadic

business proximity. Using a unique dataset of the U.S. high-tech industry as an example,

we detailed the procedure and system of using topic models to analyze the publicly avail-

able, textual descriptions of company business and constructing proximity according to the

structured results. We then validated the new measure by relating it to the simple category-

based classication and analyzing its statistical relationships with rm interactions including

M&A, investment, and job mobility. In a more rigorous statistical analysis, we also demon-

strated the new measure’s usefulness in modeling matching of M&As, where we constructed a

network of high-tech companies and documented empirical evidence on the nuanced relation-

ship between matching and business proximity. Moreover, to show the practical value of the

proposed data-analytic framework, we deployed various Big Data and analytics technologies

to build a prototype of a cloud-based information system for industry intelligence.

This research sheds light on the value of leveraging data science techniques in the de-

velopment of novel measures (Einav and Levin 2013) for large-scale business analytics. Our

data-driven, analytics-based approach requires no expert preprocessing, provides ner gran-

ularity (compared with the SIC- or NAICS-based methods), is more comprehensive on quan-

tifying rms’ positions in the spaces of product, market, and technology (compared with the

patent- or customer-based methods), and can be better automated and scaled to Big Data

(compared with all previous methods). When built into an automated system as in Section 5,

the method is also more responsive in capturing industry trends than any human-annotation-

based approach. Substantively, the comprehensive, granular business proximity measure is

an enabler in the M&A application to show the nuanced relationship between the transac-

tion likelihood and the rms’ business similarity and complementarity. The result manifests

economically meaningful information can be extracted from unstructured data through care-

ful analysis and large-scale computation. Thus, our methodology greatly complements the

toolkit for measuring business proximity, and it is especially useful when researchers or ana-

lysts are studying either an already narrowly focused industry or a highly dynamic industry

34

or when the rms under study are small and privately held (e.g., startups) so industry classi-

cation is largely unavailable. Meanwhile, we wish to stress that our measure is not intended

as a replacement for the existing methods in all scenarios. For instance, when the research

question is at a relatively macro level, only rms’ broad industry membership is important,

and all rms’ SIC or NAICS codes are available, the researcher should not be hesitant to

use the SIC- or NAICS-based methods.

More broadly, the data-analytic framework used in the study presents a general approach

for understanding industry structure and it also demonstrates the potential transformation

Big Data analytics can bring into both industry intelligence practice and strategy and indus-

trial organization research. For analytics-minded managers, rms’ relatedness in business

is a very important metric for identifying potential partners, competitors, and alliance or

acquisition targets. The saying in management goes, “if you cannot measure it, you cannot

manage it.” As shown in our study, the proposed proximity measure provides ner granular-

ity, and is proved to be eective in high-tech M&A analytics. More importantly, as a general

approach to organize unstructured data for industry intelligence, the usefulness of the pro-

posed framework is not limited to measuring proximity and analyzing M&As. Rather, as

argued and demonstrated in Section 5, it provides a handy leverage for entrepreneurs, ven-

ture capitalists, and analysts to navigate the constantly changing landscape of the networked

business environment, which is much needed in light of the rapid evolution of technology

and increasing complexity of the digital economy. Our prototype can be the rst step in

building a Business Intelligence platform to fully realize the framework’s practical potential.

In response to the transformation, even for outside the domain of industry intelligence, or-

ganizations need to invest in IT infrastructure and capability to better organize and analyze

unstructured data, as the ability of distilling value from unstructured data will be an im-

portant competitive advantage in the digital economy. Our prototype is also an example of

organizing unstructured data and integrating the state-of-the-art storage and computation

technologies to build a decision support system. For business and economics scholars, our

method can perhaps be adapted and serve as an alternative approach of dening market

boundary or identifying industry rivals, which is a crucial step in the empirical research of

industrial organization. Additionally, future research can explore the possibility of combin-

ing topic modeling results and clustering algorithms to build an industry hierarchy, which

could be a data-driven alternative to the expert-labeled systems that are currently in use.

A data-driven approach is especially desirable for industries such as high-tech because the

underlying technology is rapidly changing and the manually labeled industry classication

35

system can be stale.

This research also advances the understanding and analysis of M&As. We documented

systematic evidence on the relationship between matching and rm proximity in the high-tech

industry which complements the previous empirical M&A literature which largely focused

on larger, public corporations (Betton et al. 2008). The proposed new measure also enabled

us to test the non-monotone relationship between business proximity and M&A matching.

More importantly, we constructed a network structure using rm proximity measured in four

dierent dimensions and adopted the statistical modeling framework of ERGMs to accom-

modate the relational nature of the matching data. The network/graph approach has been

fruitfully applied to analyzing a variety of economic exchanges and markets (as surveyed

in Easley and Kleinberg 2010, Jackson 2010). However, whereas the literature is abundant

with studies on how networks aect the interaction and performance of rms, research us-

ing rigorous statistical methods to analyze the structure of inter-rm networks is relatively

underdeveloped. To our knowledge, the M&A application in the study is the rst to use a

statistical network model to analyze relational transactions among companies. We believe

statistical network models are currently underutilized by management scholars in their em-

pirical research on inter-organizational linkage despite the fact that relational data is actually

not uncommon in the studies of many very important questions. For example, strategic al-

liances, investments, and patent license agreements among companies can all be visualized

and carefully analyzed as graphs/networks. We predict that with the growing availability of

network datasets and ongoing development of large-scale computing technologies, statistical

network models’ value in management research will be increasingly recognized.

In closing, we wish to point out some additional caveats and limitations of the research.

First, since SIC- or NAICS-based industry classication or patent data is unavailable for most

companies in CrunchBase, we could not directly compare the proposed business proximity

measure with that based on industry hierarchy (Wang and Zajac 2007) or the measure based

on patent citation (Stuart 1998) in terms of their explanatory power for M&A matching.

Though this is less crucial for this paper, since our goal is not to search for the best empirical

model for M&As, it could be an interesting research project to nd a suitable dataset where

all the new and traditional measures could be operationalized and compared directly. Second,

for our data-analytic approach, the number of topics in LDA is a free parameter for users to

choose. When performing topic modeling on the CrunchBase descriptions, we selected a nite

set of values for this parameter, which is sucient for our purpose of illustrating the general

36

methodology. Nevertheless, from a practical point of view, it is worth investigating whether

an “optimal” number of topics exists, and if so, how it should be determined. Third, in the

machine learning literature, there are several extensions to the LDA algorithm (e.g., Teh et

al. 2004, Inoyue et al. 2014). Future research could investigate how these extensions could

benet understanding company businesses through text analysis. Fourth, some important

company-level characteristics — notably company size and revenue — are unavailable in our

dataset, which inevitably limited our ability to extend our empirical application on M&A

matching. For instance, had we observed company size, we would be able to study the

moderating eect of companies’ size on the relationship between business proximity and the

matching likelihood. Lastly, the model we employed in the empirical analysis is a static

network model. To deepen our understanding about the dependence structure of M&A

transactions, future research could examine the evolution of the M&A network by using

some dynamic network models.

References

[1] Adomavicius, G. and A. Tuzhilin 2005, “Toward the Next Generation of Recommender

Systems: A Survey of the State-of- the-Art and Possible Extensions,” IEEE Transac-

tions on Knowledge and Data Engineering, 17(6), 734-749.

[2] Ahuja, G. and R. Katila 2001, “Technological Acquisitions and the Innovation Per-

formance of Acquiring Firms: A Longitudinal Study,” Strategic Management Journal,

22(3), 197-220.

[3] Amit, R., L. Glosten, and E. Muller 1990, “Entrepreneurial Ability, Venture Invest-

ments, and Risk Sharing,” Management Science, 36(10), 1233-1246.

[4] Betton, S., B.E. Eckbo, and K.S. Thorburn 2008, “Corporate Takeovers,” Chapter 15

in B.E. Eckbo ed., Handbook of Corporate Finance: Empirical Corporate Finance ed.

1, Vol. 2, 291-430. Elsevier/North-Holland, 2008.

[5] Blei, D.M. 2012, “Introduction to Probabilistic Topic Models,” Communications of the

ACM, 55(4), 77-84.

[6] Blei, D.M., A.Y. Ng, and M.I. Jordan 2003, “Latent Dirichlet Allocation,” Journal of

Machine Learning Research, 3, 993-1022.

37

[7] Chakrabarti, A. and W. Mitchell 2013, “The Persistent Eect of Geographic Distance

in Acquisition Target Selection,” Organization Science, 24(6), 1805-1826.

[8] Chen, H. R.H.L. Chiang, and V.C. Storey 2012, “Business Intelligence and Analytics:

From Big Data to Big Impact,” MIS Quarterly, 36(4), 1165-1188.

[9] Chiang, R.H.L., P. Goes, and E.A. Stohr 2012, “Business Intelligence and Analytics Ed-

ucation and Program Development: A Unique Opportunity for the Information Systems

Discipline,” ACM Transactions on Management Information Systems, 3(3), 12:1–12:13.

[10] Chung, S., H. Singh, K. Lee 2000, “Complementarity, Status Similarity and Social

Capital as Drivers of Alliance Formation,” Strategic Management Journal, 21(1), 1-22.

[11] Cohen, L., A. Frazzini, and C.J. Malloy 2008, “The Small World of Investing: Board

Connections and Mutual Fund Returns,” Journal of Political Economy, 116(5), 951-979.

[12] Easley, D. and J. Kleinberg 2010, Networks, Crowds, and Markets: Reasoning About a

Highly Connected World. Cambridge University Press, 2010.

[13] Einav, L. and J.D. Levin 2013, “The Data Evolution and Economic Analysis,” NBER

Working Paper 19035, May 2013.

[14] Erel, I., R.C. Liao, and M.S. Weisbach 2012, “Determinants of Cross-Border Mergers

and Acquisitions,” Journal of Finance, 67(3), 1045-1082.

[15] Fallick. B, C.A. Fleischman and J.B. Rebitzer 2006, “Job-Hopping in Silicon Valley:

Some Evidence concerning the Microfoundations of a High-Technology Cluster,” The

Review of Economics and Statistics, 88(3), 472-481.

[16] Faraj, S. and S.L. Johnson 2011, “Network Exchange Patterns in Online Communities,”

Organization Science, 22(6), 1464-1480.

[17] Ghose, A., P.G. Ipeirotis, and B. Li 2012, “Designing Ranking Systems for Hotels

on Travel Search Engines by Mining User-Generated and Crowd-Sourced Content,”

Marketing Science, 31(3), 493-520.

[18] Goldenberg, A., A.X. Zheng, S.E. Fienberg, and E.M. Airoldi 2010, “A Survey of Sta-

tistical Network Models,” Foundations and Trends in Machine Learning, 2(2), 129-233.

[19] Gompers, P.A. 1995, “Optimal Investment, Monitoring, and the Staging of Venture

Capital,” Journal of Finance, 50(5), 1461-1489.

38

[20] Griths, T.L. and M. Steyvers 2004, “Finding Scientic Topics,” Proceedings of the

National Academy of Science, 101, 5228-5235.

[21] Handcock, M.S., D.R. Hunter, C.T. Butts, S.M. Goodreau, and M. Morris 2008,

“statnet: Software Tools for the Representation, Visualization, Analysis and Simu-

lation of Network Data,” Journal of Statistical Software, 24, 1-11.

[22] Hochberg, Y., A. Ljungqvist, and Y. Lu 2007, “Whom You Know Matters: Venture

Capital Networks and Investment Performance,” Journal of Finance, 62(1), 251-301.

[23] Hunter, D.R. and M.S. Handcock 2006, “Inference in Curved Exponential Family Models

for Networks,” Journal of Computational and Graphical Statistics, 15(3), 565-583.

[24] Inouye, D. P. Ravikumar, and I. Dhillon 2014, “Admixture of Poisson MRFs: A Topic

Model with Word Dependencies,” Proceedings of International Conference on Machine

Learning, 31, 683–691.

[25] Jackson, M.O. 2010, Social and Economic Networks. Princeton University Press, 2010.

[26] Lorenzoni, G. and A. Lipparini 1999, “The Leveraging of Interrm Relationships as A

Distinctive Organizational Capability: A Longitudinal Study,” Strategic Management

Journal, 20(4), 317-338.

[27] Mikkelson, W.H. and R.S. Ruback 1985, “An Empirical Analysis of The Interrm Equity

Investment Process,” Journal of Financial Economics, 14(4), 523-553.

[28] Mitsuhashi, H. and H.R. Greve 2009, “A Matching Theory of Alliance Formation and

Organizational Success: Complementarity and Compatibility,” Academy of Management

Journal, 52(5), 975-995.

[29] Moscarini, G. and K. Thomsson 2007, “Occupational and Job Mobility in the US,”

Scandinavian Journal of Economics, 109(4), 807-836.

[30] Mowery, D.C., J.E. Oxley, and B.S. Silverman 1998, “Technological Overlap and Inter-

rm Cooperation: Implications for The Resource-Based View of The Firm,” Research

Policy, 27(5), 507-523.

[31] Rhodes-Kropf, M. and D.T. Robinson 2008, “The Market for Mergers and the Bound-

aries of the rm,” Journal of Finance, 63(3), 1161-1211.

39

[32] Sears, J. and G. Hoetker 2014, “Technological Overlap, Technological Capabilities, and

Resource Recombination in Technological Acquisitions,” Strategic Management Journal,

35(1), 48-67.

[33] Shi, Z., H. Rui, and A.B. Whinston 2014, “Content Sharing in A Social Broadcasting

Environment: Evidence from Twitter,” MIS Quarterly, 38(1), 123-142.

[34] Shmueli, G. and O.R. Koppius 2011, “Predictive Analytics in Information Systems

Research,” MIS Quarterly, 35(3), 553-572.

[35] Skerlavaj, M., V. Dimovski, and K.C. Desouza 2010, “Patterns and Structures of Intra-

Organizational Learning Networks within a Knowledge-Intensive Organization,” Jour-

nal of Information Technology, 25(2), 189-204.

[36] Snijders, T.A.B. 2002, “Markov Chain Monte Carlo Estimation of Exponential Random

Graph Models,” Journal of Social Structure, 3(2), 1-40.

[37] Stuart, T.E. 1998, “Network Positions and Propensities to Collaborate: An Investiga-

tion of Strategic Alliance Formation in a High-Technology Industry,” Administrative

Science Quarterly, 43(3), 668-698.

[38] Stuart, T.E. and S. Yim 2010, “Board Interlocks and The Propensity to Be Targeted

in Private Equity Transactions,” Journal of Financial Economics, 97(1), 174-189.

[39] Teh, Y.W., M.I. Jordan, M.J. Beal, and D.M. Blei 2006, “Hierarchical Dirichlet Pro-

cesses,” Journal of the American Statistical Association, 101, 1566-1581.

[40] Wang, L. and E.J. Zajac 2007, “Alliance or Acquisition? A Dyadic Perspective on

Interrm Resource Combinations,” Strategic Management Journal, 28(13), 1291-1317.

[41] Whinston, A.B. and X. Geng, “Operationalizing The Essential Role of The Information

Technology Artifact in Information Systems Research: Gray Area, Pitfalls, and The

Importance of Strategic Ambiguity,” MIS Quarterly, 28(2), 149-159.

[42] Xu, L, J.A. Duan, and A.B. Whinston 2014, “Path to Purchase: A Mutually Exciting

Point Process Model for Online Advertising and Conversion,” Management Science,

60(6), 1392-1412.

40

A Additional Tables

Network graph

Y , Y

ij

a random network graph matrix, its i, j element

Y

−ij

all elements except i, j

Y the set of all possible graphs for a xed set of nodes

y, y

ij

a realization of the random network graph and its i, j element

z

k

(y) a statistic of network graph y

Network statistics

t total number of edges

d

2

number of nodes which have at least 2 edges

h

sta

s

number of edges within state s

h

cat

c

number of edges within category c

p

g

sum of geographic proximity over all edges

p

s

sum of social proximity over all edges

p

f

sum of investor proximity over all edges

p

b

sum of business proximity over all edges

Nodal characteristics

s

i

state where i’s headquarter is located

c

i

category to which i belongs

Dyadic characteristics

p

g,ij

geographic proximity of i and j

p

s,ij

social proximity of i and j

p

f,ij

investor proximity of i and j

p

b,ij

business proximity of i and j

Table 6: ERGM Notations

41

Coe S.E. p-value Coe S.E. p-value

Geographic -0.2699 0.3440 0.4326 NV - - -

Social 0.0532 0.0108 0.0000 NY - - -

Investor 0.0270 0.0522 0.6049 OH - - -

Business 0.4635 0.1378 0.0008 OK - - -

Edges -12.5625 3.7908 0.0009 OR - - -

Degree> 2 2.4820 0.6438 0.0001 PA - - -

State RI - - -

AL - - - SC - - -

AR - - - SD - - -

AZ - - - TN - - -

CA 2.3899 0.8178 0.0035 TX - - -

CO - - - UT - - -

CT - - - VA - - -

DC - - - VT - - -

DE - - - WA - - -

FL - - - WI - - -

GA - - - WV - - -

HI - - - WY - - -

IA - - - Category

ID - - - advertising - - -

IL - - - biotech - - -

IN - - - cleantech - - -

KS - - - consulting - - -

KY - - - ecommerce - - -

LA - - - education - - -

MA 4.6361 1.1201 0.0000 enterprise 2.9201 0.8882 0.0010

MD - - - games video 3.0284 1.0953 0.0057

ME - - - hardware 3.7045 1.7912 0.0386

MI - - - legal - - -

MN - - - mobile 1.8611 1.2047 0.1223

MO - - - network hosting - - -

MS - - - other - - -

MT - - - public relations - - -

NC - - - search - - -

NE - - - security - - -

NH 9.7899 1.5931 0.0000 semiconductor - - -

NJ 5.6899 1.6428 0.0005 software - - -

NM - - - web -0.9020 2.1375 0.6731

Table 7: Model Coecients from Sample 1

42

Number of Number of Number of Number of Number of Number of

Samples Samples Samples Samples Samples Samples

with Coecient p-value Coecient Coecient p-value

Coecient > 0 < 1.0% > 0 < 1.0%

advertising 28 28 14 mobile 27 27 16

biotech 37 37 32 net hosting 8 8 6

cleantech 12 12 10 other 0 - -

consulting 12 12 9 pub rel 10 10 6

ecommerce 12 12 6 search 0 - -

education 0 - - security 0 - -

enterprise 22 22 20 semiconductor 17 17 14

games video 28 28 16 software 89 85 55

hardware 31 31 29 web 78 70 22

legal 0 - -

(a) Category

Table 8: Category-Based Selective Mixing Coecients (100 Samples): Equation (10) exclud-

ing θ

b

p

b

43

Topic Dimension Top 5 Words

1 Product video,music,digital,entertainment,artists

2 Product news,site,blog,articles,publishing

3 Product job,jobs,search,employers,career

4 Product people,community,members,share,friends

5 Product facebook,friends,share,twitter,photos

6 Product energy,power,solar,systems,water

7 Product systems,design,applications,devices,semiconductor

8 Product consulting,clients,support,systems,experience

9 Product event,sports,events,fans,tickets

10 Product insurance,financial,credit,tax,mortgage

11 Product deals,shopping,consumers,local,retailers

12 Product health,care,medical,healthcare,patient

13 Product students,learning,education,college,school

14 Product food,restaurants,fitness,restaurant,pet

15 Product investment,financial,investors,capital,trading

16 Product advertising,publishers,advertisers,brands,digital

17 Product manage,project,documents,document,tools

18 Product treatment,medical,research,clinical,diseases

19 Product games,game,gaming,virtual,entertainment

20 Product security,compliance,secure,protection,access

21 Product search,engine,website,seo,optimization

22 Product search,user,engine,results,relevant

23 Product fashion,art,brands,custom,design

24 Product equipment,repair,car,home,accessories

25 Product law,legal,government,public,federal

26 Product analytics,research,analysis,intelligence,performance

27 Product travel,travelers,vacation,hotel,hotels

28 Product real,estate,home,buyers,property

29 Product payment,card,cards,credit,payments

30 Technology/Product phone,email,text,voice,messaging

31 Technology/Product wireless,networks,communications,internet,providers

32 Technology/Product cloud,storage,hosting,server,servers

33 Technology/Product app,apps,iphone,android,applications

34 Technology/Product design,applications,application,custom,website

35 Technology/Product site,website,free,allows,user

36 Technology/Product testing,test,monitoring,tracking,performance

37 Market/Technology digital,clients,brand,agency,design

38 Market sales,customer,lead,email,leads

39 Market solution,cost,costs,applications,enterprise

40 Market organizations,community,support,organization,businesses

41 Market make,people,time,just,way

42 Market quality,customer,needs,clients,provide

43 Market systems,operates,headquartered,subsidiary,serves

44 Market united,states,offices,america,europe

45 Market san,york,city,california,francisco

46 Market award,magazine,awards,best,world

47 Market million,world,leading,largest,global

48 Market/Team team,experience,industry,world,market

49 Team partners,ventures,capital,including,san

50 Team launched,million,product,ceo,acquirede

Table 9: LDA Results of CrunchBase Data

44