Statistical Methodology in the Social Sciences 2017: Session 1

Statistical Methodology in the Social Sciences 2017: Session 1


Okay, so
I tried to group papers into groups. And this first session is perhaps
some new methodologies that a lot of people across different areas of
social science are interested in. Our first speaker is Lauren Peritz. I’ve asked speakers to talk for 25 minutes. and leave five minutes for the discussion. [SOUND]
>>So, did you->>Yes. Okay. Sorry. [INAUDIBLE] [INAUDIBLE].>>Great, thank you. Okay, hopefully not too technical. All right, well thank you so much for inviting me to participate
in this conference. I’m delighted to be able to
present some ongoing work. The project here is entitled to
preferential trade agreements, increased trade, and a network analysis statistical
models to try to get at this question. Let me begin with some current
events to help motivate the project. As most of you are probably aware,
the Trans Pacific Partnership is a multi-lateral trade agreement
negotiated among 12 different countries, including the US, Japan,
Korea, and a number of others. And the Obama administration
had been a major proponent. And as a way to expand trade with
numerous Asian and Pacific partners. And to lock in trade rules that
would be strategically favorable to the United States going forward. One of the many justifications has been
to secure favorable trade relations with partners that might
otherwise be drawn to China. However, the Obama administration,
despite having negotiated for several years and arrived at a highly
detailed multilateral treaty, encountered resistance at
the ratification stage. Dissent in Congress meant that
the trade agreement did not go up for a vote before the 2016 election. And then one of the first moves of the new
Trump administration was to withdraw the US from
the Trans Pacific Partnership entirely. Among other reasons,
the Trump administration argued that the TPP would do little to
expand exports to the US, from the US to foreign markets,
rather leading to an import surge. On both sides of the aisle, there seems to
be this assumption that trade agreements such as the TTP, have a very large
aggregate effect on trade flows. And both suggest that the effect would
apply not only to members of the trade agreement, but also to non-members, for example by excluding China,
which is not involved in the negotiations. So preferential trade agreements
are treaties that governments sign in order to set terms
of international trade. They typically grant each other
preferential treatment meaning more favorable market access. A familiar example is
the North American Free Trade Area, NAFTA, which eliminates tariffs for
blockade restrictions for the most part for the US, Canada,
and Mexico, while allowing each country to independently set their
trade policies for non-members. Notably, these trade agreements
can take different forms, and some being more stringent than others. Almost all countries are part of at
least one preferential trade agreement. And the data I use in
this study accounts for approximately 200 reciprocal
agreements currently in force, and many others that have been in
force over the past few decades. Because some countries may
have overlapping commitments, it’s informative to think of
them as forming a network. So the substantive questions
that I’m interested in, first, when do countries form
preferential trade agreements? You could think of an example. The US has a free trade agreement with
Korea and a separate one with Australia. At the same time, Korea and Australia
have trade agreements with each other. And so when Korea sets its trade policy, it’s beholden to both its separate
commitments to the US and to Australia. And so does that incentivize different
trade agreements to be formed. Certainly there are approximate reasons. Governments may wish to enhance
trade with other members of preferential trade agreements. But there’s also anecdotal evidence for
secondhand effects or externalities. When South American
countries negotiated for instance, they banded together to
gain leverage of bargaining power when interacting with the North American
members of the free trade area. In other words, one PTA may beget another. The second related question is, do preferential trade agreements
actually promote trade? It’s important to distinguish
between expansion of trade dependence which may bring efficiency
gains and trade diversion. If preferential trade agreements simply
redirect trade from one partner to another, there may be
little economic payoff for negotiating these
complicated legal contracts. And this has been a point of contention
as the European Union expands. To give you a sense of the scale
of the trade agreement network, let me illustrate some snapshots and
it’s evolution. This figure shows trade agreements
formed between 1966 and 1970. Most of the activity was regionally
focused, including, as you can see, the predecessor up to the African Union. That activity intensified in
subsequent years with expansion of multilateral agreements. And in 1991 to 5 we see significant
expansion of the European Union membership, for instance. So what I’m doing here is treating
each country as a node and the treaty agreements as ties or
edges between those nodes. More recently, preferential trade agreements have been
formed cross-regionally as is evident in the illustration of PTAs
formed between 2001 and 2005. So that’s more recent. Okay, so as you’ve probably gathered, the
number of preferential trade agreements in force has substantially grown
over recent decades, and this effort towards legalization has not
been limited to any particular countries. Rather, the number of countries that are
members to at least one PTA has steadily grown to the point that nearly every
country participates in this network. As countries have more trade agreements,
we can think of each country as a node, the trade agreements as a tire
edge as I had mentioned and the density statistic as
an informative measure. This network has become more dense over
time, particularly since the 1990s. So lots of work on international politics
uses descriptive networks statistics as I’ve shown so far. But to really get a hold on
the mechanisms I’m interested in, I want to do something inferential. And so to figure out how trade and
PTA formation is related, we need to account for the important
indirect effects of the network structure. And the indirect effects of
trade diversion, and so forth. In the political science department
here at Davis, professors Mouse and Kenny have been very
innovative in building and applying differential network
analysis to political problems. So what I propose is to apply some of
those same insights to the domain of the international political economy to
better understand how governments use treaties to deepen their
economic ties to one another. The two research questions
I address are first, when new countries form
preferential trade agreements. This is naturally a network question. Since the choice to form a tie clearly
depends on other nodes’ actions On the literature on trade
cooperation lends insight. And importantly,
there are never [INAUDIBLE] explanations, which I test, but
I wanna just highlight a couple for you. As I illustrated by the demise of
the Trans-Pacific partnership, domestic politics within
countries clearly matter. Governments that faced many constraints as
the Obama administration did when trying to float the TPP through a Republican
Congress had difficulty ratifying new treaties. So we wanna think of the characteristics of the countries in determining
the network [INAUDIBLE]. In addition when countries form
preferential trade agreements, there’s a stronger incentive for countries to
band together to gain economic leverage. So in this literature, [INAUDIBLE] at
all have identified closure effects. In other words, governments have
a higher propensity of signing PTAs with others to they’re indirectly tied. The second question is
technically linked to the first. What is the impact of the PTA
network on trade flows? The political economy
literature establishes that preferential trade agreements lead to an uptick in trade between members
in the subsequent five to ten years. The different timing of the impact
has been identified when this aggregates between intensive
versus extensive margins. So how much of this effect comes
from increase trade dependence? How much comes from the negative
externalities of PTAs? In other words trade
diversion away from numbers. So I’ll speak a little bit more about
the conflict with the multi-lateral trade regime that people are interested
in that substantively. But just to give you a hypothetical
cartoon of what I’m modeling, it highlights why traditional regression
analysis would be unsuitable. By assuming the way the effect of one
node’s status on another as is required with basic regression model assumptions,
we could end up with biased estimates. So you can think of, say, four countries
which are all trading with each other. We’re not gonna worry about the direction,
a directed network, but rather undirected bilateral trade. And you can imagine that three of these
countries form a preferential trade agreement. If indeed that intensifies
trade within that community, it might detract from trade
outside the community, right? And so
this is one hypothetical outcome and that would incentivize a country
outside the community to try to join. So this is the type of dynamic
that we are interested in. Okay, so thus far I’ve discussed
the substantive puzzle and the inferential challenges. Now I wanna explain the statistical models
that allow me to counter these effects. For inferential network analysis I rely
on experiential random graph models. The basic version uses characteristics
of each nodes, countries and their relationships to predict,
observe network ties, PTAs. And there’s variations on this urban
framework which I’d be happy to discuss. There are a number of alternative network
models which have their own strengths and weaknesses. And in a recent comparison of different
network modelling alternatives the urban model which I choose seems
to best suit the substantive puzzle. So how was this work? What is the random,
exponential random graph model? Considers the network conditional
series of predictor terms. These predictors, networks, statistics
represent configurations of ties, for example triangles of three
nodes of a common attribute. Better have all the sides to occur
more often than expected by chance. These terms along with your coefficients
define the probability of each edge, and the probability of the entire network. The true example is
the Homogeneous Familiar Model. The configuration is an edge, the predictor is the total
number of other edges. And the network may be viewed as
a collection of independent and identically distributed variables. Like this on and
off switch between pairs of nodes. Building on this baseline model,
one can also account for more sophisticated networks. So, suppose you want a model on
an empirical case where not all nodes are equally likely. This can be modelled with [INAUDIBLE]
that obtain maximum likely the estimates that describe
the impact of local selection forces, including node to edge attributes
on the structure of a network. For example if two nodes,
two countries sharing a common attribute. For example both low income countries
may be more likely to form an edge, a preferential trade agreement
with one another, and edges with certain attributes
may be more likely to form. For example, if there’s more bilateral
trade shared between those countries. So this n (x) term defines
the support of all obtainable networks that as a function
of the nodal characteristics. If you had just one simple network
statistic like an edge count, that would enter into
the vector parameters y here. If you had covariance of interest that,
like whether two countries share a contiguous geographic border,
that would enter this extra. We can talk more about the mechanics
of it if that’s of interest to people. The normalizing factor,
this c, normalizing factor, captures the summation
over the sample space. And so estimating part of this
is a computational challenge for the urban models. And so to estimate these models I
rely on an R package by Hunter, et al, that’s available open source,
so it’s easy to get access to. So what do I find from my first analysis? Here, I’m all of the predictors of the
preferential trade agreement network at various slices of time. Bilateral trade between pairs of countries
is indeed a strong predictor that they will form new PTAs with one another, even
accounting for trade with other countries. Importantly, there’s a strong
transitivity effect. So this last term which stands for geometrical weighted edge-wise shared
partners, it’s a whole long thing. Basically it gives you a sense of
the number of pairs of nodes that are connected by both direct edge and
by two halves through another node. And this significant coefficient
points to transitivity in the network that’s beyond what maybe
explained solely by nodal characteristics. This tends to suggest that countries
prefer to form trade agreements with other connecting countries. These plots show estimates over time,
and what I just want to highlight for you here is that consistently over time, smaller economies have been more
prone to form these connections. Bilateral trade is a strong predictor. And over here, in more recent years governments that
have a lot of these internal checks and balances through different party held by,
say the legislature and executive in the US case are less
prone to form these trade agreements. Which is exactly what we saw when
the US failed to ratify the TPP. So it may be part of a broader trend. It’s informative to compare these
estimates to standard logic model Results, in other words with out
encountering poor network effects. The table shows the two analysis
with the exact same specification, same data, save the network statistics. And trade intensity is a even stronger
predictor in the network settings, suggesting that concerns
about competition, or trade diversion may be
a little exaggerated. Okay, so an important extension to the ERG
model, it takes values, and the ties, and this allows one to model the strength
of the ties between nodes. It introduces a nonbinary
reference distribution, to allow for count, or continuous data. So, these models has some
problems with the generosity. If for
instance if you’re not a very sparse, or a very complete network it would
be [INAUDIBLE] challenging and less reliable estimates. I applied the value to other models
to examine two additional questions. First, as I noted,
trade agreements vary in their design. With some imposing more stringent
obligations on their members than others. One might expect that governments that
initially trade with one another chose these deeper more
stringent legal mechanisms. For example, the European Union has extremely
strengthened requirement over trade. Whereas there are more
shallow trade agreements, such as between the US [INAUDIBLE]
have very minimal requirements. So, the intensity of trade pretty set
high problems of these [INAUDIBLE] trade agreements for the differences
that are actually quite small. Second, and I think more interestingly,
I aim to model the subsequent as a function of previous
preferential trade agreements. So, here the question is does trade
between member states expand the years following the implementation
of a new trade agreement. If so, is this driven by trade diversion,
or market growth? So, this table shows the year over
year percentage change in bilateral trade as a function of the formation
of new preferential trade agreements. These ones I look at the ones
implemented in the preceding five years. Although I tried all sorts
of different time lags and seemed to find pretty consistent results. So, what this shows is a reliably
knew preferential trade agreements lead to an expansion in trade between
those countries, even accounting for other preferential trade
agreements in the network. Did a number of other analyses, including
focusing on the temporal aspects. Another variation of these ERG models
looks at the network evolution over time. So, one can model two separate processes,
the formation of new ties and the dissolution of existing ties. This kind of extension, however, is substantively pretty tricky
in this particular application. In the international politics literature,
some scholars have highlighted the existence of so-called
zombieance institutions. So, what this means is that countries tend
to establish international agreements, which eventually become obsolete, but
they don’t bother to disband them. And so, these institutions
continue to persist on paper yet maybe have no effect on the economic
relations between the governments. And it will vary in practice. And so, sometimes this happens if, sometimes when you get superseded
by new international agreements. Nevertheless, the risk of this
is that the dissolution of ties is not easily measurable,
and then it sort of. So, I do look at some of the factors that
predict how durable a preferential trade agreement is, allowing for the fact that at least some of these
ties will be meaningless once. So, let me just kind of wrap up some of
the findings that I’ve shown to you today and some of the exceptions
that I’ve explored. First and most reliably,
preferential trade agreements networks reflect established traded
relations between countries. Once we account for the indirect effect, it’s clear that countries form PTAs
when they trade more together. There are clear community effects. However, there is less support for
a balancing effect. In which countries form preferential
trade agreements to improve bargaining leverage over others. Second, preferential trade agreements
reliably promote an increase in trade, and this has been a point of
contention in [INAUDIBLE]. If I measure it as a percentage
change over previous levels, I do see an uptake in trade in
the subsequent five to ten years. The benefits for members of PTS appears
to outweigh losses for non-members. And the ERG models do not unfortunately
allow you to pin down whether I can attribute the causal mechanism
to the trade agreement itself. Does that have any
independent causal effect, or is it merely serving as a marker for
future expansion in trade? In other words we can think
of it as not necessarily in a case that preferential trade
agreements actually prompt more trade. Rather it’s possible that these PTA
are signed when countries anticipate deepening economic relations as
a way to kind of pave the way for more cooperative future interactions. So, of course, given observational data,
it’s hard to pin down causal effects. However, third,
in an analysis of these temporal models, I did find that preferential trade
agreements with legalized means for dispute resolution tend to last longer. So, you can think of
NAFTA as a prime example. When there’s a disagreement over
trade barriers that arises between, say the US and Canada,
as been the case in many instances. Those countries can engage in a
quasi-judicial process to reach settlement rather than resort to trade wars and dissolve their preferential
trade agreement. This tendency echoes a broader finding
in the literature on international cooperation. That dispute settlement mechanisms can
stabilize an international regime, making it more durable over time. And finally, there are only slight
differences between different types of preferential trade payments. Another point of interest in
the literature has been whether these trade agreement networks conflict with
the World Trade Organization, and here I don’t find very much
evidence in this respect. In effect,
do countries that are dissatisfied with the multilateral trade regime
as embodied by the WTO, seek to secure preferential
treatment out side of that umbrella. Well, in this respect the fears
might be a little exaggerated, because Stubber, and Sunder,
involved a lot of disputes within the WTO, don’t seem subsequently seek more,
trade agreements outside of that regime. Finally, since this is an ongoing project,
I wanna outline for you my next steps. And hopefully that gives you some
sense of where I need suggestions. So, first I’m working on
rerunning the analysis, where I look instead of at a total trade. Instead of looking at total trade,
I examine the intensive and extensive margins of trade
at the country level. This distinguishes between
increasing volumes of trade along country cares that
already had trade of some sort. And so, this allows you to get the entry
of new firms into trading relations, or new product levels. And. And on the other hand, the initiation of trade between
countries that previously had none. So, there’s some literature on this and I’m kinda working through what
one would expect reasonably. But one might expect that, PTAs do
little to open entirely new markets, but can facilitate the expansion of
trade between counties whose markets are already complementary. The second point I’m working
on concerns regional dynamics. As you saw, there’s clear qualitative
differences between regional clustering and cross regional
preferential trade agreements. And then finally, I’ve spent some time
working on estimating the joint relation between high selection and
node variable influence on the network. Several statisticians have been working
on a model that combines the ergom with random field modelling and
the scenario of active research, which I try to find out of my data So,
with that said, I’d like to just wrap up. And thanks for your attention, and
I look forward to your comments.>>[APPLAUSE]
>>Okay, so we got five minutes for discussion. Any questions? Yes? So question,
within [INAUDIBLE] regression so you’re running a binary variable
[INAUDIBLE] on the binary variable [INAUDIBLE]
>>The trade is a continuous variable, so it’s the [INAUDIBLE]
>>Even if it’s a binomial if it’s>>[INAUDIBLE] I tried between one and five years at the lab, and
it was pretty consistent. [INAUDIBLE] as a percentage of [INAUDIBLE]
>>Yeah, yeah, so I looked at trade dependence, and then,
I also looked at the percentage change in trade as the outcome variable in
the second portion of the analysis.>>[INAUDIBLE]
>>Correct, yeah, the second so the first analysis was
like, treat a grievance as the outcome. Looking at [INAUDIBLE] preceding errors. The second analysis, I flip it around. Clearly, [INAUDIBLE] However, all of
these things are so, there’s only so much you can do. But, the outcome was a value network where
I looked at the percentage change in trade after the years following
a new preferential trade agreement.>>Okay so when you say trade
[INAUDIBLE] means that trade becomes 75 [INAUDIBLE]
>>I tried it a couple of different ways, like a one year change,
and then, like the levels>>Yeah, and so the results I’m sure they were like [INAUDIBLE].>>Is it possible [INAUDIBLE] network structures [INAUDIBLE]? Relationships trade is sequential?>>Yes.
>>[INAUDIBLE]>>Yeah, so I tried a couple different ways to
get out how much closure where all of them are linked to each other,
versus hub and spoke kind of model. I guess I haven’t found any really
striking patterns on that, but some of it’s model specification, so I gotta
play around with that a little bit more. But thank you, that’s a good guess. That would be interesting. I’ll push on that point a little further.>>So I’m not familiar with these models,
and so I found them a little bit hard to understand what
the unit of observation was? So, subgroups and so on would help. And one of the things I know, is that, if you just look at bionatural
trade between countries. And then,
you’re dealing with country peers. There’s a real serious
problem with [INAUDIBLE], that you you have to allow for
correlations across country peers. And the models that we use do
not pick up on that correlation. In fact, in my own work I found out that
you can be out by a factor of three, or three to five on standard errors. So I could count a total from these
models, but is that an issue here? The standard error’s that you’re getting, do they adequately control
into dependencies? That’s an error.>>I mean in theory yes,
but I don’t have any good measures to tell you how much information or [INAUDIBLE]
it is.>>Well if anything happens
it will be a regression, the question is whether
there is a regression. And then the other thing was with
these models,and on many models, to what extent are the results that
you get, just simply statistical, Significance versus a meaningful
economic magnitudes. So there are a lot of stars, but
were these bigger picks of smaller picks? So for example, towards the end,
it said to say that it’s changing. And then, I saw numbers like 0.77, which to me 0.77% is very small. Yes.>>I think with the ERDM,
it’s Is a really complicated step, but because it’s is really similar
to statistical regression, and the coefficients that you get with this
model, you can actually get the and then that is a measure that other
people at the university level can understand better than
>>[INAUDIBLE]>>Okay, so it’d be good to translate.>>I think that, that would help them.>>Yeah, I started down that group by
doing a comparison to the standard logistic results, and so I’ll push that
further then we can get a sense of scale. Thank you.>>Okay, well thank you Jan.>>[APPLAUSE] All right.
Cool. My name is Tyler Scott, and I’m actually
brand new here to campus over in Whitson Hall on Environmental Science and
Policy. And so, I’m pretty proud of myself for actually finding my way
to this building today. So, off to a good start. This sort of encapsulates
my research topic, but essentially, the talk I’m giving today is
really network analysis in a messy world. So you might not care about forest,
and fish, and water and environmental government, but, many of you probably do have network type
applications and likely some messy data. So hopefully, this can speak to that. There’s a lot to digest in
computational social science. And so one of the themes today, that I’m gonna present, is that even if
you don’t know how to do all of it, and you can’t, you’re not the best at all of
it, there’s some really simple tools and strategies that you can use in your
particular messy applications. So for me, what that looks like,
is that I study coordination and institutional complexity. And I’ll show you an example of
that in environmental governance. And so this is my story about how I
combined text mining and graph models, older models [INAUDIBLE] to get at these
questions that I’m most interested in. So here’s my first example, you’ve probably seen a lot in the news
recently about the City of Houston. It’s a great example about institutional
complexity in environmental governance. There are more than 1,400 different
independently regulated water providers just in the Houston metropolitan area. Which is just a ridiculous
number of independent utilities. And because of that they face
significant coordination problems. So they’re all fighting for
the same supply resources, they all have straws in
the same groundwater aquifers. They’re fighting for infrastructure and finance stability to sort of tax and
spend. And then they have to coordinate on
all sorts of service [INAUDIBLE] challenges as well. Because obviously not all 1,400 of these
systems have the capacity to sort of go it on their own. And in particularly, a lot of these are
essentially like you and your neighbors in your cul-de-sac decide that,
we’re gonna form our own water utility. So lots of interesting
coordination problems. And you have these local neighborhoods, who are also operating
almost as their own city. And it’s obviously been in the news
recently since Hurricane Harvey. You’ve probably,
if you read a lot of these articles, it’ll often mention a lot
of these small utilities. These are the ones that are having
trouble getting back on track. They’re not sure what
the quality of their water is. It’s not the city of
Houston in the center here. It’s all of these tiny thousands
of districts out in the outskirts. So if I’m interested in how they
coordinate and how they share information, how they provide clean water. The classic way we’ve done
that in social science, when we were studying the environmental
governance is with surveys. And they pose issues that many
of you are all familiar with. So the first one is recall. It’s one thing to ask students
at a high school classroom, who are your friends in class? There’s only 30 people, they can sort of,
think through all of that and name their friends. It’s another to ask people in that type
of complex institutional environment, who are all the people
you coordinate with? Second, even if they did answer that,
even if you have all the time and money in the world to keep rolling out
your own surveys, these folks are busy. And they don’t have time, they don’t
really wanna answer your surveys. And so even if they do have perfect
recall and you have all the monies, you’re still not gonna be able to get good
results going forward because ultimately we care about how this changes over time. And so we have significant problems
with repeating our surveys as well. This is sort of an idiosyncratic feature
that you can get also sort of shows up in data as well. We get this sort of
person-organization fuzziness or in her case it might be
a company-country fuzziness. Well, we’re sort of, it’s like I sent
a survey to these water providers. And one person gets it, and they fill
out who are my coordination partners? But if you think about the city of
Houston, they have hundreds of employees. And so
if you give that list to someone else, maybe they talk with a completely
different subset of organizations. Whereas if it goes to a small
organization, that one person might be able to speak authoritatively to
all of their network connections. And so, really in practice what this means
is your discussion section gets longer as you sort of justify to reviewers why
you think it’s okay to sort of aggregate to organizations. But it’s nonetheless sort of an ongoing
latent problem that’s sort of always bubbling in the under current is what
are these nodes actually representing. And surveys also particularly when
we zoom down to sort of local and low level governments,
they miss key things. So, one issue here,
is these types of small, operators. They’re not necessarily showing up in
large scale regional planning gigs. So, we have sort of this high level, like, big agencies, big municipalities doing
things like regional coordination bodies. Then we have these small folks who
are just trying to keep the water going. And so, they’re not gonna show up if you
survey like regional coordination bodies. And they also likely if you ask,
well, who are your collaborators? They’re not gonna list
their actual collaborators. They’re not gonna listen to people you
care about like the person they pay to do their taxes and the person they
pay to keep the water going and their finance expert. Things like that. So one thing that my co-authors and I are always trying to think through is,
how can we better measure coordination in these types of networks
without having to resort to survey data? And so one thing we hit upon is personnel
records as an alternative network measure. So when there’s lots of evidence
that shows that expertise, skilled labor is a key driver of
environmental outcome in service delivery. So you have to have access to
people who know how to treat water, who know how to operate a debt
financing scheme, things like that. And so it’s a key driver of
outcomes they all have to rely on. And the cool part about studying these
types of administrative processes is they also all have to file lots
of administrative records. So you can go to the state of Texas’s
government website and there’s a page for every single water provider. And they list all their contacts,
all the licensed technicians they employ, all of their other experts
that are on their payroll. And so what we argue is that personnel who
work for multiple providers are a conduit for information-sharing in connectivity,
just like social relationships might be. And so, instead of a social
relation connecting to nodes, we’re actually going to use people,
as the connection. So, when you think about that, one question isn’t necessarily
interesting at all. Because small providers are inherently
going to share do to sort of the transaction costs economic logic. They don’t need a full time employee,
so they pay somebody part time. So we’re not interested in
predicting contracting, because that’s already
sort of well settled. What we’re interested in is asking this
question of if I am going to share, who do I share with? Do I share with people I’m competing with? Do I share with large organizations,
small organizations? Those types of questions. So here’s a little bit about
what this network looks like. So we’re gonna use a bipartite
network graph, and that’s essentially just
a two level network. So on one level we have people and
at one level we have organizations. And thankfully Lauren’s already done a lot
of heavy lifting in explaining how so I don’t have to get into that too much. Essentially as someone in the back
mentioned, we’re talking about. And in essence, you could use a logit
model to estimate the same thing or you could estimate your
logit model in an ERGM. What sets apart an ERGM typically is
we don’t just have system attributes like what kind of system you are and
how big are you. We have these autoregressive
structural terms. And these can get complex as we sort of
wade into the sociology literature for all the names they have for them. But if you’re familiar with
a spatial modeling application, it’s a similar type thing. We’re simply regressing,
we’re simply controlling for surrounding ties on a given
title we’re trying to protect. We’re saying that your neighbors matter. So if I run a water utility and
you run a water utility and we already share one person, you might be
more likely to share another, as well. That type of thing. The other interesting thing is
sort a messy network problem. It’s very common when you start out
with a two mode network to actually collapse the matrix, take the cross
products to get a one mode network. But that would be sort of
weird in this case because you would start with this person who works
for three systems and you’d collapse it. And then what you would be modeling
is a full set of ties between all of these nodes. So an example would be the graph that
Lauren showed again where you saw that in some of these bilateral trade agreements
because everyone is part of the agreement, everyone has a tie with everyone else. That can pose some serious modeling
challenges when you have one contractor who works for 10 different systems. And then you tell the model, all 10 of these organizations
have a tie to each other. It’s also, I guess I should say, I think
it’s also sort of more conceptually appropriate in this case because this
really is what our network looks like. And so
otherwise we sort of obfuscate that. So a couple quick hidden results. So, as sort of a proof of concept,
size behaves as it should. We simply see more employees, and
because you have more employees, you’re on the margin more likely you share
someone, if you are a large provider. Questions more interesting then, there’s lots of small providers who are
drawing upon the same groundwater aquifer. And the way Texas works it is that
you have the straw in that drink, you can pull out as much as you want. And they have lots of carrots
trying to get people to stop. But it’s a massive
collective action problem. And so one thing that’s actually
interesting about this is we wouldn’t expect that these providers competing for
the same resource share personnel because that information about what
they’re doing gets back and forth. But if we’re sort of looking at it from
another direction, this might be a way to help all of them actually work together
towards the collective solution of these regional groundwater authorities
are hoping they work towards. Another example and then standard
errors are somewhat large on this. But again we’re really just interested
at this point as a proof of concept in which direction this effects. We see the districts that buy and sell from one another are also
likely to share personnel. So plans for building on this,
because I think as a first step, it’s even important just to demonstrate
that these types of network ties follow sort of basic intuition. We intend to connect personnel
sharing to performance outcomes, particularly, since pre and
post hurricane Harvey. Looking at who employees and
what employees you share, how that affects how quickly you
were able to get back online. And well maybe inverting that and
looking at turnover and asking if you had, sort of,
a severe event when the hurricane hit, did you end up switching sort of and
going with some other personnel. Data collection’s ongoing for this and
what that essentially looks like Is every month my Google calendar sends me a ping
and I fire up the web scraper again and I go to all the different district
websites and scrape everything. So hopefully in a couple of years
we’ll have something pretty cool. Because that’s ongoing, I’m gonna pin
it quickly to another example to show how we take these types of data and
try and extend them over time. So the second example is a hydro
power facility we’re licensing that’s run by
the Federal Energy Regulatory Commission. And it’s a really cool sample
of environmental collaboration, because FERC actually says,
you all have to collaborate. So, this isn’t a case where everybody
says, hey, let’s work together, this is going to be great. This is a case,
where a Federal agency says, you all need to get your act together and
work this out. Or else this is just going to go to court
forever, and nobody really wants that. So what we see in these cases, it might be one small dam way
up in the mountains somewhere, so like the case we analyzed was way
up in the North Cascades National Park. And we still have tons of
different stakeholder groups, so private firms, resource agencies,
local citizens, local towns. And the challenge they have to come
together on is to develop a 30 to 50 year operating plan to specify all
the contingencies about how they’re gonna operate this dam, to provide energy and
protect fish, and do all sorts of other things for
the next 30 to 50 years. So the types of questions that I’m
interested in looking at these sorts of processes mandated instance
of collaboration is, who actually shows up When FERC
says you all have to do this. Who becomes the leader in the process, or who sort of maintains
leadership throughout the time? And who leaves, right? One thing you may be concerned about is,
a collaborative institution becomes a coalition over time and
sort of peripheral people drop out and it’s sort of a collaborative
in name only at that point. So the challenge in this type of a process
is it’s way too big to hand code, and it’s way too long to survey. So we’re looking at again,
one relatively small dam, way up in the mountains in
the state of Washington. And this takes 16 years and they held
more that 600 meetings, were able to find meeting summaries for 591 of them and
no one wants to read through all those. And we also can’t really
follow up with the survey and ask the people what was
your participation like? Cuz they just can’t give us good responses
over what they did for the past 16 years. Thankfully, as I said,
someone does take notes and so that’s sort of our point of entry into
analyzing this stack of facts, so the first thing I’m going
to ask is who shows up. And so
we are gonna perform a text mining, and we’re gonna use a text mining
tool called entity extraction. And essentially how it works is it
takes these meeting summaries and it tokenizes them into phrases,
and I say phrases not words, because some phrases like fast
food are actually one word, right. And so using a machine learning
classification tool to say this is a word but
this compound word is a word as well. And then we’re going again use
a different machine learning technique, it’s a pre-trained classifier that then
looks at these words or phrases and says this is a person, this is a location,
this is a date, this is a number, so pretty basic, ultimately,
it’s statistics on the back end, right. Because these machine learning algorithms
have been pre-trained on thousands of books and they use that pre-training,
that dataset, to then say, looking at this new sentence I’ve never seen before,
I think that is also a person’s name. And so once we get this extracted set of
people who show up in these meetings, we supplement it with hand holding. So, a few examples of
what that looks like, because real people had to write these
meeting notes, there’s misspellings. We use fuzzy regular expression matching
to pick up where David has the i and the v reversed, things like that. Another example would be, if you’ve ever heard of a Dolly Varden
it’s actually a kind of trout, but IBM sort of baseline classifier doesn’t know
that that’s not a person, that’s a fish. And so, that would be something going
forward as we build our own customized model we’ll pick that up but for
now we filter that out in the backend. So we don’t just wanna know who shows up, we also wanna know what they
are doing when they arrive. And so, the second text mining tool we
are gonna use is part of SpeechText. So this time we are going
to take these texts and we are gonna break them up into
sentences rather than phrases. And then we’re gonna feed it back into
a machine learning classifier and it’s going to spit us back out a probabilistic
model that says this is the object of a sentence, this is the subject of
a sentence and this is the verb. And here we’re particularly
interested in subject and verb, so the person who’s doing something and
what they’re doing. So a quick snapshot of how we’re
doing when we actually try and run these things,
we hand coded 50 different meetings. And so in the top, you can see how we
performed with respect to attendance. The machine has false positives but it
doesn’t have false negatives because it’s not gonna generate a name out of thin air. So, it’s only going to find cases where
someone wasn’t actually in the meeting, but their name came up. So, it might be a reference to President
Obama, or it might be someone saying, let’s invite so and so next time, and so
that’s why the hand coding tends to track below machine coding, is because some of
those names weren’t actually present. When we look at the part of speech taking, we’re a little more all over the board we
hit miss high, sometimes high, mainly low. This is a noisy thing even for human
coders, it’s actually sort of a vexing when we down to code 50 minute meetings
how much ambiguity there was even for us in saying who did what and dealing with things like generic nouns
that refer to multiple sets of people. So I think this is an ongoing
thing to be explored. And then the final step, and
we use this very simply right now but we take the verbs that
come out the backend and we feed them into the verb
net lexicon dictionary. Linguists are all about verbs and they
have great classifiers that’ll tell you way more than you wanna know about
what this verb is and what it means. Right now, we use it to filter out
past-tense and future-tense so we don’t wanna know what someone did in
the last meeting, we don’t what they’re gonna do next time, we wanna know what
happened in each particular meeting. One other thing that you’ll likely
find if you’ve applied this is the way people write up these types
of documents is very different than how we formally classify language. And so Watson’s program thinks there’s
a lot of physical activities going on at these meetings because every time
someone gives a presentation, the person who’s taking the notes says, so
and so walked us through the new report.>>[LAUGH]
>>Right, and so it says, wow these people are walking around
there’s lots of activity and there’s sort of a weird. That’s why we use this for
filtering right now and not for a live analysis, cuz we need to spend
a lot more time with these verb types. So, again, thankfully Lauren
did all the heavy lifting. Essentially here, we’re going to use
a temporal ERGM model we’re going to use still Leifeld’s btergm package, it’s not
quick and it’s as complex as it sounds. It’s essentially a bootstrap version of
the temporal ERGM that Lauren is using. The big advantage here is that
it has a major speed bump. So it uses a pseudo-like limit
estimator and then bootstraps that so the standard errors are not bias
downwards and the upshot, and I know it sounds silly to say a large
network when I have 770 people, right. That’s small data, but for urban models,
that can be a big a big deal, right, so when you have have
upwards of 500 nodes and you’re fitting an MCMC driven urban model. If you’re not careful that can be weeks,
right, or you might never even get it fix. So the ability to get something
that gives you results in a minute. Is a big plus. And so that’s sort of the exciting
takeaway about the [INAUDIBLE] versus a lot of the other urban modelling
tools that are out there. And essentially we’re just modelling this
network change over time across these [INAUDIBLE]. So a quick snapshot of the types
of things we learned from this. We find that the people who
started out as leaders? They’re leaders the whole time. So people don’t sort of rise up somewhere
in a 16-year process and take charge. If you’re in charge at the start you’re gonna dominate
the conversation at the end. We found that anyone who folks said,
this person has a key sign-off role, they tend to dominate
the conversation as well. You’ve inserted power with a little bit
of veto power, you’re gonna talk more, you’re gonna present more,
you’re gonna dominate the process. The interesting thing is after
the license is approved, it enters this five-year implementation phase and
there’s no more mandate to collaborate. And so what we see after this phase, when
we interact the stability of individual interactions with this phase is
that the stability drops, right? And so we see less regular interaction, we
see people who just show up in one meeting and they don’t show up again for a year. So it appears that that mandate from does
sort of provide some key support that keeps people coming and
keeps them participating. Which matters because meetings suck,
right? And it takes time to drive there and
you have better things to do. And so anything you can do to serve
support people continuing in this type of collaborative process can be valued. So, wrapping up, for a lot of network
research, actually running the model, it’s like 10% of the work, right? Getting a network that you
can model is the real hurdle. And if you can pull it off, graph models
are useful even if you do not have something that’s sort
of a standard network. So a lot of times we think it has to be
a sort of a truly social network, but really what we use graph models for
is when we think there is some sort of structural interdependence, when we
think that the neighboring ties matter. So whatever the nodes and edges are, if you think there’s some sort
of interdependence that you can’t capture simply by sort of applying fixed effects
and random effects for particular actors. You might need to explore
an urban-type approach. And then finally, if I can do this,
any of you can, right? This is not high-level text mining,
for instance. Someone like Dan is so far beyond in terms of what he’s actually
doing when he’s processing language. We’re doing fairly simple things
like spitting out people and spitting out the actions they take. But we can learn a lot from that. So I’m excited about sort of
applying these types of simple tools to important real world problems. So with that I’ll thank you and
I look forward to your feedback.>>[APPLAUSE]
>>Questions?>>So I’ll ask a question.>>Sure.>>So we see a lot of stuff about
networks and these graphical models and it’s a very active area at the moment. Is there a good reference?>>Yes. My favorite is the which I think one
of your advisors is also an author, [INAUDIBLE]?>>Yeah.
>>Yeah, so that’s my favorite textbook, but a lot of rules online University
of Washington has a ton of resources. They have a super active user group,
so I think the first thing I’d go to is statnet.org
>>The thing is, a lot of us are actually fairly knowledgeable on
statistical measures in general. My view is it’s a question
of finding the reference. I can mention a lot of papers. But presumably there’s one or two pages
out there for instance would explain when we say bootstrap, you need to bootstrap
out of something that’s independent. You need to do an incorrect bootstrap. So I’d be interested to know, for
example, what’s the that’s being used?>>So the way that one works is it’s a And
what they do is that if it only works for temporal network models because they
actually re-sample those network periods. So they’ll pull a set randomly and
they’ll run the model on those and they’ll do it again. And so they have a 2012 paper and
I forget where it’s published. But it essentially shows that
when they apply that method, but the pseudo-likelihood estimated for
a temporal, ergo, they do generate them by a standard index. But so I think the high-hanging fruit out
there is still figuring out how we can do a similar thing for
a non-temporal network model.>>Any other questions? Okay.
Well, thank you again.>>[APPLAUSE]>>Let me start. We read a lot of stuff
about machine learning, deep learning, this that and
the other and with the previous paper particularly showed a lot of
the tools out there that are being used. And I know really nothing about this, but I just want to know things, all right? So I committed rather foolishly
to teaching a PhD course that first meant six weeks on machine
learning to force myself to learn and
spend a lot of time on it and so on. I figure I can explain it in 30 minutes,
so we’ll see how it goes. Okay, so this is the key,
this slide, all right? The goal with the machine
learning is prediction, right? That’s the goal,
which is often different from a lot of what we wanna do which is
to get a beta hat, right? Which has some physical interpretation. And in pure machine learning
there’s no structural model. There’s no knowledge that there’s
a gravity model of trade and trade flows between countries will
depend on the size of the countries, the distance from the countries and so on. It’s just throwing in this data and
then there’s some algorithm, and given that algorithm,
it spits out an answer, right? So there’s no kinda structural model, no model in the purest
form of machine running. That’s why it’s called
machine learning type. But it’s an algorithm and
it’s using existing data, okay? These train the machine to come up with
a prediction model, and then this model is then used to apply to a different set
of data to make some predictions, right? What we would call that is our sample
pool, then that’s what it is, okay? And then, you got to be concerned about
out filling the existing data all right? So if you like there’s some
truth expected model of y condition on x,
that’s what we want to fit. But what it’s aiming for
is the y has to be observed. So if there’s an outlier it’s going
to be chasing to get that outlier and expel that outlier, well. When you apply to a different data set,
the outliers are somewhere else. It’s just not gonna do as well. So there are methods to
guide against overfeeding. And this is really something that we
just don’t do very much of, all right? If you and
adjusted [INAUDIBLE] is some attempt. But really I’d say this
is perhaps the biggest difference between machine learning and
what we do. I think there’s a lot more
concern about overfeeding, right? So there are lots of different algorithms. They go by names like deep nets,
[INAUDIBLE], and so on, right? It just Up, and
that’s rather like there’s a logic model, there’s a Gaussian model, there’s
a whatever, but there’s so many of them. And this is the hard thing, the different algorithms work better
on different types of data, right? So a regression tree might be
fantastic for one type of problem, but hopeless on some other problem. Some other problem we’d want, I don’t
know what a neural network, or all that, so okay. And then as we saw from the previous talk, forming the data to input could
be an art in itself, right? There’s a term for this,
is data carpentry. So generally in the problem facility
that I work, most of the people work on, usually, the good idea about what Y is and
what the answers would be, right? But don’t ask me how you get data
out of a photo of a face and then do facial recognition, all right? How do you get Ys and Xs out of that? I have no idea, right? But the thing is, that’s a problem that
a lot of people are dealing in, so people who deal in that problem
have got off the shelf methods to capture the features, right? It’ll be distance between the eyes, etc. And what could go wrong? How could we, well, we always say correlation does
not imply causation, right? This is where, in any field,
your base knowledge, the starting point is that trade will
depend on these distances and so on. And then as I’ve said in the first pool, saw on the first pool,
the trade partnerships will matter. And the challenge here is
you bring in the models, you bring in some structure. But also we generally want, we know that getting cause with
inference can be difficult. And what people are doing is,
if you have some causal approach that along the way
requires some part of it to be flexibly modeled,
then maybe we use the machine learner, for it to get that flexible modeling part. And put it back in with
the rest of our methods. But then we have to make sure our
influence guards against the fact that we could be back in a non-fulfilling
situation when we use the machine of origin, okay? So this is the bottom line,
right, this is what’s going on. And then a lot of my talk is just now,
I think, mentioning buzzwords.>>[LAUGH]
>>Fail. Start. Okay, I’ll be good. Where is that? It’s behind?>>I stole it.>>[LAUGH]
>>This is good for that. Okay, so
the topic is called machine learning, but other names are statistical learning,
data learning, data analytics. And a big thing is, the term big data is
used, but it doesn’t have to be big data. And then the literature distinguishes
between supervised learning, where there is a Y and an X,
and we call that regression. And then there’s regression,
where our outcome, well it’s continuous, but it could
also be setting a count or whatever. But it’s one that we have some,
I guess, cardinal measure, right? A 6 is 2 times 3, all right? The alternative is classification, right? Which is a where we
would use logit model or a multinomial model. And then unsupervised
learning there’s no Y. All right, unsupervised learning, we are just trying to get Xs and
see some pattern in them. And a big one would be in psychometrics
where we’ve got all these attributes on people. We put them together and we see, there’s this cluster of people
that have these common attributes? And then we look at them and
we say, yeah it looks like that person is an aggressive person,
we’ll say that’s a type A individual. So what I’m gonna focus on is just
the first one, the regression, okay? So a little bit about classification and really almost nothing
about cluster analysis. Okay, so
a good thing is this not over-fitting, so the term is a training dataset, or an estimation sample, and
then a test dataset, or a holdout sample, or
a validation set, okay? And then kind of the fitting along
was here and then deciding whether it’s good or not, having a race between
different models, would be using mine. So the way we might get
the beta house from here and then use it in predicting here. And then see which model was
getting beta hats that did a better job of predicting here. So as I’ve shown on this example And the big thing is main [INAUDIBLE]. With machine learning you
got to have a loss function. And the loss function does used for
a continuous [INAUDIBLE] is main [INAUDIBLE] and
think of this as like we done Some, we have multiple observations [INAUDIBLE]. So this might even be quite [INAUDIBLE]. This is the average of them. And then the lots where
it was two [INAUDIBLE], okay? So Which shall I go for? Shall I go for
the Y that connects the dots? Or should do some sort of smoothie, right? Now what made squared error is doing
is it’s balancing the two, all right. So this Prediction here could be biased,
all right? The truth is somewhat higher. But I’ve gotten a smoother fit. And mean squared error is the sum
of bias squared plus variance. But a lot of the machine
learners use biased estimates. And even Asymptotically because
of this smoothness so far, right? And one place where if you’re familiar
with nonparametric regression, right, nonparametric regression if
you use cross validation, that’s essentially using
this mean squared error. Okay, so I’m gonna create an example
where the data is quadratic The daily generating process,
believe it or not. But when I run the regression [INAUDIBLE]
time siting whether it’s x cube or x four or x [INAUDIBLE]
most important. But here I set it up so that I had, 20 observations here and
20 here And I had set it up so that the x’s were the same, all right?. In my estimation sample and
my validation sample. But I’m using this data,
the quality of my fit is this. So now I’ve got the same x’s over here,
so it’s gonna be the same predictions. But I’m comparing to actual data, and you can see that I’m just getting
more outlines here, right? It’s just not predicting
as well out of sample. Okay, so when I do things in
sample with the linear model, quadratic, cubic, quartic,
it’s getting better and better and better. But when I go out of sample,
the lowest is actually the quadratic. Which was the data generating
process I used here. It didn’t necessarily have to be this way. But in most simulations,
it’s going to turn out that we, cuz there is randomness here. So I just have this here to try and
show that it’s very easy to go wrong and particularly to over fit, right? This is a sign that quartic is
best when my data genarating process was a quadratic. Okay, so what could we do? Well what I’ve done here
is basic validation. But I’ve lost precision, right? To do this I had to say,
I’ve got 40 observations, I’ll only use 20 in estimation. So I can have this other 20 outside for
my validation. And then furthermore, the results
will depend, where do I do the split? There’s many 50 50 splits I
can do with the data, okay? So what k-fold cross validation does, here I’ve got, say, five splits. We randomly split the data into fifths. We estimate on four-fifths of the data,
and predict on the other one-fifth. And we can do that five times, right? So first, I hold out the first fold. Estimate on the others. And so I’ll then use those estimates
on the other folds for the first fold. Now I’ll estimate,
I’ll hold out the second fold. So I’ll estimate on the first,
third, fourth, fifth. Get B to hatch there, and
then predict on the second fold. This is where I left the first fold out, what the root mean squared I got was. When I left the second fold out,
third, fourth, fifth. And this was for one particular model,
for the quadratic, okay? And then the average is 12.3935, okay? So I’ve five times estimated models on
four-fifths of the data and predicted on the fifth, and I’ve got the average
mean squared error across those five. So now I’m gonna do the same thing,
not just for the quadratic but for the linear model, the quadratic,
cubic, and quartic. And we’ll see which gives the best result. And now I don’t know why this is
[LAUGH] actually, when I look at it, why the linear came up with
the same as the quadratic. But it is showing the quadratic
has the lowest mean square. And again, because of randomness and
only 20 observations, and 40 observations. Didn’t necessarily have
to turn out that way. The five probablility would turn out
on any realization of a random sample. Okay, So I spent a lot of
time on that because I really think that forces us to think a little bit
about what we’re doing in our own work. Where do we run the risk of overfitting? When you put structure on the model. So, for example,
it’s a linear regression with, well, except if I have a lot like
linear function, right? So that’s a quite structural model. In that case, there is theory out
there on information criteria, which give a penalty
from the model slides. Okay, so the bigger the model, We maximize the likelihood as we,
Have more complex models, better fit, the log likelihood goes up,
minus the log likelihood goes down. We want this to be small,
but then there’s a penalty. But now I see actually
pretty weak penalty. This is a bigger penalty,
evasion criteria. Okay, and there are other ones known as CP
upon we’d assume is adjusted R squared. They’re much more model-specific. And in some circumstances you can
show that cross-validation’s doing the same thing. Cross-validation’s just a universal thing. But it’s computationally more expensive. And maybe you can do this. For some problems may be good. And maybe for some initial analysis,
speed things up and then later on do
a cross-validation more properly. Even cross-validation,
I mean it’s pretty arbitrary. Why is it that we’re using
bias squared plus variance? Why not two times bias squared plus
variance, okay, as our loss function? But we can use bias squared plus variance, which underlined [INAUDIBLE]
why use a squared loss? Why not use an absolute loss? So I mean there is arbitrariness. Okay, so when we use AIC and BIC, in this example they both come up with the quadratic we don’t invest. Because the AIC didn’t have such
a big penalty, there wasn’t so much difference between the quadratic and
the quartic. Whereas there was a bigger
difference with the BIC. All right, so
now we’re getting into regression. And the first thing is
that generally you think, linear regression,
pretty restrictive, all right? But if you had a ton of data, and you
could put in all sorts of interaction and quadratics and so on, right? A linear regression might actually
do a very good job of predicting. And it’s easy to work with. I mean, when we say linear regression,
with all those interactions and so on, it is nonlinear in the explanatory variables,
but it’s linear in parameters, all right? And that makes it very quick and
fast to estimate, okay? And then the other reason for going with linear regression is that’s
the basis of the more complex models. Okay, so
methods to reduce model complexity. Don’t go too overboard. Choose a subset of regressors. Shrink regression
coefficients towards zero. Reduce the dimension of regressors. So say we’ve got 100 regressors,
I’ll just go with the best [INAUDIBLE] linear combinations of
those regressors in terms of it. So I’ll go through these. And then I forgot this one,
but it may predict well. Direction on how to [INAUDIBLE]. Okay, so this would be easiest if you
had something just on regressing. This assumes we decided. These are the 133 regressors I’m gonna go,
right? I’m bringing in my interactions and
so on, and then we wanna choose which
are the best 133, okay? And there are these methods to say, okay, let’s first of all have a rule for
what would be the best reversal? Two, what would be the best? Three, so then you come up with the best
model for one, two, three up to 133. And then you go and
do the cross validation and say which of those is best predicting
our sample and our cross validation. Or we could use AIC, but
something that penalizes the. In R, it’s all automated, right? Once it’s set up it will be a long line command who will spit out one of the best. This model with [INAUDIBLE]. Okay, and then not only that but
this is a standard problem so the algorithm methods that are used
will be about 100,000 times faster than if you tried
to cover that yourself. There’s a thing called the leaps and bounds procedure that
makes this pretty quick. [INAUDIBLE] tradeoff, so
I kind of explained that already. Shrinkage methods. Okay, so what shrinkage methods do is what
function is the sum of squared residuals. And to avoid over fitting, or to reduce model complexity,
they shrink towards zero. And one way to shrink towards zero is, what we typically do, which is,
it’s either in the model or it’s not. So we test statistical
significance of regressors. And in some areas, you will exclude things
that are specifically insignificant, okay? Well, the lasso’s doing that, but it’s not
doing that using statistical significance. Okay, and the shrinkage methods are on
the parameters itself, on the betas. And the simplest ones treat each beta as having the same impact, right? And so
the standard thing to do is sort of, well you have to rescale out
the regressors and standardize them. So that they each have mean
zero variance one, right? In that case, each beater/s is telling me
if I have a one standard deviation change in this x,
then what is the change and why. That’s not clear that’s
necessarily the best. What’s the difference between
the one standard deviation change in x versus a one standard
deviation in x squared? Maybe I should have something clever
than that, but this is the default. And if you wanted to standardize
your data in some other way, you could always do it and
put it into these okay? So the ridge can leave us a multiple
sum of the squared coefficients and the penalty is an absolute value, okay? So let’s, first of all, do ridge. So the ridge, we’re going to Minimize the sum of squared residuals plus the sum, This sum. So as we add more regressors
this is going to go up, that’s a penalty because we
wanna get this thing down. And then this lambda is a tuning
parameter, this lambda is like a bandwidth when you’re doing
a nonparametric regression. And one can use cross validation methods
to choose the bandwidth properly. Or you could use AI CBIC, okay? When you go through the argument,
I just put this in, the reach estimator. It’s x transposed, x inverse plus
lambda i Inverse x transpose y. So that makes it clear we’re dealing
with what we would think of as a biased estimate [INAUDIBLE] cuz these
squares [INAUDIBLE] would be this. Okay, now what the ridge does,
it shrinks all coefficients towards zero. [INAUDIBLE] This is to
completely quickly compute many different values again,
and then this is all. The lasso, has an absolute loss, and
this next diagram shows what’s going on. So for the ridge,
we have beta 1 to beta 2. Okay, so we’re minimizing sum of i, y i minus beta 1 x i minus
beta 2 x i all squared. So that’s gonna give us, that’s
a quadratic, so it’ll have these ellipses. And this is what l s would’ve given us,
right? And then, but we’re restricting ourselves to beta 1 hat
squared plus beta 2 hat squared in here. So we’re gonna have to move
down in this direction. As we move down in this direction, we’re moving away from minimizing
the sum of squared residuals. So this is a larger sum
of squared residuals. A larger squared still,
sub squared residuals. A larger still, and this is the best
we can do subject to the constraint. And because we’ve got this circle for a way that we’re constraining things
to be, and this is an ellipse, we’re very likely to be where
both beta 1 and beta 2. We’re not likely to be here or
here, right? Whereas with that lasso, the lasso, it was in terms of absolute values,
so this is our constraint and now for a variety of ellipses it’s much
more likely to be in a corner. And a corner means we’re
dropping something, right? And the lasso particularly certainly
in the has been very popular. Because of the. So to mention reduction, Like I said, it’s Just going from terms of
xs to a smaller number of linear combinations
of the xs [INAUDIBLE] High dimensional models, what that means is is large relative to n. In that case, there can be some
problems that some of the methods require that we have more observations and
parameters. There are some solutions. Non linear models Okay, so polynomial regression, and I’ve already
talked about that, quadratics and so on. Step functions, regression splines,
smoothing splines, B splines, wavelets. What these methods do is
kinda break it into pieces. And the simplest would be, If I had a scalar, X, to just split it into all the Xs, split it into deciles, and just fit a linear regression on the lowest 10% of the axis. And then another one on the next 10%. Another one, and
then make sure that those connect, okay? Well, the big one of
these is cubic splines. So it’s done with a cubic regression and you can very easily set it up. It’s very simple to not
only have them touch, so it’s connected, but
the first derivatives and the second derivatives at the point
where they touch the same. And then those methods, what they’ll
do is, you do worry about the taller points going out too far, so
they’ll linearize the tail. And from what I can see, to me, maybe this
problem is moving into multi-dimension but for one dimension thing. To me it just makes so much more sense
than doing multiple polynomial regression by global linear, right? I think the big reason for
this is you can prove a lot more things. But these generally
have a fixed bandwidth. This essentially is
allowing things to build. It doesn’t have to be every 10% of the
data, but there where your data is narrow, right? You’re still using the same number
of observations as when you’re in sparser areas. Any way, I’m just trying to give you an
idea of just how much stuff there is out there, right? Any one thing is not gonna too hard,
right? If someone can tell you that your problem
leads to the machine learning methods that are out there and so on,
you’re in business, right? But if you have to go out on your own,
no, you wanna go and talk to someone who’s spent
many years knowing about all these different methods and
language such as this. Okay, neural networks, in neural networks, So Y is predicted by Xs and what we’ll say is that my addition function will be this. And this is a linear
combination of some Zs. And then the Zs, in turn, are in linear
combinations of things at a lower layer. And you keep going back until you get to,
all right, so it’s a linear combination this and then somehow
we’re tying all these things together. Next level, we are getting more and
more and more and more. And somehow or other this works fantastic.>>[LAUGH]
>>Okay. All right, so
they’re particularly good for prediction, speech recognition,
image recognition, Google Translate. Google Translate was trying to
do things the way we do things. It was trying to teach what the rules
of language were, all right? Here’s the dictionary, here’s how
we put things together in English. Here’s how we put things
together in French, right? And then I got this guy
from Toronto to come in. And with about ten people working
with him, in three months, he’d improved things as much as
Google Translate doing things the old way, being able to improve things over
the last few years, all right? And so all of a sudden they
threw all their resources in. And they must have done it for
one a lot of translation, they’re doing several translations a week,
okay? And so this is off the shelf software. Right, so first of all,
you use things to get your image, or text, or whatever, into some sort of data. I think now it’s estimated using a thing
called stochastic gradient descent. So you bring randomness into the whatever, we could do something like
a Newton mass in algorithm. There are these programs out there,
I’ve been to their websites, and they promise you the world,
and it’s all very easy.>>[LAUGH]
>>Haven’t tested that theory out. And okay, I’m almost out of time. Regression trees, okay,
so regression trees, you simply use squared arrow loss. So the original one is the sum
of yi minus y bar squared, okay? Then what you do is you think of every
possible way I could split up my data. And that’ll give me, basically, and then we’ll use the means of
those two splits, all right? So this is discovered that if I do this
split at education less than over 12 and education greater than 12,
the sum of yi minus y1 bar squared, over these people,
plus the sum of yi minus y2 bar squared, these people,
it’s the smallest of that split. And then we do things again and
think of every split. And it turns out here, the next thing to do was to split
this into something, all right? By male, female, then age, oops. Then, all right, and
you get the tree, okay? The trouble with this regression tree
is it’s what’s called a grading app/ algorithm. When it made the split here, it didn’t
think about the consequences further down, all right? Once it’s splits you,
it’s just set forever. So more refined things,
random forests and so on, what they do is they bring in randomness. So one way to do it would be
to subsample your data and get trees going on these subsamples,
right? And I don’t have time to go through all of
them, but you’ve got all of this in here. And then some of these methods are ways of
getting a number of different predictions. Maybe we started the tree this way,
we started this way, and so on. And combining forecast and
combining forecast is viewed as the best thing to do usually. So there’s an annual prediction
competition around the world between teams of graduate students. And was it last year or the year before
that the Stat Department here at Davis, one of them, of the graduate students, are now using about four
different prediction methods. And their data was simply coding
from some Internet shopping company, and all they had were these numbers and
letters. And there wasn’t even anything saying
the first three digits mean something, one variable and the next six
are something, it was just complete. Somehow these machine learners and
so on get some good predictions. Classification, so now we’ve got categories that And
that’s natural ordering. So there supersecedent would be. And then a lot of people use linear and
quadratic discriminate analysis that support vector classifiers and support
vector machines of generalizations. I’m way over, okay challenging,
unsupervised learning, challenging causal inference. So basically the idea
with causal inference is if your familiar with selection
on observables right, so if I could just get the right
controls in there then and I may need one other variable then
I could believe my results okay. So what we want is to have,
get really good controls. Well if you could use a machine
leaner to get those controls, the machine learner is gonna
do a better job prediction. It’s going to, and then your controls
will be controlling for more. And then you might feel better about
assuming that anything that hasn’t been displaying is not gonna be correlated
with the [INAUDIBLE] of interest. So that’s one example. And I’ve got this here,
so just to finish off. [SOUND] Okay,
if you go to my website, I’ve got, Sorry, you have to go down a bit, but I’ve
got this thing here, machine learning. And, I’ve got 120 slides here and discussion on [INAUDIBLE],
but I particularly like this [INAUDIBLE] book which is an introduction
to [INAUDIBLE] applications in half. And this is available as a free PDF or
you can get it as a $25 not hard copy but
probably it’s a soft copy, and then I’ve got what is going on
at the moment in economics. And then finally,
I said that there was this, Book. Little cautionary tale. Sorry. How Big Data Increases Weapons
of Math Destruction.>>[LAUGH]
>>How big data increases inequality and threatens democracy.>>[LAUGH]
>>Okay, and the idea is that agencies have the algorithms to predict
what is the probability of someone I guess not
honoring their debts okay. It’s fantastic at that. Is that the same thing as a good predictor of whether someone renting
a house will make their monthly payment? All right,
what happens if you want to rent a house? Gotta watch your credit score, right? It’s just very easy to
misuse these things. Another one is their very black box,
all right, who should decide whether
someone goes on parole? Is it a judge? Or is it a model that spits out, there is a 93% chance this person will commit
a crime in the next year, all right. Or who should, okay, anyway,
I think I’m out of time. Okay, I’m done.>>[APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *