Okay, so

I tried to group papers into groups. And this first session is perhaps

some new methodologies that a lot of people across different areas of

social science are interested in. Our first speaker is Lauren Peritz. I’ve asked speakers to talk for 25 minutes. and leave five minutes for the discussion. [SOUND]

>>So, did you->>Yes. Okay. Sorry. [INAUDIBLE] [INAUDIBLE].>>Great, thank you. Okay, hopefully not too technical. All right, well thank you so much for inviting me to participate

in this conference. I’m delighted to be able to

present some ongoing work. The project here is entitled to

preferential trade agreements, increased trade, and a network analysis statistical

models to try to get at this question. Let me begin with some current

events to help motivate the project. As most of you are probably aware,

the Trans Pacific Partnership is a multi-lateral trade agreement

negotiated among 12 different countries, including the US, Japan,

Korea, and a number of others. And the Obama administration

had been a major proponent. And as a way to expand trade with

numerous Asian and Pacific partners. And to lock in trade rules that

would be strategically favorable to the United States going forward. One of the many justifications has been

to secure favorable trade relations with partners that might

otherwise be drawn to China. However, the Obama administration,

despite having negotiated for several years and arrived at a highly

detailed multilateral treaty, encountered resistance at

the ratification stage. Dissent in Congress meant that

the trade agreement did not go up for a vote before the 2016 election. And then one of the first moves of the new

Trump administration was to withdraw the US from

the Trans Pacific Partnership entirely. Among other reasons,

the Trump administration argued that the TPP would do little to

expand exports to the US, from the US to foreign markets,

rather leading to an import surge. On both sides of the aisle, there seems to

be this assumption that trade agreements such as the TTP, have a very large

aggregate effect on trade flows. And both suggest that the effect would

apply not only to members of the trade agreement, but also to non-members, for example by excluding China,

which is not involved in the negotiations. So preferential trade agreements

are treaties that governments sign in order to set terms

of international trade. They typically grant each other

preferential treatment meaning more favorable market access. A familiar example is

the North American Free Trade Area, NAFTA, which eliminates tariffs for

blockade restrictions for the most part for the US, Canada,

and Mexico, while allowing each country to independently set their

trade policies for non-members. Notably, these trade agreements

can take different forms, and some being more stringent than others. Almost all countries are part of at

least one preferential trade agreement. And the data I use in

this study accounts for approximately 200 reciprocal

agreements currently in force, and many others that have been in

force over the past few decades. Because some countries may

have overlapping commitments, it’s informative to think of

them as forming a network. So the substantive questions

that I’m interested in, first, when do countries form

preferential trade agreements? You could think of an example. The US has a free trade agreement with

Korea and a separate one with Australia. At the same time, Korea and Australia

have trade agreements with each other. And so when Korea sets its trade policy, it’s beholden to both its separate

commitments to the US and to Australia. And so does that incentivize different

trade agreements to be formed. Certainly there are approximate reasons. Governments may wish to enhance

trade with other members of preferential trade agreements. But there’s also anecdotal evidence for

secondhand effects or externalities. When South American

countries negotiated for instance, they banded together to

gain leverage of bargaining power when interacting with the North American

members of the free trade area. In other words, one PTA may beget another. The second related question is, do preferential trade agreements

actually promote trade? It’s important to distinguish

between expansion of trade dependence which may bring efficiency

gains and trade diversion. If preferential trade agreements simply

redirect trade from one partner to another, there may be

little economic payoff for negotiating these

complicated legal contracts. And this has been a point of contention

as the European Union expands. To give you a sense of the scale

of the trade agreement network, let me illustrate some snapshots and

it’s evolution. This figure shows trade agreements

formed between 1966 and 1970. Most of the activity was regionally

focused, including, as you can see, the predecessor up to the African Union. That activity intensified in

subsequent years with expansion of multilateral agreements. And in 1991 to 5 we see significant

expansion of the European Union membership, for instance. So what I’m doing here is treating

each country as a node and the treaty agreements as ties or

edges between those nodes. More recently, preferential trade agreements have been

formed cross-regionally as is evident in the illustration of PTAs

formed between 2001 and 2005. So that’s more recent. Okay, so as you’ve probably gathered, the

number of preferential trade agreements in force has substantially grown

over recent decades, and this effort towards legalization has not

been limited to any particular countries. Rather, the number of countries that are

members to at least one PTA has steadily grown to the point that nearly every

country participates in this network. As countries have more trade agreements,

we can think of each country as a node, the trade agreements as a tire

edge as I had mentioned and the density statistic as

an informative measure. This network has become more dense over

time, particularly since the 1990s. So lots of work on international politics

uses descriptive networks statistics as I’ve shown so far. But to really get a hold on

the mechanisms I’m interested in, I want to do something inferential. And so to figure out how trade and

PTA formation is related, we need to account for the important

indirect effects of the network structure. And the indirect effects of

trade diversion, and so forth. In the political science department

here at Davis, professors Mouse and Kenny have been very

innovative in building and applying differential network

analysis to political problems. So what I propose is to apply some of

those same insights to the domain of the international political economy to

better understand how governments use treaties to deepen their

economic ties to one another. The two research questions

I address are first, when new countries form

preferential trade agreements. This is naturally a network question. Since the choice to form a tie clearly

depends on other nodes’ actions On the literature on trade

cooperation lends insight. And importantly,

there are never [INAUDIBLE] explanations, which I test, but

I wanna just highlight a couple for you. As I illustrated by the demise of

the Trans-Pacific partnership, domestic politics within

countries clearly matter. Governments that faced many constraints as

the Obama administration did when trying to float the TPP through a Republican

Congress had difficulty ratifying new treaties. So we wanna think of the characteristics of the countries in determining

the network [INAUDIBLE]. In addition when countries form

preferential trade agreements, there’s a stronger incentive for countries to

band together to gain economic leverage. So in this literature, [INAUDIBLE] at

all have identified closure effects. In other words, governments have

a higher propensity of signing PTAs with others to they’re indirectly tied. The second question is

technically linked to the first. What is the impact of the PTA

network on trade flows? The political economy

literature establishes that preferential trade agreements lead to an uptick in trade between members

in the subsequent five to ten years. The different timing of the impact

has been identified when this aggregates between intensive

versus extensive margins. So how much of this effect comes

from increase trade dependence? How much comes from the negative

externalities of PTAs? In other words trade

diversion away from numbers. So I’ll speak a little bit more about

the conflict with the multi-lateral trade regime that people are interested

in that substantively. But just to give you a hypothetical

cartoon of what I’m modeling, it highlights why traditional regression

analysis would be unsuitable. By assuming the way the effect of one

node’s status on another as is required with basic regression model assumptions,

we could end up with biased estimates. So you can think of, say, four countries

which are all trading with each other. We’re not gonna worry about the direction,

a directed network, but rather undirected bilateral trade. And you can imagine that three of these

countries form a preferential trade agreement. If indeed that intensifies

trade within that community, it might detract from trade

outside the community, right? And so

this is one hypothetical outcome and that would incentivize a country

outside the community to try to join. So this is the type of dynamic

that we are interested in. Okay, so thus far I’ve discussed

the substantive puzzle and the inferential challenges. Now I wanna explain the statistical models

that allow me to counter these effects. For inferential network analysis I rely

on experiential random graph models. The basic version uses characteristics

of each nodes, countries and their relationships to predict,

observe network ties, PTAs. And there’s variations on this urban

framework which I’d be happy to discuss. There are a number of alternative network

models which have their own strengths and weaknesses. And in a recent comparison of different

network modelling alternatives the urban model which I choose seems

to best suit the substantive puzzle. So how was this work? What is the random,

exponential random graph model? Considers the network conditional

series of predictor terms. These predictors, networks, statistics

represent configurations of ties, for example triangles of three

nodes of a common attribute. Better have all the sides to occur

more often than expected by chance. These terms along with your coefficients

define the probability of each edge, and the probability of the entire network. The true example is

the Homogeneous Familiar Model. The configuration is an edge, the predictor is the total

number of other edges. And the network may be viewed as

a collection of independent and identically distributed variables. Like this on and

off switch between pairs of nodes. Building on this baseline model,

one can also account for more sophisticated networks. So, suppose you want a model on

an empirical case where not all nodes are equally likely. This can be modelled with [INAUDIBLE]

that obtain maximum likely the estimates that describe

the impact of local selection forces, including node to edge attributes

on the structure of a network. For example if two nodes,

two countries sharing a common attribute. For example both low income countries

may be more likely to form an edge, a preferential trade agreement

with one another, and edges with certain attributes

may be more likely to form. For example, if there’s more bilateral

trade shared between those countries. So this n (x) term defines

the support of all obtainable networks that as a function

of the nodal characteristics. If you had just one simple network

statistic like an edge count, that would enter into

the vector parameters y here. If you had covariance of interest that,

like whether two countries share a contiguous geographic border,

that would enter this extra. We can talk more about the mechanics

of it if that’s of interest to people. The normalizing factor,

this c, normalizing factor, captures the summation

over the sample space. And so estimating part of this

is a computational challenge for the urban models. And so to estimate these models I

rely on an R package by Hunter, et al, that’s available open source,

so it’s easy to get access to. So what do I find from my first analysis? Here, I’m all of the predictors of the

preferential trade agreement network at various slices of time. Bilateral trade between pairs of countries

is indeed a strong predictor that they will form new PTAs with one another, even

accounting for trade with other countries. Importantly, there’s a strong

transitivity effect. So this last term which stands for geometrical weighted edge-wise shared

partners, it’s a whole long thing. Basically it gives you a sense of

the number of pairs of nodes that are connected by both direct edge and

by two halves through another node. And this significant coefficient

points to transitivity in the network that’s beyond what maybe

explained solely by nodal characteristics. This tends to suggest that countries

prefer to form trade agreements with other connecting countries. These plots show estimates over time,

and what I just want to highlight for you here is that consistently over time, smaller economies have been more

prone to form these connections. Bilateral trade is a strong predictor. And over here, in more recent years governments that

have a lot of these internal checks and balances through different party held by,

say the legislature and executive in the US case are less

prone to form these trade agreements. Which is exactly what we saw when

the US failed to ratify the TPP. So it may be part of a broader trend. It’s informative to compare these

estimates to standard logic model Results, in other words with out

encountering poor network effects. The table shows the two analysis

with the exact same specification, same data, save the network statistics. And trade intensity is a even stronger

predictor in the network settings, suggesting that concerns

about competition, or trade diversion may be

a little exaggerated. Okay, so an important extension to the ERG

model, it takes values, and the ties, and this allows one to model the strength

of the ties between nodes. It introduces a nonbinary

reference distribution, to allow for count, or continuous data. So, these models has some

problems with the generosity. If for

instance if you’re not a very sparse, or a very complete network it would

be [INAUDIBLE] challenging and less reliable estimates. I applied the value to other models

to examine two additional questions. First, as I noted,

trade agreements vary in their design. With some imposing more stringent

obligations on their members than others. One might expect that governments that

initially trade with one another chose these deeper more

stringent legal mechanisms. For example, the European Union has extremely

strengthened requirement over trade. Whereas there are more

shallow trade agreements, such as between the US [INAUDIBLE]

have very minimal requirements. So, the intensity of trade pretty set

high problems of these [INAUDIBLE] trade agreements for the differences

that are actually quite small. Second, and I think more interestingly,

I aim to model the subsequent as a function of previous

preferential trade agreements. So, here the question is does trade

between member states expand the years following the implementation

of a new trade agreement. If so, is this driven by trade diversion,

or market growth? So, this table shows the year over

year percentage change in bilateral trade as a function of the formation

of new preferential trade agreements. These ones I look at the ones

implemented in the preceding five years. Although I tried all sorts

of different time lags and seemed to find pretty consistent results. So, what this shows is a reliably

knew preferential trade agreements lead to an expansion in trade between

those countries, even accounting for other preferential trade

agreements in the network. Did a number of other analyses, including

focusing on the temporal aspects. Another variation of these ERG models

looks at the network evolution over time. So, one can model two separate processes,

the formation of new ties and the dissolution of existing ties. This kind of extension, however, is substantively pretty tricky

in this particular application. In the international politics literature,

some scholars have highlighted the existence of so-called

zombieance institutions. So, what this means is that countries tend

to establish international agreements, which eventually become obsolete, but

they don’t bother to disband them. And so, these institutions

continue to persist on paper yet maybe have no effect on the economic

relations between the governments. And it will vary in practice. And so, sometimes this happens if, sometimes when you get superseded

by new international agreements. Nevertheless, the risk of this

is that the dissolution of ties is not easily measurable,

and then it sort of. So, I do look at some of the factors that

predict how durable a preferential trade agreement is, allowing for the fact that at least some of these

ties will be meaningless once. So, let me just kind of wrap up some of

the findings that I’ve shown to you today and some of the exceptions

that I’ve explored. First and most reliably,

preferential trade agreements networks reflect established traded

relations between countries. Once we account for the indirect effect, it’s clear that countries form PTAs

when they trade more together. There are clear community effects. However, there is less support for

a balancing effect. In which countries form preferential

trade agreements to improve bargaining leverage over others. Second, preferential trade agreements

reliably promote an increase in trade, and this has been a point of

contention in [INAUDIBLE]. If I measure it as a percentage

change over previous levels, I do see an uptake in trade in

the subsequent five to ten years. The benefits for members of PTS appears

to outweigh losses for non-members. And the ERG models do not unfortunately

allow you to pin down whether I can attribute the causal mechanism

to the trade agreement itself. Does that have any

independent causal effect, or is it merely serving as a marker for

future expansion in trade? In other words we can think

of it as not necessarily in a case that preferential trade

agreements actually prompt more trade. Rather it’s possible that these PTA

are signed when countries anticipate deepening economic relations as

a way to kind of pave the way for more cooperative future interactions. So, of course, given observational data,

it’s hard to pin down causal effects. However, third,

in an analysis of these temporal models, I did find that preferential trade

agreements with legalized means for dispute resolution tend to last longer. So, you can think of

NAFTA as a prime example. When there’s a disagreement over

trade barriers that arises between, say the US and Canada,

as been the case in many instances. Those countries can engage in a

quasi-judicial process to reach settlement rather than resort to trade wars and dissolve their preferential

trade agreement. This tendency echoes a broader finding

in the literature on international cooperation. That dispute settlement mechanisms can

stabilize an international regime, making it more durable over time. And finally, there are only slight

differences between different types of preferential trade payments. Another point of interest in

the literature has been whether these trade agreement networks conflict with

the World Trade Organization, and here I don’t find very much

evidence in this respect. In effect,

do countries that are dissatisfied with the multilateral trade regime

as embodied by the WTO, seek to secure preferential

treatment out side of that umbrella. Well, in this respect the fears

might be a little exaggerated, because Stubber, and Sunder,

involved a lot of disputes within the WTO, don’t seem subsequently seek more,

trade agreements outside of that regime. Finally, since this is an ongoing project,

I wanna outline for you my next steps. And hopefully that gives you some

sense of where I need suggestions. So, first I’m working on

rerunning the analysis, where I look instead of at a total trade. Instead of looking at total trade,

I examine the intensive and extensive margins of trade

at the country level. This distinguishes between

increasing volumes of trade along country cares that

already had trade of some sort. And so, this allows you to get the entry

of new firms into trading relations, or new product levels. And. And on the other hand, the initiation of trade between

countries that previously had none. So, there’s some literature on this and I’m kinda working through what

one would expect reasonably. But one might expect that, PTAs do

little to open entirely new markets, but can facilitate the expansion of

trade between counties whose markets are already complementary. The second point I’m working

on concerns regional dynamics. As you saw, there’s clear qualitative

differences between regional clustering and cross regional

preferential trade agreements. And then finally, I’ve spent some time

working on estimating the joint relation between high selection and

node variable influence on the network. Several statisticians have been working

on a model that combines the ergom with random field modelling and

the scenario of active research, which I try to find out of my data So,

with that said, I’d like to just wrap up. And thanks for your attention, and

I look forward to your comments.>>[APPLAUSE]

>>Okay, so we got five minutes for discussion. Any questions? Yes? So question,

within [INAUDIBLE] regression so you’re running a binary variable

[INAUDIBLE] on the binary variable [INAUDIBLE]

>>The trade is a continuous variable, so it’s the [INAUDIBLE]

>>Even if it’s a binomial if it’s>>[INAUDIBLE] I tried between one and five years at the lab, and

it was pretty consistent. [INAUDIBLE] as a percentage of [INAUDIBLE]

>>Yeah, yeah, so I looked at trade dependence, and then,

I also looked at the percentage change in trade as the outcome variable in

the second portion of the analysis.>>[INAUDIBLE]

>>Correct, yeah, the second so the first analysis was

like, treat a grievance as the outcome. Looking at [INAUDIBLE] preceding errors. The second analysis, I flip it around. Clearly, [INAUDIBLE] However, all of

these things are so, there’s only so much you can do. But, the outcome was a value network where

I looked at the percentage change in trade after the years following

a new preferential trade agreement.>>Okay so when you say trade

[INAUDIBLE] means that trade becomes 75 [INAUDIBLE]

>>I tried it a couple of different ways, like a one year change,

and then, like the levels>>Yeah, and so the results I’m sure they were like [INAUDIBLE].>>Is it possible [INAUDIBLE] network structures [INAUDIBLE]? Relationships trade is sequential?>>Yes.

>>[INAUDIBLE]>>Yeah, so I tried a couple different ways to

get out how much closure where all of them are linked to each other,

versus hub and spoke kind of model. I guess I haven’t found any really

striking patterns on that, but some of it’s model specification, so I gotta

play around with that a little bit more. But thank you, that’s a good guess. That would be interesting. I’ll push on that point a little further.>>So I’m not familiar with these models,

and so I found them a little bit hard to understand what

the unit of observation was? So, subgroups and so on would help. And one of the things I know, is that, if you just look at bionatural

trade between countries. And then,

you’re dealing with country peers. There’s a real serious

problem with [INAUDIBLE], that you you have to allow for

correlations across country peers. And the models that we use do

not pick up on that correlation. In fact, in my own work I found out that

you can be out by a factor of three, or three to five on standard errors. So I could count a total from these

models, but is that an issue here? The standard error’s that you’re getting, do they adequately control

into dependencies? That’s an error.>>I mean in theory yes,

but I don’t have any good measures to tell you how much information or [INAUDIBLE]

it is.>>Well if anything happens

it will be a regression, the question is whether

there is a regression. And then the other thing was with

these models,and on many models, to what extent are the results that

you get, just simply statistical, Significance versus a meaningful

economic magnitudes. So there are a lot of stars, but

were these bigger picks of smaller picks? So for example, towards the end,

it said to say that it’s changing. And then, I saw numbers like 0.77, which to me 0.77% is very small. Yes.>>I think with the ERDM,

it’s Is a really complicated step, but because it’s is really similar

to statistical regression, and the coefficients that you get with this

model, you can actually get the and then that is a measure that other

people at the university level can understand better than

>>[INAUDIBLE]>>Okay, so it’d be good to translate.>>I think that, that would help them.>>Yeah, I started down that group by

doing a comparison to the standard logistic results, and so I’ll push that

further then we can get a sense of scale. Thank you.>>Okay, well thank you Jan.>>[APPLAUSE] All right.

Cool. My name is Tyler Scott, and I’m actually

brand new here to campus over in Whitson Hall on Environmental Science and

Policy. And so, I’m pretty proud of myself for actually finding my way

to this building today. So, off to a good start. This sort of encapsulates

my research topic, but essentially, the talk I’m giving today is

really network analysis in a messy world. So you might not care about forest,

and fish, and water and environmental government, but, many of you probably do have network type

applications and likely some messy data. So hopefully, this can speak to that. There’s a lot to digest in

computational social science. And so one of the themes today, that I’m gonna present, is that even if

you don’t know how to do all of it, and you can’t, you’re not the best at all of

it, there’s some really simple tools and strategies that you can use in your

particular messy applications. So for me, what that looks like,

is that I study coordination and institutional complexity. And I’ll show you an example of

that in environmental governance. And so this is my story about how I

combined text mining and graph models, older models [INAUDIBLE] to get at these

questions that I’m most interested in. So here’s my first example, you’ve probably seen a lot in the news

recently about the City of Houston. It’s a great example about institutional

complexity in environmental governance. There are more than 1,400 different

independently regulated water providers just in the Houston metropolitan area. Which is just a ridiculous

number of independent utilities. And because of that they face

significant coordination problems. So they’re all fighting for

the same supply resources, they all have straws in

the same groundwater aquifers. They’re fighting for infrastructure and finance stability to sort of tax and

spend. And then they have to coordinate on

all sorts of service [INAUDIBLE] challenges as well. Because obviously not all 1,400 of these

systems have the capacity to sort of go it on their own. And in particularly, a lot of these are

essentially like you and your neighbors in your cul-de-sac decide that,

we’re gonna form our own water utility. So lots of interesting

coordination problems. And you have these local neighborhoods, who are also operating

almost as their own city. And it’s obviously been in the news

recently since Hurricane Harvey. You’ve probably,

if you read a lot of these articles, it’ll often mention a lot

of these small utilities. These are the ones that are having

trouble getting back on track. They’re not sure what

the quality of their water is. It’s not the city of

Houston in the center here. It’s all of these tiny thousands

of districts out in the outskirts. So if I’m interested in how they

coordinate and how they share information, how they provide clean water. The classic way we’ve done

that in social science, when we were studying the environmental

governance is with surveys. And they pose issues that many

of you are all familiar with. So the first one is recall. It’s one thing to ask students

at a high school classroom, who are your friends in class? There’s only 30 people, they can sort of,

think through all of that and name their friends. It’s another to ask people in that type

of complex institutional environment, who are all the people

you coordinate with? Second, even if they did answer that,

even if you have all the time and money in the world to keep rolling out

your own surveys, these folks are busy. And they don’t have time, they don’t

really wanna answer your surveys. And so even if they do have perfect

recall and you have all the monies, you’re still not gonna be able to get good

results going forward because ultimately we care about how this changes over time. And so we have significant problems

with repeating our surveys as well. This is sort of an idiosyncratic feature

that you can get also sort of shows up in data as well. We get this sort of

person-organization fuzziness or in her case it might be

a company-country fuzziness. Well, we’re sort of, it’s like I sent

a survey to these water providers. And one person gets it, and they fill

out who are my coordination partners? But if you think about the city of

Houston, they have hundreds of employees. And so

if you give that list to someone else, maybe they talk with a completely

different subset of organizations. Whereas if it goes to a small

organization, that one person might be able to speak authoritatively to

all of their network connections. And so, really in practice what this means

is your discussion section gets longer as you sort of justify to reviewers why

you think it’s okay to sort of aggregate to organizations. But it’s nonetheless sort of an ongoing

latent problem that’s sort of always bubbling in the under current is what

are these nodes actually representing. And surveys also particularly when

we zoom down to sort of local and low level governments,

they miss key things. So, one issue here,

is these types of small, operators. They’re not necessarily showing up in

large scale regional planning gigs. So, we have sort of this high level, like, big agencies, big municipalities doing

things like regional coordination bodies. Then we have these small folks who

are just trying to keep the water going. And so, they’re not gonna show up if you

survey like regional coordination bodies. And they also likely if you ask,

well, who are your collaborators? They’re not gonna list

their actual collaborators. They’re not gonna listen to people you

care about like the person they pay to do their taxes and the person they

pay to keep the water going and their finance expert. Things like that. So one thing that my co-authors and I are always trying to think through is,

how can we better measure coordination in these types of networks

without having to resort to survey data? And so one thing we hit upon is personnel

records as an alternative network measure. So when there’s lots of evidence

that shows that expertise, skilled labor is a key driver of

environmental outcome in service delivery. So you have to have access to

people who know how to treat water, who know how to operate a debt

financing scheme, things like that. And so it’s a key driver of

outcomes they all have to rely on. And the cool part about studying these

types of administrative processes is they also all have to file lots

of administrative records. So you can go to the state of Texas’s

government website and there’s a page for every single water provider. And they list all their contacts,

all the licensed technicians they employ, all of their other experts

that are on their payroll. And so what we argue is that personnel who

work for multiple providers are a conduit for information-sharing in connectivity,

just like social relationships might be. And so, instead of a social

relation connecting to nodes, we’re actually going to use people,

as the connection. So, when you think about that, one question isn’t necessarily

interesting at all. Because small providers are inherently

going to share do to sort of the transaction costs economic logic. They don’t need a full time employee,

so they pay somebody part time. So we’re not interested in

predicting contracting, because that’s already

sort of well settled. What we’re interested in is asking this

question of if I am going to share, who do I share with? Do I share with people I’m competing with? Do I share with large organizations,

small organizations? Those types of questions. So here’s a little bit about

what this network looks like. So we’re gonna use a bipartite

network graph, and that’s essentially just

a two level network. So on one level we have people and

at one level we have organizations. And thankfully Lauren’s already done a lot

of heavy lifting in explaining how so I don’t have to get into that too much. Essentially as someone in the back

mentioned, we’re talking about. And in essence, you could use a logit

model to estimate the same thing or you could estimate your

logit model in an ERGM. What sets apart an ERGM typically is

we don’t just have system attributes like what kind of system you are and

how big are you. We have these autoregressive

structural terms. And these can get complex as we sort of

wade into the sociology literature for all the names they have for them. But if you’re familiar with

a spatial modeling application, it’s a similar type thing. We’re simply regressing,

we’re simply controlling for surrounding ties on a given

title we’re trying to protect. We’re saying that your neighbors matter. So if I run a water utility and

you run a water utility and we already share one person, you might be

more likely to share another, as well. That type of thing. The other interesting thing is

sort a messy network problem. It’s very common when you start out

with a two mode network to actually collapse the matrix, take the cross

products to get a one mode network. But that would be sort of

weird in this case because you would start with this person who works

for three systems and you’d collapse it. And then what you would be modeling

is a full set of ties between all of these nodes. So an example would be the graph that

Lauren showed again where you saw that in some of these bilateral trade agreements

because everyone is part of the agreement, everyone has a tie with everyone else. That can pose some serious modeling

challenges when you have one contractor who works for 10 different systems. And then you tell the model, all 10 of these organizations

have a tie to each other. It’s also, I guess I should say, I think

it’s also sort of more conceptually appropriate in this case because this

really is what our network looks like. And so

otherwise we sort of obfuscate that. So a couple quick hidden results. So, as sort of a proof of concept,

size behaves as it should. We simply see more employees, and

because you have more employees, you’re on the margin more likely you share

someone, if you are a large provider. Questions more interesting then, there’s lots of small providers who are

drawing upon the same groundwater aquifer. And the way Texas works it is that

you have the straw in that drink, you can pull out as much as you want. And they have lots of carrots

trying to get people to stop. But it’s a massive

collective action problem. And so one thing that’s actually

interesting about this is we wouldn’t expect that these providers competing for

the same resource share personnel because that information about what

they’re doing gets back and forth. But if we’re sort of looking at it from

another direction, this might be a way to help all of them actually work together

towards the collective solution of these regional groundwater authorities

are hoping they work towards. Another example and then standard

errors are somewhat large on this. But again we’re really just interested

at this point as a proof of concept in which direction this effects. We see the districts that buy and sell from one another are also

likely to share personnel. So plans for building on this,

because I think as a first step, it’s even important just to demonstrate

that these types of network ties follow sort of basic intuition. We intend to connect personnel

sharing to performance outcomes, particularly, since pre and

post hurricane Harvey. Looking at who employees and

what employees you share, how that affects how quickly you

were able to get back online. And well maybe inverting that and

looking at turnover and asking if you had, sort of,

a severe event when the hurricane hit, did you end up switching sort of and

going with some other personnel. Data collection’s ongoing for this and

what that essentially looks like Is every month my Google calendar sends me a ping

and I fire up the web scraper again and I go to all the different district

websites and scrape everything. So hopefully in a couple of years

we’ll have something pretty cool. Because that’s ongoing, I’m gonna pin

it quickly to another example to show how we take these types of data and

try and extend them over time. So the second example is a hydro

power facility we’re licensing that’s run by

the Federal Energy Regulatory Commission. And it’s a really cool sample

of environmental collaboration, because FERC actually says,

you all have to collaborate. So, this isn’t a case where everybody

says, hey, let’s work together, this is going to be great. This is a case,

where a Federal agency says, you all need to get your act together and

work this out. Or else this is just going to go to court

forever, and nobody really wants that. So what we see in these cases, it might be one small dam way

up in the mountains somewhere, so like the case we analyzed was way

up in the North Cascades National Park. And we still have tons of

different stakeholder groups, so private firms, resource agencies,

local citizens, local towns. And the challenge they have to come

together on is to develop a 30 to 50 year operating plan to specify all

the contingencies about how they’re gonna operate this dam, to provide energy and

protect fish, and do all sorts of other things for

the next 30 to 50 years. So the types of questions that I’m

interested in looking at these sorts of processes mandated instance

of collaboration is, who actually shows up When FERC

says you all have to do this. Who becomes the leader in the process, or who sort of maintains

leadership throughout the time? And who leaves, right? One thing you may be concerned about is,

a collaborative institution becomes a coalition over time and

sort of peripheral people drop out and it’s sort of a collaborative

in name only at that point. So the challenge in this type of a process

is it’s way too big to hand code, and it’s way too long to survey. So we’re looking at again,

one relatively small dam, way up in the mountains in

the state of Washington. And this takes 16 years and they held

more that 600 meetings, were able to find meeting summaries for 591 of them and

no one wants to read through all those. And we also can’t really

follow up with the survey and ask the people what was

your participation like? Cuz they just can’t give us good responses

over what they did for the past 16 years. Thankfully, as I said,

someone does take notes and so that’s sort of our point of entry into

analyzing this stack of facts, so the first thing I’m going

to ask is who shows up. And so

we are gonna perform a text mining, and we’re gonna use a text mining

tool called entity extraction. And essentially how it works is it

takes these meeting summaries and it tokenizes them into phrases,

and I say phrases not words, because some phrases like fast

food are actually one word, right. And so using a machine learning

classification tool to say this is a word but

this compound word is a word as well. And then we’re going again use

a different machine learning technique, it’s a pre-trained classifier that then

looks at these words or phrases and says this is a person, this is a location,

this is a date, this is a number, so pretty basic, ultimately,

it’s statistics on the back end, right. Because these machine learning algorithms

have been pre-trained on thousands of books and they use that pre-training,

that dataset, to then say, looking at this new sentence I’ve never seen before,

I think that is also a person’s name. And so once we get this extracted set of

people who show up in these meetings, we supplement it with hand holding. So, a few examples of

what that looks like, because real people had to write these

meeting notes, there’s misspellings. We use fuzzy regular expression matching

to pick up where David has the i and the v reversed, things like that. Another example would be, if you’ve ever heard of a Dolly Varden

it’s actually a kind of trout, but IBM sort of baseline classifier doesn’t know

that that’s not a person, that’s a fish. And so, that would be something going

forward as we build our own customized model we’ll pick that up but for

now we filter that out in the backend. So we don’t just wanna know who shows up, we also wanna know what they

are doing when they arrive. And so, the second text mining tool we

are gonna use is part of SpeechText. So this time we are going

to take these texts and we are gonna break them up into

sentences rather than phrases. And then we’re gonna feed it back into

a machine learning classifier and it’s going to spit us back out a probabilistic

model that says this is the object of a sentence, this is the subject of

a sentence and this is the verb. And here we’re particularly

interested in subject and verb, so the person who’s doing something and

what they’re doing. So a quick snapshot of how we’re

doing when we actually try and run these things,

we hand coded 50 different meetings. And so in the top, you can see how we

performed with respect to attendance. The machine has false positives but it

doesn’t have false negatives because it’s not gonna generate a name out of thin air. So, it’s only going to find cases where

someone wasn’t actually in the meeting, but their name came up. So, it might be a reference to President

Obama, or it might be someone saying, let’s invite so and so next time, and so

that’s why the hand coding tends to track below machine coding, is because some of

those names weren’t actually present. When we look at the part of speech taking, we’re a little more all over the board we

hit miss high, sometimes high, mainly low. This is a noisy thing even for human

coders, it’s actually sort of a vexing when we down to code 50 minute meetings

how much ambiguity there was even for us in saying who did what and dealing with things like generic nouns

that refer to multiple sets of people. So I think this is an ongoing

thing to be explored. And then the final step, and

we use this very simply right now but we take the verbs that

come out the backend and we feed them into the verb

net lexicon dictionary. Linguists are all about verbs and they

have great classifiers that’ll tell you way more than you wanna know about

what this verb is and what it means. Right now, we use it to filter out

past-tense and future-tense so we don’t wanna know what someone did in

the last meeting, we don’t what they’re gonna do next time, we wanna know what

happened in each particular meeting. One other thing that you’ll likely

find if you’ve applied this is the way people write up these types

of documents is very different than how we formally classify language. And so Watson’s program thinks there’s

a lot of physical activities going on at these meetings because every time

someone gives a presentation, the person who’s taking the notes says, so

and so walked us through the new report.>>[LAUGH]

>>Right, and so it says, wow these people are walking around

there’s lots of activity and there’s sort of a weird. That’s why we use this for

filtering right now and not for a live analysis, cuz we need to spend

a lot more time with these verb types. So, again, thankfully Lauren

did all the heavy lifting. Essentially here, we’re going to use

a temporal ERGM model we’re going to use still Leifeld’s btergm package, it’s not

quick and it’s as complex as it sounds. It’s essentially a bootstrap version of

the temporal ERGM that Lauren is using. The big advantage here is that

it has a major speed bump. So it uses a pseudo-like limit

estimator and then bootstraps that so the standard errors are not bias

downwards and the upshot, and I know it sounds silly to say a large

network when I have 770 people, right. That’s small data, but for urban models,

that can be a big a big deal, right, so when you have have

upwards of 500 nodes and you’re fitting an MCMC driven urban model. If you’re not careful that can be weeks,

right, or you might never even get it fix. So the ability to get something

that gives you results in a minute. Is a big plus. And so that’s sort of the exciting

takeaway about the [INAUDIBLE] versus a lot of the other urban modelling

tools that are out there. And essentially we’re just modelling this

network change over time across these [INAUDIBLE]. So a quick snapshot of the types

of things we learned from this. We find that the people who

started out as leaders? They’re leaders the whole time. So people don’t sort of rise up somewhere

in a 16-year process and take charge. If you’re in charge at the start you’re gonna dominate

the conversation at the end. We found that anyone who folks said,

this person has a key sign-off role, they tend to dominate

the conversation as well. You’ve inserted power with a little bit

of veto power, you’re gonna talk more, you’re gonna present more,

you’re gonna dominate the process. The interesting thing is after

the license is approved, it enters this five-year implementation phase and

there’s no more mandate to collaborate. And so what we see after this phase, when

we interact the stability of individual interactions with this phase is

that the stability drops, right? And so we see less regular interaction, we

see people who just show up in one meeting and they don’t show up again for a year. So it appears that that mandate from does

sort of provide some key support that keeps people coming and

keeps them participating. Which matters because meetings suck,

right? And it takes time to drive there and

you have better things to do. And so anything you can do to serve

support people continuing in this type of collaborative process can be valued. So, wrapping up, for a lot of network

research, actually running the model, it’s like 10% of the work, right? Getting a network that you

can model is the real hurdle. And if you can pull it off, graph models

are useful even if you do not have something that’s sort

of a standard network. So a lot of times we think it has to be

a sort of a truly social network, but really what we use graph models for

is when we think there is some sort of structural interdependence, when we

think that the neighboring ties matter. So whatever the nodes and edges are, if you think there’s some sort

of interdependence that you can’t capture simply by sort of applying fixed effects

and random effects for particular actors. You might need to explore

an urban-type approach. And then finally, if I can do this,

any of you can, right? This is not high-level text mining,

for instance. Someone like Dan is so far beyond in terms of what he’s actually

doing when he’s processing language. We’re doing fairly simple things

like spitting out people and spitting out the actions they take. But we can learn a lot from that. So I’m excited about sort of

applying these types of simple tools to important real world problems. So with that I’ll thank you and

I look forward to your feedback.>>[APPLAUSE]

>>Questions?>>So I’ll ask a question.>>Sure.>>So we see a lot of stuff about

networks and these graphical models and it’s a very active area at the moment. Is there a good reference?>>Yes. My favorite is the which I think one

of your advisors is also an author, [INAUDIBLE]?>>Yeah.

>>Yeah, so that’s my favorite textbook, but a lot of rules online University

of Washington has a ton of resources. They have a super active user group,

so I think the first thing I’d go to is statnet.org

>>The thing is, a lot of us are actually fairly knowledgeable on

statistical measures in general. My view is it’s a question

of finding the reference. I can mention a lot of papers. But presumably there’s one or two pages

out there for instance would explain when we say bootstrap, you need to bootstrap

out of something that’s independent. You need to do an incorrect bootstrap. So I’d be interested to know, for

example, what’s the that’s being used?>>So the way that one works is it’s a And

what they do is that if it only works for temporal network models because they

actually re-sample those network periods. So they’ll pull a set randomly and

they’ll run the model on those and they’ll do it again. And so they have a 2012 paper and

I forget where it’s published. But it essentially shows that

when they apply that method, but the pseudo-likelihood estimated for

a temporal, ergo, they do generate them by a standard index. But so I think the high-hanging fruit out

there is still figuring out how we can do a similar thing for

a non-temporal network model.>>Any other questions? Okay.

Well, thank you again.>>[APPLAUSE]>>Let me start. We read a lot of stuff

about machine learning, deep learning, this that and

the other and with the previous paper particularly showed a lot of

the tools out there that are being used. And I know really nothing about this, but I just want to know things, all right? So I committed rather foolishly

to teaching a PhD course that first meant six weeks on machine

learning to force myself to learn and

spend a lot of time on it and so on. I figure I can explain it in 30 minutes,

so we’ll see how it goes. Okay, so this is the key,

this slide, all right? The goal with the machine

learning is prediction, right? That’s the goal,

which is often different from a lot of what we wanna do which is

to get a beta hat, right? Which has some physical interpretation. And in pure machine learning

there’s no structural model. There’s no knowledge that there’s

a gravity model of trade and trade flows between countries will

depend on the size of the countries, the distance from the countries and so on. It’s just throwing in this data and

then there’s some algorithm, and given that algorithm,

it spits out an answer, right? So there’s no kinda structural model, no model in the purest

form of machine running. That’s why it’s called

machine learning type. But it’s an algorithm and

it’s using existing data, okay? These train the machine to come up with

a prediction model, and then this model is then used to apply to a different set

of data to make some predictions, right? What we would call that is our sample

pool, then that’s what it is, okay? And then, you got to be concerned about

out filling the existing data all right? So if you like there’s some

truth expected model of y condition on x,

that’s what we want to fit. But what it’s aiming for

is the y has to be observed. So if there’s an outlier it’s going

to be chasing to get that outlier and expel that outlier, well. When you apply to a different data set,

the outliers are somewhere else. It’s just not gonna do as well. So there are methods to

guide against overfeeding. And this is really something that we

just don’t do very much of, all right? If you and

adjusted [INAUDIBLE] is some attempt. But really I’d say this

is perhaps the biggest difference between machine learning and

what we do. I think there’s a lot more

concern about overfeeding, right? So there are lots of different algorithms. They go by names like deep nets,

[INAUDIBLE], and so on, right? It just Up, and

that’s rather like there’s a logic model, there’s a Gaussian model, there’s

a whatever, but there’s so many of them. And this is the hard thing, the different algorithms work better

on different types of data, right? So a regression tree might be

fantastic for one type of problem, but hopeless on some other problem. Some other problem we’d want, I don’t

know what a neural network, or all that, so okay. And then as we saw from the previous talk, forming the data to input could

be an art in itself, right? There’s a term for this,

is data carpentry. So generally in the problem facility

that I work, most of the people work on, usually, the good idea about what Y is and

what the answers would be, right? But don’t ask me how you get data

out of a photo of a face and then do facial recognition, all right? How do you get Ys and Xs out of that? I have no idea, right? But the thing is, that’s a problem that

a lot of people are dealing in, so people who deal in that problem

have got off the shelf methods to capture the features, right? It’ll be distance between the eyes, etc. And what could go wrong? How could we, well, we always say correlation does

not imply causation, right? This is where, in any field,

your base knowledge, the starting point is that trade will

depend on these distances and so on. And then as I’ve said in the first pool, saw on the first pool,

the trade partnerships will matter. And the challenge here is

you bring in the models, you bring in some structure. But also we generally want, we know that getting cause with

inference can be difficult. And what people are doing is,

if you have some causal approach that along the way

requires some part of it to be flexibly modeled,

then maybe we use the machine learner, for it to get that flexible modeling part. And put it back in with

the rest of our methods. But then we have to make sure our

influence guards against the fact that we could be back in a non-fulfilling

situation when we use the machine of origin, okay? So this is the bottom line,

right, this is what’s going on. And then a lot of my talk is just now,

I think, mentioning buzzwords.>>[LAUGH]

>>Fail. Start. Okay, I’ll be good. Where is that? It’s behind?>>I stole it.>>[LAUGH]

>>This is good for that. Okay, so

the topic is called machine learning, but other names are statistical learning,

data learning, data analytics. And a big thing is, the term big data is

used, but it doesn’t have to be big data. And then the literature distinguishes

between supervised learning, where there is a Y and an X,

and we call that regression. And then there’s regression,

where our outcome, well it’s continuous, but it could

also be setting a count or whatever. But it’s one that we have some,

I guess, cardinal measure, right? A 6 is 2 times 3, all right? The alternative is classification, right? Which is a where we

would use logit model or a multinomial model. And then unsupervised

learning there’s no Y. All right, unsupervised learning, we are just trying to get Xs and

see some pattern in them. And a big one would be in psychometrics

where we’ve got all these attributes on people. We put them together and we see, there’s this cluster of people

that have these common attributes? And then we look at them and

we say, yeah it looks like that person is an aggressive person,

we’ll say that’s a type A individual. So what I’m gonna focus on is just

the first one, the regression, okay? So a little bit about classification and really almost nothing

about cluster analysis. Okay, so

a good thing is this not over-fitting, so the term is a training dataset, or an estimation sample, and

then a test dataset, or a holdout sample, or

a validation set, okay? And then kind of the fitting along

was here and then deciding whether it’s good or not, having a race between

different models, would be using mine. So the way we might get

the beta house from here and then use it in predicting here. And then see which model was

getting beta hats that did a better job of predicting here. So as I’ve shown on this example And the big thing is main [INAUDIBLE]. With machine learning you

got to have a loss function. And the loss function does used for

a continuous [INAUDIBLE] is main [INAUDIBLE] and

think of this as like we done Some, we have multiple observations [INAUDIBLE]. So this might even be quite [INAUDIBLE]. This is the average of them. And then the lots where

it was two [INAUDIBLE], okay? So Which shall I go for? Shall I go for

the Y that connects the dots? Or should do some sort of smoothie, right? Now what made squared error is doing

is it’s balancing the two, all right. So this Prediction here could be biased,

all right? The truth is somewhat higher. But I’ve gotten a smoother fit. And mean squared error is the sum

of bias squared plus variance. But a lot of the machine

learners use biased estimates. And even Asymptotically because

of this smoothness so far, right? And one place where if you’re familiar

with nonparametric regression, right, nonparametric regression if

you use cross validation, that’s essentially using

this mean squared error. Okay, so I’m gonna create an example

where the data is quadratic The daily generating process,

believe it or not. But when I run the regression [INAUDIBLE]

time siting whether it’s x cube or x four or x [INAUDIBLE]

most important. But here I set it up so that I had, 20 observations here and

20 here And I had set it up so that the x’s were the same, all right?. In my estimation sample and

my validation sample. But I’m using this data,

the quality of my fit is this. So now I’ve got the same x’s over here,

so it’s gonna be the same predictions. But I’m comparing to actual data, and you can see that I’m just getting

more outlines here, right? It’s just not predicting

as well out of sample. Okay, so when I do things in

sample with the linear model, quadratic, cubic, quartic,

it’s getting better and better and better. But when I go out of sample,

the lowest is actually the quadratic. Which was the data generating

process I used here. It didn’t necessarily have to be this way. But in most simulations,

it’s going to turn out that we, cuz there is randomness here. So I just have this here to try and

show that it’s very easy to go wrong and particularly to over fit, right? This is a sign that quartic is

best when my data genarating process was a quadratic. Okay, so what could we do? Well what I’ve done here

is basic validation. But I’ve lost precision, right? To do this I had to say,

I’ve got 40 observations, I’ll only use 20 in estimation. So I can have this other 20 outside for

my validation. And then furthermore, the results

will depend, where do I do the split? There’s many 50 50 splits I

can do with the data, okay? So what k-fold cross validation does, here I’ve got, say, five splits. We randomly split the data into fifths. We estimate on four-fifths of the data,

and predict on the other one-fifth. And we can do that five times, right? So first, I hold out the first fold. Estimate on the others. And so I’ll then use those estimates

on the other folds for the first fold. Now I’ll estimate,

I’ll hold out the second fold. So I’ll estimate on the first,

third, fourth, fifth. Get B to hatch there, and

then predict on the second fold. This is where I left the first fold out, what the root mean squared I got was. When I left the second fold out,

third, fourth, fifth. And this was for one particular model,

for the quadratic, okay? And then the average is 12.3935, okay? So I’ve five times estimated models on

four-fifths of the data and predicted on the fifth, and I’ve got the average

mean squared error across those five. So now I’m gonna do the same thing,

not just for the quadratic but for the linear model, the quadratic,

cubic, and quartic. And we’ll see which gives the best result. And now I don’t know why this is

[LAUGH] actually, when I look at it, why the linear came up with

the same as the quadratic. But it is showing the quadratic

has the lowest mean square. And again, because of randomness and

only 20 observations, and 40 observations. Didn’t necessarily have

to turn out that way. The five probablility would turn out

on any realization of a random sample. Okay, So I spent a lot of

time on that because I really think that forces us to think a little bit

about what we’re doing in our own work. Where do we run the risk of overfitting? When you put structure on the model. So, for example,

it’s a linear regression with, well, except if I have a lot like

linear function, right? So that’s a quite structural model. In that case, there is theory out

there on information criteria, which give a penalty

from the model slides. Okay, so the bigger the model, We maximize the likelihood as we,

Have more complex models, better fit, the log likelihood goes up,

minus the log likelihood goes down. We want this to be small,

but then there’s a penalty. But now I see actually

pretty weak penalty. This is a bigger penalty,

evasion criteria. Okay, and there are other ones known as CP

upon we’d assume is adjusted R squared. They’re much more model-specific. And in some circumstances you can

show that cross-validation’s doing the same thing. Cross-validation’s just a universal thing. But it’s computationally more expensive. And maybe you can do this. For some problems may be good. And maybe for some initial analysis,

speed things up and then later on do

a cross-validation more properly. Even cross-validation,

I mean it’s pretty arbitrary. Why is it that we’re using

bias squared plus variance? Why not two times bias squared plus

variance, okay, as our loss function? But we can use bias squared plus variance, which underlined [INAUDIBLE]

why use a squared loss? Why not use an absolute loss? So I mean there is arbitrariness. Okay, so when we use AIC and BIC, in this example they both come up with the quadratic we don’t invest. Because the AIC didn’t have such

a big penalty, there wasn’t so much difference between the quadratic and

the quartic. Whereas there was a bigger

difference with the BIC. All right, so

now we’re getting into regression. And the first thing is

that generally you think, linear regression,

pretty restrictive, all right? But if you had a ton of data, and you

could put in all sorts of interaction and quadratics and so on, right? A linear regression might actually

do a very good job of predicting. And it’s easy to work with. I mean, when we say linear regression,

with all those interactions and so on, it is nonlinear in the explanatory variables,

but it’s linear in parameters, all right? And that makes it very quick and

fast to estimate, okay? And then the other reason for going with linear regression is that’s

the basis of the more complex models. Okay, so

methods to reduce model complexity. Don’t go too overboard. Choose a subset of regressors. Shrink regression

coefficients towards zero. Reduce the dimension of regressors. So say we’ve got 100 regressors,

I’ll just go with the best [INAUDIBLE] linear combinations of

those regressors in terms of it. So I’ll go through these. And then I forgot this one,

but it may predict well. Direction on how to [INAUDIBLE]. Okay, so this would be easiest if you

had something just on regressing. This assumes we decided. These are the 133 regressors I’m gonna go,

right? I’m bringing in my interactions and

so on, and then we wanna choose which

are the best 133, okay? And there are these methods to say, okay, let’s first of all have a rule for

what would be the best reversal? Two, what would be the best? Three, so then you come up with the best

model for one, two, three up to 133. And then you go and

do the cross validation and say which of those is best predicting

our sample and our cross validation. Or we could use AIC, but

something that penalizes the. In R, it’s all automated, right? Once it’s set up it will be a long line command who will spit out one of the best. This model with [INAUDIBLE]. Okay, and then not only that but

this is a standard problem so the algorithm methods that are used

will be about 100,000 times faster than if you tried

to cover that yourself. There’s a thing called the leaps and bounds procedure that

makes this pretty quick. [INAUDIBLE] tradeoff, so

I kind of explained that already. Shrinkage methods. Okay, so what shrinkage methods do is what

function is the sum of squared residuals. And to avoid over fitting, or to reduce model complexity,

they shrink towards zero. And one way to shrink towards zero is, what we typically do, which is,

it’s either in the model or it’s not. So we test statistical

significance of regressors. And in some areas, you will exclude things

that are specifically insignificant, okay? Well, the lasso’s doing that, but it’s not

doing that using statistical significance. Okay, and the shrinkage methods are on

the parameters itself, on the betas. And the simplest ones treat each beta as having the same impact, right? And so

the standard thing to do is sort of, well you have to rescale out

the regressors and standardize them. So that they each have mean

zero variance one, right? In that case, each beater/s is telling me

if I have a one standard deviation change in this x,

then what is the change and why. That’s not clear that’s

necessarily the best. What’s the difference between

the one standard deviation change in x versus a one standard

deviation in x squared? Maybe I should have something clever

than that, but this is the default. And if you wanted to standardize

your data in some other way, you could always do it and

put it into these okay? So the ridge can leave us a multiple

sum of the squared coefficients and the penalty is an absolute value, okay? So let’s, first of all, do ridge. So the ridge, we’re going to Minimize the sum of squared residuals plus the sum, This sum. So as we add more regressors

this is going to go up, that’s a penalty because we

wanna get this thing down. And then this lambda is a tuning

parameter, this lambda is like a bandwidth when you’re doing

a nonparametric regression. And one can use cross validation methods

to choose the bandwidth properly. Or you could use AI CBIC, okay? When you go through the argument,

I just put this in, the reach estimator. It’s x transposed, x inverse plus

lambda i Inverse x transpose y. So that makes it clear we’re dealing

with what we would think of as a biased estimate [INAUDIBLE] cuz these

squares [INAUDIBLE] would be this. Okay, now what the ridge does,

it shrinks all coefficients towards zero. [INAUDIBLE] This is to

completely quickly compute many different values again,

and then this is all. The lasso, has an absolute loss, and

this next diagram shows what’s going on. So for the ridge,

we have beta 1 to beta 2. Okay, so we’re minimizing sum of i, y i minus beta 1 x i minus

beta 2 x i all squared. So that’s gonna give us, that’s

a quadratic, so it’ll have these ellipses. And this is what l s would’ve given us,

right? And then, but we’re restricting ourselves to beta 1 hat

squared plus beta 2 hat squared in here. So we’re gonna have to move

down in this direction. As we move down in this direction, we’re moving away from minimizing

the sum of squared residuals. So this is a larger sum

of squared residuals. A larger squared still,

sub squared residuals. A larger still, and this is the best

we can do subject to the constraint. And because we’ve got this circle for a way that we’re constraining things

to be, and this is an ellipse, we’re very likely to be where

both beta 1 and beta 2. We’re not likely to be here or

here, right? Whereas with that lasso, the lasso, it was in terms of absolute values,

so this is our constraint and now for a variety of ellipses it’s much

more likely to be in a corner. And a corner means we’re

dropping something, right? And the lasso particularly certainly

in the has been very popular. Because of the. So to mention reduction, Like I said, it’s Just going from terms of

xs to a smaller number of linear combinations

of the xs [INAUDIBLE] High dimensional models, what that means is is large relative to n. In that case, there can be some

problems that some of the methods require that we have more observations and

parameters. There are some solutions. Non linear models Okay, so polynomial regression, and I’ve already

talked about that, quadratics and so on. Step functions, regression splines,

smoothing splines, B splines, wavelets. What these methods do is

kinda break it into pieces. And the simplest would be, If I had a scalar, X, to just split it into all the Xs, split it into deciles, and just fit a linear regression on the lowest 10% of the axis. And then another one on the next 10%. Another one, and

then make sure that those connect, okay? Well, the big one of

these is cubic splines. So it’s done with a cubic regression and you can very easily set it up. It’s very simple to not

only have them touch, so it’s connected, but

the first derivatives and the second derivatives at the point

where they touch the same. And then those methods, what they’ll

do is, you do worry about the taller points going out too far, so

they’ll linearize the tail. And from what I can see, to me, maybe this

problem is moving into multi-dimension but for one dimension thing. To me it just makes so much more sense

than doing multiple polynomial regression by global linear, right? I think the big reason for

this is you can prove a lot more things. But these generally

have a fixed bandwidth. This essentially is

allowing things to build. It doesn’t have to be every 10% of the

data, but there where your data is narrow, right? You’re still using the same number

of observations as when you’re in sparser areas. Any way, I’m just trying to give you an

idea of just how much stuff there is out there, right? Any one thing is not gonna too hard,

right? If someone can tell you that your problem

leads to the machine learning methods that are out there and so on,

you’re in business, right? But if you have to go out on your own,

no, you wanna go and talk to someone who’s spent

many years knowing about all these different methods and

language such as this. Okay, neural networks, in neural networks, So Y is predicted by Xs and what we’ll say is that my addition function will be this. And this is a linear

combination of some Zs. And then the Zs, in turn, are in linear

combinations of things at a lower layer. And you keep going back until you get to,

all right, so it’s a linear combination this and then somehow

we’re tying all these things together. Next level, we are getting more and

more and more and more. And somehow or other this works fantastic.>>[LAUGH]

>>Okay. All right, so

they’re particularly good for prediction, speech recognition,

image recognition, Google Translate. Google Translate was trying to

do things the way we do things. It was trying to teach what the rules

of language were, all right? Here’s the dictionary, here’s how

we put things together in English. Here’s how we put things

together in French, right? And then I got this guy

from Toronto to come in. And with about ten people working

with him, in three months, he’d improved things as much as

Google Translate doing things the old way, being able to improve things over

the last few years, all right? And so all of a sudden they

threw all their resources in. And they must have done it for

one a lot of translation, they’re doing several translations a week,

okay? And so this is off the shelf software. Right, so first of all,

you use things to get your image, or text, or whatever, into some sort of data. I think now it’s estimated using a thing

called stochastic gradient descent. So you bring randomness into the whatever, we could do something like

a Newton mass in algorithm. There are these programs out there,

I’ve been to their websites, and they promise you the world,

and it’s all very easy.>>[LAUGH]

>>Haven’t tested that theory out. And okay, I’m almost out of time. Regression trees, okay,

so regression trees, you simply use squared arrow loss. So the original one is the sum

of yi minus y bar squared, okay? Then what you do is you think of every

possible way I could split up my data. And that’ll give me, basically, and then we’ll use the means of

those two splits, all right? So this is discovered that if I do this

split at education less than over 12 and education greater than 12,

the sum of yi minus y1 bar squared, over these people,

plus the sum of yi minus y2 bar squared, these people,

it’s the smallest of that split. And then we do things again and

think of every split. And it turns out here, the next thing to do was to split

this into something, all right? By male, female, then age, oops. Then, all right, and

you get the tree, okay? The trouble with this regression tree

is it’s what’s called a grading app/ algorithm. When it made the split here, it didn’t

think about the consequences further down, all right? Once it’s splits you,

it’s just set forever. So more refined things,

random forests and so on, what they do is they bring in randomness. So one way to do it would be

to subsample your data and get trees going on these subsamples,

right? And I don’t have time to go through all of

them, but you’ve got all of this in here. And then some of these methods are ways of

getting a number of different predictions. Maybe we started the tree this way,

we started this way, and so on. And combining forecast and

combining forecast is viewed as the best thing to do usually. So there’s an annual prediction

competition around the world between teams of graduate students. And was it last year or the year before

that the Stat Department here at Davis, one of them, of the graduate students, are now using about four

different prediction methods. And their data was simply coding

from some Internet shopping company, and all they had were these numbers and

letters. And there wasn’t even anything saying

the first three digits mean something, one variable and the next six

are something, it was just complete. Somehow these machine learners and

so on get some good predictions. Classification, so now we’ve got categories that And

that’s natural ordering. So there supersecedent would be. And then a lot of people use linear and

quadratic discriminate analysis that support vector classifiers and support

vector machines of generalizations. I’m way over, okay challenging,

unsupervised learning, challenging causal inference. So basically the idea

with causal inference is if your familiar with selection

on observables right, so if I could just get the right

controls in there then and I may need one other variable then

I could believe my results okay. So what we want is to have,

get really good controls. Well if you could use a machine

leaner to get those controls, the machine learner is gonna

do a better job prediction. It’s going to, and then your controls

will be controlling for more. And then you might feel better about

assuming that anything that hasn’t been displaying is not gonna be correlated

with the [INAUDIBLE] of interest. So that’s one example. And I’ve got this here,

so just to finish off. [SOUND] Okay,

if you go to my website, I’ve got, Sorry, you have to go down a bit, but I’ve

got this thing here, machine learning. And, I’ve got 120 slides here and discussion on [INAUDIBLE],

but I particularly like this [INAUDIBLE] book which is an introduction

to [INAUDIBLE] applications in half. And this is available as a free PDF or

you can get it as a $25 not hard copy but

probably it’s a soft copy, and then I’ve got what is going on

at the moment in economics. And then finally,

I said that there was this, Book. Little cautionary tale. Sorry. How Big Data Increases Weapons

of Math Destruction.>>[LAUGH]

>>How big data increases inequality and threatens democracy.>>[LAUGH]

>>Okay, and the idea is that agencies have the algorithms to predict

what is the probability of someone I guess not

honoring their debts okay. It’s fantastic at that. Is that the same thing as a good predictor of whether someone renting

a house will make their monthly payment? All right,

what happens if you want to rent a house? Gotta watch your credit score, right? It’s just very easy to

misuse these things. Another one is their very black box,

all right, who should decide whether

someone goes on parole? Is it a judge? Or is it a model that spits out, there is a 93% chance this person will commit

a crime in the next year, all right. Or who should, okay, anyway,

I think I’m out of time. Okay, I’m done.>>[APPLAUSE]